Hi Luis, just an opinion (worked with Nutch intensively, 2005-2008).
Web crawling is a bitch, and Nutch won't make it any easier.

Some problems you'll find along the way:

   1. Spidering tunnels/traps
   2. Duplicate and near-duplicate content removal
   3. GET parameter explosion in dynamic pages
   4. Compromises between breadth and depth of crawl (you only have that
   much time, and every site has its unique link geometry)

Nutch has its own set of tools (urlfilters, depth control...) to cope with
each problem, but sometimes you solve, say, 3, and 4 comes back again.

My advice would be to use "some popular search engines" as a way to mine the
web (you always can ask for all the pages indexed in a domain). They have
done this job, and nicely done. In fact, due to their ranking algorithms
(based on link geometry), a 'popular' page will always be indexed, and to
me, that's a good circumstance (i.e: you can always claim that with your own
web crawler you've covered more url's for a specific site, but what's the
value if the extra url's are *not that important* ?)

If I'm absolutely forced to crawl a site, I use plain old 'curl' or 'wget'.
Open source, tunable via a vast array of parameters and 'black boxes'. I do
not see any justification in deploying 'the nutch monster' just to crawl
some web portion already crawled by "popular search engines"

On the 'scrapping' / xhtml mining front, 'mechanize' library (python, perl,
ruby, whatever flavour) and 'Beautiful Soup' for python have always fed my
hunger for web scrapping.

Good luck :D


On Tue, Oct 18, 2011 at 9:16 AM, Marco Martinez <
mmarti...@paradigmatecnologico.com> wrote:

> Hi Luis,
>
> Have you tried the copyField function with custom analyzers and tokenizers?
>
> bye,
>
> Marco Martínez Bautista
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>
> 2011/10/18 Luis Cappa Banda <luisca...@gmail.com>
>
> > Hello everyone.
> >
> > I've been thinking about a way to retrieve information from a domain (for
> > example, http://www.ign.com) to process and index. My idea is to use
> Solr
> > as
> > a searcher. I'm familiarized with Apache Nutch and I know that the latest
> > version has a gateway to Solr to retrieve and index information with it.
> I
> > tried it and it worked fine, but it's a little bit complex to develop
> > plugins to process info and index it in a new field desired. Perhaps one
> of
> > you have tried another (and better) alternative to data mine web
> > information. Which is your recommendation? Can you give me any scraping
> > suggestion?
> >
> > Thank you very much.
> >
> > Luis Cappa.
> >
>



-- 
Whether it's science, technology, personal experience, true love, astrology,
or gut feelings, each of us has confidence in something that we will never
fully comprehend.
 --Roy H. William

Reply via email to