Crawling techniques?

Viksit Gaur Sun, 06 Jan 2008 19:53:36 -0800

Hello all,

I was trying to figure out the best method to crawl a site withoutgetting any of the irrelevant bits such as flash widgets, javascript,links to ad networks, and others. The objective is to index all relevanttextual data. (This may be extrapolated to other forms of data of course)

My main question is - should this sort of elimination be done during thecrawl, which would mean modifying the crawler; or should everything becrawled, indexed, and then have a text parsing system with some logic toextract the relevant bits?

Using the crawl-urlfilter seems like the first option, but I believe ithas its drawbacks. Firstly, it needs regexps which match URLs, whichwould have to be handwritten (even automated scripts would need humanmanipulation at some point). For instance, the scripts or images may behosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entriesare far apart to make automation tough. And any such customizationswould need to be tailor made for each site crawled - a tall task. Isthere a way to extend the crawler itself to do this? I remember seeingsomething on the list archives about extending the crawler, but I can'tfind it again anymore.. Any pointers?

The second option was to write some sort of a custom class for theindexer (a form of the pluginexample on the wiki I guess).

Either way, I'm not sure what the better method is. Any ideas would beappreciated!


Cheers,
Viksit

PS, Cross posted on nutch-user and nutch-agent, since I wasn't surewhich one was a better option.

Crawling techniques?

Reply via email to