Hello all,

I was trying to figure out the best method to crawl a site without getting any of the irrelevant bits such as flash widgets, javascript, links to ad networks, and others. The objective is to index all relevant textual data. (This may be extrapolated to other forms of data of course)

My main question is - should this sort of elimination be done during the crawl, which would mean modifying the crawler; or should everything be crawled, indexed, and then have a text parsing system with some logic to extract the relevant bits?

Using the crawl-urlfilter seems like the first option, but I believe it has its drawbacks. Firstly, it needs regexps which match URLs, which would have to be handwritten (even automated scripts would need human manipulation at some point). For instance, the scripts or images may be hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries are far apart to make automation tough. And any such customizations would need to be tailor made for each site crawled - a tall task. Is there a way to extend the crawler itself to do this? I remember seeing something on the list archives about extending the crawler, but I can't find it again anymore.. Any pointers?

The second option was to write some sort of a custom class for the indexer (a form of the pluginexample on the wiki I guess).

Either way, I'm not sure what the better method is. Any ideas would be appreciated!

Cheers,
Viksit

PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure which one was a better option.

Reply via email to