Hey Martin, Thanks for the link - thats pretty close to what I was looking for, I'll give it a shot! The discussion which lead to the thread you pointed out was even better!
Cheers, Viksit On Jan 7, 2008 3:28 AM, Martin Kuen <[EMAIL PROTECTED]> wrote: > Hi Viksit, > > maybe you are looking for this thread: > http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465 > > Cheers, > > Martin > > > PS: nutch-user is the correct option. nutch-agent is primarly for > site-owners who want to report misbehaving nutch bots. > > On Jan 7, 2008 4:52 AM, Viksit Gaur <[EMAIL PROTECTED]> wrote: > > Hello all, > > > > I was trying to figure out the best method to crawl a site without > > getting any of the irrelevant bits such as flash widgets, javascript, > > links to ad networks, and others. The objective is to index all relevant > > textual data. (This may be extrapolated to other forms of data of > course) > > > > My main question is - should this sort of elimination be done during the > > crawl, which would mean modifying the crawler; or should everything be > > crawled, indexed, and then have a text parsing system with some logic to > > extract the relevant bits? > > > > Using the crawl-urlfilter seems like the first option, but I believe it > > has its drawbacks. Firstly, it needs regexps which match URLs, which > > would have to be handwritten (even automated scripts would need human > > manipulation at some point). For instance, the scripts or images may be > > hosted at scripts.foo.com or foo.com/bar/foobar/scripts - both entries > > are far apart to make automation tough. And any such customizations > > would need to be tailor made for each site crawled - a tall task. Is > > there a way to extend the crawler itself to do this? I remember seeing > > something on the list archives about extending the crawler, but I can't > > find it again anymore.. Any pointers? > > > > The second option was to write some sort of a custom class for the > > indexer (a form of the pluginexample on the wiki I guess). > > > > Either way, I'm not sure what the better method is. Any ideas would be > > appreciated! > > > > Cheers, > > Viksit > > > > PS, Cross posted on nutch-user and nutch-agent, since I wasn't sure > > which one was a better option. > > >
