hi there, got a chance to look at this, for a first cut i extended the html parser to simply ignore the <li> <h1>.. etc and only parse <p> - paragraphs. it works quite well considering it was a quick and easy fix.
next issue is that i am ending up with a lot of empty fields in the lucene index. a quick look around seems to prove that those indexed urls actually had useless information too. which is a good thing too. my question is - what is the best way to discard documents that have empty content before i even get to index them? my first take was to do something in the HTMLParser class but not sure how go about this as early as the possible during a crawl? any suggestions? On Fri, 2009-05-22 at 12:12 +0200, Andrzej Bialecki wrote: > Iain Downs wrote: > > There's a half a dozen approaches in the competition. What's useful is the > > paper which came out of it (I think there may have been another competition > > since then also) which details the approaches taken. > > > > I have my own approach to this (not entered in CleanEval), but it's > > commercial and not yet ready for prime-time, I'm afraid. > > > > One simple approach (from Serge Sharroff), is to estimate the density of > > tags. The lower the density of tags, the more likely it is to be proper > > text. > > > > What is absolutely clear is that you have to play the odds. There is no way > > at the moment that you can get near 100% success. And I reckon if there > > was, Google would be doing it (their results quality is somewhat poorer for > > including navigation text - IMHO). > > I described a simple method that works reasonably well here: > > http://article.gmane.org/gmane.comp.search.nutch.devel/25020 > > But I agree, in general case the problem is hard. Algorithms that work > in the context of a single page are usually worse than the ones that > work on a whole corpus (or a subset of it, e.g. all pages from a site, > or from a certain hierarchy in a site), but they are also much faster. > If the quick & dirty gives you 80% of what you want, then maybe there's > no reason in getting too sophisticated ;) > >
