you may always return null during the parsing phases for documents which you don't like.
so if you see that content is empty you may return null and the document won't be indexed nor stored. Best Regards Alexander Aristov 2009/5/26 Fadzi Ushewokunze <[email protected]> > hi there, > > got a chance to look at this, for a first cut i extended the html parser > to simply ignore the <li> <h1>.. etc and only parse <p> - paragraphs. it > works quite well considering it was a quick and easy fix. > > next issue is that i am ending up with a lot of empty fields in the > lucene index. a quick look around seems to prove that those indexed urls > actually had useless information too. which is a good thing too. > > my question is - what is the best way to discard documents that have > empty content before i even get to index them? my first take was to do > something in the HTMLParser class but not sure how go about this as > early as the possible during a crawl? > > any suggestions? > > On Fri, 2009-05-22 at 12:12 +0200, Andrzej Bialecki wrote: > > Iain Downs wrote: > > > There's a half a dozen approaches in the competition. What's useful is > the > > > paper which came out of it (I think there may have been another > competition > > > since then also) which details the approaches taken. > > > > > > I have my own approach to this (not entered in CleanEval), but it's > > > commercial and not yet ready for prime-time, I'm afraid. > > > > > > One simple approach (from Serge Sharroff), is to estimate the density > of > > > tags. The lower the density of tags, the more likely it is to be > proper > > > text. > > > > > > What is absolutely clear is that you have to play the odds. There is > no way > > > at the moment that you can get near 100% success. And I reckon if > there > > > was, Google would be doing it (their results quality is somewhat poorer > for > > > including navigation text - IMHO). > > > > I described a simple method that works reasonably well here: > > > > http://article.gmane.org/gmane.comp.search.nutch.devel/25020 > > > > But I agree, in general case the problem is hard. Algorithms that work > > in the context of a single page are usually worse than the ones that > > work on a whole corpus (or a subset of it, e.g. all pages from a site, > > or from a certain hierarchy in a site), but they are also much faster. > > If the quick & dirty gives you 80% of what you want, then maybe there's > > no reason in getting too sophisticated ;) > > > > > >
