hi there,

got a chance to look at this, for a first cut i extended the html parser
to simply ignore the <li> <h1>.. etc and only parse <p> - paragraphs. it
works quite well considering it was a quick and easy fix. 

next issue is that i am ending up with a lot of empty fields in the
lucene index. a quick look around seems to prove that those indexed urls
actually had useless information too. which is a good thing too.

my question is - what is the best way to discard documents that have
empty content before i even get to index them? my first take was to do
something in the HTMLParser class but not sure how go about this as
early as the possible during a crawl?

any suggestions?

On Fri, 2009-05-22 at 12:12 +0200, Andrzej Bialecki wrote:
> Iain Downs wrote:
> > There's a half a dozen approaches in the competition.  What's useful is the
> > paper which came out of it (I think there may have been another competition
> > since then also) which details the approaches taken.
> > 
> > I have my own approach to this (not entered in CleanEval), but it's
> > commercial and not yet ready for prime-time, I'm afraid.
> > 
> > One simple approach (from Serge Sharroff), is to estimate the density of
> > tags.  The lower the density of tags, the more likely it is to be proper
> > text.
> > 
> > What is absolutely clear is that you have to play the odds.  There is no way
> > at the moment that you can get near 100% success.  And I reckon if there
> > was, Google would be doing it (their results quality is somewhat poorer for
> > including navigation text - IMHO).
> 
> I described a simple method that works reasonably well here:
> 
> http://article.gmane.org/gmane.comp.search.nutch.devel/25020
> 
> But I agree, in general case the problem is hard. Algorithms that work 
> in the context of a single page are usually worse than the ones that 
> work on a whole corpus (or a subset of it, e.g. all pages from a site, 
> or from a certain hierarchy in a site), but they are also much faster. 
> If the quick & dirty gives you 80% of what you want, then maybe there's 
> no reason in getting too sophisticated ;)
> 
> 

Reply via email to