Re: clean text

Andrzej Bialecki Fri, 22 May 2009 03:13:46 -0700

Iain Downs wrote:

There's a half a dozen approaches in the competition.  What's useful is the
paper which came out of it (I think there may have been another competition
since then also) which details the approaches taken.


I have my own approach to this (not entered in CleanEval), but it's
commercial and not yet ready for prime-time, I'm afraid.

One simple approach (from Serge Sharroff), is to estimate the density of
tags.  The lower the density of tags, the more likely it is to be proper
text.

What is absolutely clear is that you have to play the odds.  There is no way
at the moment that you can get near 100% success.  And I reckon if there
was, Google would be doing it (their results quality is somewhat poorer for
including navigation text - IMHO).


I described a simple method that works reasonably well here:

http://article.gmane.org/gmane.comp.search.nutch.devel/25020

But I agree, in general case the problem is hard. Algorithms that workin the context of a single page are usually worse than the ones thatwork on a whole corpus (or a subset of it, e.g. all pages from a site,or from a certain hierarchy in a site), but they are also much faster.If the quick & dirty gives you 80% of what you want, then maybe there'sno reason in getting too sophisticated ;)



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: clean text

Reply via email to