Iain Downs wrote:
There's a half a dozen approaches in the competition. What's useful is the
paper which came out of it (I think there may have been another competition
since then also) which details the approaches taken.
I have my own approach to this (not entered in CleanEval), but it's
commercial and not yet ready for prime-time, I'm afraid.
One simple approach (from Serge Sharroff), is to estimate the density of
tags. The lower the density of tags, the more likely it is to be proper
text.
What is absolutely clear is that you have to play the odds. There is no way
at the moment that you can get near 100% success. And I reckon if there
was, Google would be doing it (their results quality is somewhat poorer for
including navigation text - IMHO).
I described a simple method that works reasonably well here:
http://article.gmane.org/gmane.comp.search.nutch.devel/25020
But I agree, in general case the problem is hard. Algorithms that work
in the context of a single page are usually worse than the ones that
work on a whole corpus (or a subset of it, e.g. all pages from a site,
or from a certain hierarchy in a site), but they are also much faster.
If the quick & dirty gives you 80% of what you want, then maybe there's
no reason in getting too sophisticated ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com