RE: clean text

Iain Downs Fri, 22 May 2009 02:53:14 -0700

There's a half a dozen approaches in the competition.  What's useful is the
paper which came out of it (I think there may have been another competition
since then also) which details the approaches taken.


I have my own approach to this (not entered in CleanEval), but it's
commercial and not yet ready for prime-time, I'm afraid.

One simple approach (from Serge Sharroff), is to estimate the density of
tags.  The lower the density of tags, the more likely it is to be proper
text.

What is absolutely clear is that you have to play the odds.  There is no way
at the moment that you can get near 100% success.  And I reckon if there
was, Google would be doing it (their results quality is somewhat poorer for
including navigation text - IMHO).

Iain

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: 22 May 2009 06:08
To: [email protected]
Subject: RE: clean text

will definitely have a look at this CLEANEVAL thing. looks  
interesting. have you used it before? thanks for the suggestion.

i guess the best bet might be a combination of Alexander's suggestion  
ie stripping down the <li> and <h1> etc tags plus some text cleaning  
application, i have been playing around with summarization, topic  
generation and tf_idf techniques to no avail.


Quoting Iain Downs <[email protected]>:

> There is an academic competition CLEANEVAL which assesses and publishes
> alternate approaches to this problem.
>
> It's not easy.
>
> Iain
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[email protected]]
> Sent: 21 May 2009 13:24
> To: [email protected]; [email protected]
> Subject: Re: clean text
>
> There is no easy way.  If your pages are from the same site or they have
the
> same structure then you can implement special parser which would extract
> text from certain parts of a page but if you want this for any html page
> then I think there is no way.
>
> Look at the html parser whcih is responsible for parsing. It extracts text
> from tags. I can suggest you to ignore such tags like <li> which are often
> used for menus and allow only tags like <p> of <h1> <h2> ...
>
> Anyway you will need to tune html parser for the purpose.
>
>
> Best Regards
> Alexander Aristov
>
>
> 2009/5/21 fadzi ushewokunze <[email protected]>
>
>> hi all,
>>
>> does anyone know a way of cleaning up text that has been crawled from
>> the web? for example, most web pages have a lot of noise ie text from
>> menus, footers, adverts, etc.. i am looking for a way to clean this up
>> and end up with clean text say continuous paragraphs that actually have
>> some information in them. thats all i want to index.
>>
>> thanks.
>>
>> fadzi
>>
>>
>>
>
>

RE: clean text

Reply via email to