will definitely have a look at this CLEANEVAL thing. looks interesting. have you used it before? thanks for the suggestion.

i guess the best bet might be a combination of Alexander's suggestion ie stripping down the <li> and <h1> etc tags plus some text cleaning application, i have been playing around with summarization, topic generation and tf_idf techniques to no avail.


Quoting Iain Downs <[email protected]>:

There is an academic competition CLEANEVAL which assesses and publishes
alternate approaches to this problem.

It's not easy.

Iain

-----Original Message-----
From: Alexander Aristov [mailto:[email protected]]
Sent: 21 May 2009 13:24
To: [email protected]; [email protected]
Subject: Re: clean text

There is no easy way.  If your pages are from the same site or they have the
same structure then you can implement special parser which would extract
text from certain parts of a page but if you want this for any html page
then I think there is no way.

Look at the html parser whcih is responsible for parsing. It extracts text
from tags. I can suggest you to ignore such tags like <li> which are often
used for menus and allow only tags like <p> of <h1> <h2> ...

Anyway you will need to tune html parser for the purpose.


Best Regards
Alexander Aristov


2009/5/21 fadzi ushewokunze <[email protected]>

hi all,

does anyone know a way of cleaning up text that has been crawled from
the web? for example, most web pages have a lot of noise ie text from
menus, footers, adverts, etc.. i am looking for a way to clean this up
and end up with clean text say continuous paragraphs that actually have
some information in them. thats all i want to index.

thanks.

fadzi








Reply via email to