There is no easy way. If your pages are from the same site or they have the same structure then you can implement special parser which would extract text from certain parts of a page but if you want this for any html page then I think there is no way.
Look at the html parser whcih is responsible for parsing. It extracts text from tags. I can suggest you to ignore such tags like <li> which are often used for menus and allow only tags like <p> of <h1> <h2> ... Anyway you will need to tune html parser for the purpose. Best Regards Alexander Aristov 2009/5/21 fadzi ushewokunze <[email protected]> > hi all, > > does anyone know a way of cleaning up text that has been crawled from > the web? for example, most web pages have a lot of noise ie text from > menus, footers, adverts, etc.. i am looking for a way to clean this up > and end up with clean text say continuous paragraphs that actually have > some information in them. thats all i want to index. > > thanks. > > fadzi > > >
