hi all, does anyone know a way of cleaning up text that has been crawled from the web? for example, most web pages have a lot of noise ie text from menus, footers, adverts, etc.. i am looking for a way to clean this up and end up with clean text say continuous paragraphs that actually have some information in them. thats all i want to index.
thanks. fadzi
