See what I know is , that after crawling it generates parse data, and parse text inside webdb or crawldb. parse data contains all the meta data I mean html tags and parse text contains pure text.
yes, the question which you have, surely can be implemented but I am not sure how will it be done? I think that you go through parse text directory and try to read the content of parsetext it will give answer of your question. you can use segread tool for reading parsetext. Regards, Ratnesh Anton Beza wrote: > > Does Nutch have the ability to filter out HTML tags from a web page and > return the raw text from that page? > > Thanks > -Anton > > -- View this message in context: http://www.nabble.com/Nutch-HTML-Tag-Filter-tf3455785.html#a9661526 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
