See what I know is ,
that after crawling it generates parse data, and parse text inside webdb or
crawldb.
parse data contains all the meta data I mean html tags and parse text
contains pure text.

yes, the question which you have, surely can be implemented but I am not
sure how will it be done?
I think that you go through parse text directory and try to read the content
of parsetext it will give answer of your question.

you can use segread tool for reading parsetext.

Regards,
Ratnesh

Anton Beza wrote:
> 
> Does Nutch have the ability to filter out HTML tags from a web page and
> return the raw text from that page?
> 
> Thanks
> -Anton
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch-HTML-Tag-Filter-tf3455785.html#a9661526
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to