Hi Folks,

A very happy new year to all of you.

I am currently using Apache nutch 1.16 and successfully extracting the html
content given seed urls. One of the requirements I have is to extract all
the image and video links from the html in a separate object. Since I have
the html content, I can use a library like jsoup to parse the content and
extract img tags.
I was wondering if there is a way in nutch to do this?
I am assuming I will have to override HtmlParseFilter class and then add my
extraction logic there. Is my understanding correct? Any sample code
reference will be helpful as well.

Thanks
Prateek

Reply via email to