Hi prateek, Please see my comment inline below On Thu, Jan 14, 2021 at 6:39 AM <[email protected]> wrote:
> > One of the requirements I have is to extract all > the image and video links from the html in a separate object. Since I have > the html content, I can use a library like jsoup to parse the content and > extract img tags. > I was wondering if there is a way in nutch to do this? > The problem here is your requirement of "... in a separate object". Will this separate object be a new record? > I am assuming I will have to override HtmlParseFilter class and then add my > extraction logic there. Is my understanding correct? Any sample code > reference will be helpful as well. > > I think you can simply add parse-html OR parse-tika AND parse-xsl to the 'plugin.includes' configuration property and then use the ordered HTMLParseFilter configuration option 'htmlparsefilter.order' as follows https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599 You can take a look at the parse-xsl plugin https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36 N.B. This patch is not yet merged into the Nutch master branch so it is not available in an official Nutch release. You would need to upgrade to Nutch 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would be greatly appreciated. -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc

