Hi Lewis, Thanks for your reply. Unfortunately, I don't have the liberty to update my current version to an unreleased version and hence the suggestion to use parse-xsl won't be useful at this time. Is the only other option is to override HtmlParseFilter and add a new plugin?
Also regarding separate objects, what i meant is if i store the image links in Outlink, then those links will also be stored in DB (because all outlink are stored for next crawl of depth > 1). I don't want to store those in crawldb and just output in some other object within the record. I hope this makes sense Regards Prateek On Thu, Jan 14, 2021 at 6:28 PM lewis john mcgibbney <[email protected]> wrote: > Hi prateek, > Please see my comment inline below > > On Thu, Jan 14, 2021 at 6:39 AM <[email protected]> wrote: > > > > > One of the requirements I have is to extract all > > the image and video links from the html in a separate object. Since I > have > > the html content, I can use a library like jsoup to parse the content and > > extract img tags. > > I was wondering if there is a way in nutch to do this? > > > > The problem here is your requirement of "... in a separate object". Will > this separate object be a new record? > > > > I am assuming I will have to override HtmlParseFilter class and then add > my > > extraction logic there. Is my understanding correct? Any sample code > > reference will be helpful as well. > > > > > I think you can simply add parse-html OR parse-tika AND parse-xsl to the > 'plugin.includes' configuration property and then use the ordered > HTMLParseFilter configuration option 'htmlparsefilter.order' as follows > https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599 > > You can take a look at the parse-xsl plugin > > https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36 > > N.B. This patch is not yet merged into the Nutch master branch so it is not > available in an official Nutch release. You would need to upgrade to Nutch > 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would > be greatly appreciated. > > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc >

