Hi Lewis, Thanks for your suggestion.
I looked at the class fetching outlinks and saw that "img" is already part of that - https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90. So I am confused as to why I don't see any images in outlinks. I have double checked that the property parser.html.outlinks.ignore_tags is also not set. So ideally images should be part of outlinks already. But when I run "bin/nutch readseg" to see the segments data, I don't see any images being captured. Any Idea what am I missing? If there is a way I can get all images in outlinks, then maybe I don't even need a plugin for that. Regards Prateek On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[email protected]> wrote: > Hi Prateek, > > On 2021/01/19 15:58:29, prateek <[email protected]> wrote: > > Is the only other option is to > > override HtmlParseFilter and add a new plugin? > > Yes I think it is. > > > > > Also regarding separate objects, what i meant is if i store the image > links > > in Outlink, then those links will also be stored in DB (because all > outlink > > are stored for next crawl of depth > 1). I don't want to store those in > > crawldb and just output in some other object within the record. I hope > this > > makes sense > > I understand. Seeing as you cannot upgrade then yes I think you need to > implement a new plugin to capture the outlinks as a new field in the > NutchDocument. You should also look into using the > 'parser.html.outlinks.ignore_tags' configuration setting. You can specify > which tags are filtered. > > lewismc >

