Hi Prateek, On 2021/01/19 15:58:29, prateek <[email protected]> wrote: > Is the only other option is to > override HtmlParseFilter and add a new plugin?
Yes I think it is. > > Also regarding separate objects, what i meant is if i store the image links > in Outlink, then those links will also be stored in DB (because all outlink > are stored for next crawl of depth > 1). I don't want to store those in > crawldb and just output in some other object within the record. I hope this > makes sense I understand. Seeing as you cannot upgrade then yes I think you need to implement a new plugin to capture the outlinks as a new field in the NutchDocument. You should also look into using the 'parser.html.outlinks.ignore_tags' configuration setting. You can specify which tags are filtered. lewismc

