Re: Extract all image and video links from a web page

Lewis John McGibbney Wed, 20 Jan 2021 09:37:13 -0800

Hi Prateek,

On 2021/01/19 15:58:29, prateek <[email protected]> wrote: 
> Is the only other option is to
> override HtmlParseFilter and add a new plugin?


Yes I think it is.

> 
> Also regarding separate objects, what i meant is if i store the image links
> in Outlink, then those links will also be stored in DB (because all outlink
> are stored for next crawl of depth > 1). I don't want to store those in
> crawldb and just output in some other object within the record. I hope this
> makes sense

I understand. Seeing as you cannot upgrade then yes I think you need to 
implement a new plugin to capture the outlinks as a new field in the 
NutchDocument. You should also look into using the 
'parser.html.outlinks.ignore_tags' configuration setting. You can specify which 
tags are filtered.

lewismc

Re: Extract all image and video links from a web page

Reply via email to