Re: Extract all image and video links from a web page

prateek Tue, 26 Jan 2021 06:14:39 -0800

Hi Lewis,

Thanks for your suggestion.

I looked at the class fetching outlinks and saw that "img" is already part
of that -
https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
So I am confused as to why I don't see any images in outlinks.
I have double checked that the property parser.html.outlinks.ignore_tags is
also not set. So ideally images should be part of outlinks already. But
when I run "bin/nutch readseg" to see the segments data, I don't see any
images being captured. Any Idea what am I missing?

If there is a way I can get all images in outlinks, then maybe I don't even
need a plugin for that.

Regards
Prateek

On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[email protected]>
wrote:

> Hi Prateek,
>
> On 2021/01/19 15:58:29, prateek <[email protected]> wrote:
> > Is the only other option is to
> > override HtmlParseFilter and add a new plugin?
>
> Yes I think it is.
>
> >
> > Also regarding separate objects, what i meant is if i store the image
> links
> > in Outlink, then those links will also be stored in DB (because all
> outlink
> > are stored for next crawl of depth > 1). I don't want to store those in
> > crawldb and just output in some other object within the record. I hope
> this
> > makes sense
>
> I understand. Seeing as you cannot upgrade then yes I think you need to
> implement a new plugin to capture the outlinks as a new field in the
> NutchDocument. You should also look into using the
> 'parser.html.outlinks.ignore_tags' configuration setting. You can specify
> which tags are filtered.
>
> lewismc
>

Re: Extract all image and video links from a web page

Reply via email to