Re: Extract all image and video links from a web page

Sebastian Nagel Wed, 27 Jan 2021 00:29:08 -0800

Hi Prateek,

are there any URL filters which filter away image links?


You can verify this using the URL filter checker:

 echo "https://example.com/image.jpg"; \
   | bin/nutch filterchecker -stdin

The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can be more URL filters activated
in the property plugin.includes.

Best,
Sebastian

On 1/26/21 3:14 PM, prateek wrote:

Hi Lewis,

Thanks for your suggestion.

I looked at the class fetching outlinks and saw that "img" is already part
of that -
https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
So I am confused as to why I don't see any images in outlinks.
I have double checked that the property parser.html.outlinks.ignore_tags is
also not set. So ideally images should be part of outlinks already. But
when I run "bin/nutch readseg" to see the segments data, I don't see any
images being captured. Any Idea what am I missing?

If there is a way I can get all images in outlinks, then maybe I don't even
need a plugin for that.

Regards
Prateek

On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[email protected]>
wrote:

Hi Prateek,

On 2021/01/19 15:58:29, prateek <[email protected]> wrote:

Is the only other option is to
override HtmlParseFilter and add a new plugin?


Yes I think it is.


Also regarding separate objects, what i meant is if i store the image

links

in Outlink, then those links will also be stored in DB (because all

outlink

are stored for next crawl of depth > 1). I don't want to store those in
crawldb and just output in some other object within the record. I hope

this

makes sense


I understand. Seeing as you cannot upgrade then yes I think you need to
implement a new plugin to capture the outlinks as a new field in the
NutchDocument. You should also look into using the
'parser.html.outlinks.ignore_tags' configuration setting. You can specify
which tags are filtered.

lewismc

Re: Extract all image and video links from a web page

Reply via email to