Hi Prateek,
are there any URL filters which filter away image links?
You can verify this using the URL filter checker:
echo "https://example.com/image.jpg" \
| bin/nutch filterchecker -stdin
The default rules in conf/regex-urlfilter.txt exclude common
image suffixes. Note that there can be more URL filters activated
in the property plugin.includes.
Best,
Sebastian
On 1/26/21 3:14 PM, prateek wrote:
Hi Lewis,
Thanks for your suggestion.
I looked at the class fetching outlinks and saw that "img" is already part
of that -
https://github.com/apache/nutch/blob/680df6ba1dc68ad5ede5fca743304593d4d5b0a3/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java#L90.
So I am confused as to why I don't see any images in outlinks.
I have double checked that the property parser.html.outlinks.ignore_tags is
also not set. So ideally images should be part of outlinks already. But
when I run "bin/nutch readseg" to see the segments data, I don't see any
images being captured. Any Idea what am I missing?
If there is a way I can get all images in outlinks, then maybe I don't even
need a plugin for that.
Regards
Prateek
On Wed, Jan 20, 2021 at 5:37 PM Lewis John McGibbney <[email protected]>
wrote:
Hi Prateek,
On 2021/01/19 15:58:29, prateek <[email protected]> wrote:
Is the only other option is to
override HtmlParseFilter and add a new plugin?
Yes I think it is.
Also regarding separate objects, what i meant is if i store the image
links
in Outlink, then those links will also be stored in DB (because all
outlink
are stored for next crawl of depth > 1). I don't want to store those in
crawldb and just output in some other object within the record. I hope
this
makes sense
I understand. Seeing as you cannot upgrade then yes I think you need to
implement a new plugin to capture the outlinks as a new field in the
NutchDocument. You should also look into using the
'parser.html.outlinks.ignore_tags' configuration setting. You can specify
which tags are filtered.
lewismc