Hi Lewis,

Thanks for your reply. Unfortunately, I don't have the liberty to update my
current version to an unreleased version and hence the suggestion to use
parse-xsl won't be useful at this time. Is the only other option is to
override HtmlParseFilter and add a new plugin?

Also regarding separate objects, what i meant is if i store the image links
in Outlink, then those links will also be stored in DB (because all outlink
are stored for next crawl of depth > 1). I don't want to store those in
crawldb and just output in some other object within the record. I hope this
makes sense

Regards
Prateek

On Thu, Jan 14, 2021 at 6:28 PM lewis john mcgibbney <[email protected]>
wrote:

> Hi prateek,
> Please see my comment inline below
>
> On Thu, Jan 14, 2021 at 6:39 AM <[email protected]> wrote:
>
> >
> > One of the requirements I have is to extract all
> > the image and video links from the html in a separate object. Since I
> have
> > the html content, I can use a library like jsoup to parse the content and
> > extract img tags.
> > I was wondering if there is a way in nutch to do this?
> >
>
> The problem here is your requirement of "... in a separate object". Will
> this separate object be a new record?
>
>
> > I am assuming I will have to override HtmlParseFilter class and then add
> my
> > extraction logic there. Is my understanding correct? Any sample code
> > reference will be helpful as well.
> >
> >
> I think you can simply add parse-html OR parse-tika AND parse-xsl to the
> 'plugin.includes' configuration property and then use the ordered
> HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
> https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599
>
> You can take a look at the parse-xsl plugin
>
> https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36
>
> N.B. This patch is not yet merged into the Nutch master branch so it is not
> available in an official Nutch release. You would need to upgrade to Nutch
> 1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
> be greatly appreciated.
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>

Reply via email to