Re: Adding html field to NutchDocument

Sebastian Nagel Tue, 01 Jun 2021 07:34:48 -0700

Hi Kieran,

thanks for the feedback!


> I didn't realise that it is intended for users to edit the bin/crawl file.

Maybe we should add a comment to encourage users to adapt the shell scripts
to their needs.  Almost 10 years ago, the Java "Crawl" class was replaced
by the scripts because a shell script is easy to modify and deploy, see
  https://issues.apache.org/jira/browse/NUTCH-1087

Best,
Sebastian


On 6/1/21 2:37 PM, Kieran Munday wrote:

Hi Sebastian,

Thank you for your response. It was a great help.
I didn't realise that it is intended for users to edit the bin/crawl file.
Although looking at it now it's clear.

This makes it easier for me to access the html content within my plugin,
thanks again

On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
<[email protected]> wrote:

Hi Kieran,

see the command-line options

          -addBinaryContent
            index raw/binary content in field `binaryContent`
          -base64
             use Base64 encoding for binary content

of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also for HTML pages which use
a different encoding than UTF-8.

Best,
Sebastian

[1]
https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842


On 5/28/21 5:28 PM, Kieran Munday wrote:

Hi users@,

I am new to Nutch (v.1.17) and my current project requires the indexing

of

the html of crawled pages. It also requires fields that can be derived

from

the raw html such as image count, and charset.

I have looked on StackOverflow for how to achieve this and most people

from

my understanding seem to be recommending processing the segments to

extract

the html and modify the documents post-crawl. This doesn't fit my use

case

as I need to calculate these fields at crawl time before they are indexed
into Elasticsearch.

The other recommendations I have seen mention creating a plugin to

override

the parse-html plugin. However, I have found rather limited documentation
on how to do this correctly and am not sure on how to return from the
plugin in a way that the field propagates into the NutchDocument which

will

be processed in the Indexers' write method.

Do any of you have any advice or links to documentation that explains how
to modify what gets set in the NutchDocument?

Thank you in advance

Re: Adding html field to NutchDocument

Reply via email to