Hi Kieran,
thanks for the feedback!
> I didn't realise that it is intended for users to edit the bin/crawl file.
Maybe we should add a comment to encourage users to adapt the shell scripts
to their needs. Almost 10 years ago, the Java "Crawl" class was replaced
by the scripts because a shell script is easy to modify and deploy, see
https://issues.apache.org/jira/browse/NUTCH-1087
Best,
Sebastian
On 6/1/21 2:37 PM, Kieran Munday wrote:
Hi Sebastian,
Thank you for your response. It was a great help.
I didn't realise that it is intended for users to edit the bin/crawl file.
Although looking at it now it's clear.
This makes it easier for me to access the html content within my plugin,
thanks again
On Fri, May 28, 2021 at 8:36 PM Sebastian Nagel
<[email protected]> wrote:
Hi Kieran,
see the command-line options
-addBinaryContent
index raw/binary content in field `binaryContent`
-base64
use Base64 encoding for binary content
of the Nutch index job [1]. Note that the content maybe indeed
binary, eg. for PDF documents but also for HTML pages which use
a different encoding than UTF-8.
Best,
Sebastian
[1]
https://wiki.apache.org/confluence/pages/viewpage.action?pageId=122916842
On 5/28/21 5:28 PM, Kieran Munday wrote:
Hi users@,
I am new to Nutch (v.1.17) and my current project requires the indexing
of
the html of crawled pages. It also requires fields that can be derived
from
the raw html such as image count, and charset.
I have looked on StackOverflow for how to achieve this and most people
from
my understanding seem to be recommending processing the segments to
extract
the html and modify the documents post-crawl. This doesn't fit my use
case
as I need to calculate these fields at crawl time before they are indexed
into Elasticsearch.
The other recommendations I have seen mention creating a plugin to
override
the parse-html plugin. However, I have found rather limited documentation
on how to do this correctly and am not sure on how to return from the
plugin in a way that the field propagates into the NutchDocument which
will
be processed in the Indexers' write method.
Do any of you have any advice or links to documentation that explains how
to modify what gets set in the NutchDocument?
Thank you in advance