Re: CloudSearch Index Writer

Fritsch, Michael Thu, 05 Sep 2024 04:11:15 -0700

Thanks Markus,

I know that there is content in the title field, because this is indexed in 
CloudSearch and I can see the content.
The raw field in CloudSearch has the same settings as the title (or content 
field which I tried before). So it is searchable and will be returned.
I do not see a "Stored" setting in the indexing scheme of CloudSearch.


I tried the indexchecker tool and got results for the tstamp, digest, host, id, 
title, url and content field. Nothing about the "raw" field.
I do not know when this "raw" field should be created. During the crawl, during 
indexing? Or is it written "on the fly" when everything is written to 
CloudSearch.

I wonder, if the copy functionality does not work in Nutch 1.19 or at least in 
the CloudSearch index writer.

Best regards,
Michael
________________________________
From: Markus Jelsma <[email protected]>
Sent: Thursday, September 5, 2024 11:23
To: [email protected] 
<[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: CloudSearch Index Writer

[You don't often get email from [email protected]. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Hello Michael,

That is impossible to say, maybe the original data had no value for the
title>raw fields, maybe the raw field in CloudSearch is not configured to
be stored, but only indexed instead.

What you can do is use Nutch indexchecker <URL> tool, this will print the
exact fields that Nutch would index to CloudSearch.

Markus

Op wo 4 sep 2024 om 17:54 schreef Fritsch, Michael
<[email protected]>:

> Hello,
> I use Nutch 1.19 to crawl my website and to index the data into AWS
> CloudSearch.
> For this, I use the CloudSearch Index writer.
> Everything works fine.
> Now I want to copy the content of the "content" field into a different
> field in CloudSearch.
> I've created this field in CloudSearch with the name "raw" and the same
> settings (except for the analysis scheme) as the "content" field.
> In the index-writers.xml configuration file, I used the following
> configuration in order to copy the content:
>
> <writer id="indexer_cloud_search_1" 
> class="org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter">
>   <parameters>
>     <param name="endpoint" value="MyEndpointAddress"/>
>     <param name="region" value="eu-west-1"/>
>     <param name="batch.dump" value="false"/>
>     <param name="batch.maxSize" value="-1"/>
>   </parameters>
>   <mapping>
>   <copy source="title" target="raw"/>
>     <rename />
>     <remove />
>   </mapping>
> </writer>
>
> Everything works without errors, that means the standard content is
> indexed into CloudSearch but I do not see any content in the "raw" field.
> Has anyone an idea, why  this happens?
>
> Best regards,
> Michael
>
>
> Dr. Michael Fritsch
> Technical Editor
>
> [image: A picture containing graphics, graphic design, font, logo
> Description automatically generated] <https://www.coremedia.com/>
>
>
>
> *Elevate Experience. Drive Impact.*
>
>
> E-Mail: [email protected]
>
> Phone: +49 (0) 40 325 587 0
> *www.coremedia.com <https://www.coremedia.com/>*
>
> [image: A pink and red letter on a black background Description
> automatically generated with low confidence]
> <https://www.linkedin.com/company/coremedia-corp/>[image: A logo of a
> camera Description automatically generated with low confidence]
> <https://www.instagram.com/coremediacc/>[image: A picture containing
> colorfulness, screenshot, graphics, red Description automatically generated]
> <https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ>
>
>
> --------------------------------------------------------------------------------
>
> CoreMedia GmbH
>
> Rödingsmarkt 9, 20459 Hamburg, Germany
>
> Managing Director: Sören Stamer
>
> Commercial Register: Amtsgericht Hamburg, HRB 162480
>
>
> --------------------------------------------------------------------------------
>
>
>
>
>

Re: CloudSearch Index Writer

Reply via email to