Thanks Markus, I know that there is content in the title field, because this is indexed in CloudSearch and I can see the content. The raw field in CloudSearch has the same settings as the title (or content field which I tried before). So it is searchable and will be returned. I do not see a "Stored" setting in the indexing scheme of CloudSearch.
I tried the indexchecker tool and got results for the tstamp, digest, host, id, title, url and content field. Nothing about the "raw" field. I do not know when this "raw" field should be created. During the crawl, during indexing? Or is it written "on the fly" when everything is written to CloudSearch. I wonder, if the copy functionality does not work in Nutch 1.19 or at least in the CloudSearch index writer. Best regards, Michael ________________________________ From: Markus Jelsma <[email protected]> Sent: Thursday, September 5, 2024 11:23 To: [email protected] <[email protected]> Cc: [email protected] <[email protected]> Subject: Re: CloudSearch Index Writer [You don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] Hello Michael, That is impossible to say, maybe the original data had no value for the title>raw fields, maybe the raw field in CloudSearch is not configured to be stored, but only indexed instead. What you can do is use Nutch indexchecker <URL> tool, this will print the exact fields that Nutch would index to CloudSearch. Markus Op wo 4 sep 2024 om 17:54 schreef Fritsch, Michael <[email protected]>: > Hello, > I use Nutch 1.19 to crawl my website and to index the data into AWS > CloudSearch. > For this, I use the CloudSearch Index writer. > Everything works fine. > Now I want to copy the content of the "content" field into a different > field in CloudSearch. > I've created this field in CloudSearch with the name "raw" and the same > settings (except for the analysis scheme) as the "content" field. > In the index-writers.xml configuration file, I used the following > configuration in order to copy the content: > > <writer id="indexer_cloud_search_1" > class="org.apache.nutch.indexwriter.cloudsearch.CloudSearchIndexWriter"> > <parameters> > <param name="endpoint" value="MyEndpointAddress"/> > <param name="region" value="eu-west-1"/> > <param name="batch.dump" value="false"/> > <param name="batch.maxSize" value="-1"/> > </parameters> > <mapping> > <copy source="title" target="raw"/> > <rename /> > <remove /> > </mapping> > </writer> > > Everything works without errors, that means the standard content is > indexed into CloudSearch but I do not see any content in the "raw" field. > Has anyone an idea, why this happens? > > Best regards, > Michael > > > Dr. Michael Fritsch > Technical Editor > > [image: A picture containing graphics, graphic design, font, logo > Description automatically generated] <https://www.coremedia.com/> > > > > *Elevate Experience. Drive Impact.* > > > E-Mail: [email protected] > > Phone: +49 (0) 40 325 587 0 > *www.coremedia.com <https://www.coremedia.com/>* > > [image: A pink and red letter on a black background Description > automatically generated with low confidence] > <https://www.linkedin.com/company/coremedia-corp/>[image: A logo of a > camera Description automatically generated with low confidence] > <https://www.instagram.com/coremediacc/>[image: A picture containing > colorfulness, screenshot, graphics, red Description automatically generated] > <https://www.youtube.com/channel/UC3u29ExYv1263SfUBWnsgdQ> > > > -------------------------------------------------------------------------------- > > CoreMedia GmbH > > Rödingsmarkt 9, 20459 Hamburg, Germany > > Managing Director: Sören Stamer > > Commercial Register: Amtsgericht Hamburg, HRB 162480 > > > -------------------------------------------------------------------------------- > > > > >

