Yes, this happens if you use recent Solr's with managed schema, it apparently 
treats text as string types. There's a ticket to change that to TextField 
though.
Markus

 
 
-----Original message-----
> From:Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: Tuesday 21st June 2016 23:15
> To: user@nutch.apache.org
> Subject: Re: immense term,Correcting analyzer
> 
> Hi,
> 
> you are right, looks like the field "content" is indexed as one single term
> and is not split ("tokenized") into words.  The best way would be
> to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml),
> see https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
> 
> > BTW, am using nutch 1.11 and solr 6.0.0
> Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!)
> 
> Sebastian
> 
> 
> On 06/21/2016 08:04 PM, shakiba davari wrote:
> > Hi guys, I'm trying to index my nutch crawled data by running:
> > 
> > bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate";
> > crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016*
> > 
> > At first it was working totally Ok. I indexed my data, sent a few queries
> > and recieved good result. but then I ran the crawling again so that it
> > crawles in a bigger depth and fetches more pages. so the last time the
> > crawler's status showed 1051 unfetched and 151 fetched data in my db. and
> > now when I run the nutch index command, I face with " java.io.IOException:
> > Job failed!"
> > here is my log:
> > 
> > java.lang.Exception:
> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > Exception writing document id
> > http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
> > possible analysis error: Document contains at least one immense term in
> > field="content" (whose UTF8 encoding is longer than the max length 32766),
> > all of which were skipped.  Please correct the analyzer to not produce such
> > terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
> > 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
> > 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
> > most 32766 in length; got 40063. Perhaps the document has an indexed string
> > field (solr.StrField) which is too large
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> > Caused by:
> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > Exception writing document id
> > http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index;
> > possible analysis error: Document contains at least one immense term in
> > field="content" (whose UTF8 encoding is longer than the max length 32766),
> > all of which were skipped.  Please correct the analyzer to not produce such
> > terms.  The prefix of the first immense term is: '[70, 114, 97, 110, 107,
> > 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32,
> > 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at
> > most 32766 in length; got 40063. Perhaps the document has an indexed string
> > field (solr.StrField) which is too large
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
> > at
> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
> > at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153)
> > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
> > at
> > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
> > at
> > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
> > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.Thread.run(Thread.java:745)
> > 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer:
> > java.io.IOException: Job 
> > failed!https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
> > 
> > 
> > I realize that in the mentioned page there is really long term so in
> > schema.xml and managed-schema I changed the type of "id", "content",and
> > "text" from "strings" to "text_general" :
> > <field name="id" type="text_general">
> > but it didn't solve the problem.
> > I'm no expert, so I'm not sure how to correct the analyzer without screwing
> > up something else. I've read somewhere that I can
> > 1. use (in index analyzer), a LengthFilterFactory in order to filter out
> > those tokens that don't fall withing a requested length range.
> > 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max
> > length of indexed tokens
> > 
> > but there are so many analyzer in the schema. should I change the analyzer
> > defined for <fieldType name="text_general"...> ? if yes since the content
> > and other fields' type are text_general, isn't it gonna affect all of them
> > too?
> > 
> > I would really appreciate any help.
> > BTW, am using nutch 1.11 and solr 6.0.0
> > 
> > Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari *
> > 
> 
> 

Reply via email to