Yes, this happens if you use recent Solr's with managed schema, it apparently treats text as string types. There's a ticket to change that to TextField though. Markus
-----Original message----- > From:Sebastian Nagel <wastl.na...@googlemail.com> > Sent: Tuesday 21st June 2016 23:15 > To: user@nutch.apache.org > Subject: Re: immense term,Correcting analyzer > > Hi, > > you are right, looks like the field "content" is indexed as one single term > and is not split ("tokenized") into words. The best way would be > to use the schema.xml shipped with Nutch ($NUTCH_HOME/conf/schema.xml), > see https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch > > > BTW, am using nutch 1.11 and solr 6.0.0 > Nutch 1.11 requires Solr 4.10.2, other versions may work (or may not!) > > Sebastian > > > On 06/21/2016 08:04 PM, shakiba davari wrote: > > Hi guys, I'm trying to index my nutch crawled data by running: > > > > bin/nutch index -D solr.server.url="http://localhost:8983/solr/carerate" > > crawl/crawldb -linkdb crawl/linkdb crawl/segments/2016* > > > > At first it was working totally Ok. I indexed my data, sent a few queries > > and recieved good result. but then I ran the crawling again so that it > > crawles in a bigger depth and fetches more pages. so the last time the > > crawler's status showed 1051 unfetched and 151 fetched data in my db. and > > now when I run the nutch index command, I face with " java.io.IOException: > > Job failed!" > > here is my log: > > > > java.lang.Exception: > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > Exception writing document id > > http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; > > possible analysis error: Document contains at least one immense term in > > field="content" (whose UTF8 encoding is longer than the max length 32766), > > all of which were skipped. Please correct the analyzer to not produce such > > terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, > > 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, > > 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at > > most 32766 in length; got 40063. Perhaps the document has an indexed string > > field (solr.StrField) which is too large > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > > Caused by: > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > Exception writing document id > > http://www.cs.toronto.edu/~frank/About_Me/about_me.html to the index; > > possible analysis error: Document contains at least one immense term in > > field="content" (whose UTF8 encoding is longer than the max length 32766), > > all of which were skipped. Please correct the analyzer to not produce such > > terms. The prefix of the first immense term is: '[70, 114, 97, 110, 107, > > 32, 82, 117, 100, 122, 105, 99, 122, 32, 45, 32, 65, 98, 111, 117, 116, 32, > > 77, 101, 32, 97, 98, 111, 117, 116]...', original message: bytes can be at > > most 32766 in length; got 40063. Perhaps the document has an indexed string > > field (solr.StrField) which is too large > > at > > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) > > at > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) > > at > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) > > at > > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) > > at > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:153) > > at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115) > > at > > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44) > > at > > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502) > > at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456) > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > > 2016-06-21 13:27:37,994 ERROR indexer.IndexingJob - Indexer: > > java.io.IOException: Job > > failed!https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231) > > > > > > I realize that in the mentioned page there is really long term so in > > schema.xml and managed-schema I changed the type of "id", "content",and > > "text" from "strings" to "text_general" : > > <field name="id" type="text_general"> > > but it didn't solve the problem. > > I'm no expert, so I'm not sure how to correct the analyzer without screwing > > up something else. I've read somewhere that I can > > 1. use (in index analyzer), a LengthFilterFactory in order to filter out > > those tokens that don't fall withing a requested length range. > > 2.use (in index analyzer), a TruncateTokenFilterFactory for fixing the max > > length of indexed tokens > > > > but there are so many analyzer in the schema. should I change the analyzer > > defined for <fieldType name="text_general"...> ? if yes since the content > > and other fields' type are text_general, isn't it gonna affect all of them > > too? > > > > I would really appreciate any help. > > BTW, am using nutch 1.11 and solr 6.0.0 > > > > Shakiba <https://ca.linkedin.com/pub/shakiba-davari/84/417/b57>* Davari * > > > >