_How_ does it fail? You must be seeing something in the logs....
On Wed, Sep 2, 2015 at 8:29 AM, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi Erick, > > Yes, i'm trying out the De-Duplication too. But I'm facing a problem with > that, which is the indexing stops working once I put in the following > De-Duplication code in solrconfig.xml. The problem seems to be with this <str > name="update.chain">dedupe</str> line. > > <requestHandler name="/update" class="solr.UpdateRequestHandler"> > <lst name="defaults"> > <str name="update.chain">dedupe</str> > </lst> > </requestHandler> > > > <updateRequestProcessorChain name="dedupe"> > <processor class="solr.processor.SignatureUpdateProcessorFactory"> > <bool name="enabled">true</bool> > <str name="signatureField">signature</str> > <bool name="overwriteDupes">false</bool> > <str name="fields">content</str> > <str name="signatureClass">solr.processor.Lookup3Signature</str> > </processor> > </updateRequestProcessorChain> > > > Regards, > Edwin > > On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com> > wrote: > >> Yes, that is an intentional limit for the size of a single token, >> which strings are. >> >> Why not use deduplication? See: >> https://cwiki.apache.org/confluence/display/solr/De-Duplication >> >> You don't have to replace the existing documents, and Solr will >> compute a hash that can be used to identify identical documents >> and you can use_that_. >> >> Best >> Erick >> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo >> <edwinye...@gmail.com> wrote: >> > Hi, >> > >> > I would like to check, is the string bytes must be at most 32766 >> characters >> > in length? >> > >> > I'm trying to do a copyField of my rich-text documents content to a field >> > with fieldType=string to try out my getting distinct result for content, >> as >> > there are several documents with the exact same content, and we only want >> > to list one of them during searching. >> > >> > However, I get the following errors in some of the documents when I tried >> > to index them with the copyField. Some of my documents are quite large in >> > size, and there is a possibility that it exceed 32766 characters. Is >> there >> > any other ways to overcome this problem? >> > >> > >> > org.apache.solr.common.SolrException: Exception writing document id >> > collection1_polymer100 to the index; possible analysis error. >> > at >> > >> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167) >> > at >> > >> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) >> > at >> > >> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) >> > at >> > >> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955) >> > at >> > >> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110) >> > at >> > >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706) >> > at >> > >> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) >> > at >> > >> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) >> > at >> > >> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207) >> > at >> > >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122) >> > at >> > >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127) >> > at >> > >> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235) >> > at >> > >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >> > at >> > >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) >> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) >> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) >> > at >> > >> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) >> > at >> > >> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) >> > at >> > >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) >> > at >> > >> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) >> > at >> > >> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) >> > at >> > >> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) >> > at >> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) >> > at >> > >> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) >> > at >> > >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) >> > at >> > >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) >> > at >> > >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) >> > at >> > >> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) >> > at >> > >> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) >> > at org.eclipse.jetty.server.Server.handle(Server.java:497) >> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) >> > at >> > >> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) >> > at >> > >> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) >> > at >> > >> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) >> > at >> > >> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) >> > at java.lang.Thread.run(Thread.java:745) >> > Caused by: java.lang.IllegalArgumentException: Document contains at least >> > one immense term in field="signature" (whose UTF8 encoding is longer than >> > the max length 32766), all of which were skipped. Please correct the >> > analyzer to not produce such terms. The prefix of the first immense term >> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56, >> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...', >> > original message: bytes can be at most 32766 in length; got 49960 >> > at >> > >> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670) >> > at >> > >> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344) >> > at >> > >> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300) >> > at >> > >> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) >> > at >> > >> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458) >> > at >> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363) >> > at >> > >> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239) >> > at >> > >> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163) >> > ... 38 more >> > Caused by: >> > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: >> bytes >> > can be at most 32766 in length; got 49960 >> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284) >> > at >> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154) >> > at >> > >> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660) >> > ... 45 more >> > >> > >> > Regards, >> > Edwin >>