Re: String bytes can be at most 32766 characters in length?

Erick Erickson Wed, 02 Sep 2015 08:36:26 -0700

_How_ does it fail? You must be seeing something in the logs....



On Wed, Sep 2, 2015 at 8:29 AM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> Hi Erick,
>
> Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
> that, which is the indexing stops working once I put in the following
> De-Duplication code in solrconfig.xml. The problem seems to be with this <str
> name="update.chain">dedupe</str> line.
>
>   <requestHandler name="/update" class="solr.UpdateRequestHandler">
>   <lst name="defaults">
> <str name="update.chain">dedupe</str>
>   </lst>
>   </requestHandler>
>
>
>     <updateRequestProcessorChain name="dedupe">
>   <processor class="solr.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">signature</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">content</str>
> <str name="signatureClass">solr.processor.Lookup3Signature</str>
>   </processor>
> </updateRequestProcessorChain>
>
>
> Regards,
> Edwin
>
> On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Yes, that is an intentional limit for the size of a single token,
>> which strings are.
>>
>> Why not use deduplication? See:
>> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>>
>> You don't have to replace the existing documents, and Solr will
>> compute a hash that can be used to identify identical documents
>> and you can use_that_.
>>
>> Best
>> Erick
>>
>> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>> <edwinye...@gmail.com> wrote:
>> > Hi,
>> >
>> > I would like to check, is the string bytes must be at most 32766
>> characters
>> > in length?
>> >
>> > I'm trying to do a copyField of my rich-text documents content to a field
>> > with fieldType=string to try out my getting distinct result for content,
>> as
>> > there are several documents with the exact same content, and we only want
>> > to list one of them during searching.
>> >
>> > However, I get the following errors in some of the documents when I tried
>> > to index them with the copyField. Some of my documents are quite large in
>> > size, and there is a possibility that it exceed 32766 characters. Is
>> there
>> > any other ways to overcome this problem?
>> >
>> >
>> > org.apache.solr.common.SolrException: Exception writing document id
>> > collection1_polymer100 to the index; possible analysis error.
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
>> > at
>> >
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
>> > at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
>> > at
>> >
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>> > at
>> >
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> > at
>> >
>> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
>> > at
>> >
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
>> > at
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> > at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> > at
>> >
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> > at
>> >
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> > at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> > at
>> >
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> > at
>> >
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>> > at
>> >
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>> > at
>> >
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> > at org.eclipse.jetty.server.Server.handle(Server.java:497)
>> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>> > at
>> >
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>> > at
>> >
>> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>> > at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>> > at
>> >
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>> > at java.lang.Thread.run(Thread.java:745)
>> > Caused by: java.lang.IllegalArgumentException: Document contains at least
>> > one immense term in field="signature" (whose UTF8 encoding is longer than
>> > the max length 32766), all of which were skipped.  Please correct the
>> > analyzer to not produce such terms.  The prefix of the first immense term
>> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62, 56,
>> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
>> > original message: bytes can be at most 32766 in length; got 49960
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
>> > at
>> >
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
>> > at
>> >
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
>> > at
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
>> > at
>> >
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
>> > ... 38 more
>> > Caused by:
>> > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
>> bytes
>> > can be at most 32766 in length; got 49960
>> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
>> > at
>> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
>> > at
>> >
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660)
>> > ... 45 more
>> >
>> >
>> > Regards,
>> > Edwin
>>

Re: String bytes can be at most 32766 characters in length?

Reply via email to