Re: String bytes can be at most 32766 characters in length?

Alexandre Rafalovitch Thu, 03 Sep 2015 05:30:24 -0700

Probably because your signatureField and your fields are the same! You
need to point signatureField at a new (not-ID) field.


You will still get duplicates, as you requested that in your other
emails, but now you would be able to group on that new signature
field.

If you have any further problems, you also need to start a new thread
with a new subject, as the current question is no longer related.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 2 September 2015 at 22:21, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote:
> Hi Alexandre,
>
> Thanks for pointing out the error. I'm able to get the documents to be
> indexed after adding in the two processors.
>
> However, I'm still seeing all the similar documents being search in the
> content without being de-duplicated. My content is currently indexed as
> fieldType=text_general.
>
>     <updateRequestProcessorChain name="dedupe">
>  <processor class="solr.processor.SignatureUpdateProcessorFactory">
> <bool name="enabled">true</bool>
> <str name="signatureField">content</str>
> <bool name="overwriteDupes">false</bool>
> <str name="fields">content</str>
> <str name="signatureClass">solr.processor.Lookup3Signature</str>
>  </processor>
> <processor class="solr.LogUpdateProcessorFactory" />
> <processor class="solr.RunUpdateProcessorFactory" />
> </updateRequestProcessorChain>
>
> Regards,
> Edwin
>
>
> On 3 September 2015 at 09:46, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> And that's because you have an incomplete chain. If you look at the
>> full example in solrconfig.xml, it shows:
>>      <updateRequestProcessorChain name="dedupe">
>>        <processor class="solr.processor.SignatureUpdateProcessorFactory">
>>          <bool name="enabled">true</bool>
>>          <str name="signatureField">id</str>
>>          <bool name="overwriteDupes">false</bool>
>>          <str name="fields">name,features,cat</str>
>>          <str name="signatureClass">solr.processor.Lookup3Signature</str>
>>        </processor>
>>        <processor class="solr.LogUpdateProcessorFactory" />
>>        <processor class="solr.RunUpdateProcessorFactory" />
>>      </updateRequestProcessorChain>
>>
>>
>> Notice, the last two processors. If you don't have those, nothing gets
>> indexed. You chain is missing them, for whatever reasons. Try adding
>> them back in, reloading the core and reindexing.
>>
>> Regards,
>>    Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 2 September 2015 at 11:29, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
>> wrote:
>> > Hi Erick,
>> >
>> > Yes, i'm trying out the De-Duplication too. But I'm facing a problem with
>> > that, which is the indexing stops working once I put in the following
>> > De-Duplication code in solrconfig.xml. The problem seems to be with this
>> <str
>> > name="update.chain">dedupe</str> line.
>> >
>> >   <requestHandler name="/update" class="solr.UpdateRequestHandler">
>> >   <lst name="defaults">
>> > <str name="update.chain">dedupe</str>
>> >   </lst>
>> >   </requestHandler>
>> >
>> >
>> >     <updateRequestProcessorChain name="dedupe">
>> >   <processor class="solr.processor.SignatureUpdateProcessorFactory">
>> > <bool name="enabled">true</bool>
>> > <str name="signatureField">signature</str>
>> > <bool name="overwriteDupes">false</bool>
>> > <str name="fields">content</str>
>> > <str name="signatureClass">solr.processor.Lookup3Signature</str>
>> >   </processor>
>> > </updateRequestProcessorChain>
>> >
>> >
>> > Regards,
>> > Edwin
>> >
>> > On 2 September 2015 at 23:10, Erick Erickson <erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Yes, that is an intentional limit for the size of a single token,
>> >> which strings are.
>> >>
>> >> Why not use deduplication? See:
>> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>> >>
>> >> You don't have to replace the existing documents, and Solr will
>> >> compute a hash that can be used to identify identical documents
>> >> and you can use_that_.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Sep 2, 2015 at 2:53 AM, Zheng Lin Edwin Yeo
>> >> <edwinye...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I would like to check, is the string bytes must be at most 32766
>> >> characters
>> >> > in length?
>> >> >
>> >> > I'm trying to do a copyField of my rich-text documents content to a
>> field
>> >> > with fieldType=string to try out my getting distinct result for
>> content,
>> >> as
>> >> > there are several documents with the exact same content, and we only
>> want
>> >> > to list one of them during searching.
>> >> >
>> >> > However, I get the following errors in some of the documents when I
>> tried
>> >> > to index them with the copyField. Some of my documents are quite
>> large in
>> >> > size, and there is a possibility that it exceed 32766 characters. Is
>> >> there
>> >> > any other ways to overcome this problem?
>> >> >
>> >> >
>> >> > org.apache.solr.common.SolrException: Exception writing document id
>> >> > collection1_polymer100 to the index; possible analysis error.
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:207)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:122)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:127)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:235)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> >> > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> >> > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> >> > at
>> >>
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>> >> > at org.eclipse.jetty.server.Server.handle(Server.java:497)
>> >> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>> >> > at
>> >> >
>> >>
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>> >> > at java.lang.Thread.run(Thread.java:745)
>> >> > Caused by: java.lang.IllegalArgumentException: Document contains at
>> least
>> >> > one immense term in field="signature" (whose UTF8 encoding is longer
>> than
>> >> > the max length 32766), all of which were skipped.  Please correct the
>> >> > analyzer to not produce such terms.  The prefix of the first immense
>> term
>> >> > is: '[32, 60, 112, 62, 60, 98, 114, 62, 32, 32, 32, 60, 98, 114, 62,
>> 56,
>> >> > 48, 56, 32, 72, 97, 110, 100, 98, 111, 111, 107, 32, 111, 102]...',
>> >> > original message: bytes can be at most 32766 in length; got 49960
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:670)
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232)
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:458)
>> >> > at
>> >>
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1363)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
>> >> > at
>> >> >
>> >>
>> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
>> >> > ... 38 more
>> >> > Caused by:
>> >> > org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException:
>> >> bytes
>> >> > can be at most 32766 in length; got 49960
>> >> > at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
>> >> > at
>> >>
>> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:154)
>> >> > at
>> >> >
>> >>
>> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:660)
>> >> > ... 45 more
>> >> >
>> >> >
>> >> > Regards,
>> >> > Edwin
>> >>
>>

Re: String bytes can be at most 32766 characters in length?

Reply via email to