Re: String bytes can be at most 32766 characters in length?

2015-09-03 Thread Zheng Lin Edwin Yeo
Thanks for your advice Alexandre. On 3 September 2015 at 20:29, Alexandre Rafalovitch wrote: > Probably because your signatureField and your fields are the same! You > need to point signatureField at a new (not-ID) field. > > You will still get duplicates, as you requested that in your other > e

Re: String bytes can be at most 32766 characters in length?

2015-09-03 Thread Alexandre Rafalovitch
Probably because your signatureField and your fields are the same! You need to point signatureField at a new (not-ID) field. You will still get duplicates, as you requested that in your other emails, but now you would be able to group on that new signature field. If you have any further problems,

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Alexandre, Thanks for pointing out the error. I'm able to get the documents to be indexed after adding in the two processors. However, I'm still seeing all the similar documents being search in the content without being de-duplicated. My content is currently indexed as fieldType=text_general.

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Alexandre Rafalovitch
And that's because you have an incomplete chain. If you look at the full example in solrconfig.xml, it shows: true id false name,features,cat solr.processor.Lookup3Signature Notice, the last two processors. I

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Erick, I couldn't really find anything special in the logs. The indexing process just went on normally, but after that when I check the index, there is nothing indexed. This is what I see from the logs. Looks the same as when the indexing works fine. INFO - 2015-09-03 01:24:35.316; [collecti

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Erick Erickson
_How_ does it fail? You must be seeing something in the logs On Wed, Sep 2, 2015 at 8:29 AM, Zheng Lin Edwin Yeo wrote: > Hi Erick, > > Yes, i'm trying out the De-Duplication too. But I'm facing a problem with > that, which is the indexing stops working once I put in the following > De-Dupl

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi Erick, Yes, i'm trying out the De-Duplication too. But I'm facing a problem with that, which is the indexing stops working once I put in the following De-Duplication code in solrconfig.xml. The problem seems to be with this dedupe line. dedupe true signature false content

Re: String bytes can be at most 32766 characters in length?

2015-09-02 Thread Erick Erickson
Yes, that is an intentional limit for the size of a single token, which strings are. Why not use deduplication? See: https://cwiki.apache.org/confluence/display/solr/De-Duplication You don't have to replace the existing documents, and Solr will compute a hash that can be used to identify identica

String bytes can be at most 32766 characters in length?

2015-09-02 Thread Zheng Lin Edwin Yeo
Hi, I would like to check, is the string bytes must be at most 32766 characters in length? I'm trying to do a copyField of my rich-text documents content to a field with fieldType=string to try out my getting distinct result for content, as there are several documents with the exact same content,