I tried to follow the de-duplication guide, but after I configured it in solrconfig.xml and schema.xml, nothing is indexed into Solr, and there is no error message. I'm using SimplePostTool to index rich-text documents.
Below are my configurations: In solrconfig.xml <requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">dedupe</str> </lst> </requestHandler> <updateRequestProcessorChain name="dedupe"> <processor class="solr.processor.SignatureUpdateProcessorFactory"> <bool name="enabled">true</bool> <str name="signatureField">id</str> <bool name="overwriteDupes">false</bool> <str name="fields">content</str> <str name="signatureClass">solr.processor.Lookup3Signature</str> </processor> </updateRequestProcessorChain> In schema.xml <field name="signature" type="string" stored="true" indexed="true" multiValued="false" /> Is there anything which I might have missed out or done wrongly? Regards, Edwin On 1 September 2015 at 10:46, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Thank you for your advice Alexandre. > > Will try out the de-duplication from the link you gave. > > Regards, > Edwin > > > On 1 September 2015 at 10:34, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > >> Re-read the question. You want to de-dupe on the full text-content. >> >> I would actually try to use the dedupe chain as per the link I gave >> but put results into a separate string field. Then, you group on that >> field. You cannot actually group on the long text field, that would >> kill any performance. So a signature is your proxy. >> >> Regards, >> Alex >> ---- >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 31 August 2015 at 22:26, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> wrote: >> > Hi Alexandre, >> > >> > Will treating it as String affect the search or other functions like >> > highlighting? >> > >> > Yes, the content must be in my index, unless I do a copyField to do >> > de-duplication on that field.. Will that help? >> > >> > Regards, >> > Edwin >> > >> > >> > On 1 September 2015 at 10:04, Alexandre Rafalovitch <arafa...@gmail.com >> > >> > wrote: >> > >> >> Can't you just treat it as String? >> >> >> >> Also, do you actually want those documents in your index in the first >> >> place? If not, have you looked at De-duplication: >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication >> >> >> >> Regards, >> >> Alex. >> >> ---- >> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> >> http://www.solr-start.com/ >> >> >> >> >> >> On 31 August 2015 at 22:00, Zheng Lin Edwin Yeo <edwinye...@gmail.com> >> >> wrote: >> >> > Thanks Jan. >> >> > >> >> > But I read that the field that is being collapsed on must be a single >> >> > valued String, Int or Float. As I'm required to get the distinct >> results >> >> > from "content" field that was indexed from a rich text document, I >> got >> >> the >> >> > following error: >> >> > >> >> > "error":{ >> >> > "msg":"java.io.IOException: 64 bit numeric collapse fields are >> not >> >> > supported", >> >> > "trace":"java.lang.RuntimeException: java.io.IOException: 64 bit >> >> > numeric collapse fields are not supported\r\n\tat >> >> > >> >> > >> >> > Is it possible to collapsed on fields which has a long integer of >> data, >> >> > like content from a rich text document? >> >> > >> >> > Regards, >> >> > Edwin >> >> > >> >> > >> >> > On 31 August 2015 at 18:59, Jan Høydahl <jan....@cominvent.com> >> wrote: >> >> > >> >> >> Hi >> >> >> >> >> >> Check out the CollapsingQParser ( >> >> >> >> >> >> https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results >> >> ). >> >> >> As long as you have a field that will be the same for all >> duplicates, >> >> you >> >> >> can “collapse” on that field. If you not have a “group id”, you can >> >> create >> >> >> one using e.g. an MD5 signature of the identical body text ( >> >> >> https://cwiki.apache.org/confluence/display/solr/De-Duplication). >> >> >> >> >> >> -- >> >> >> Jan Høydahl, search solution architect >> >> >> Cominvent AS - www.cominvent.com >> >> >> >> >> >> > 31. aug. 2015 kl. 12.03 skrev Zheng Lin Edwin Yeo < >> >> edwinye...@gmail.com >> >> >> >: >> >> >> > >> >> >> > Hi, >> >> >> > >> >> >> > I'm using Solr 5.2.1, and I would like to find out, what is the >> best >> >> way >> >> >> to >> >> >> > get Solr to return only distinct results? >> >> >> > >> >> >> > Currently, I've indexed several exact similar documents into Solr, >> >> with >> >> >> > just different id and title, but the content is exactly the same. >> >> When I >> >> >> do >> >> >> > a search, Solr will return all these documents several time in the >> >> list. >> >> >> > >> >> >> > What is the most suitable way to get Solr to return only one of >> the >> >> >> > document during the search? >> >> >> > I understand that there is result grouping and faceting, but I'm >> not >> >> sure >> >> >> > if that is the best way. >> >> >> > >> >> >> > Regards, >> >> >> > Edwin >> >> >> >> >> >> >> >> >> > >