Re: Removing duplicates during a query
OK - I see that this can be done with Field Collapsing/Grouping. I also see the mentions in the Wiki for avoiding duplicates using a 16-byte hash. So, question withdrawn... On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis wrote: > Suppose I have two documents with different id, and there is another > field, for instance "content-hash" which is something like a 16-byte hash > of the content. > > Can Solr be configured to return just one copy, and drop the other if both > are relevant? > > If Solr does drop one result, do you get any indication in the document > that was kept that there was another copy? > >
RE: removing duplicates
This would describe the facet parameters we're talking about: http://wiki.apache.org/solr/SimpleFacetParameters Query something like this: http://localhost:8983/solr/select?q=*:*&fl=id&rows=0&facet=true&facet.limit=-1&facet.field=&facet.mincount=2 Then filter on each facet returned with a filter query described here: http://wiki.apache.org/solr/CommonQueryParameters Example: q=*:*&fq=: Then you would have to get all ids returned and delete all but the first one using some app... Thanks Robi -Original Message- From: Ali, Saqib [mailto:docbook@gmail.com] Sent: Wednesday, August 21, 2013 2:34 PM To: solr-user@lucene.apache.org Subject: Re: removing duplicates Thanks Aloke and Robert. Can you please give me code/query snippets? (newbie here) On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal wrote: > Hi, > > Facet by one of the duplicate fields (probably by the numeric field > that you mentioned) and set facet.mincount=2. > > Regards, > Aloke > > > On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib wrote: > > > hello, > > > > We have documents that are duplicates i.e. the ID is different, but > > rest > of > > the fields are same. Is there a query that can remove duplicate, and > > just leave one copy of the document on solr? There is one numeric > > field that > we > > can key off for find duplicates. > > > > Please advise. > > > > Thanks > > >
Re: removing duplicates
Hi, This will help you identify the duplicates: q=*:*&fl=id&facet=true&facet.mincount=2&rows=0&facet.field= To actually remove them from Solr, you will have to do something like Robert suggested. Write an application that uses the results to build a delete by id query ( http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_documents_by_ID_and_by_Query ). Regards, Aloke On Thu, Aug 22, 2013 at 3:04 AM, Ali, Saqib wrote: > Thanks Aloke and Robert. Can you please give me code/query snippets? > (newbie here) > > > On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal > wrote: > > > Hi, > > > > Facet by one of the duplicate fields (probably by the numeric field that > > you mentioned) and set facet.mincount=2. > > > > Regards, > > Aloke > > > > > > On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib > wrote: > > > > > hello, > > > > > > We have documents that are duplicates i.e. the ID is different, but > rest > > of > > > the fields are same. Is there a query that can remove duplicate, and > just > > > leave one copy of the document on solr? There is one numeric field that > > we > > > can key off for find duplicates. > > > > > > Please advise. > > > > > > Thanks > > > > > >
Re: removing duplicates
Thanks Aloke and Robert. Can you please give me code/query snippets? (newbie here) On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal wrote: > Hi, > > Facet by one of the duplicate fields (probably by the numeric field that > you mentioned) and set facet.mincount=2. > > Regards, > Aloke > > > On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib wrote: > > > hello, > > > > We have documents that are duplicates i.e. the ID is different, but rest > of > > the fields are same. Is there a query that can remove duplicate, and just > > leave one copy of the document on solr? There is one numeric field that > we > > can key off for find duplicates. > > > > Please advise. > > > > Thanks > > >
RE: removing duplicates
Hi Perhaps you could query for all documents asking for the id field to be returned and then facet on the field you say you can key off of for duplicates. Set the facet mincount to 2, then you would have to filter on each facet value and page through all doc IDs (except skip the first document) for each returned facet and delete by ID using a small app or something like that. Spin all the deletes into the index and then do a commit at the end. I think that would do it. Thanks Robi -Original Message- From: Ali, Saqib [mailto:docbook@gmail.com] Sent: Wednesday, August 21, 2013 2:15 PM To: solr-user@lucene.apache.org Subject: removing duplicates hello, We have documents that are duplicates i.e. the ID is different, but rest of the fields are same. Is there a query that can remove duplicate, and just leave one copy of the document on solr? There is one numeric field that we can key off for find duplicates. Please advise. Thanks
Re: removing duplicates
Hi, Facet by one of the duplicate fields (probably by the numeric field that you mentioned) and set facet.mincount=2. Regards, Aloke On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib wrote: > hello, > > We have documents that are duplicates i.e. the ID is different, but rest of > the fields are same. Is there a query that can remove duplicate, and just > leave one copy of the document on solr? There is one numeric field that we > can key off for find duplicates. > > Please advise. > > Thanks >
Re: Removing duplicates
> I know that I can use the > SignatureUpdateProcessorFactory to remove duplicates but I > would like the duplicates in the index but remove them > conditionally at query time. > > Is there any easy way I could accomplish this? Closest thing can be group documents by signature field. http://wiki.apache.org/solr/FieldCollapsing