Removing duplicates during a query

2013-08-22 Thread Dan Davis
Suppose I have two documents with different id, and there is another field,
for instance content-hash which is something like a 16-byte hash of the
content.

Can Solr be configured to return just one copy, and drop the other if both
are relevant?

If Solr does drop one result, do you get any indication in the document
that was kept that there was another copy?


Re: Removing duplicates during a query

2013-08-22 Thread Dan Davis
OK - I see that this can be done with Field Collapsing/Grouping.  I also
see the mentions in the Wiki for avoiding duplicates using a 16-byte hash.

So, question withdrawn...


On Thu, Aug 22, 2013 at 10:21 PM, Dan Davis dansm...@gmail.com wrote:

 Suppose I have two documents with different id, and there is another
 field, for instance content-hash which is something like a 16-byte hash
 of the content.

 Can Solr be configured to return just one copy, and drop the other if both
 are relevant?

 If Solr does drop one result, do you get any indication in the document
 that was kept that there was another copy?




removing duplicates

2013-08-21 Thread Ali, Saqib
hello,

We have documents that are duplicates i.e. the ID is different, but rest of
the fields are same. Is there a query that can remove duplicate, and just
leave one copy of the document on solr? There is one numeric field that we
can key off for find duplicates.

Please advise.

Thanks


Re: removing duplicates

2013-08-21 Thread Aloke Ghoshal
Hi,

Facet by one of the duplicate fields (probably by the numeric field that
you mentioned) and set facet.mincount=2.

Regards,
Aloke


On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com wrote:

 hello,

 We have documents that are duplicates i.e. the ID is different, but rest of
 the fields are same. Is there a query that can remove duplicate, and just
 leave one copy of the document on solr? There is one numeric field that we
 can key off for find duplicates.

 Please advise.

 Thanks



RE: removing duplicates

2013-08-21 Thread Petersen, Robert
Hi

Perhaps you could query for all documents asking for the id field to be 
returned and then facet on the field you say you can key off of for duplicates. 
 Set the facet mincount to 2, then you would have to filter on each facet value 
and page through all doc IDs (except skip the first document) for each returned 
facet and delete by ID using a small app or something like that.  Spin all the 
deletes into the index and then do a commit at the end.  I think that would do 
it.

Thanks
Robi

-Original Message-
From: Ali, Saqib [mailto:docbook@gmail.com] 
Sent: Wednesday, August 21, 2013 2:15 PM
To: solr-user@lucene.apache.org
Subject: removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of the 
fields are same. Is there a query that can remove duplicate, and just leave one 
copy of the document on solr? There is one numeric field that we can key off 
for find duplicates.

Please advise.

Thanks



Re: removing duplicates

2013-08-21 Thread Ali, Saqib
Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,

 Facet by one of the duplicate fields (probably by the numeric field that
 you mentioned) and set facet.mincount=2.

 Regards,
 Aloke


 On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com wrote:

  hello,
 
  We have documents that are duplicates i.e. the ID is different, but rest
 of
  the fields are same. Is there a query that can remove duplicate, and just
  leave one copy of the document on solr? There is one numeric field that
 we
  can key off for find duplicates.
 
  Please advise.
 
  Thanks
 



Re: removing duplicates

2013-08-21 Thread Aloke Ghoshal
Hi,

This will help you identify the duplicates:
q=*:*fl=idfacet=truefacet.mincount=2rows=0facet.field=One_Of_The_Duplicated_Fields

To actually remove them from Solr, you will have to do something like
Robert suggested. Write an application that uses the results to build a
delete by id query (
http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_documents_by_ID_and_by_Query
).

Regards,
Aloke


On Thu, Aug 22, 2013 at 3:04 AM, Ali, Saqib docbook@gmail.com wrote:

 Thanks Aloke and Robert. Can you please give me code/query snippets?
 (newbie here)


 On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal alghos...@gmail.com
 wrote:

  Hi,
 
  Facet by one of the duplicate fields (probably by the numeric field that
  you mentioned) and set facet.mincount=2.
 
  Regards,
  Aloke
 
 
  On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com
 wrote:
 
   hello,
  
   We have documents that are duplicates i.e. the ID is different, but
 rest
  of
   the fields are same. Is there a query that can remove duplicate, and
 just
   leave one copy of the document on solr? There is one numeric field that
  we
   can key off for find duplicates.
  
   Please advise.
  
   Thanks
  
 



RE: removing duplicates

2013-08-21 Thread Petersen, Robert
This would describe the facet parameters we're talking about:

http://wiki.apache.org/solr/SimpleFacetParameters

Query something like this:
http://localhost:8983/solr/select?q=*:*fl=idrows=0facet=truefacet.limit=-1facet.field=your
 field namefacet.mincount=2

Then filter on each facet returned with a filter query described here: 
http://wiki.apache.org/solr/CommonQueryParameters
Example: q=*:*fq=your field name:your field value

Then you would have to get all ids returned and delete all but the first one 
using some app...

Thanks 
Robi


-Original Message-
From: Ali, Saqib [mailto:docbook@gmail.com] 
Sent: Wednesday, August 21, 2013 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: removing duplicates

Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal alghos...@gmail.com wrote:

 Hi,

 Facet by one of the duplicate fields (probably by the numeric field 
 that you mentioned) and set facet.mincount=2.

 Regards,
 Aloke


 On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib docbook@gmail.com wrote:

  hello,
 
  We have documents that are duplicates i.e. the ID is different, but 
  rest
 of
  the fields are same. Is there a query that can remove duplicate, and 
  just leave one copy of the document on solr? There is one numeric 
  field that
 we
  can key off for find duplicates.
 
  Please advise.
 
  Thanks
 




答复: removing duplicates

2013-08-21 Thread Liu
This picture is extracted from apache-solr-ref-guide-4.4.pdf ,Maybe it will
help you.
You could download the document from
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/

-邮件原件-
发件人: Ali, Saqib [mailto:docbook@gmail.com] 
发送时间: 2013年8月22日 5:15
收件人: solr-user@lucene.apache.org
主题: removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of
the fields are same. Is there a query that can remove duplicate, and just
leave one copy of the document on solr? There is one numeric field that we
can key off for find duplicates.

Please advise.

Thanks


Re: Removing duplicates

2011-02-19 Thread Ahmet Arslan
 I know that I can use the
 SignatureUpdateProcessorFactory to remove duplicates but I
 would like the duplicates in the index but remove them
 conditionally at query time.
 
 Is there any easy way I could accomplish this?


Closest thing can be group documents by signature field.
http://wiki.apache.org/solr/FieldCollapsing


  


Removing duplicates

2011-02-18 Thread Mark
I know that I can use the SignatureUpdateProcessorFactory to remove 
duplicates but I would like the duplicates in the index but remove them 
conditionally at query time.


Is there any easy way I could accomplish this?