We are running a six node SOLR cloud which 3 shards and 3 replications. The
version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr
client to interact with Solr. Documents that we add to solr have a unique id
and it can never have duplicates. 
Our use case is to query the index for a give searchterm and pull all
documents that matches the query. Usually our query hits over 40K documents.
While we iterate through all 40K+ documents, after few iteration, we see the
same documents ids repeated over and over, and at the end we see some 20-33%
of the records are duplicates. 
In the below code snippet after some iterations, we see a difference in the
length of idslist and idsset. Any insight into how to troubleshoot this
issue is greatly appreciated.

from pysolr import Solr
solr=  Solr('http://solrhost/solr/#/collection1')    
if __name__ == '__main__':


        idslist = list()
    idsset = set()    
    query = 'snow'
    skip = 0
    limit= 500
    i = 0
    while True:
        response = solr.search(q=query, rows=limit, start=skip, 
shards='host1:7575/solr,host2:7575/solr,host3:7575/solr', fl="id,source")
        if skip == 0:
            hits = response.hits
            line = "Solr Hits Count: (%s)\n" % (hits)
            print line      
        if len(response.docs) == 0:
            break         
        for result in response:
            idslist.append(result['id'])
            idsset.add(result['id'])
            if i % 500 == 0:
                print len(idslist), len(idsset)
            i+=1
        skip += limit             




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-records-while-retrieving-documents-tp4039776.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to