We are running a six node SOLR cloud which 3 shards and 3 replications. The version of solr cloud is 4.0.0.2012.08.06.22.50.47. We use Python PySolr client to interact with Solr. Documents that we add to solr have a unique id and it can never have duplicates. Our use case is to query the index for a give searchterm and pull all documents that matches the query. Usually our query hits over 40K documents. While we iterate through all 40K+ documents, after few iteration, we see the same documents ids repeated over and over, and at the end we see some 20-33% of the records are duplicates. In the below code snippet after some iterations, we see a difference in the length of idslist and idsset. Any insight into how to troubleshoot this issue is greatly appreciated.
from pysolr import Solr solr= Solr('http://solrhost/solr/#/collection1') if __name__ == '__main__': idslist = list() idsset = set() query = 'snow' skip = 0 limit= 500 i = 0 while True: response = solr.search(q=query, rows=limit, start=skip, shards='host1:7575/solr,host2:7575/solr,host3:7575/solr', fl="id,source") if skip == 0: hits = response.hits line = "Solr Hits Count: (%s)\n" % (hits) print line if len(response.docs) == 0: break for result in response: idslist.append(result['id']) idsset.add(result['id']) if i % 500 == 0: print len(idslist), len(idsset) i+=1 skip += limit -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-records-while-retrieving-documents-tp4039776.html Sent from the Solr - User mailing list archive at Nabble.com.