Also, the function used to generate hashes is org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit value. The range of the hash values assigned to each shard are resident in Zookeeper. Since you are using only a single hash component, all 32-bits will be used by the entire ID field value.
I.e. I see no routing delimiter (!) in your example ID value: "possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30" Which isn't required, but it means that documents (logs?) will be distributed in a round-robin fashion over the shards. Not grouped by host or environment (if I am reading it right). You might consider the following: <environment>!<hostname>!UUID E.g. "intl-staging!possting.mongo-v2.services.com!c2d2a376-5e4a-11e2-8963-0026b9414f30" This way documents from the same host will be grouped together, most likely on the same shard. Further, within the same environment, documents will be grouped on the same subset of shards. This will allow client applications to set _route_=<environment>! or _route_=<environment>!<hostname>! and limit queries to those shards containing relevant data when the corresponding filter queries are applied. If you were using route delimiters, then the default for a 2-part key (1 delimiter) is to use 16 bits for each part. The default for a 3-part key (2 delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd part. In any case, the high-order bytes of the hash dominate the distribution of data. -----Original Message----- From: Reitzel, Charles Sent: Tuesday, July 21, 2015 9:55 AM To: solr-user@lucene.apache.org Subject: RE: Solr Cloud: Duplicate documents in multiple shards When are you generating the UUID exactly? If you set the unique ID field on an "update", and it contains a new UUID, you have effectively created a new document. Just a thought. -----Original Message----- From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] Sent: Tuesday, July 21, 2015 4:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud: Duplicate documents in multiple shards Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html Sent from the Solr - User mailing list archive at Nabble.com. ************************************************************************* This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA-CREF *************************************************************************