RE: Solr Cloud: Duplicate documents in multiple shards

Reitzel, Charles Tue, 21 Jul 2015 07:02:10 -0700

Also, the function used to generate hashes is 
org.apache.solr.common.util.Hash.murmurhash3_x86_32(), which produces a 32-bit 
value.   The range of the hash values assigned to each shard are resident in 
Zookeeper.   Since you are using only a single hash component, all 32-bits will 
be used by the entire ID field value.


I.e. I see no routing delimiter (!) in your example ID value:

"possting.mongo-v2.services.com-intl-staging-c2d2a376-5e4a-11e2-8963-0026b9414f30"

Which isn't required, but it means that documents (logs?) will be distributed 
in a round-robin fashion over the shards.  Not grouped by host or environment 
(if I am reading it right).

You might consider the following:  <environment>!<hostname>!UUID

E.g. 
"intl-staging!possting.mongo-v2.services.com!c2d2a376-5e4a-11e2-8963-0026b9414f30"

This way documents from the same host will be grouped together, most likely on 
the same shard.  Further, within the same environment, documents will be 
grouped on the same subset of shards. This will allow client applications to 
set _route_=<environment>!  or _route_=<environment>!<hostname>! and limit 
queries to those shards containing relevant data when the corresponding filter 
queries are applied.

If you were using route delimiters, then the default for a 2-part key (1 
delimiter) is to use 16 bits for each part.  The default for a 3-part key (2 
delimiters) is to use 8-bits each for the 1st 2 parts and 16 bits for the 3rd 
part.   In any case, the high-order bytes of the hash dominate the distribution 
of data.    

-----Original Message-----
From: Reitzel, Charles 
Sent: Tuesday, July 21, 2015 9:55 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr Cloud: Duplicate documents in multiple shards

When are you generating the UUID exactly?   If you set the unique ID field on 
an "update", and it contains a new UUID, you have effectively created a new 
document.   Just a thought.

-----Original Message-----
From: mesenthil1 [mailto:senthilkumar.arumu...@viacomcontractor.com] 
Sent: Tuesday, July 21, 2015 4:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cloud: Duplicate documents in multiple shards

Unable to delete by passing distrib=false as well. Also it is difficult to 
identify those duplicate documents among the 130 million. 

Is there a way we can see the generated hash key and mapping them to the 
specific shard?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218317.html
Sent from the Solr - User mailing list archive at Nabble.com.

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

RE: Solr Cloud: Duplicate documents in multiple shards

Reply via email to