Re: Finding out optimal hash ranges for shard split
Nope, there is no way to find that out without actually doing the split. If you have composite keys then you could also split using the prefix of a composite id via the split.key parameter. On Wed, May 6, 2015 at 9:32 AM, anand.mahajan an...@zerebral.co.in wrote: Looks like its not possible to find out the optimal hash ranges for a split before you actually split it. So the only way out is to keep splitting out the large subshards? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Finding out optimal hash ranges for shard split
Okay - Thanks for the confirmation Shalin. Could this be a feature request in the Collections API - that we have a Split shard dry run API that accepts sub-shards count as a request param and returns the optimal shard ranges for the number of sub-shards requested to be created along with the respective document counts for each of the sub-shards? The users can then use this shard ranges for the actual split? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204100.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Finding out optimal hash ranges for shard split
Hi Anand, The nature of the hash function (murmur3) should lead to a approximately uniform distribution of documents across sub-shards. Have you investigated why, if at all, the sub-shards are not balanced? Do you use composite keys e.g. abc!id1 which cause the imbalance? I don't think there is a (cheap) way to implement what you are asking in the current scheme of things because unless we go through each id and calculate the hash, we have no way of knowing the optimal distribution. However, if we were to store the hash of the key as a separate field in the index then it should be possible to binary search for hash ranges which lead to approx. equal distribution of docs in sub-shards. We can implement something like that inside Solr. On Wed, May 6, 2015 at 4:42 PM, anand.mahajan an...@zerebral.co.in wrote: Okay - Thanks for the confirmation Shalin. Could this be a feature request in the Collections API - that we have a Split shard dry run API that accepts sub-shards count as a request param and returns the optimal shard ranges for the number of sub-shards requested to be created along with the respective document counts for each of the sub-shards? The users can then use this shard ranges for the actual split? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204100.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar.
Re: Finding out optimal hash ranges for shard split
Yes - I'm using 2 level composite ids and that has caused the imbalance for some shards. Its cars data and the composite ids are of the form year-make!model-and couple of other specifications. e.g. 2013Ford!Edge!123456 - but there are just far too many Ford 2013 or 2011 cars that go and occupy the same shards. This was done so as co-location of these docs is required as well for a few of the search requirements - to avoid it hitting all shards all the time and all queries do have the year and make combinations always specified and its easier to work out the target shard for the query. Regarding storing the hash against each document and then querying to find out the optimal ranges - could it be done so that Solr maintains incremental counters for each of the hash in the range for the shard - and then the collections Splitshard API could use this internally to propose the optimal shard ranges for the split? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204124.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Finding out optimal hash ranges for shard split
Looks like its not possible to find out the optimal hash ranges for a split before you actually split it. So the only way out is to keep splitting out the large subshards? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html Sent from the Solr - User mailing list archive at Nabble.com.
Finding out optimal hash ranges for shard split
Hi all, Before doing a splitshard - Is there a way to figure out optimal hash ranges for the shard that will evenly split the documents on the new sub-shards that get created? Sort of a dry-run to the actual split shard command with ranges parameter specified with it that just shows the number of docs that will reside on the new sub-shards if the split shard command was executed with a given hash range? Thanks, Anand -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609.html Sent from the Solr - User mailing list archive at Nabble.com.