Re: Finding out optimal hash ranges for shard split

2015-05-06 Thread Shalin Shekhar Mangar
Nope, there is no way to find that out without actually doing the split. If
you have composite keys then you could also split using the prefix of a
composite id via the split.key parameter.

On Wed, May 6, 2015 at 9:32 AM, anand.mahajan an...@zerebral.co.in wrote:

 Looks like its not possible to find out the optimal hash ranges for a split
 before you actually split it. So the only way out is to keep splitting out
 the large subshards?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Finding out optimal hash ranges for shard split

2015-05-06 Thread anand.mahajan
Okay - Thanks for the confirmation Shalin.  Could this be a feature request
in the Collections API - that we have a Split shard dry run API that accepts
sub-shards count as a request param and returns the optimal shard ranges for
the number of sub-shards requested to be created along with the respective
document counts for each of the sub-shards? The users can then use this
shard ranges for the actual split?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204100.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Finding out optimal hash ranges for shard split

2015-05-06 Thread Shalin Shekhar Mangar
Hi Anand,

The nature of the hash function (murmur3) should lead to a approximately
uniform distribution of documents across sub-shards. Have you investigated
why, if at all, the sub-shards are not balanced? Do you use composite keys
e.g. abc!id1 which cause the imbalance?

I don't think there is a (cheap) way to implement what you are asking in
the current scheme of things because unless we go through each id and
calculate the hash, we have no way of knowing the optimal distribution.
However, if we were to store the hash of the key as a separate field in the
index then it should be possible to binary search for hash ranges which
lead to approx. equal distribution of docs in sub-shards. We can implement
something like that inside Solr.

On Wed, May 6, 2015 at 4:42 PM, anand.mahajan an...@zerebral.co.in wrote:

 Okay - Thanks for the confirmation Shalin.  Could this be a feature request
 in the Collections API - that we have a Split shard dry run API that
 accepts
 sub-shards count as a request param and returns the optimal shard ranges
 for
 the number of sub-shards requested to be created along with the respective
 document counts for each of the sub-shards? The users can then use this
 shard ranges for the actual split?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204100.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Regards,
Shalin Shekhar Mangar.


Re: Finding out optimal hash ranges for shard split

2015-05-06 Thread anand.mahajan
Yes - I'm using 2 level composite ids and that has caused the imbalance for
some shards.
Its cars data and the composite ids are of the form year-make!model-and
couple of other specifications. e.g. 2013Ford!Edge!123456 - but there are
just far too many Ford 2013 or 2011 cars that go and occupy the same shards.
This was done so as co-location of these docs is required as well for a few
of the search requirements - to avoid it hitting all shards all the time and
all queries do have the year and make combinations always specified and its
easier to work out the target shard for the query.

Regarding storing the hash against each document and then querying to find
out the optimal ranges - could it be done so that Solr maintains incremental
counters for each of the hash in the range for the shard - and then the
collections Splitshard API could use this internally to propose the optimal
shard ranges for the split? 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204124.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Finding out optimal hash ranges for shard split

2015-05-05 Thread anand.mahajan
Looks like its not possible to find out the optimal hash ranges for a split
before you actually split it. So the only way out is to keep splitting out
the large subshards?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html
Sent from the Solr - User mailing list archive at Nabble.com.


Finding out optimal hash ranges for shard split

2015-05-03 Thread anand.mahajan
Hi all,

Before doing a splitshard - Is there a way to figure out optimal hash ranges
for the shard that will evenly split the documents on the new sub-shards
that get created? Sort of a dry-run to the actual split shard command with
ranges parameter specified with it that just shows the number of docs that
will reside on the new sub-shards if the split shard command was executed
with a given hash range? 

Thanks,
Anand



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609.html
Sent from the Solr - User mailing list archive at Nabble.com.