Hi Simon,

I might be wrong but I'm pretty sure the splits file you specify is assumed to 
be full of strings.  So even though they look like bytes they're being 
interpreted as the string value (like '\x00') instead of the actual byte \x00.  
The only way I could get the byte representation of ints (in my case) to be 
used for pre-splitting was to do it programatically.

Hope that helps,
Oliver

On 2012-06-12, at 11:41 AM, Simon Kelly wrote:

> Yes, I'm aware that UUID's are designed to be unique and not evenly
> distributed but I wouldn't expect a big gap in their distribution either.
> 
> The other thing that is really confusing me is that the regions splits
> aren't lexicographical sorted. Perhaps there is a problem with the way I'm
> specifying the splits in the split file. I haven't been able to find any
> docs on what format the splits keys should be in so I've used what's
> produced by Bytes.toStringBinary. Is that correct?
> 
> Simon
> 
> On 12 June 2012 10:23, Michael Segel <michael_se...@hotmail.com> wrote:
> 
>> UUIDs are unique but not necessarily random and even in random samplings,
>> you may not see an even distribution except over time.
>> 
>> 
>> Sent from my iPhone
>> 
>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <simongdke...@gmail.com> wrote:
>> 
>>> Hi
>>> 
>>> I'm getting some unexpected results with a pre-split table where some of
>>> the regions are not getting any data.
>>> 
>>> The table keys are UUID (generated using Java's UUID.randomUUID() ) which
>>> I'm storing as a byte[16]:
>>> 
>>>   key[0-7] = uuid most significant bits
>>>   key[8-15] = uuid least significant bits
>>> 
>>> The table is created via the shell as follows:
>>> 
>>>   create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'}
>>> 
>>> The splits.txt is generated using the code here:
>>> http://pastebin.com/DAExXMDz which generates 32 regions split between
>> x00
>>> and xFF. I have also tried with 16 byte regions keys (x00x00... to
>>> xFFxFF...).
>>> 
>>> As far as I understand this should distribute the rows evenly across the
>>> regions but I'm getting a bunch of regions with no rows. I'm also
>> confused
>>> as the the ordering of the regions since it seems the start and end keys
>>> aren't really matching up correctly. You can see the regions and the
>>> requests they are getting here: http://pastebin.com/B4771g5X
>>> 
>>> Thanks in advance for the help.
>>> Simon
>> 


--
Oliver Meyn
Software Developer
Global Biodiversity Information Facility (GBIF)
+45 35 32 15 12
http://www.gbif.org

Reply via email to