Re: Is it necessary to set MD5 on rowkey?

Michael Segel Wed, 19 Dec 2012 11:47:08 -0800

Ok, 

Maybe I'm missing something.
Why don't you walk me through the use of a salt example.



On Dec 19, 2012, at 12:37 PM, lars hofhansl <[email protected]> wrote:

> I would disagree here.
> It depends on what you are doing and blanket statements about "this is very, 
> very bad" typically do not help.
> 
> Salting (even round robin) is very nice to distribute write load *and* it 
> gives you a natural way to parallelize scans assuming scans are of reasonable 
> size.
> 
> If the typical use case is point gets then hashing or inverting keys would be 
> preferable. As usual: It depends.
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Michael Segel <[email protected]>
> To: [email protected] 
> Sent: Tuesday, December 18, 2012 3:29 PM
> Subject: Re: Is it necessary to set MD5 on rowkey?
> 
> Alex, 
> And that's the point. Salt as you explain it conceptually implies that the 
> number you are adding to the key to ensure a better distribution means that 
> you will have inefficiencies in terms of scans and gets. 
> 
> Using a hash as either the full key, or taking the hash, truncating it and 
> appending the key may screw up scans, but your get() is intact. 
> 
> There are other options like inverting the numeric key ... 
> 
> And of course doing nothing. 
> 
> Using a salt as part of the design pattern is bad. 
> 
> With respect to the OP, I was discussing the use of hash and some 
> alternatives to how to implement the hash of a key. 
> Again, doing nothing may also make sense too, if you understand the risks and 
> you know how your data is going to be used.
> 
> 
> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]> wrote:
> 
>> Mike,
>> 
>> Please read *full post* before judge. In particular, "Hash-based
>> distribution" section. You can find the same in HBaseWD small README file
>> [1] (not sure if you read it at all before commenting on the lib). Round
>> robin is mainly for explaining the concept/idea (though not only for that).
>> 
>> Thank you,
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>> Solr
>> 
>> [1] https://github.com/sematext/HBaseWD
>> 
>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>> <[email protected]>wrote:
>> 
>>> Quick answer...
>>> 
>>> Look at the salt.
>>> Its just a number from a round robin counter.
>>> There is no tie between the salt and row.
>>> 
>>> So when you want to fetch a single row, how do you do it?
>>> ...
>>> ;-)
>>> 
>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[email protected]>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> @Mike:
>>>> 
>>>> I'm the author of that post :).
>>>> 
>>>> Quick reply to your last comment:
>>>> 
>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>> be
>>>> helpful for anyone who is looking into using this approach.
>>>> 
>>>> 2) The approach described in the post also says you can prefix with the
>>>> hash, you probably missed that.
>>>> 
>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>> Please re-read the question: the intention is to distribute the load
>>> while
>>>> still being able to do "partial key scans". The blog post linked above
>>>> explains one possible solution for that, while your answer doesn't.
>>>> 
>>>> @bigdata:
>>>> 
>>>> Basically when it comes to solving two issues: distributing writes and
>>>> having ability to read data sequentially, you have to balance between
>>> being
>>>> good at both of them. Very good presentation by Lars:
>>>> 
>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>> ,
>>>> slide 22. You will see how this is correlated. In short:
>>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>>> writes, while compromises ability to do range scans efficiently
>>>> * having very limited number of 'salt' prefixes still allows to do range
>>>> scans (less efficiently than normal range scans, of course, but still
>>> good
>>>> enough in many cases) while providing worse distribution of writes
>>>> 
>>>> In the latter case by choosing number of possible 'salt' prefixes (which
>>>> could be derived from hashed values, etc.) you can balance between
>>>> distributing writes efficiency and ability to run fast range scans.
>>>> 
>>>> Hope this helps
>>>> 
>>>> Alex Baranau
>>>> ------
>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>>> -
>>>> Solr
>>>> 
>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
>>> [email protected]>wrote:
>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> First, the use of a 'Salt' is a very, very bad idea and I would really
>>>>> hope that the author of that blog take it down.
>>>>> While it may solve an initial problem in terms of region hot spotting,
>>> it
>>>>> creates another problem when it comes to fetching data. Fetching data
>>> takes
>>>>> more effort.
>>>>> 
>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more
>>> random
>>>>> key that is unique to the record.  Some would argue that using MD5 or
>>> SHA-1
>>>>> that mathematically you could have a collision, however you could then
>>>>> append the key to the hash to guarantee uniqueness. You could also do
>>>>> things like take the hash and then truncate it to the first byte and
>>> then
>>>>> append the record key. This should give you enough randomness to avoid
>>> hot
>>>>> spotting after the initial region completion and you could pre-split out
>>>>> any number of regions. (First byte 0-255 for values, so you can program
>>> the
>>>>> split...
>>>>> 
>>>>> 
>>>>> Having said that... yes, you lose the ability to perform a sequential
>>> scan
>>>>> of the data.  At least to a point.  It depends on your schema.
>>>>> 
>>>>> Note that you need to think about how you are primarily going to access
>>>>> the data.  You can then determine the best way to store the data to gain
>>>>> the best performance. For some applications... the region hot spotting
>>>>> isn't an important issue.
>>>>> 
>>>>> Note YMMV
>>>>> 
>>>>> HTH
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]>
>>> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5
>>>>>> (heavy scan):
>>>>>> * you can use composed keys with a field that can segregate data
>>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>>> * or use Salt with a limited number of values (example
>>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>>   so that a scan is a combination of 16 filters on on each salt values
>>>>>>   you can base your code on HBaseWD by sematext
>>>>>> 
>>>>>> 
>>>>> 
>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>     https://github.com/sematext/HBaseWD
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> 
>>>>>> 2012/12/18 bigdata <[email protected]>
>>>>>> 
>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method to
>>>>>>> balance the records stored in different parts. But If I want to search
>>>>> some
>>>>>>> sequential rowkey records, such as date as rowkey or partially. I can
>>>>> not
>>>>>>> use rowkey filter to scan a range of date value one time on the date
>>> by
>>>>>>> MD5. How to balance this issue?
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Damien HARDY
>>>>> 
>>>>> 
>>>

Re: Is it necessary to set MD5 on rowkey?

Reply via email to