Re: Is it necessary to set MD5 on rowkey?

Michael Segel Wed, 19 Dec 2012 13:02:35 -0800

I think you missed the point. 
You seem to think that salting is ok. 
I want you to walk through an example so that we can discuss it. ;-)



On Dec 19, 2012, at 2:51 PM, lars hofhansl <lhofha...@yahoo.com> wrote:

> Doesn't Alex' blog post do that?
> 
> 
> 
> 
> ________________________________
> From: Michael Segel <michael_se...@hotmail.com>
> To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> 
> Sent: Wednesday, December 19, 2012 11:46 AM
> Subject: Re: Is it necessary to set MD5 on rowkey?
> 
> Ok, 
> 
> Maybe I'm missing something.
> Why don't you walk me through the use of a salt example. 
> 
> 
> On Dec 19, 2012, at 12:37 PM, lars hofhansl <lhofha...@yahoo.com> wrote:
> 
>> I would disagree here.
>> It depends on what you are doing and blanket statements about "this is very, 
>> very bad" typically do not help.
>> 
>> Salting (even round robin) is very nice to distribute write load *and* it 
>> gives you a natural way to parallelize scans assuming scans are of 
>> reasonable size.
>> 
>> If the typical use case is point gets then hashing or inverting keys would 
>> be preferable. As usual: It depends.
>> 
>> -- Lars
>> 
>> 
>> 
>> ________________________________
>> From: Michael Segel <michael_se...@hotmail.com>
>> To: user@hbase.apache.org 
>> Sent: Tuesday, December 18, 2012 3:29 PM
>> Subject: Re: Is it necessary to set MD5 on rowkey?
>> 
>> Alex, 
>> And that's the point. Salt as you explain it conceptually implies that the 
>> number you are adding to the key to ensure a better distribution means that 
>> you will have inefficiencies in terms of scans and gets. 
>> 
>> Using a hash as either the full key, or taking the hash, truncating it and 
>> appending the key may screw up scans, but your get() is intact. 
>> 
>> There are other options like inverting the numeric key ... 
>> 
>> And of course doing nothing. 
>> 
>> Using a salt as part of the design pattern is bad. 
>> 
>> With respect to the OP, I was discussing the use of hash and some 
>> alternatives to how to implement the hash of a key. 
>> Again, doing nothing may also make sense too, if you understand the risks 
>> and you know how your data is going to be used.
>> 
>> 
>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <alex.barano...@gmail.com> wrote:
>> 
>>> Mike,
>>> 
>>> Please read *full post* before judge. In particular, "Hash-based
>>> distribution" section. You can find the same in HBaseWD small README file
>>> [1] (not sure if you read it at all before commenting on the lib). Round
>>> robin is mainly for explaining the concept/idea (though not only for that).
>>> 
>>> Thank you,
>>> Alex Baranau
>>> ------
>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>>> Solr
>>> 
>>> [1] https://github.com/sematext/HBaseWD
>>> 
>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>>> <michael_se...@hotmail.com>wrote:
>>> 
>>>> Quick answer...
>>>> 
>>>> Look at the salt.
>>>> Its just a number from a round robin counter.
>>>> There is no tie between the salt and row.
>>>> 
>>>> So when you want to fetch a single row, how do you do it?
>>>> ...
>>>> ;-)
>>>> 
>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <alex.barano...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> @Mike:
>>>>> 
>>>>> I'm the author of that post :).
>>>>> 
>>>>> Quick reply to your last comment:
>>>>> 
>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>>> be
>>>>> helpful for anyone who is looking into using this approach.
>>>>> 
>>>>> 2) The approach described in the post also says you can prefix with the
>>>>> hash, you probably missed that.
>>>>> 
>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>>> Please re-read the question: the intention is to distribute the load
>>>> while
>>>>> still being able to do "partial key scans". The blog post linked above
>>>>> explains one possible solution for that, while your answer doesn't.
>>>>> 
>>>>> @bigdata:
>>>>> 
>>>>> Basically when it comes to solving two issues: distributing writes and
>>>>> having ability to read data sequentially, you have to balance between
>>>> being
>>>>> good at both of them. Very good presentation by Lars:
>>>>> 
>>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>>> ,
>>>>> slide 22. You will see how this is correlated. In short:
>>>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>>>> writes, while compromises ability to do range scans efficiently
>>>>> * having very limited number of 'salt' prefixes still allows to do range
>>>>> scans (less efficiently than normal range scans, of course, but still
>>>> good
>>>>> enough in many cases) while providing worse distribution of writes
>>>>> 
>>>>> In the latter case by choosing number of possible 'salt' prefixes (which
>>>>> could be derived from hashed values, etc.) you can balance between
>>>>> distributing writes efficiency and ability to run fast range scans.
>>>>> 
>>>>> Hope this helps
>>>>> 
>>>>> Alex Baranau
>>>>> ------
>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>>>> -
>>>>> Solr
>>>>> 
>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
>>>> michael_se...@hotmail.com>wrote:
>>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would really
>>>>>> hope that the author of that blog take it down.
>>>>>> While it may solve an initial problem in terms of region hot spotting,
>>>> it
>>>>>> creates another problem when it comes to fetching data. Fetching data
>>>> takes
>>>>>> more effort.
>>>>>> 
>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more
>>>> random
>>>>>> key that is unique to the record.  Some would argue that using MD5 or
>>>> SHA-1
>>>>>> that mathematically you could have a collision, however you could then
>>>>>> append the key to the hash to guarantee uniqueness. You could also do
>>>>>> things like take the hash and then truncate it to the first byte and
>>>> then
>>>>>> append the record key. This should give you enough randomness to avoid
>>>> hot
>>>>>> spotting after the initial region completion and you could pre-split out
>>>>>> any number of regions. (First byte 0-255 for values, so you can program
>>>> the
>>>>>> split...
>>>>>> 
>>>>>> 
>>>>>> Having said that... yes, you lose the ability to perform a sequential
>>>> scan
>>>>>> of the data.  At least to a point.  It depends on your schema.
>>>>>> 
>>>>>> Note that you need to think about how you are primarily going to access
>>>>>> the data.  You can then determine the best way to store the data to gain
>>>>>> the best performance. For some applications... the region hot spotting
>>>>>> isn't an important issue.
>>>>>> 
>>>>>> Note YMMV
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> -Mike
>>>>>> 
>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <dha...@viadeoteam.com>
>>>> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5
>>>>>>> (heavy scan):
>>>>>>> * you can use composed keys with a field that can segregate data
>>>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>>>> * or use Salt with a limited number of values (example
>>>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>>>    so that a scan is a combination of 16 filters on on each salt values
>>>>>>>    you can base your code on HBaseWD by sematext
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>     https://github.com/sematext/HBaseWD
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> 
>>>>>>> 2012/12/18 bigdata <bigdatab...@outlook.com>
>>>>>>> 
>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method to
>>>>>>>> balance the records stored in different parts. But If I want to search
>>>>>> some
>>>>>>>> sequential rowkey records, such as date as rowkey or partially. I can
>>>>>> not
>>>>>>>> use rowkey filter to scan a range of date value one time on the date
>>>> by
>>>>>>>> MD5. How to balance this issue?
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Damien HARDY
>>>>>> 
>>>>>>

Re: Is it necessary to set MD5 on rowkey?

Reply via email to