Re: Is it necessary to set MD5 on rowkey?

Michael Segel Thu, 20 Dec 2012 12:15:46 -0800

Nick,

Yes there is an implied definition of  the term 'salting' which to those with a 
CS or Software Engineering background will take to heart.
However it goes beyond this definition.


Per Lars and Alex, they are talking about bucketing the data.  Again this is 
not a good idea. 
As you point out even using a modulo function( eg  the % math symbol  45%10 = 
5) you are still creating a certain nondeterministic value in the key. 

While this buckets the results in to 10 buckets, you no longer can do a single 
partial scan or a single get() to find that row. 

This is what is implied in 'salting' .

Now this is much different than taking the hash, truncating the hash and then 
appending the key. ( hash(key).toString.substring[1,2]+key) [Ok not code but 
you get the idea]
Using this, I can still use a single get() to fetch the row if I know the key.  
Again, with Salting you can't do that.

What I find troublesome is that there are better design patterns which solve 
the same problem. 

HTH

-Mike

On Dec 20, 2012, at 12:15 PM, Nick Dimiduk <[email protected]> wrote:

> I think there's some hair-splitting going on here. The term "salting," by
> strict definition [0] from the cryptographic context, means the
> introduction of randomness to produce a one-way encoding of a value. The
> technique for rowkey design described here does not include the
> introduction of said randomness, nor is it a one-way operation. Instead, it
> divides the input more or less evenly across a fixed number of buckets.
> 
> If you've introduced one-way randomness, you have no way to reproduce the
> rowkey after it's been inserted. Thus, accessing a specific row would be
> O(n) where n is the number of rows in the table -- you'd have to scan the
> table looking for the desired row. Use of a bucketing technique, on the
> other hand, adds a small amount of computation to calculate the true
> rowkey, making access to a single row a O(1) + C vs the number of rows in
> the table.
> 
> -n
> 
> [0]: http://en.wikipedia.org/wiki/Salt_(cryptography)
> 
> On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel 
> <[email protected]>wrote:
> 
>> Lars,
>> 
>> Ok... he's talking about buckets.
>> 
>> So when you have N buckets, what is the least number of get()s do you need
>> to fetch the single row?
>> (Hint: The answer is N)
>> 
>> How many scans? (N again)
>> 
>> Do you disagree?
>> 
>> 
>> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[email protected]> wrote:
>> 
>>> Mike, please think about what you write before you write it.
>>> You will most definitely not need a full table scan (much less a *FULL*
>> *TABLE* *SCAN* ;-) ).
>>> 
>>> Read Alex's blog post again, it's a good post (IMHO). He is talking
>> about buckets.
>>> 
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Michael Segel <[email protected]>
>>> To: [email protected]
>>> Sent: Wednesday, December 19, 2012 5:23 PM
>>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>> 
>>> Ok,
>>> 
>>> Lets try this one more time...
>>> 
>>> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
>> retrieve the row.
>>> If you do something like a salt that uses only  a preset of N
>> combinations, you will have to do N get()s in order to fetch the row.
>>> 
>>> This is bad. VERY BAD.
>>> 
>>> If you hash the row, you will get a consistent value each time you hash
>> the key.  If you use SHA-1, the odds of a collision are mathematically
>> possible, however highly improbable. So people have recommended that they
>> append the key to the hash to form the new key. Here, you might as well as
>> truncate the hash to just the most significant byte or two and the append
>> the key. This will give you enough of an even distribution that you can
>> avoid hot spotting.
>>> 
>>> So if I use the hash, I can effectively still get the row of data back
>> with a single get(). Otherwise its a full table scan.
>>> 
>>> Do you see the difference?
>>> 
>>> 
>>> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
>> [email protected]> wrote:
>>> 
>>>> Hi Mike,
>>>> 
>>>> If in your business case, the only thing you need when you retreive
>>>> your data is to do full scan over MR jobs, then you can salt with
>>>> what-ever you want. Hash, random values, etc.
>>>> 
>>>> If you know you have x regions, then you can simply do a round-robin
>>>> salting, or a random salting over those x regions.
>>>> 
>>>> Then when you run your MR job, you discard the first bytes, and do
>>>> what you want with your data.
>>>> 
>>>> So I also think that salting can still be usefull. All depend on what
>>>> you do with your data.
>>>> 
>>>> Must my opinion.
>>>> 
>>>> JM
>>>> 
>>>> 2012/12/19, Michael Segel <[email protected]>:
>>>>> Ok...
>>>>> 
>>>>> So you use a random byte or two at the front of the row.
>>>>> How do you then use get() to find the row?
>>>>> How do you do a partial scan()?
>>>>> 
>>>>> Do you start to see the problem?
>>>>> The only way to get to the row is to do a full table scan. That kills
>> HBase
>>>>> and you would be better off going with a partitioned Hive table.
>>>>> 
>>>>> Using a hash of the key or a portion of the hash is not a salt.
>>>>> That's not what I have a problem with. Each time you want to fetch the
>> key,
>>>>> you just hash it, truncate the hash and then prepend it to the key.
>> You will
>>>>> then be able to use get().
>>>>> 
>>>>> Using a salt would imply using some form of a modulo math to get a
>> round
>>>>> robin prefix.  Or a random number generator.
>>>>> 
>>>>> That's the issue.
>>>>> 
>>>>> Does that make sense?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote:
>>>>> 
>>>>>> Let's say you want to decompose a url into domain and path to include
>> in
>>>>>> your row key.
>>>>>> 
>>>>>> You could of course just use the url as the key, but you will see
>>>>>> hotspotting since most will start with "http". To mitigate this, you
>> could
>>>>>> add a random byte or two at the beginning (random salt) to improve
>>>>>> distribution of keys, but you break single record Gets (and Scans
>>>>>> arguably). Another approach is to use a hash-based salt: hash the
>> whole
>>>>>> key and use a few of those bytes as a salt. This fixes Gets but Scans
>> are
>>>>>> still not effective.
>>>>>> 
>>>>>> One approach I've taken is to hash only a part of the key. Consider
>> the
>>>>>> following key structure
>>>>>> 
>>>>>> <2 bytes of hash(domain)><domain><path>
>>>>>> 
>>>>>> With this you get 16 bits for a hash-based salt. The salt is
>> deterministic
>>>>>> so Gets work fine, and for a single domain the salt is the same so
>> you can
>>>>>> easily do Scans across a domain. If you had some further structure to
>> your
>>>>>> key that you wished to scan across, you could do something like:
>>>>>> 
>>>>>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>
>>>>>> 
>>>>>> It really boils down to identifying your access patterns and
>> read/write
>>>>>> requirements and constructing a row key accordingly.
>>>>>> 
>>>>>> HTH,
>>>>>> David
>>>>>> 
>>>>>> On 12/18/12 6:29 PM, Michael Segel wrote:
>>>>>>> Alex,
>>>>>>> And that's the point. Salt as you explain it conceptually implies
>> that
>>>>>>> the number you are adding to the key to ensure a better distribution
>>>>>>> means that you will have inefficiencies in terms of scans and gets.
>>>>>>> 
>>>>>>> Using a hash as either the full key, or taking the hash, truncating
>> it
>>>>>>> and appending the key may screw up scans, but your get() is intact.
>>>>>>> 
>>>>>>> There are other options like inverting the numeric key ...
>>>>>>> 
>>>>>>> And of course doing nothing.
>>>>>>> 
>>>>>>> Using a salt as part of the design pattern is bad.
>>>>>>> 
>>>>>>> With respect to the OP, I was discussing the use of hash and some
>>>>>>> alternatives to how to implement the hash of a key.
>>>>>>> Again, doing nothing may also make sense too, if you understand the
>> risks
>>>>>>> and you know how your data is going to be used.
>>>>>>> 
>>>>>>> 
>>>>>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Mike,
>>>>>>>> 
>>>>>>>> Please read *full post* before judge. In particular, "Hash-based
>>>>>>>> distribution" section. You can find the same in HBaseWD small README
>>>>>>>> file
>>>>>>>> [1] (not sure if you read it at all before commenting on the lib).
>>>>>>>> Round
>>>>>>>> robin is mainly for explaining the concept/idea (though not only for
>>>>>>>> that).
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> Alex Baranau
>>>>>>>> ------
>>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
>> ElasticSearch
>>>>>>>> -
>>>>>>>> Solr
>>>>>>>> 
>>>>>>>> [1] https://github.com/sematext/HBaseWD
>>>>>>>> 
>>>>>>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>>>>>>>> <[email protected]>wrote:
>>>>>>>> 
>>>>>>>>> Quick answer...
>>>>>>>>> 
>>>>>>>>> Look at the salt.
>>>>>>>>> Its just a number from a round robin counter.
>>>>>>>>> There is no tie between the salt and row.
>>>>>>>>> 
>>>>>>>>> So when you want to fetch a single row, how do you do it?
>>>>>>>>> ...
>>>>>>>>> ;-)
>>>>>>>>> 
>>>>>>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <
>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> @Mike:
>>>>>>>>>> 
>>>>>>>>>> I'm the author of that post :).
>>>>>>>>>> 
>>>>>>>>>> Quick reply to your last comment:
>>>>>>>>>> 
>>>>>>>>>> 1) Could you please describe why "the use of a 'Salt' is a very,
>> very
>>>>>>>>>> bad
>>>>>>>>>> idea" in more specific way than "Fetching data takes more effort".
>>>>>>>>>> Would
>>>>>>>>> be
>>>>>>>>>> helpful for anyone who is looking into using this approach.
>>>>>>>>>> 
>>>>>>>>>> 2) The approach described in the post also says you can prefix
>> with
>>>>>>>>>> the
>>>>>>>>>> hash, you probably missed that.
>>>>>>>>>> 
>>>>>>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata
>>>>>>>>>> guy.
>>>>>>>>>> Please re-read the question: the intention is to distribute the
>> load
>>>>>>>>> while
>>>>>>>>>> still being able to do "partial key scans". The blog post linked
>>>>>>>>>> above
>>>>>>>>>> explains one possible solution for that, while your answer
>> doesn't.
>>>>>>>>>> 
>>>>>>>>>> @bigdata:
>>>>>>>>>> 
>>>>>>>>>> Basically when it comes to solving two issues: distributing writes
>>>>>>>>>> and
>>>>>>>>>> having ability to read data sequentially, you have to balance
>> between
>>>>>>>>> being
>>>>>>>>>> good at both of them. Very good presentation by Lars:
>>>>>>>>>> 
>>>>>>>>> 
>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>>>>>>>> ,
>>>>>>>>>> slide 22. You will see how this is correlated. In short:
>>>>>>>>>> * having md5/other hash prefix of the key does better w.r.t.
>>>>>>>>>> distributing
>>>>>>>>>> writes, while compromises ability to do range scans efficiently
>>>>>>>>>> * having very limited number of 'salt' prefixes still allows to do
>>>>>>>>>> range
>>>>>>>>>> scans (less efficiently than normal range scans, of course, but
>> still
>>>>>>>>> good
>>>>>>>>>> enough in many cases) while providing worse distribution of writes
>>>>>>>>>> 
>>>>>>>>>> In the latter case by choosing number of possible 'salt' prefixes
>>>>>>>>>> (which
>>>>>>>>>> could be derived from hashed values, etc.) you can balance between
>>>>>>>>>> distributing writes efficiency and ability to run fast range
>> scans.
>>>>>>>>>> 
>>>>>>>>>> Hope this helps
>>>>>>>>>> 
>>>>>>>>>> Alex Baranau
>>>>>>>>>> ------
>>>>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
>>>>>>>>>> ElasticSearch
>>>>>>>>> -
>>>>>>>>>> Solr
>>>>>>>>>> 
>>>>>>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
>>>>>>>>> [email protected]>wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would
>>>>>>>>>>> really
>>>>>>>>>>> hope that the author of that blog take it down.
>>>>>>>>>>> While it may solve an initial problem in terms of region hot
>>>>>>>>>>> spotting,
>>>>>>>>> it
>>>>>>>>>>> creates another problem when it comes to fetching data. Fetching
>>>>>>>>>>> data
>>>>>>>>> takes
>>>>>>>>>>> more effort.
>>>>>>>>>>> 
>>>>>>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a
>> more
>>>>>>>>> random
>>>>>>>>>>> key that is unique to the record.  Some would argue that using
>> MD5
>>>>>>>>>>> or
>>>>>>>>> SHA-1
>>>>>>>>>>> that mathematically you could have a collision, however you could
>>>>>>>>>>> then
>>>>>>>>>>> append the key to the hash to guarantee uniqueness. You could
>> also
>>>>>>>>>>> do
>>>>>>>>>>> things like take the hash and then truncate it to the first byte
>> and
>>>>>>>>> then
>>>>>>>>>>> append the record key. This should give you enough randomness to
>>>>>>>>>>> avoid
>>>>>>>>> hot
>>>>>>>>>>> spotting after the initial region completion and you could
>> pre-split
>>>>>>>>>>> out
>>>>>>>>>>> any number of regions. (First byte 0-255 for values, so you can
>>>>>>>>>>> program
>>>>>>>>> the
>>>>>>>>>>> split...
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Having said that... yes, you lose the ability to perform a
>>>>>>>>>>> sequential
>>>>>>>>> scan
>>>>>>>>>>> of the data.  At least to a point.  It depends on your schema.
>>>>>>>>>>> 
>>>>>>>>>>> Note that you need to think about how you are primarily going to
>>>>>>>>>>> access
>>>>>>>>>>> the data.  You can then determine the best way to store the data
>> to
>>>>>>>>>>> gain
>>>>>>>>>>> the best performance. For some applications... the region hot
>>>>>>>>>>> spotting
>>>>>>>>>>> isn't an important issue.
>>>>>>>>>>> 
>>>>>>>>>>> Note YMMV
>>>>>>>>>>> 
>>>>>>>>>>> HTH
>>>>>>>>>>> 
>>>>>>>>>>> -Mike
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]
>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> 
>>>>>>>>>>>> There is middle term betwen sequecial keys (hot spoting risk)
>> and
>>>>>>>>>>>> md5
>>>>>>>>>>>> (heavy scan):
>>>>>>>>>>>> * you can use composed keys with a field that can segregate data
>>>>>>>>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>>>>>>>>> * or use Salt with a limited number of values (example
>>>>>>>>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>>>>>>>> so that a scan is a combination of 16 filters on on each salt
>>>>>>>>>>>> values
>>>>>>>>>>>> you can base your code on HBaseWD by sematext
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>>>>>    https://github.com/sematext/HBaseWD
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 2012/12/18 bigdata <[email protected]>
>>>>>>>>>>>> 
>>>>>>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good
>> method
>>>>>>>>>>>>> to
>>>>>>>>>>>>> balance the records stored in different parts. But If I want to
>>>>>>>>>>>>> search
>>>>>>>>>>> some
>>>>>>>>>>>>> sequential rowkey records, such as date as rowkey or
>> partially. I
>>>>>>>>>>>>> can
>>>>>>>>>>> not
>>>>>>>>>>>>> use rowkey filter to scan a range of date value one time on the
>>>>>>>>>>>>> date
>>>>>>>>> by
>>>>>>>>>>>>> MD5. How to balance this issue?
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Damien HARDY
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>> 
>>

Re: Is it necessary to set MD5 on rowkey?

Reply via email to