Nick, Yes there is an implied definition of the term 'salting' which to those with a CS or Software Engineering background will take to heart. However it goes beyond this definition.
Per Lars and Alex, they are talking about bucketing the data. Again this is not a good idea. As you point out even using a modulo function( eg the % math symbol 45%10 = 5) you are still creating a certain nondeterministic value in the key. While this buckets the results in to 10 buckets, you no longer can do a single partial scan or a single get() to find that row. This is what is implied in 'salting' . Now this is much different than taking the hash, truncating the hash and then appending the key. ( hash(key).toString.substring[1,2]+key) [Ok not code but you get the idea] Using this, I can still use a single get() to fetch the row if I know the key. Again, with Salting you can't do that. What I find troublesome is that there are better design patterns which solve the same problem. HTH -Mike On Dec 20, 2012, at 12:15 PM, Nick Dimiduk <[email protected]> wrote: > I think there's some hair-splitting going on here. The term "salting," by > strict definition [0] from the cryptographic context, means the > introduction of randomness to produce a one-way encoding of a value. The > technique for rowkey design described here does not include the > introduction of said randomness, nor is it a one-way operation. Instead, it > divides the input more or less evenly across a fixed number of buckets. > > If you've introduced one-way randomness, you have no way to reproduce the > rowkey after it's been inserted. Thus, accessing a specific row would be > O(n) where n is the number of rows in the table -- you'd have to scan the > table looking for the desired row. Use of a bucketing technique, on the > other hand, adds a small amount of computation to calculate the true > rowkey, making access to a single row a O(1) + C vs the number of rows in > the table. > > -n > > [0]: http://en.wikipedia.org/wiki/Salt_(cryptography) > > On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel > <[email protected]>wrote: > >> Lars, >> >> Ok... he's talking about buckets. >> >> So when you have N buckets, what is the least number of get()s do you need >> to fetch the single row? >> (Hint: The answer is N) >> >> How many scans? (N again) >> >> Do you disagree? >> >> >> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[email protected]> wrote: >> >>> Mike, please think about what you write before you write it. >>> You will most definitely not need a full table scan (much less a *FULL* >> *TABLE* *SCAN* ;-) ). >>> >>> Read Alex's blog post again, it's a good post (IMHO). He is talking >> about buckets. >>> >>> >>> -- Lars >>> >>> >>> >>> ________________________________ >>> From: Michael Segel <[email protected]> >>> To: [email protected] >>> Sent: Wednesday, December 19, 2012 5:23 PM >>> Subject: Re: Is it necessary to set MD5 on rowkey? >>> >>> Ok, >>> >>> Lets try this one more time... >>> >>> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to >> retrieve the row. >>> If you do something like a salt that uses only a preset of N >> combinations, you will have to do N get()s in order to fetch the row. >>> >>> This is bad. VERY BAD. >>> >>> If you hash the row, you will get a consistent value each time you hash >> the key. If you use SHA-1, the odds of a collision are mathematically >> possible, however highly improbable. So people have recommended that they >> append the key to the hash to form the new key. Here, you might as well as >> truncate the hash to just the most significant byte or two and the append >> the key. This will give you enough of an even distribution that you can >> avoid hot spotting. >>> >>> So if I use the hash, I can effectively still get the row of data back >> with a single get(). Otherwise its a full table scan. >>> >>> Do you see the difference? >>> >>> >>> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari < >> [email protected]> wrote: >>> >>>> Hi Mike, >>>> >>>> If in your business case, the only thing you need when you retreive >>>> your data is to do full scan over MR jobs, then you can salt with >>>> what-ever you want. Hash, random values, etc. >>>> >>>> If you know you have x regions, then you can simply do a round-robin >>>> salting, or a random salting over those x regions. >>>> >>>> Then when you run your MR job, you discard the first bytes, and do >>>> what you want with your data. >>>> >>>> So I also think that salting can still be usefull. All depend on what >>>> you do with your data. >>>> >>>> Must my opinion. >>>> >>>> JM >>>> >>>> 2012/12/19, Michael Segel <[email protected]>: >>>>> Ok... >>>>> >>>>> So you use a random byte or two at the front of the row. >>>>> How do you then use get() to find the row? >>>>> How do you do a partial scan()? >>>>> >>>>> Do you start to see the problem? >>>>> The only way to get to the row is to do a full table scan. That kills >> HBase >>>>> and you would be better off going with a partitioned Hive table. >>>>> >>>>> Using a hash of the key or a portion of the hash is not a salt. >>>>> That's not what I have a problem with. Each time you want to fetch the >> key, >>>>> you just hash it, truncate the hash and then prepend it to the key. >> You will >>>>> then be able to use get(). >>>>> >>>>> Using a salt would imply using some form of a modulo math to get a >> round >>>>> robin prefix. Or a random number generator. >>>>> >>>>> That's the issue. >>>>> >>>>> Does that make sense? >>>>> >>>>> >>>>> >>>>> On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote: >>>>> >>>>>> Let's say you want to decompose a url into domain and path to include >> in >>>>>> your row key. >>>>>> >>>>>> You could of course just use the url as the key, but you will see >>>>>> hotspotting since most will start with "http". To mitigate this, you >> could >>>>>> add a random byte or two at the beginning (random salt) to improve >>>>>> distribution of keys, but you break single record Gets (and Scans >>>>>> arguably). Another approach is to use a hash-based salt: hash the >> whole >>>>>> key and use a few of those bytes as a salt. This fixes Gets but Scans >> are >>>>>> still not effective. >>>>>> >>>>>> One approach I've taken is to hash only a part of the key. Consider >> the >>>>>> following key structure >>>>>> >>>>>> <2 bytes of hash(domain)><domain><path> >>>>>> >>>>>> With this you get 16 bits for a hash-based salt. The salt is >> deterministic >>>>>> so Gets work fine, and for a single domain the salt is the same so >> you can >>>>>> easily do Scans across a domain. If you had some further structure to >> your >>>>>> key that you wished to scan across, you could do something like: >>>>>> >>>>>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path> >>>>>> >>>>>> It really boils down to identifying your access patterns and >> read/write >>>>>> requirements and constructing a row key accordingly. >>>>>> >>>>>> HTH, >>>>>> David >>>>>> >>>>>> On 12/18/12 6:29 PM, Michael Segel wrote: >>>>>>> Alex, >>>>>>> And that's the point. Salt as you explain it conceptually implies >> that >>>>>>> the number you are adding to the key to ensure a better distribution >>>>>>> means that you will have inefficiencies in terms of scans and gets. >>>>>>> >>>>>>> Using a hash as either the full key, or taking the hash, truncating >> it >>>>>>> and appending the key may screw up scans, but your get() is intact. >>>>>>> >>>>>>> There are other options like inverting the numeric key ... >>>>>>> >>>>>>> And of course doing nothing. >>>>>>> >>>>>>> Using a salt as part of the design pattern is bad. >>>>>>> >>>>>>> With respect to the OP, I was discussing the use of hash and some >>>>>>> alternatives to how to implement the hash of a key. >>>>>>> Again, doing nothing may also make sense too, if you understand the >> risks >>>>>>> and you know how your data is going to be used. >>>>>>> >>>>>>> >>>>>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected] >>> >>>>>>> wrote: >>>>>>> >>>>>>>> Mike, >>>>>>>> >>>>>>>> Please read *full post* before judge. In particular, "Hash-based >>>>>>>> distribution" section. You can find the same in HBaseWD small README >>>>>>>> file >>>>>>>> [1] (not sure if you read it at all before commenting on the lib). >>>>>>>> Round >>>>>>>> robin is mainly for explaining the concept/idea (though not only for >>>>>>>> that). >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Alex Baranau >>>>>>>> ------ >>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - >> ElasticSearch >>>>>>>> - >>>>>>>> Solr >>>>>>>> >>>>>>>> [1] https://github.com/sematext/HBaseWD >>>>>>>> >>>>>>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> Quick answer... >>>>>>>>> >>>>>>>>> Look at the salt. >>>>>>>>> Its just a number from a round robin counter. >>>>>>>>> There is no tie between the salt and row. >>>>>>>>> >>>>>>>>> So when you want to fetch a single row, how do you do it? >>>>>>>>> ... >>>>>>>>> ;-) >>>>>>>>> >>>>>>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau < >> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> @Mike: >>>>>>>>>> >>>>>>>>>> I'm the author of that post :). >>>>>>>>>> >>>>>>>>>> Quick reply to your last comment: >>>>>>>>>> >>>>>>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, >> very >>>>>>>>>> bad >>>>>>>>>> idea" in more specific way than "Fetching data takes more effort". >>>>>>>>>> Would >>>>>>>>> be >>>>>>>>>> helpful for anyone who is looking into using this approach. >>>>>>>>>> >>>>>>>>>> 2) The approach described in the post also says you can prefix >> with >>>>>>>>>> the >>>>>>>>>> hash, you probably missed that. >>>>>>>>>> >>>>>>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata >>>>>>>>>> guy. >>>>>>>>>> Please re-read the question: the intention is to distribute the >> load >>>>>>>>> while >>>>>>>>>> still being able to do "partial key scans". The blog post linked >>>>>>>>>> above >>>>>>>>>> explains one possible solution for that, while your answer >> doesn't. >>>>>>>>>> >>>>>>>>>> @bigdata: >>>>>>>>>> >>>>>>>>>> Basically when it comes to solving two issues: distributing writes >>>>>>>>>> and >>>>>>>>>> having ability to read data sequentially, you have to balance >> between >>>>>>>>> being >>>>>>>>>> good at both of them. Very good presentation by Lars: >>>>>>>>>> >>>>>>>>> >> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 >>>>>>>>> , >>>>>>>>>> slide 22. You will see how this is correlated. In short: >>>>>>>>>> * having md5/other hash prefix of the key does better w.r.t. >>>>>>>>>> distributing >>>>>>>>>> writes, while compromises ability to do range scans efficiently >>>>>>>>>> * having very limited number of 'salt' prefixes still allows to do >>>>>>>>>> range >>>>>>>>>> scans (less efficiently than normal range scans, of course, but >> still >>>>>>>>> good >>>>>>>>>> enough in many cases) while providing worse distribution of writes >>>>>>>>>> >>>>>>>>>> In the latter case by choosing number of possible 'salt' prefixes >>>>>>>>>> (which >>>>>>>>>> could be derived from hashed values, etc.) you can balance between >>>>>>>>>> distributing writes efficiency and ability to run fast range >> scans. >>>>>>>>>> >>>>>>>>>> Hope this helps >>>>>>>>>> >>>>>>>>>> Alex Baranau >>>>>>>>>> ------ >>>>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - >>>>>>>>>> ElasticSearch >>>>>>>>> - >>>>>>>>>> Solr >>>>>>>>>> >>>>>>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel < >>>>>>>>> [email protected]>wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would >>>>>>>>>>> really >>>>>>>>>>> hope that the author of that blog take it down. >>>>>>>>>>> While it may solve an initial problem in terms of region hot >>>>>>>>>>> spotting, >>>>>>>>> it >>>>>>>>>>> creates another problem when it comes to fetching data. Fetching >>>>>>>>>>> data >>>>>>>>> takes >>>>>>>>>>> more effort. >>>>>>>>>>> >>>>>>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a >> more >>>>>>>>> random >>>>>>>>>>> key that is unique to the record. Some would argue that using >> MD5 >>>>>>>>>>> or >>>>>>>>> SHA-1 >>>>>>>>>>> that mathematically you could have a collision, however you could >>>>>>>>>>> then >>>>>>>>>>> append the key to the hash to guarantee uniqueness. You could >> also >>>>>>>>>>> do >>>>>>>>>>> things like take the hash and then truncate it to the first byte >> and >>>>>>>>> then >>>>>>>>>>> append the record key. This should give you enough randomness to >>>>>>>>>>> avoid >>>>>>>>> hot >>>>>>>>>>> spotting after the initial region completion and you could >> pre-split >>>>>>>>>>> out >>>>>>>>>>> any number of regions. (First byte 0-255 for values, so you can >>>>>>>>>>> program >>>>>>>>> the >>>>>>>>>>> split... >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Having said that... yes, you lose the ability to perform a >>>>>>>>>>> sequential >>>>>>>>> scan >>>>>>>>>>> of the data. At least to a point. It depends on your schema. >>>>>>>>>>> >>>>>>>>>>> Note that you need to think about how you are primarily going to >>>>>>>>>>> access >>>>>>>>>>> the data. You can then determine the best way to store the data >> to >>>>>>>>>>> gain >>>>>>>>>>> the best performance. For some applications... the region hot >>>>>>>>>>> spotting >>>>>>>>>>> isn't an important issue. >>>>>>>>>>> >>>>>>>>>>> Note YMMV >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> -Mike >>>>>>>>>>> >>>>>>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected] >>> >>>>>>>>> wrote: >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> There is middle term betwen sequecial keys (hot spoting risk) >> and >>>>>>>>>>>> md5 >>>>>>>>>>>> (heavy scan): >>>>>>>>>>>> * you can use composed keys with a field that can segregate data >>>>>>>>>>>> (hostname, productname, metric name) like OpenTSDB >>>>>>>>>>>> * or use Salt with a limited number of values (example >>>>>>>>>>>> substr(md5(rowid),0,1) = 16 values) >>>>>>>>>>>> so that a scan is a combination of 16 filters on on each salt >>>>>>>>>>>> values >>>>>>>>>>>> you can base your code on HBaseWD by sematext >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2012/12/18 bigdata <[email protected]> >>>>>>>>>>>> >>>>>>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good >> method >>>>>>>>>>>>> to >>>>>>>>>>>>> balance the records stored in different parts. But If I want to >>>>>>>>>>>>> search >>>>>>>>>>> some >>>>>>>>>>>>> sequential rowkey records, such as date as rowkey or >> partially. I >>>>>>>>>>>>> can >>>>>>>>>>> not >>>>>>>>>>>>> use rowkey filter to scan a range of date value one time on the >>>>>>>>>>>>> date >>>>>>>>> by >>>>>>>>>>>>> MD5. How to balance this issue? >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Damien HARDY >>>>>>>>>>> >>>>>>>>> >>>>>> >>>>>> >>>>> >>>>> >> >>
