Hi Mike, If in your business case, the only thing you need when you retreive your data is to do full scan over MR jobs, then you can salt with what-ever you want. Hash, random values, etc.
If you know you have x regions, then you can simply do a round-robin salting, or a random salting over those x regions. Then when you run your MR job, you discard the first bytes, and do what you want with your data. So I also think that salting can still be usefull. All depend on what you do with your data. Must my opinion. JM 2012/12/19, Michael Segel <[email protected]>: > Ok... > > So you use a random byte or two at the front of the row. > How do you then use get() to find the row? > How do you do a partial scan()? > > Do you start to see the problem? > The only way to get to the row is to do a full table scan. That kills HBase > and you would be better off going with a partitioned Hive table. > > Using a hash of the key or a portion of the hash is not a salt. > That's not what I have a problem with. Each time you want to fetch the key, > you just hash it, truncate the hash and then prepend it to the key. You will > then be able to use get(). > > Using a salt would imply using some form of a modulo math to get a round > robin prefix. Or a random number generator. > > That's the issue. > > Does that make sense? > > > > On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote: > >> Let's say you want to decompose a url into domain and path to include in >> your row key. >> >> You could of course just use the url as the key, but you will see >> hotspotting since most will start with "http". To mitigate this, you could >> add a random byte or two at the beginning (random salt) to improve >> distribution of keys, but you break single record Gets (and Scans >> arguably). Another approach is to use a hash-based salt: hash the whole >> key and use a few of those bytes as a salt. This fixes Gets but Scans are >> still not effective. >> >> One approach I've taken is to hash only a part of the key. Consider the >> following key structure >> >> <2 bytes of hash(domain)><domain><path> >> >> With this you get 16 bits for a hash-based salt. The salt is deterministic >> so Gets work fine, and for a single domain the salt is the same so you can >> easily do Scans across a domain. If you had some further structure to your >> key that you wished to scan across, you could do something like: >> >> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path> >> >> It really boils down to identifying your access patterns and read/write >> requirements and constructing a row key accordingly. >> >> HTH, >> David >> >> On 12/18/12 6:29 PM, Michael Segel wrote: >>> Alex, >>> And that's the point. Salt as you explain it conceptually implies that >>> the number you are adding to the key to ensure a better distribution >>> means that you will have inefficiencies in terms of scans and gets. >>> >>> Using a hash as either the full key, or taking the hash, truncating it >>> and appending the key may screw up scans, but your get() is intact. >>> >>> There are other options like inverting the numeric key ... >>> >>> And of course doing nothing. >>> >>> Using a salt as part of the design pattern is bad. >>> >>> With respect to the OP, I was discussing the use of hash and some >>> alternatives to how to implement the hash of a key. >>> Again, doing nothing may also make sense too, if you understand the risks >>> and you know how your data is going to be used. >>> >>> >>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]> >>> wrote: >>> >>>> Mike, >>>> >>>> Please read *full post* before judge. In particular, "Hash-based >>>> distribution" section. You can find the same in HBaseWD small README >>>> file >>>> [1] (not sure if you read it at all before commenting on the lib). >>>> Round >>>> robin is mainly for explaining the concept/idea (though not only for >>>> that). >>>> >>>> Thank you, >>>> Alex Baranau >>>> ------ >>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch >>>> - >>>> Solr >>>> >>>> [1] https://github.com/sematext/HBaseWD >>>> >>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel >>>> <[email protected]>wrote: >>>> >>>>> Quick answer... >>>>> >>>>> Look at the salt. >>>>> Its just a number from a round robin counter. >>>>> There is no tie between the salt and row. >>>>> >>>>> So when you want to fetch a single row, how do you do it? >>>>> ... >>>>> ;-) >>>>> >>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> @Mike: >>>>>> >>>>>> I'm the author of that post :). >>>>>> >>>>>> Quick reply to your last comment: >>>>>> >>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very >>>>>> bad >>>>>> idea" in more specific way than "Fetching data takes more effort". >>>>>> Would >>>>> be >>>>>> helpful for anyone who is looking into using this approach. >>>>>> >>>>>> 2) The approach described in the post also says you can prefix with >>>>>> the >>>>>> hash, you probably missed that. >>>>>> >>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata >>>>>> guy. >>>>>> Please re-read the question: the intention is to distribute the load >>>>> while >>>>>> still being able to do "partial key scans". The blog post linked >>>>>> above >>>>>> explains one possible solution for that, while your answer doesn't. >>>>>> >>>>>> @bigdata: >>>>>> >>>>>> Basically when it comes to solving two issues: distributing writes >>>>>> and >>>>>> having ability to read data sequentially, you have to balance between >>>>> being >>>>>> good at both of them. Very good presentation by Lars: >>>>>> >>>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 >>>>> , >>>>>> slide 22. You will see how this is correlated. In short: >>>>>> * having md5/other hash prefix of the key does better w.r.t. >>>>>> distributing >>>>>> writes, while compromises ability to do range scans efficiently >>>>>> * having very limited number of 'salt' prefixes still allows to do >>>>>> range >>>>>> scans (less efficiently than normal range scans, of course, but still >>>>> good >>>>>> enough in many cases) while providing worse distribution of writes >>>>>> >>>>>> In the latter case by choosing number of possible 'salt' prefixes >>>>>> (which >>>>>> could be derived from hashed values, etc.) you can balance between >>>>>> distributing writes efficiency and ability to run fast range scans. >>>>>> >>>>>> Hope this helps >>>>>> >>>>>> Alex Baranau >>>>>> ------ >>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - >>>>>> ElasticSearch >>>>> - >>>>>> Solr >>>>>> >>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel < >>>>> [email protected]>wrote: >>>>>>> Hi, >>>>>>> >>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would >>>>>>> really >>>>>>> hope that the author of that blog take it down. >>>>>>> While it may solve an initial problem in terms of region hot >>>>>>> spotting, >>>>> it >>>>>>> creates another problem when it comes to fetching data. Fetching >>>>>>> data >>>>> takes >>>>>>> more effort. >>>>>>> >>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more >>>>> random >>>>>>> key that is unique to the record. Some would argue that using MD5 >>>>>>> or >>>>> SHA-1 >>>>>>> that mathematically you could have a collision, however you could >>>>>>> then >>>>>>> append the key to the hash to guarantee uniqueness. You could also >>>>>>> do >>>>>>> things like take the hash and then truncate it to the first byte and >>>>> then >>>>>>> append the record key. This should give you enough randomness to >>>>>>> avoid >>>>> hot >>>>>>> spotting after the initial region completion and you could pre-split >>>>>>> out >>>>>>> any number of regions. (First byte 0-255 for values, so you can >>>>>>> program >>>>> the >>>>>>> split... >>>>>>> >>>>>>> >>>>>>> Having said that... yes, you lose the ability to perform a >>>>>>> sequential >>>>> scan >>>>>>> of the data. At least to a point. It depends on your schema. >>>>>>> >>>>>>> Note that you need to think about how you are primarily going to >>>>>>> access >>>>>>> the data. You can then determine the best way to store the data to >>>>>>> gain >>>>>>> the best performance. For some applications... the region hot >>>>>>> spotting >>>>>>> isn't an important issue. >>>>>>> >>>>>>> Note YMMV >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> >>>>> wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and >>>>>>>> md5 >>>>>>>> (heavy scan): >>>>>>>> * you can use composed keys with a field that can segregate data >>>>>>>> (hostname, productname, metric name) like OpenTSDB >>>>>>>> * or use Salt with a limited number of values (example >>>>>>>> substr(md5(rowid),0,1) = 16 values) >>>>>>>> so that a scan is a combination of 16 filters on on each salt >>>>>>>> values >>>>>>>> you can base your code on HBaseWD by sematext >>>>>>>> >>>>>>>> >>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> >>>>>>>> 2012/12/18 bigdata <[email protected]> >>>>>>>> >>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method >>>>>>>>> to >>>>>>>>> balance the records stored in different parts. But If I want to >>>>>>>>> search >>>>>>> some >>>>>>>>> sequential rowkey records, such as date as rowkey or partially. I >>>>>>>>> can >>>>>>> not >>>>>>>>> use rowkey filter to scan a range of date value one time on the >>>>>>>>> date >>>>> by >>>>>>>>> MD5. How to balance this issue? >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Damien HARDY >>>>>>> >>>>> >> >> > >
