Ok, Maybe I'm missing something. Why don't you walk me through the use of a salt example.
On Dec 19, 2012, at 12:37 PM, lars hofhansl <[email protected]> wrote: > I would disagree here. > It depends on what you are doing and blanket statements about "this is very, > very bad" typically do not help. > > Salting (even round robin) is very nice to distribute write load *and* it > gives you a natural way to parallelize scans assuming scans are of reasonable > size. > > If the typical use case is point gets then hashing or inverting keys would be > preferable. As usual: It depends. > > -- Lars > > > > ________________________________ > From: Michael Segel <[email protected]> > To: [email protected] > Sent: Tuesday, December 18, 2012 3:29 PM > Subject: Re: Is it necessary to set MD5 on rowkey? > > Alex, > And that's the point. Salt as you explain it conceptually implies that the > number you are adding to the key to ensure a better distribution means that > you will have inefficiencies in terms of scans and gets. > > Using a hash as either the full key, or taking the hash, truncating it and > appending the key may screw up scans, but your get() is intact. > > There are other options like inverting the numeric key ... > > And of course doing nothing. > > Using a salt as part of the design pattern is bad. > > With respect to the OP, I was discussing the use of hash and some > alternatives to how to implement the hash of a key. > Again, doing nothing may also make sense too, if you understand the risks and > you know how your data is going to be used. > > > On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]> wrote: > >> Mike, >> >> Please read *full post* before judge. In particular, "Hash-based >> distribution" section. You can find the same in HBaseWD small README file >> [1] (not sure if you read it at all before commenting on the lib). Round >> robin is mainly for explaining the concept/idea (though not only for that). >> >> Thank you, >> Alex Baranau >> ------ >> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - >> Solr >> >> [1] https://github.com/sematext/HBaseWD >> >> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel >> <[email protected]>wrote: >> >>> Quick answer... >>> >>> Look at the salt. >>> Its just a number from a round robin counter. >>> There is no tie between the salt and row. >>> >>> So when you want to fetch a single row, how do you do it? >>> ... >>> ;-) >>> >>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> @Mike: >>>> >>>> I'm the author of that post :). >>>> >>>> Quick reply to your last comment: >>>> >>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad >>>> idea" in more specific way than "Fetching data takes more effort". Would >>> be >>>> helpful for anyone who is looking into using this approach. >>>> >>>> 2) The approach described in the post also says you can prefix with the >>>> hash, you probably missed that. >>>> >>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy. >>>> Please re-read the question: the intention is to distribute the load >>> while >>>> still being able to do "partial key scans". The blog post linked above >>>> explains one possible solution for that, while your answer doesn't. >>>> >>>> @bigdata: >>>> >>>> Basically when it comes to solving two issues: distributing writes and >>>> having ability to read data sequentially, you have to balance between >>> being >>>> good at both of them. Very good presentation by Lars: >>>> >>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 >>> , >>>> slide 22. You will see how this is correlated. In short: >>>> * having md5/other hash prefix of the key does better w.r.t. distributing >>>> writes, while compromises ability to do range scans efficiently >>>> * having very limited number of 'salt' prefixes still allows to do range >>>> scans (less efficiently than normal range scans, of course, but still >>> good >>>> enough in many cases) while providing worse distribution of writes >>>> >>>> In the latter case by choosing number of possible 'salt' prefixes (which >>>> could be derived from hashed values, etc.) you can balance between >>>> distributing writes efficiency and ability to run fast range scans. >>>> >>>> Hope this helps >>>> >>>> Alex Baranau >>>> ------ >>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch >>> - >>>> Solr >>>> >>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel < >>> [email protected]>wrote: >>>> >>>>> >>>>> Hi, >>>>> >>>>> First, the use of a 'Salt' is a very, very bad idea and I would really >>>>> hope that the author of that blog take it down. >>>>> While it may solve an initial problem in terms of region hot spotting, >>> it >>>>> creates another problem when it comes to fetching data. Fetching data >>> takes >>>>> more effort. >>>>> >>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more >>> random >>>>> key that is unique to the record. Some would argue that using MD5 or >>> SHA-1 >>>>> that mathematically you could have a collision, however you could then >>>>> append the key to the hash to guarantee uniqueness. You could also do >>>>> things like take the hash and then truncate it to the first byte and >>> then >>>>> append the record key. This should give you enough randomness to avoid >>> hot >>>>> spotting after the initial region completion and you could pre-split out >>>>> any number of regions. (First byte 0-255 for values, so you can program >>> the >>>>> split... >>>>> >>>>> >>>>> Having said that... yes, you lose the ability to perform a sequential >>> scan >>>>> of the data. At least to a point. It depends on your schema. >>>>> >>>>> Note that you need to think about how you are primarily going to access >>>>> the data. You can then determine the best way to store the data to gain >>>>> the best performance. For some applications... the region hot spotting >>>>> isn't an important issue. >>>>> >>>>> Note YMMV >>>>> >>>>> HTH >>>>> >>>>> -Mike >>>>> >>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> >>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5 >>>>>> (heavy scan): >>>>>> * you can use composed keys with a field that can segregate data >>>>>> (hostname, productname, metric name) like OpenTSDB >>>>>> * or use Salt with a limited number of values (example >>>>>> substr(md5(rowid),0,1) = 16 values) >>>>>> so that a scan is a combination of 16 filters on on each salt values >>>>>> you can base your code on HBaseWD by sematext >>>>>> >>>>>> >>>>> >>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>> https://github.com/sematext/HBaseWD >>>>>> >>>>>> Cheers, >>>>>> >>>>>> >>>>>> 2012/12/18 bigdata <[email protected]> >>>>>> >>>>>>> Many articles tell me that MD5 rowkey or part of it is good method to >>>>>>> balance the records stored in different parts. But If I want to search >>>>> some >>>>>>> sequential rowkey records, such as date as rowkey or partially. I can >>>>> not >>>>>>> use rowkey filter to scan a range of date value one time on the date >>> by >>>>>>> MD5. How to balance this issue? >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Damien HARDY >>>>> >>>>> >>>
