I think you missed the point. You seem to think that salting is ok. I want you to walk through an example so that we can discuss it. ;-)
On Dec 19, 2012, at 2:51 PM, lars hofhansl <lhofha...@yahoo.com> wrote: > Doesn't Alex' blog post do that? > > > > > ________________________________ > From: Michael Segel <michael_se...@hotmail.com> > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > Sent: Wednesday, December 19, 2012 11:46 AM > Subject: Re: Is it necessary to set MD5 on rowkey? > > Ok, > > Maybe I'm missing something. > Why don't you walk me through the use of a salt example. > > > On Dec 19, 2012, at 12:37 PM, lars hofhansl <lhofha...@yahoo.com> wrote: > >> I would disagree here. >> It depends on what you are doing and blanket statements about "this is very, >> very bad" typically do not help. >> >> Salting (even round robin) is very nice to distribute write load *and* it >> gives you a natural way to parallelize scans assuming scans are of >> reasonable size. >> >> If the typical use case is point gets then hashing or inverting keys would >> be preferable. As usual: It depends. >> >> -- Lars >> >> >> >> ________________________________ >> From: Michael Segel <michael_se...@hotmail.com> >> To: user@hbase.apache.org >> Sent: Tuesday, December 18, 2012 3:29 PM >> Subject: Re: Is it necessary to set MD5 on rowkey? >> >> Alex, >> And that's the point. Salt as you explain it conceptually implies that the >> number you are adding to the key to ensure a better distribution means that >> you will have inefficiencies in terms of scans and gets. >> >> Using a hash as either the full key, or taking the hash, truncating it and >> appending the key may screw up scans, but your get() is intact. >> >> There are other options like inverting the numeric key ... >> >> And of course doing nothing. >> >> Using a salt as part of the design pattern is bad. >> >> With respect to the OP, I was discussing the use of hash and some >> alternatives to how to implement the hash of a key. >> Again, doing nothing may also make sense too, if you understand the risks >> and you know how your data is going to be used. >> >> >> On Dec 18, 2012, at 11:36 AM, Alex Baranau <alex.barano...@gmail.com> wrote: >> >>> Mike, >>> >>> Please read *full post* before judge. In particular, "Hash-based >>> distribution" section. You can find the same in HBaseWD small README file >>> [1] (not sure if you read it at all before commenting on the lib). Round >>> robin is mainly for explaining the concept/idea (though not only for that). >>> >>> Thank you, >>> Alex Baranau >>> ------ >>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - >>> Solr >>> >>> [1] https://github.com/sematext/HBaseWD >>> >>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel >>> <michael_se...@hotmail.com>wrote: >>> >>>> Quick answer... >>>> >>>> Look at the salt. >>>> Its just a number from a round robin counter. >>>> There is no tie between the salt and row. >>>> >>>> So when you want to fetch a single row, how do you do it? >>>> ... >>>> ;-) >>>> >>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <alex.barano...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> @Mike: >>>>> >>>>> I'm the author of that post :). >>>>> >>>>> Quick reply to your last comment: >>>>> >>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad >>>>> idea" in more specific way than "Fetching data takes more effort". Would >>>> be >>>>> helpful for anyone who is looking into using this approach. >>>>> >>>>> 2) The approach described in the post also says you can prefix with the >>>>> hash, you probably missed that. >>>>> >>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy. >>>>> Please re-read the question: the intention is to distribute the load >>>> while >>>>> still being able to do "partial key scans". The blog post linked above >>>>> explains one possible solution for that, while your answer doesn't. >>>>> >>>>> @bigdata: >>>>> >>>>> Basically when it comes to solving two issues: distributing writes and >>>>> having ability to read data sequentially, you have to balance between >>>> being >>>>> good at both of them. Very good presentation by Lars: >>>>> >>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 >>>> , >>>>> slide 22. You will see how this is correlated. In short: >>>>> * having md5/other hash prefix of the key does better w.r.t. distributing >>>>> writes, while compromises ability to do range scans efficiently >>>>> * having very limited number of 'salt' prefixes still allows to do range >>>>> scans (less efficiently than normal range scans, of course, but still >>>> good >>>>> enough in many cases) while providing worse distribution of writes >>>>> >>>>> In the latter case by choosing number of possible 'salt' prefixes (which >>>>> could be derived from hashed values, etc.) you can balance between >>>>> distributing writes efficiency and ability to run fast range scans. >>>>> >>>>> Hope this helps >>>>> >>>>> Alex Baranau >>>>> ------ >>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch >>>> - >>>>> Solr >>>>> >>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel < >>>> michael_se...@hotmail.com>wrote: >>>>> >>>>>> >>>>>> Hi, >>>>>> >>>>>> First, the use of a 'Salt' is a very, very bad idea and I would really >>>>>> hope that the author of that blog take it down. >>>>>> While it may solve an initial problem in terms of region hot spotting, >>>> it >>>>>> creates another problem when it comes to fetching data. Fetching data >>>> takes >>>>>> more effort. >>>>>> >>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more >>>> random >>>>>> key that is unique to the record. Some would argue that using MD5 or >>>> SHA-1 >>>>>> that mathematically you could have a collision, however you could then >>>>>> append the key to the hash to guarantee uniqueness. You could also do >>>>>> things like take the hash and then truncate it to the first byte and >>>> then >>>>>> append the record key. This should give you enough randomness to avoid >>>> hot >>>>>> spotting after the initial region completion and you could pre-split out >>>>>> any number of regions. (First byte 0-255 for values, so you can program >>>> the >>>>>> split... >>>>>> >>>>>> >>>>>> Having said that... yes, you lose the ability to perform a sequential >>>> scan >>>>>> of the data. At least to a point. It depends on your schema. >>>>>> >>>>>> Note that you need to think about how you are primarily going to access >>>>>> the data. You can then determine the best way to store the data to gain >>>>>> the best performance. For some applications... the region hot spotting >>>>>> isn't an important issue. >>>>>> >>>>>> Note YMMV >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <dha...@viadeoteam.com> >>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5 >>>>>>> (heavy scan): >>>>>>> * you can use composed keys with a field that can segregate data >>>>>>> (hostname, productname, metric name) like OpenTSDB >>>>>>> * or use Salt with a limited number of values (example >>>>>>> substr(md5(rowid),0,1) = 16 values) >>>>>>> so that a scan is a combination of 16 filters on on each salt values >>>>>>> you can base your code on HBaseWD by sematext >>>>>>> >>>>>>> >>>>>> >>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>> https://github.com/sematext/HBaseWD >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> >>>>>>> 2012/12/18 bigdata <bigdatab...@outlook.com> >>>>>>> >>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method to >>>>>>>> balance the records stored in different parts. But If I want to search >>>>>> some >>>>>>>> sequential rowkey records, such as date as rowkey or partially. I can >>>>>> not >>>>>>>> use rowkey filter to scan a range of date value one time on the date >>>> by >>>>>>>> MD5. How to balance this issue? >>>>>>>> Thanks. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Damien HARDY >>>>>> >>>>>>