Re: Is it necessary to set MD5 on rowkey?

Alex Baranau Tue, 18 Dec 2012 09:36:30 -0800

Mike,

Please read *full post* before judge. In particular, "Hash-based
distribution" section. You can find the same in HBaseWD small README file
[1] (not sure if you read it at all before commenting on the lib). Round
robin is mainly for explaining the concept/idea (though not only for that).


Thank you,
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] https://github.com/sematext/HBaseWD

On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
<michael_se...@hotmail.com>wrote:

> Quick answer...
>
> Look at the salt.
> Its just a number from a round robin counter.
> There is no tie between the salt and row.
>
> So when you want to fetch a single row, how do you do it?
> ...
> ;-)
>
> On Dec 18, 2012, at 11:12 AM, Alex Baranau <alex.barano...@gmail.com>
> wrote:
>
> > Hello,
> >
> > @Mike:
> >
> > I'm the author of that post :).
> >
> > Quick reply to your last comment:
> >
> > 1) Could you please describe why "the use of a 'Salt' is a very, very bad
> > idea" in more specific way than "Fetching data takes more effort". Would
> be
> > helpful for anyone who is looking into using this approach.
> >
> > 2) The approach described in the post also says you can prefix with the
> > hash, you probably missed that.
> >
> > 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
> > Please re-read the question: the intention is to distribute the load
> while
> > still being able to do "partial key scans". The blog post linked above
> > explains one possible solution for that, while your answer doesn't.
> >
> > @bigdata:
> >
> > Basically when it comes to solving two issues: distributing writes and
> > having ability to read data sequentially, you have to balance between
> being
> > good at both of them. Very good presentation by Lars:
> >
> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
> ,
> > slide 22. You will see how this is correlated. In short:
> > * having md5/other hash prefix of the key does better w.r.t. distributing
> > writes, while compromises ability to do range scans efficiently
> > * having very limited number of 'salt' prefixes still allows to do range
> > scans (less efficiently than normal range scans, of course, but still
> good
> > enough in many cases) while providing worse distribution of writes
> >
> > In the latter case by choosing number of possible 'salt' prefixes (which
> > could be derived from hashed values, etc.) you can balance between
> > distributing writes efficiency and ability to run fast range scans.
> >
> > Hope this helps
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> > On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
> michael_se...@hotmail.com>wrote:
> >
> >>
> >> Hi,
> >>
> >> First, the use of a 'Salt' is a very, very bad idea and I would really
> >> hope that the author of that blog take it down.
> >> While it may solve an initial problem in terms of region hot spotting,
> it
> >> creates another problem when it comes to fetching data. Fetching data
> takes
> >> more effort.
> >>
> >> With respect to using a hash (MD5 or SHA-1) you are creating a more
> random
> >> key that is unique to the record.  Some would argue that using MD5 or
> SHA-1
> >> that mathematically you could have a collision, however you could then
> >> append the key to the hash to guarantee uniqueness. You could also do
> >> things like take the hash and then truncate it to the first byte and
> then
> >> append the record key. This should give you enough randomness to avoid
> hot
> >> spotting after the initial region completion and you could pre-split out
> >> any number of regions. (First byte 0-255 for values, so you can program
> the
> >> split...
> >>
> >>
> >> Having said that... yes, you lose the ability to perform a sequential
> scan
> >> of the data.  At least to a point.  It depends on your schema.
> >>
> >> Note that you need to think about how you are primarily going to access
> >> the data.  You can then determine the best way to store the data to gain
> >> the best performance. For some applications... the region hot spotting
> >> isn't an important issue.
> >>
> >> Note YMMV
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >> On Dec 18, 2012, at 3:33 AM, Damien Hardy <dha...@viadeoteam.com>
> wrote:
> >>
> >>> Hello,
> >>>
> >>> There is middle term betwen sequecial keys (hot spoting risk) and md5
> >>> (heavy scan):
> >>> * you can use composed keys with a field that can segregate data
> >>> (hostname, productname, metric name) like OpenTSDB
> >>> * or use Salt with a limited number of values (example
> >>> substr(md5(rowid),0,1) = 16 values)
> >>>   so that a scan is a combination of 16 filters on on each salt values
> >>>   you can base your code on HBaseWD by sematext
> >>>
> >>>
> >>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >>>      https://github.com/sematext/HBaseWD
> >>>
> >>> Cheers,
> >>>
> >>>
> >>> 2012/12/18 bigdata <bigdatab...@outlook.com>
> >>>
> >>>> Many articles tell me that MD5 rowkey or part of it is good method to
> >>>> balance the records stored in different parts. But If I want to search
> >> some
> >>>> sequential rowkey records, such as date as rowkey or partially. I can
> >> not
> >>>> use rowkey filter to scan a range of date value one time on the date
> by
> >>>> MD5. How to balance this issue?
> >>>> Thanks.
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Damien HARDY
> >>
> >>
>
>

Re: Is it necessary to set MD5 on rowkey?

Reply via email to