Let's say you want to decompose a url into domain and path to include in your row key.

You could of course just use the url as the key, but you will see hotspotting since most will start with "http". To mitigate this, you could add a random byte or two at the beginning (random salt) to improve distribution of keys, but you break single record Gets (and Scans arguably). Another approach is to use a hash-based salt: hash the whole key and use a few of those bytes as a salt. This fixes Gets but Scans are still not effective.

One approach I've taken is to hash only a part of the key. Consider the following key structure

<2 bytes of hash(domain)><domain><path>

With this you get 16 bits for a hash-based salt. The salt is deterministic so Gets work fine, and for a single domain the salt is the same so you can easily do Scans across a domain. If you had some further structure to your key that you wished to scan across, you could do something like:

<2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>

It really boils down to identifying your access patterns and read/write requirements and constructing a row key accordingly.

HTH,
David

On 12/18/12 6:29 PM, Michael Segel wrote:
Alex,
And that's the point. Salt as you explain it conceptually implies that the 
number you are adding to the key to ensure a better distribution means that you 
will have inefficiencies in terms of scans and gets.

Using a hash as either the full key, or taking the hash, truncating it and 
appending the key may screw up scans, but your get() is intact.

There are other options like inverting the numeric key ...

And of course doing nothing.

Using a salt as part of the design pattern is bad.

With respect to the OP, I was discussing the use of hash and some alternatives 
to how to implement the hash of a key.
Again, doing nothing may also make sense too, if you understand the risks and 
you know how your data is going to be used.


On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]> wrote:

Mike,

Please read *full post* before judge. In particular, "Hash-based
distribution" section. You can find the same in HBaseWD small README file
[1] (not sure if you read it at all before commenting on the lib). Round
robin is mainly for explaining the concept/idea (though not only for that).

Thank you,
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

[1] https://github.com/sematext/HBaseWD

On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
<[email protected]>wrote:

Quick answer...

Look at the salt.
Its just a number from a round robin counter.
There is no tie between the salt and row.

So when you want to fetch a single row, how do you do it?
...
;-)

On Dec 18, 2012, at 11:12 AM, Alex Baranau <[email protected]>
wrote:

Hello,

@Mike:

I'm the author of that post :).

Quick reply to your last comment:

1) Could you please describe why "the use of a 'Salt' is a very, very bad
idea" in more specific way than "Fetching data takes more effort". Would
be
helpful for anyone who is looking into using this approach.

2) The approach described in the post also says you can prefix with the
hash, you probably missed that.

3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
Please re-read the question: the intention is to distribute the load
while
still being able to do "partial key scans". The blog post linked above
explains one possible solution for that, while your answer doesn't.

@bigdata:

Basically when it comes to solving two issues: distributing writes and
having ability to read data sequentially, you have to balance between
being
good at both of them. Very good presentation by Lars:

http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
,
slide 22. You will see how this is correlated. In short:
* having md5/other hash prefix of the key does better w.r.t. distributing
writes, while compromises ability to do range scans efficiently
* having very limited number of 'salt' prefixes still allows to do range
scans (less efficiently than normal range scans, of course, but still
good
enough in many cases) while providing worse distribution of writes

In the latter case by choosing number of possible 'salt' prefixes (which
could be derived from hashed values, etc.) you can balance between
distributing writes efficiency and ability to run fast range scans.

Hope this helps

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
-
Solr

On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
[email protected]>wrote:
Hi,

First, the use of a 'Salt' is a very, very bad idea and I would really
hope that the author of that blog take it down.
While it may solve an initial problem in terms of region hot spotting,
it
creates another problem when it comes to fetching data. Fetching data
takes
more effort.

With respect to using a hash (MD5 or SHA-1) you are creating a more
random
key that is unique to the record.  Some would argue that using MD5 or
SHA-1
that mathematically you could have a collision, however you could then
append the key to the hash to guarantee uniqueness. You could also do
things like take the hash and then truncate it to the first byte and
then
append the record key. This should give you enough randomness to avoid
hot
spotting after the initial region completion and you could pre-split out
any number of regions. (First byte 0-255 for values, so you can program
the
split...


Having said that... yes, you lose the ability to perform a sequential
scan
of the data.  At least to a point.  It depends on your schema.

Note that you need to think about how you are primarily going to access
the data.  You can then determine the best way to store the data to gain
the best performance. For some applications... the region hot spotting
isn't an important issue.

Note YMMV

HTH

-Mike

On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]>
wrote:
Hello,

There is middle term betwen sequecial keys (hot spoting risk) and md5
(heavy scan):
* you can use composed keys with a field that can segregate data
(hostname, productname, metric name) like OpenTSDB
* or use Salt with a limited number of values (example
substr(md5(rowid),0,1) = 16 values)
  so that a scan is a combination of 16 filters on on each salt values
  you can base your code on HBaseWD by sematext


http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
     https://github.com/sematext/HBaseWD

Cheers,


2012/12/18 bigdata <[email protected]>

Many articles tell me that MD5 rowkey or part of it is good method to
balance the records stored in different parts. But If I want to search
some
sequential rowkey records, such as date as rowkey or partially. I can
not
use rowkey filter to scan a range of date value one time on the date
by
MD5. How to balance this issue?
Thanks.





--
Damien HARDY



Reply via email to