[ 
https://issues.apache.org/jira/browse/HBASE-11682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101308#comment-14101308
 ] 

Jonathan Hsieh commented on HBASE-11682:
----------------------------------------

{code}
+      <para>Salting in this sense has nothing to do with cryptography, but 
refers to adding random
+        data to the start of a row key. In this case, salting refers to adding 
a prefix to the row
+        key to cause it to sort differently than it otherwise would. Salting 
can be helpful if you
+        have a few keys that come up over and over, along with other rows that 
don't fit those keys.
+        In that case, the regions holding rows with the "hot" keys would be 
overloaded, compared to
+        the other regions. Salting completely removes ordering, so is often a 
poorer choice than
+        hashing. Using totally random row keys for data which is accessed 
sequentially would remove
+        the benefit of HBase's row-sorting algorithm and cause very poor 
performance, as each get or
+        scan would need to query all regions.</para>
{code}

I don't think this salting example is correct about the ramifications.  Both 
Nick and I agree that salting is puting some random value in front of the 
actual value.  This means instead of one sorted list of entries, we'd have many 
n sorted lists of entries if the cardinality of the salt is n.

Example:  naively we have rowkeys like this:

foo0001
foo0002
foo0003
foo0004

if we us a 4 way salt (a,b,c,d), we could end up with data resorted like this:

a-foo0003
b-foo0001
c-foo0004
d-foo0002

Let say we add some new values to row foo0003.  It could get salted with a new 
salt, let's say 'c'.

a-foo0003
b-foo0001
*c-foo0003*
c-foo0004
d-foo0002

To read we still could get things read in the original order but we'd have to 
have a reader starting from each salt in parallel to get the rows back in 
order. (and likely need to do some coalescing of foo0003 to combine the 
a-foo0003 and c-foo0003 rows back into one.  The effect here in this situtation 
is that we could be writing with 4x the throughput now since we would be on 4 
different machines.(assuming that the a, b, c, d are balanced onto different 
machines).

Nick's point of view (please correct me if I am wrong) says that you could 
"salt" the original row key with a one-way hash so that foo0003 would always 
get salted with 'a'.  This would spread rowkeys that are lexicographically 
close (foo0001 and foo0002) to different machines that could help reduce 
contention and increase overall throughput but not allow ever allow a single 
row to have 4x the throughput like the other approach.

{code}
+      <para>Hashing refers to applying a random one-way function to the row 
key, such that a
+        particular row always gets the same arbitrary value applied. This 
preserves the sort order
+        so that scans are effective, but spreads out load across a region. One 
example where hashing
+        is the right strategy would be if for some reason, a large proportion 
of rows started with
+        the same letter. Normally, these would all be sorted into the same 
region. You can apply a
+        hash to artificially differentiate them and spread them out.</para>
{code}

Hashing actually totally trashes the sort order -- in fact the goal of hashing 
is to evenly disburse entries that are near each other lexicographically as 
much as possible.

> Explain hotspotting
> -------------------
>
>                 Key: HBASE-11682
>                 URL: https://issues.apache.org/jira/browse/HBASE-11682
>             Project: HBase
>          Issue Type: Task
>          Components: documentation
>            Reporter: Misty Stanley-Jones
>            Assignee: Misty Stanley-Jones
>         Attachments: HBASE-11682-1.patch, HBASE-11682.patch, HBASE-11682.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to