Hi Mike,

If in your business case, the only thing you need when you retreive
your data is to do full scan over MR jobs, then you can salt with
what-ever you want. Hash, random values, etc.

If you know you have x regions, then you can simply do a round-robin
salting, or a random salting over those x regions.

Then when you run your MR job, you discard the first bytes, and do
what you want with your data.

So I also think that salting can still be usefull. All depend on what
you do with your data.

Must my opinion.

JM

2012/12/19, Michael Segel <[email protected]>:
> Ok...
>
> So you use a random byte or two at the front of the row.
> How do you then use get() to find the row?
> How do you do a partial scan()?
>
> Do you start to see the problem?
> The only way to get to the row is to do a full table scan. That kills HBase
> and you would be better off going with a partitioned Hive table.
>
> Using a hash of the key or a portion of the hash is not a salt.
> That's not what I have a problem with. Each time you want to fetch the key,
> you just hash it, truncate the hash and then prepend it to the key. You will
> then be able to use get().
>
> Using a salt would imply using some form of a modulo math to get a round
> robin prefix.  Or a random number generator.
>
> That's the issue.
>
> Does that make sense?
>
>
>
> On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote:
>
>> Let's say you want to decompose a url into domain and path to include in
>> your row key.
>>
>> You could of course just use the url as the key, but you will see
>> hotspotting since most will start with "http". To mitigate this, you could
>> add a random byte or two at the beginning (random salt) to improve
>> distribution of keys, but you break single record Gets (and Scans
>> arguably). Another approach is to use a hash-based salt: hash the whole
>> key and use a few of those bytes as a salt. This fixes Gets but Scans are
>> still not effective.
>>
>> One approach I've taken is to hash only a part of the key. Consider the
>> following key structure
>>
>> <2 bytes of hash(domain)><domain><path>
>>
>> With this you get 16 bits for a hash-based salt. The salt is deterministic
>> so Gets work fine, and for a single domain the salt is the same so you can
>> easily do Scans across a domain. If you had some further structure to your
>> key that you wished to scan across, you could do something like:
>>
>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>
>>
>> It really boils down to identifying your access patterns and read/write
>> requirements and constructing a row key accordingly.
>>
>> HTH,
>> David
>>
>> On 12/18/12 6:29 PM, Michael Segel wrote:
>>> Alex,
>>> And that's the point. Salt as you explain it conceptually implies that
>>> the number you are adding to the key to ensure a better distribution
>>> means that you will have inefficiencies in terms of scans and gets.
>>>
>>> Using a hash as either the full key, or taking the hash, truncating it
>>> and appending the key may screw up scans, but your get() is intact.
>>>
>>> There are other options like inverting the numeric key ...
>>>
>>> And of course doing nothing.
>>>
>>> Using a salt as part of the design pattern is bad.
>>>
>>> With respect to the OP, I was discussing the use of hash and some
>>> alternatives to how to implement the hash of a key.
>>> Again, doing nothing may also make sense too, if you understand the risks
>>> and you know how your data is going to be used.
>>>
>>>
>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]>
>>> wrote:
>>>
>>>> Mike,
>>>>
>>>> Please read *full post* before judge. In particular, "Hash-based
>>>> distribution" section. You can find the same in HBaseWD small README
>>>> file
>>>> [1] (not sure if you read it at all before commenting on the lib).
>>>> Round
>>>> robin is mainly for explaining the concept/idea (though not only for
>>>> that).
>>>>
>>>> Thank you,
>>>> Alex Baranau
>>>> ------
>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>>>> -
>>>> Solr
>>>>
>>>> [1] https://github.com/sematext/HBaseWD
>>>>
>>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>>>> <[email protected]>wrote:
>>>>
>>>>> Quick answer...
>>>>>
>>>>> Look at the salt.
>>>>> Its just a number from a round robin counter.
>>>>> There is no tie between the salt and row.
>>>>>
>>>>> So when you want to fetch a single row, how do you do it?
>>>>> ...
>>>>> ;-)
>>>>>
>>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> @Mike:
>>>>>>
>>>>>> I'm the author of that post :).
>>>>>>
>>>>>> Quick reply to your last comment:
>>>>>>
>>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very
>>>>>> bad
>>>>>> idea" in more specific way than "Fetching data takes more effort".
>>>>>> Would
>>>>> be
>>>>>> helpful for anyone who is looking into using this approach.
>>>>>>
>>>>>> 2) The approach described in the post also says you can prefix with
>>>>>> the
>>>>>> hash, you probably missed that.
>>>>>>
>>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata
>>>>>> guy.
>>>>>> Please re-read the question: the intention is to distribute the load
>>>>> while
>>>>>> still being able to do "partial key scans". The blog post linked
>>>>>> above
>>>>>> explains one possible solution for that, while your answer doesn't.
>>>>>>
>>>>>> @bigdata:
>>>>>>
>>>>>> Basically when it comes to solving two issues: distributing writes
>>>>>> and
>>>>>> having ability to read data sequentially, you have to balance between
>>>>> being
>>>>>> good at both of them. Very good presentation by Lars:
>>>>>>
>>>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>>>> ,
>>>>>> slide 22. You will see how this is correlated. In short:
>>>>>> * having md5/other hash prefix of the key does better w.r.t.
>>>>>> distributing
>>>>>> writes, while compromises ability to do range scans efficiently
>>>>>> * having very limited number of 'salt' prefixes still allows to do
>>>>>> range
>>>>>> scans (less efficiently than normal range scans, of course, but still
>>>>> good
>>>>>> enough in many cases) while providing worse distribution of writes
>>>>>>
>>>>>> In the latter case by choosing number of possible 'salt' prefixes
>>>>>> (which
>>>>>> could be derived from hashed values, etc.) you can balance between
>>>>>> distributing writes efficiency and ability to run fast range scans.
>>>>>>
>>>>>> Hope this helps
>>>>>>
>>>>>> Alex Baranau
>>>>>> ------
>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
>>>>>> ElasticSearch
>>>>> -
>>>>>> Solr
>>>>>>
>>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
>>>>> [email protected]>wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would
>>>>>>> really
>>>>>>> hope that the author of that blog take it down.
>>>>>>> While it may solve an initial problem in terms of region hot
>>>>>>> spotting,
>>>>> it
>>>>>>> creates another problem when it comes to fetching data. Fetching
>>>>>>> data
>>>>> takes
>>>>>>> more effort.
>>>>>>>
>>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more
>>>>> random
>>>>>>> key that is unique to the record.  Some would argue that using MD5
>>>>>>> or
>>>>> SHA-1
>>>>>>> that mathematically you could have a collision, however you could
>>>>>>> then
>>>>>>> append the key to the hash to guarantee uniqueness. You could also
>>>>>>> do
>>>>>>> things like take the hash and then truncate it to the first byte and
>>>>> then
>>>>>>> append the record key. This should give you enough randomness to
>>>>>>> avoid
>>>>> hot
>>>>>>> spotting after the initial region completion and you could pre-split
>>>>>>> out
>>>>>>> any number of regions. (First byte 0-255 for values, so you can
>>>>>>> program
>>>>> the
>>>>>>> split...
>>>>>>>
>>>>>>>
>>>>>>> Having said that... yes, you lose the ability to perform a
>>>>>>> sequential
>>>>> scan
>>>>>>> of the data.  At least to a point.  It depends on your schema.
>>>>>>>
>>>>>>> Note that you need to think about how you are primarily going to
>>>>>>> access
>>>>>>> the data.  You can then determine the best way to store the data to
>>>>>>> gain
>>>>>>> the best performance. For some applications... the region hot
>>>>>>> spotting
>>>>>>> isn't an important issue.
>>>>>>>
>>>>>>> Note YMMV
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> -Mike
>>>>>>>
>>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]>
>>>>> wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and
>>>>>>>> md5
>>>>>>>> (heavy scan):
>>>>>>>> * you can use composed keys with a field that can segregate data
>>>>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>>>>> * or use Salt with a limited number of values (example
>>>>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>>>>  so that a scan is a combination of 16 filters on on each salt
>>>>>>>> values
>>>>>>>>  you can base your code on HBaseWD by sematext
>>>>>>>>
>>>>>>>>
>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>     https://github.com/sematext/HBaseWD
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>>
>>>>>>>> 2012/12/18 bigdata <[email protected]>
>>>>>>>>
>>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method
>>>>>>>>> to
>>>>>>>>> balance the records stored in different parts. But If I want to
>>>>>>>>> search
>>>>>>> some
>>>>>>>>> sequential rowkey records, such as date as rowkey or partially. I
>>>>>>>>> can
>>>>>>> not
>>>>>>>>> use rowkey filter to scan a range of date value one time on the
>>>>>>>>> date
>>>>> by
>>>>>>>>> MD5. How to balance this issue?
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Damien HARDY
>>>>>>>
>>>>>
>>
>>
>
>

Reply via email to