RE: Is it necessary to set MD5 on rowkey?

bigdata Tue, 18 Dec 2012 07:21:11 -0800

Thanks to all of you! 
Actually, I want to make some reports about device access times daily and some 
selected days range. I design a table like this:
row key:  date_deviceid
this rowkey can help me calculate daily login devices count. I can add a prefix 
(such as 2 digital bytes of MD5(date)), and calculate a special day quickly. 
But when I calculate for a range time, it is not suitable.It's hard to balance 
it because I think I have 50% for daily reports and another 50% for range 
reports.
But another question is I have a report about daily new deviceid count (never 
access system before), it means that I should use deviceid for search condition 
with all date. I've met several problems like this: I use a rowkey for one 
query but no way for another query. I should create another rowkey format for 
other query. But question is I can not create two original tables with 
different rowkey!!!
Any suggestions? Or better solutions for my questions? Thanks


> Subject: Re: Is it necessary to set MD5 on rowkey?
> From: [email protected]
> Date: Tue, 18 Dec 2012 07:52:53 -0600
> To: [email protected]
> 
> 
> Hi,
> 
> First, the use of a 'Salt' is a very, very bad idea and I would really hope 
> that the author of that blog take it down.
> While it may solve an initial problem in terms of region hot spotting, it 
> creates another problem when it comes to fetching data. Fetching data takes 
> more effort.
> 
> With respect to using a hash (MD5 or SHA-1) you are creating a more random 
> key that is unique to the record.  Some would argue that using MD5 or SHA-1 
> that mathematically you could have a collision, however you could then append 
> the key to the hash to guarantee uniqueness. You could also do things like 
> take the hash and then truncate it to the first byte and then append the 
> record key. This should give you enough randomness to avoid hot spotting 
> after the initial region completion and you could pre-split out any number of 
> regions. (First byte 0-255 for values, so you can program the split... 
> 
> 
> Having said that... yes, you lose the ability to perform a sequential scan of 
> the data.  At least to a point.  It depends on your schema. 
> 
> Note that you need to think about how you are primarily going to access the 
> data.  You can then determine the best way to store the data to gain the best 
> performance. For some applications... the region hot spotting isn't an 
> important issue. 
> 
> Note YMMV
> 
> HTH
> 
> -Mike
> 
> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote:
> 
> > Hello,
> > 
> > There is middle term betwen sequecial keys (hot spoting risk) and md5
> > (heavy scan):
> >  * you can use composed keys with a field that can segregate data
> > (hostname, productname, metric name) like OpenTSDB
> >  * or use Salt with a limited number of values (example
> > substr(md5(rowid),0,1) = 16 values)
> >    so that a scan is a combination of 16 filters on on each salt values
> >    you can base your code on HBaseWD by sematext
> > 
> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >       https://github.com/sematext/HBaseWD
> > 
> > Cheers,
> > 
> > 
> > 2012/12/18 bigdata <[email protected]>
> > 
> >> Many articles tell me that MD5 rowkey or part of it is good method to
> >> balance the records stored in different parts. But If I want to search some
> >> sequential rowkey records, such as date as rowkey or partially. I can not
> >> use rowkey filter to scan a range of date value one time on the date by
> >> MD5. How to balance this issue?
> >> Thanks.
> >> 
> >> 
> > 
> > 
> > 
> > 
> > -- 
> > Damien HARDY
>

RE: Is it necessary to set MD5 on rowkey?

Reply via email to