Thanks to all of you! Actually, I want to make some reports about device access times daily and some selected days range. I design a table like this: row key: date_deviceid this rowkey can help me calculate daily login devices count. I can add a prefix (such as 2 digital bytes of MD5(date)), and calculate a special day quickly. But when I calculate for a range time, it is not suitable.It's hard to balance it because I think I have 50% for daily reports and another 50% for range reports. But another question is I have a report about daily new deviceid count (never access system before), it means that I should use deviceid for search condition with all date. I've met several problems like this: I use a rowkey for one query but no way for another query. I should create another rowkey format for other query. But question is I can not create two original tables with different rowkey!!! Any suggestions? Or better solutions for my questions? Thanks
> Subject: Re: Is it necessary to set MD5 on rowkey? > From: [email protected] > Date: Tue, 18 Dec 2012 07:52:53 -0600 > To: [email protected] > > > Hi, > > First, the use of a 'Salt' is a very, very bad idea and I would really hope > that the author of that blog take it down. > While it may solve an initial problem in terms of region hot spotting, it > creates another problem when it comes to fetching data. Fetching data takes > more effort. > > With respect to using a hash (MD5 or SHA-1) you are creating a more random > key that is unique to the record. Some would argue that using MD5 or SHA-1 > that mathematically you could have a collision, however you could then append > the key to the hash to guarantee uniqueness. You could also do things like > take the hash and then truncate it to the first byte and then append the > record key. This should give you enough randomness to avoid hot spotting > after the initial region completion and you could pre-split out any number of > regions. (First byte 0-255 for values, so you can program the split... > > > Having said that... yes, you lose the ability to perform a sequential scan of > the data. At least to a point. It depends on your schema. > > Note that you need to think about how you are primarily going to access the > data. You can then determine the best way to store the data to gain the best > performance. For some applications... the region hot spotting isn't an > important issue. > > Note YMMV > > HTH > > -Mike > > On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote: > > > Hello, > > > > There is middle term betwen sequecial keys (hot spoting risk) and md5 > > (heavy scan): > > * you can use composed keys with a field that can segregate data > > (hostname, productname, metric name) like OpenTSDB > > * or use Salt with a limited number of values (example > > substr(md5(rowid),0,1) = 16 values) > > so that a scan is a combination of 16 filters on on each salt values > > you can base your code on HBaseWD by sematext > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > https://github.com/sematext/HBaseWD > > > > Cheers, > > > > > > 2012/12/18 bigdata <[email protected]> > > > >> Many articles tell me that MD5 rowkey or part of it is good method to > >> balance the records stored in different parts. But If I want to search some > >> sequential rowkey records, such as date as rowkey or partially. I can not > >> use rowkey filter to scan a range of date value one time on the date by > >> MD5. How to balance this issue? > >> Thanks. > >> > >> > > > > > > > > > > -- > > Damien HARDY >
