JM, have a look at https://github.com/sematext/HBaseWD (this comes up often.... Doug, maybe you could add it to the Ref Guide?)
Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Jean-Marc Spaggiari <jean-m...@spaggiari.org> >To: user@hbase.apache.org >Sent: Wednesday, June 13, 2012 12:16 PM >Subject: Timestamp as a key good practice? > >I watched Lars George's video about HBase and read the documentation >and it's saying that it's not a good idea to have the timestamp as a >key because that will always load the same region until the timestamp >reach a certain value and move to the next region (hotspotting). > >I have a table with a uniq key, a file path and a "last update" field. >I can easily find back the file with the ID and find when it has been >updated. > >But what I need too is to find the files not updated for more than a >certain period of time. > >If I want to retrieve that from this single table, I will have to do a >full parsing of the table. Which might take a while. > >So I thought of building a table to reference that (kind of secondary >index). The key is the "last update", one FC and each column will have >the ID of the file with a dummy content. > >When a file is updated, I remove its cell from this table, and >introduce a new cell with the new timestamp as the key. > >And so one. > >With this schema, I can find the files by ID very quickly and I can >find the files which need to be updated pretty quickly too. But it's >hotspotting one region. > >From the video (0:45:10) I can see 4 situations. >1) Hotspotting. >2) Salting. >3) Key field swap/promotion >4) Randomization. > >I need to avoid hostpotting, so I looked at the 3 other options. > >I can do salting. Like prefix the timestamp with a number between 0 >and 9. So that will distribut the load over 10 servers. To find all >the files with a timestamp below a specific value, I will need to run >10 requests instead of one. But when the load will becaume to big for >10 servers, I will have to prefix by a byte between 0 and 99? Which >mean 100 request? And the more regions I will have, the more requests >I will have to do. Is that really a good approach? > >Key field swap is close to salting. I can add the first few bytes from >the path before the timestamp, but the issue will remain the same. > >I looked and randomization, and I can't do that. Else I will have no >way to retreive the information I'm looking for. > >So the question is. Is there a good way to store the data to retrieve >them base on the date? > >Thanks, > >JM > > >