You can play tricks with the arrangement of the key. For instance, you can put date at the end of the key. That would let you pull data for a particular user for a particular date range. The date should not be a time stamp, but should be a low-res version of time (day-level resolution might be ok) so that you can minimize number of rows.
On Thu, Jul 14, 2011 at 12:52 PM, Andre Reiter <a.rei...@web.de> wrote: > Hi everybody, > > we have our hadoop + hbase cluster running at the moment with 6 servers > > everything is working just fine. We have a web application, where data is > stored with the row key = user id (meaningless UUID). So our users have a > cookie, which is the row key, behind this key are families with items, i.e. > family "impressions", where every impression is stored with its time stamp > etc... > > the row key is defined with the user id, to make the real time request > possible, so we can retrieve all user data very fast > > new we are running mapreduce jobs, to generate a report: for example we > want to know how many impressions were done by all users in last x days. > therefore the scan of the MR job is running over all data in our hbase table > for the particular family. this takes at the moment about 70 seconds, which > is actually a bit too long, and with the data growing, the time will > increase, unless we add new workers to the cluster. we have right now 22 > regions > > the problem i see, is that we can not define a filter for the scan, the row > key (user id) is just an UUID, nothing meaningfull in it > > what can we do, to however improve (accelerate) the scan process? is it > maybe advisable to store the data more redundant. so for example we create > second table and store every impression twice, one time with the user id as > row key in the first table, and the second one with a time stamp as a row > key in the second table. > the data volume would grow twice as fast, but our scans will work x times > faster on the second table compared to now > > comments are very appreciated > > andre > >