First, I recommend upgrading to the latest HBase 0.19 release, 0.19.3.
You have a few choices, but in short you want to use filters.
http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/filter/package-summary.html
Specifically, you should look at the RegExpRowFilter:
http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/filter/RegExpRowFilter.html
You could set up the regular expression to only return stuff from the
month you want. Inside the MR job you would know every row returned
would come from the month in question and would be able to look at the
key to determine the agency_id and day.
There's an example in TIFB docs:
http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
Best of luck!
JG
Michael Hauck wrote:
Hi,
i'm new to hbase MapReduce and want to do following:
- create daily statistics with sql queries against a sql database
- store statistic results in hbase
- run daily MapReduce on that results to compute monthly statistics
I stored this data in hbase table 'route_conversion_statistics'.
My keys have the format '<agency_id>_yyyy-MM-dd' like '208_2009-06-08'
My ColumnFamilies are 'looks', 'bookings', 'turnover', 'paxcount'
For example:
The row '208_2009-06-08' has about 30000 column like this:
looks:FRAMUC, value=123
looks:FRALAX, value=456
...
bookings:FRAMUC, value=15
bookings:FRALAX, value=34
...
turnover:FRAMUC, value=1534.34
turnover:FRALAX, value=4574.35
...
paxcount:FRAMUC, value=356
paxcount:FRALAX, value=5676
...
Now i want to create a new row with the corresponding key '208_2009-06'
and put
the sum of all columns from '208_2009-06-01' to '208_2009-06-30'
What is the best practice to do this?
How can i scan over this monthly range?
I use hadoop 0.19.1 and HBase 0.19.1
Thanks,
Michael