Also to note-- I followed these guidelines ( http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html) and did not observe the same speedup. I'm not quite sure what sort of hardware boxes are required to get here.
On Wed, Feb 3, 2010 at 1:40 PM, Chris Bates < [email protected]> wrote: > Hello all, > > I've read around about doing table indexing and it seems like there are a > couple approaches which I'd like clarification. > > What I have been doing is near full table scans because I'm doing a lot of > aggregation and statistics for our analytics project. We have nearly 7.5GB > of data per a day to load into HBase. My schema has been Row: Timestamp, > ColFam1: col1... , ColFam2: col1 .... It takes somewhere close to 5 hours > to load in all the data I need from HDFS MapReduce. We currently are only > running HBase on 3 machines, about 1.5 - 2gb RAM each. We're going to scale > out in the next month to 2 8core machines with 30gb of RAM each. > > With this in mind, I'm now focusing on performance. I'm working on getting > LZO compression enabled on all the machines, but I was more curious as to > the best way to index. > > It seems like there are two strategies: Use the tableindexed package or > roll my own where I'd create a new table with the rowID's as the values from > the primary table column lookup. Then when I do a scan on the main table, I > would grab one value that satisfies my filters, and use that value to scan > over the index table to grab all the rows that satisfy it. > > Does anyone know about the performance of these two approaches, or if there > are others? How does it affect loading? I'd like to load in my 7.5gb of > data per day in a matter of minutes not hours, and then be able to query > columns in seconds not tens of minutes. >
