I have a table with 2 column families, call them A and B, with new data regularly being added. They are very different sizes: B is 100x the size of A. Among other uses for this data, I have a MapReduce job that needs to read all of A, but only recent data from B (e.g. last day). Here are some methods I've considered:
1. Use a Filter to get throw out older data from B (this is what I currently do). However, all the data from B still needs to be read from disk, causing a disk IO bottleneck. 2. Configure the table input format to read from B only, using a TimeRange for recent data, and have each map task open a separate scanner for A (without a TimeRange) then merge the data in the map task. However, this adds complexity to the job and gives up the atomicity/consistency guarantees as new writes hit both column families. 3. Add a new column family C to the table with an additional copy of the data in B, but set a TTL on it. All writes duplicate the data written to B and C. Change the scan to include C instead of B. However, this adds all the overhead of another column family, more writes, and having to set the TTL to the maximum of any time window I want to scan efficiently. 4. Implement an enhancement to HBase's Scan to allow giving each column family its own TimeRange. The job would then be able to skip most old large store files (hopefully all of them with tiered compaction at some point). Does anyone have other suggestions? Would HBase be willing to accept updating Scan to have different TimeRange's for each column families? Dave