I have a table with 2 column families, call them A and B, with new data
regularly being added. They are very different sizes: B is 100x the size of
A.  Among other uses for this data, I have a MapReduce job that needs to
read all of A, but only recent data from B (e.g. last day).  Here are some
methods I've considered:

   1. Use a Filter to get throw out older data from B (this is what I
   currently do).  However, all the data from B still needs to be read from
   disk, causing a disk IO bottleneck.
   2. Configure the table input format to read from B only, using a
   TimeRange for recent data, and have each map task open a separate scanner
   for A (without a TimeRange) then merge the data in the map task.  However,
   this adds complexity to the job and gives up the atomicity/consistency
   guarantees as new writes hit both column families.
   3. Add a new column family C to the table with an additional copy of the
   data in B, but set a TTL on it.  All writes duplicate the data written to B
   and C.  Change the scan to include C instead of B.  However, this adds all
   the overhead of another column family, more writes, and having to set the
   TTL to the maximum of any time window I want to scan efficiently.
   4. Implement an enhancement to HBase's Scan to allow giving each column
   family its own TimeRange.  The job would then be able to skip most old
   large store files (hopefully all of them with tiered compaction at some
   point).

Does anyone have other suggestions?  Would HBase be willing to accept
updating Scan to have different TimeRange's for each column families?


Dave

Reply via email to