Hi Ted, Thanks for the suggestion, but I'm not sure that it helps my case much. I wasn't very familiar with the feature, and it doesn't seem very well documented - I had to go to the source and the originating JIRA to understand how it works. It sounds like it allows you to mark which column families the filter operates on ("essential" seems an odd name). If any data from those column families passes the filter, then the scan loads and includes data from the remaining families without filtering it. In my case, it's not clear from a row's family A whether or not family B for that row is required (though that could probably be added). Moreover, even if a row has recent data, we don't want to load all the old data from that row. We'd prefer to be able to entirely skip reading the data off disk for the old store files.
Dave On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you considered using essential column family feature (through Filter) > ? > In your case A would be the essential column family. > Within TimeRange for recent data, the filter would return both column > families. > Outside the TimeRange, only family A is returned. > > Cheers > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <lat...@davelink.net> wrote: > > > I have a table with 2 column families, call them A and B, with new data > > regularly being added. They are very different sizes: B is 100x the size > of > > A. Among other uses for this data, I have a MapReduce job that needs to > > read all of A, but only recent data from B (e.g. last day). Here are > some > > methods I've considered: > > > > 1. Use a Filter to get throw out older data from B (this is what I > > currently do). However, all the data from B still needs to be read > from > > disk, causing a disk IO bottleneck. > > 2. Configure the table input format to read from B only, using a > > TimeRange for recent data, and have each map task open a separate > > scanner > > for A (without a TimeRange) then merge the data in the map task. > > However, > > this adds complexity to the job and gives up the atomicity/consistency > > guarantees as new writes hit both column families. > > 3. Add a new column family C to the table with an additional copy of > the > > data in B, but set a TTL on it. All writes duplicate the data written > > to B > > and C. Change the scan to include C instead of B. However, this adds > > all > > the overhead of another column family, more writes, and having to set > > the > > TTL to the maximum of any time window I want to scan efficiently. > > 4. Implement an enhancement to HBase's Scan to allow giving each > column > > family its own TimeRange. The job would then be able to skip most old > > large store files (hopefully all of them with tiered compaction at > some > > point). > > > > Does anyone have other suggestions? Would HBase be willing to accept > > updating Scan to have different TimeRange's for each column families? > > > > > > Dave > > >