Ted, That sounds like allowing the Filter to indicate it wants to include the "essential" column family data but skip the other column family. I still don't think that would help very much. In my case, I currently don't know which data in family B to include based on the values in family A. Moreover, for many rows there is a lot of data in family B and I want to include recent data but not the old data from the same row.
Dave On Sun, Aug 2, 2015 at 8:06 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Here is refined version: > http://pastebin.com/WXjYKmBG > > Cheers > > On Sun, Aug 2, 2015 at 2:57 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Dave: > > I wonder if Filter response can be enhanced in the following manner: > > > > http://pastebin.com/sb6apTPm > > > > My approach is based on using essential column family (column family A in > > your case) to guide whether the remaining column families should be > loaded. > > To be specific, if outside the TimeRange you specify (last day), your > > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW. > > > > What do you think ? > > > > Cheers > > > > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <lat...@davelink.net> wrote: > > > >> Thanks for brainstorming, Ted. That sounds like option 2 I listed > using a > >> separate scanner for A vs B which "adds complexity to the job and gives > up > >> the atomicity/consistency guarantees as new writes hit both column > >> families". > >> > >> On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > >> > Can you achieve your goal with two scans ? > >> > The first scan specifies TimeRange corresponding to last day. This > scan > >> > returns both column families. > >> > The other scan specifies TimeRange excluding last day. This scan > returns > >> > column family A. > >> > > >> > Cheers > >> > > >> > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <lat...@davelink.net> > >> wrote: > >> > > >> > > Hi Ted, > >> > > > >> > > Thanks for the suggestion, but I'm not sure that it helps my case > >> much. > >> > I > >> > > wasn't very familiar with the feature, and it doesn't seem very well > >> > > documented - I had to go to the source and the originating JIRA to > >> > > understand how it works. It sounds like it allows you to mark which > >> > column > >> > > families the filter operates on ("essential" seems an odd name). If > >> any > >> > > data from those column families passes the filter, then the scan > loads > >> > and > >> > > includes data from the remaining families without filtering it. In > my > >> > > case, it's not clear from a row's family A whether or not family B > for > >> > that > >> > > row is required (though that could probably be added). Moreover, > even > >> > if a > >> > > row has recent data, we don't want to load all the old data from > that > >> > row. > >> > > We'd prefer to be able to entirely skip reading the data off disk > for > >> the > >> > > old store files. > >> > > > >> > > Dave > >> > > > >> > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > > > >> > > > Have you considered using essential column family feature (through > >> > > Filter) > >> > > > ? > >> > > > In your case A would be the essential column family. > >> > > > Within TimeRange for recent data, the filter would return both > >> column > >> > > > families. > >> > > > Outside the TimeRange, only family A is returned. > >> > > > > >> > > > Cheers > >> > > > > >> > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <lat...@davelink.net> > >> > wrote: > >> > > > > >> > > > > I have a table with 2 column families, call them A and B, with > new > >> > data > >> > > > > regularly being added. They are very different sizes: B is 100x > >> the > >> > > size > >> > > > of > >> > > > > A. Among other uses for this data, I have a MapReduce job that > >> needs > >> > > to > >> > > > > read all of A, but only recent data from B (e.g. last day). > Here > >> are > >> > > > some > >> > > > > methods I've considered: > >> > > > > > >> > > > > 1. Use a Filter to get throw out older data from B (this is > >> what I > >> > > > > currently do). However, all the data from B still needs to > be > >> > read > >> > > > from > >> > > > > disk, causing a disk IO bottleneck. > >> > > > > 2. Configure the table input format to read from B only, > using > >> a > >> > > > > TimeRange for recent data, and have each map task open a > >> separate > >> > > > > scanner > >> > > > > for A (without a TimeRange) then merge the data in the map > >> task. > >> > > > > However, > >> > > > > this adds complexity to the job and gives up the > >> > > atomicity/consistency > >> > > > > guarantees as new writes hit both column families. > >> > > > > 3. Add a new column family C to the table with an additional > >> copy > >> > of > >> > > > the > >> > > > > data in B, but set a TTL on it. All writes duplicate the > data > >> > > written > >> > > > > to B > >> > > > > and C. Change the scan to include C instead of B. However, > >> this > >> > > adds > >> > > > > all > >> > > > > the overhead of another column family, more writes, and > having > >> to > >> > > set > >> > > > > the > >> > > > > TTL to the maximum of any time window I want to scan > >> efficiently. > >> > > > > 4. Implement an enhancement to HBase's Scan to allow giving > >> each > >> > > > column > >> > > > > family its own TimeRange. The job would then be able to skip > >> most > >> > > old > >> > > > > large store files (hopefully all of them with tiered > >> compaction at > >> > > > some > >> > > > > point). > >> > > > > > >> > > > > Does anyone have other suggestions? Would HBase be willing to > >> accept > >> > > > > updating Scan to have different TimeRange's for each column > >> families? > >> > > > > > >> > > > > > >> > > > > Dave > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >