Ted,
That sounds like allowing the Filter to indicate it wants to include the
"essential" column family data but skip the other column family.  I still
don't think that would help very much.  In my case, I currently don't know
which data in family B to include based on the values in family A.
Moreover, for many rows there is a lot of data in family B and I want to
include recent data but not the old data from the same row.

Dave

On Sun, Aug 2, 2015 at 8:06 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Here is refined version:
> http://pastebin.com/WXjYKmBG
>
> Cheers
>
> On Sun, Aug 2, 2015 at 2:57 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> > Dave:
> > I wonder if Filter response can be enhanced in the following manner:
> >
> > http://pastebin.com/sb6apTPm
> >
> > My approach is based on using essential column family (column family A in
> > your case) to guide whether the remaining column families should be
> loaded.
> > To be specific, if outside the TimeRange you specify (last day), your
> > filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW.
> >
> > What do you think ?
> >
> > Cheers
> >
> > On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <lat...@davelink.net> wrote:
> >
> >> Thanks for brainstorming, Ted.  That sounds like option 2 I listed
> using a
> >> separate scanner for A vs B which "adds complexity to the job and gives
> up
> >> the atomicity/consistency guarantees as new writes hit both column
> >> families".
> >>
> >> On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>
> >> > Can you achieve your goal with two scans ?
> >> > The first scan specifies TimeRange corresponding to last day. This
> scan
> >> > returns both column families.
> >> > The other scan specifies TimeRange excluding last day. This scan
> returns
> >> > column family A.
> >> >
> >> > Cheers
> >> >
> >> > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <lat...@davelink.net>
> >> wrote:
> >> >
> >> > > Hi Ted,
> >> > >
> >> > > Thanks for the suggestion, but I'm not sure that it helps my case
> >> much.
> >> > I
> >> > > wasn't very familiar with the feature, and it doesn't seem very well
> >> > > documented - I had to go to the source and the originating JIRA to
> >> > > understand how it works.  It sounds like it allows you to mark which
> >> > column
> >> > > families the filter operates on ("essential" seems an odd name).  If
> >> any
> >> > > data from those column families passes the filter, then the scan
> loads
> >> > and
> >> > > includes data from the remaining families without filtering it.  In
> my
> >> > > case, it's not clear from a row's family A whether or not family B
> for
> >> > that
> >> > > row is required (though that could probably be added).  Moreover,
> even
> >> > if a
> >> > > row has recent data, we don't want to load all the old data from
> that
> >> > row.
> >> > > We'd prefer to be able to entirely skip reading the data off disk
> for
> >> the
> >> > > old store files.
> >> > >
> >> > > Dave
> >> > >
> >> > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >> > >
> >> > > > Have you considered using essential column family feature (through
> >> > > Filter)
> >> > > > ?
> >> > > > In your case A would be the essential column family.
> >> > > > Within TimeRange for recent data, the filter would return both
> >> column
> >> > > > families.
> >> > > > Outside the TimeRange, only family A is returned.
> >> > > >
> >> > > > Cheers
> >> > > >
> >> > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <lat...@davelink.net>
> >> > wrote:
> >> > > >
> >> > > > > I have a table with 2 column families, call them A and B, with
> new
> >> > data
> >> > > > > regularly being added. They are very different sizes: B is 100x
> >> the
> >> > > size
> >> > > > of
> >> > > > > A.  Among other uses for this data, I have a MapReduce job that
> >> needs
> >> > > to
> >> > > > > read all of A, but only recent data from B (e.g. last day).
> Here
> >> are
> >> > > > some
> >> > > > > methods I've considered:
> >> > > > >
> >> > > > >    1. Use a Filter to get throw out older data from B (this is
> >> what I
> >> > > > >    currently do).  However, all the data from B still needs to
> be
> >> > read
> >> > > > from
> >> > > > >    disk, causing a disk IO bottleneck.
> >> > > > >    2. Configure the table input format to read from B only,
> using
> >> a
> >> > > > >    TimeRange for recent data, and have each map task open a
> >> separate
> >> > > > > scanner
> >> > > > >    for A (without a TimeRange) then merge the data in the map
> >> task.
> >> > > > > However,
> >> > > > >    this adds complexity to the job and gives up the
> >> > > atomicity/consistency
> >> > > > >    guarantees as new writes hit both column families.
> >> > > > >    3. Add a new column family C to the table with an additional
> >> copy
> >> > of
> >> > > > the
> >> > > > >    data in B, but set a TTL on it.  All writes duplicate the
> data
> >> > > written
> >> > > > > to B
> >> > > > >    and C.  Change the scan to include C instead of B.  However,
> >> this
> >> > > adds
> >> > > > > all
> >> > > > >    the overhead of another column family, more writes, and
> having
> >> to
> >> > > set
> >> > > > > the
> >> > > > >    TTL to the maximum of any time window I want to scan
> >> efficiently.
> >> > > > >    4. Implement an enhancement to HBase's Scan to allow giving
> >> each
> >> > > > column
> >> > > > >    family its own TimeRange.  The job would then be able to skip
> >> most
> >> > > old
> >> > > > >    large store files (hopefully all of them with tiered
> >> compaction at
> >> > > > some
> >> > > > >    point).
> >> > > > >
> >> > > > > Does anyone have other suggestions?  Would HBase be willing to
> >> accept
> >> > > > > updating Scan to have different TimeRange's for each column
> >> families?
> >> > > > >
> >> > > > >
> >> > > > > Dave
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Reply via email to