Re: scan column families with different time ranges

Ted Yu Sun, 02 Aug 2015 08:06:55 -0700

Here is refined version:
http://pastebin.com/WXjYKmBG


Cheers

On Sun, Aug 2, 2015 at 2:57 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Dave:
> I wonder if Filter response can be enhanced in the following manner:
>
> http://pastebin.com/sb6apTPm
>
> My approach is based on using essential column family (column family A in
> your case) to guide whether the remaining column families should be loaded.
> To be specific, if outside the TimeRange you specify (last day), your
> filter returns ReturnCode.INCLUDE_AND_SEEK_NEXT_ROW.
>
> What do you think ?
>
> Cheers
>
> On Sat, Aug 1, 2015 at 8:06 PM, Dave Latham <lat...@davelink.net> wrote:
>
>> Thanks for brainstorming, Ted.  That sounds like option 2 I listed using a
>> separate scanner for A vs B which "adds complexity to the job and gives up
>> the atomicity/consistency guarantees as new writes hit both column
>> families".
>>
>> On Sat, Aug 1, 2015 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> > Can you achieve your goal with two scans ?
>> > The first scan specifies TimeRange corresponding to last day. This scan
>> > returns both column families.
>> > The other scan specifies TimeRange excluding last day. This scan returns
>> > column family A.
>> >
>> > Cheers
>> >
>> > On Sat, Aug 1, 2015 at 8:35 AM, Dave Latham <lat...@davelink.net>
>> wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > Thanks for the suggestion, but I'm not sure that it helps my case
>> much.
>> > I
>> > > wasn't very familiar with the feature, and it doesn't seem very well
>> > > documented - I had to go to the source and the originating JIRA to
>> > > understand how it works.  It sounds like it allows you to mark which
>> > column
>> > > families the filter operates on ("essential" seems an odd name).  If
>> any
>> > > data from those column families passes the filter, then the scan loads
>> > and
>> > > includes data from the remaining families without filtering it.  In my
>> > > case, it's not clear from a row's family A whether or not family B for
>> > that
>> > > row is required (though that could probably be added).  Moreover, even
>> > if a
>> > > row has recent data, we don't want to load all the old data from that
>> > row.
>> > > We'd prefer to be able to entirely skip reading the data off disk for
>> the
>> > > old store files.
>> > >
>> > > Dave
>> > >
>> > > On Sat, Aug 1, 2015 at 7:53 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> > >
>> > > > Have you considered using essential column family feature (through
>> > > Filter)
>> > > > ?
>> > > > In your case A would be the essential column family.
>> > > > Within TimeRange for recent data, the filter would return both
>> column
>> > > > families.
>> > > > Outside the TimeRange, only family A is returned.
>> > > >
>> > > > Cheers
>> > > >
>> > > > On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <lat...@davelink.net>
>> > wrote:
>> > > >
>> > > > > I have a table with 2 column families, call them A and B, with new
>> > data
>> > > > > regularly being added. They are very different sizes: B is 100x
>> the
>> > > size
>> > > > of
>> > > > > A.  Among other uses for this data, I have a MapReduce job that
>> needs
>> > > to
>> > > > > read all of A, but only recent data from B (e.g. last day).  Here
>> are
>> > > > some
>> > > > > methods I've considered:
>> > > > >
>> > > > >    1. Use a Filter to get throw out older data from B (this is
>> what I
>> > > > >    currently do).  However, all the data from B still needs to be
>> > read
>> > > > from
>> > > > >    disk, causing a disk IO bottleneck.
>> > > > >    2. Configure the table input format to read from B only, using
>> a
>> > > > >    TimeRange for recent data, and have each map task open a
>> separate
>> > > > > scanner
>> > > > >    for A (without a TimeRange) then merge the data in the map
>> task.
>> > > > > However,
>> > > > >    this adds complexity to the job and gives up the
>> > > atomicity/consistency
>> > > > >    guarantees as new writes hit both column families.
>> > > > >    3. Add a new column family C to the table with an additional
>> copy
>> > of
>> > > > the
>> > > > >    data in B, but set a TTL on it.  All writes duplicate the data
>> > > written
>> > > > > to B
>> > > > >    and C.  Change the scan to include C instead of B.  However,
>> this
>> > > adds
>> > > > > all
>> > > > >    the overhead of another column family, more writes, and having
>> to
>> > > set
>> > > > > the
>> > > > >    TTL to the maximum of any time window I want to scan
>> efficiently.
>> > > > >    4. Implement an enhancement to HBase's Scan to allow giving
>> each
>> > > > column
>> > > > >    family its own TimeRange.  The job would then be able to skip
>> most
>> > > old
>> > > > >    large store files (hopefully all of them with tiered
>> compaction at
>> > > > some
>> > > > >    point).
>> > > > >
>> > > > > Does anyone have other suggestions?  Would HBase be willing to
>> accept
>> > > > > updating Scan to have different TimeRange's for each column
>> families?
>> > > > >
>> > > > >
>> > > > > Dave
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: scan column families with different time ranges

Reply via email to