Hi Dave,

>  Would HBase be willing to accept updating Scan to have different
TimeRange's for each column families?

We could try it. I'm not sure how familiar you are with the relevant code.
I'm guessing some? Look at ScanQueryMatcher. This and related concerns
govern how we search through store files. Timerange handling is done at the
top level (the SQM). Then for each column we have a leaf tracker
(implementing ColumnTracker) that tracks column specific info like number
of versions for a cell found in each. We'd need to push timerange handling
down into the column trackers. This would be a tricky refactor on delicate
code. I suspect we could be comfortable making this change in master and on
branch-1 for upcoming unscheduled minor release line 1.3. Would that work?
Or would this change need to go further back?

Maybe someone else has another suggestion.


On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <lat...@davelink.net> wrote:

> I have a table with 2 column families, call them A and B, with new data
> regularly being added. They are very different sizes: B is 100x the size of
> A.  Among other uses for this data, I have a MapReduce job that needs to
> read all of A, but only recent data from B (e.g. last day).  Here are some
> methods I've considered:
>
>    1. Use a Filter to get throw out older data from B (this is what I
>    currently do).  However, all the data from B still needs to be read from
>    disk, causing a disk IO bottleneck.
>    2. Configure the table input format to read from B only, using a
>    TimeRange for recent data, and have each map task open a separate
> scanner
>    for A (without a TimeRange) then merge the data in the map task.
> However,
>    this adds complexity to the job and gives up the atomicity/consistency
>    guarantees as new writes hit both column families.
>    3. Add a new column family C to the table with an additional copy of the
>    data in B, but set a TTL on it.  All writes duplicate the data written
> to B
>    and C.  Change the scan to include C instead of B.  However, this adds
> all
>    the overhead of another column family, more writes, and having to set
> the
>    TTL to the maximum of any time window I want to scan efficiently.
>    4. Implement an enhancement to HBase's Scan to allow giving each column
>    family its own TimeRange.  The job would then be able to skip most old
>    large store files (hopefully all of them with tiered compaction at some
>    point).
>
> Does anyone have other suggestions?  Would HBase be willing to accept
> updating Scan to have different TimeRange's for each column families?
>
>
> Dave
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Reply via email to