> Have you seen: https://issues.apache.org/jira/browse/HBASE-4241 ?

Heh, missed that one for some reason. Thanks for pointing to it.

And HBASE-5311 looks even more promising. This actually addresses Qs I had
about "real changes" of KVs in memstore during its compaction (as opposed
to just omitting redundant edits as in HBASE-4241).

>From what I can tell (discussion at HBASE-5311 and commit in HBASE-4241)
the main problem to solve is make these memstore compactions fast & ideally
non-locking for updates of the memstore. And what I described in point 3 of
the initial message isn't really a problem.

Perhaps we could even stick a coprocessor in such memstore compactions so
that even greater compaction can be made based on application logic? Would
it be valuable?

Thank you!

Alex Baranau

On Wed, May 2, 2012 at 1:29 AM, lars hofhansl <lhofha...@yahoo.com> wrote:

> HBASE-4241 solves part of the problem. It avoids flushing cells from the
> memstore to disk that would be collected during the next compaction anyway.
> Unfortunately it does not reduce the number of memstore flushes; it just
> leads to smaller HFiles.
>
>
> There's HBASE-5311 to discuss ways to address the latter problem.
>
> Note that in any case *all* edits need to be written to the WAL -as you
> cannot anticipate future edits.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Igal Shilman <ig...@wix.com>
> To: dev@hbase.apache.org
> Cc:
> Sent: Tuesday, May 1, 2012 10:11 PM
> Subject: Re: Understanding compacting memstore/HLog before flush
>
> Hi Alex,
> Have you seen: https://issues.apache.org/jira/browse/HBASE-4241 ?
>
> Igal.
> On May 2, 2012 7:01 AM, "Alex Baranau" <alex.barano...@gmail.com> wrote:
>
> > Hello,
> >
> > Could you please tell me if I correctly understand this problem...
> >
> > Example behavior 1:
> > * create table
> > * do 10 operations: insert cell, override (given that versions #
> configured
> > to 1) it, override, ... override.
> > * after flushing memstore with these edits, all of them getting written
> to
> > hfiles
> >
> > Ideally, in this situation one edit should be performed (resulting value
> of
> > cell). I.e. only "current visible state" of memstore should be flushed as
> > opposed to flushing all the edits from HLog. This will have a lot of
> > benefits (e.g. reducing data amount to flush -> may be less frequent
> > flushing needing -> less freq compactions, etc. operations), esp in
> > particular use-cases (like using counters, or updating some "aggregated
> > values").
> >
> > The problem, as I understand (correct me here, please if I'm wrong) is
> that
> > it is not an easy thing to do, mainly because
> > 1) additional resource management burden (flushing large memstore isn't
> > cheap)
> > 2) compaction may add a lot of unnecessary overhead (so that in some
> cases
> > there will be no actual benefit from it), may make flushing much slower,
> > which can bring a lot of issues
> > 3) edits flushed from memstore and HLog edits should be kept in sync,
> > because we want the flush process to be reliable. I.e. if it fails in the
> > middle we should be able to restore the state from HLog. Keeping memstore
> > and HLog in sync during compaction (and we would need partial compaction
> of
> > some older data of the memstore) is difficult.
> > 4) anything else?
> >
> > Esp. 3rd point - am I getting it right?
> >
> > Thanx,
> > Alex Baranau
> >
>
>

Reply via email to