Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912

Stack Mon, 12 Sep 2016 12:01:44 -0700

Late to the game. A few comments after rereading this thread as a 'user'.

+ Before merge, a user-facing feature like this should work (If this
is "higher-bar
for new features", bring it on -- smile).
+ As a user, I tried the branch with tools after reviewing the just-posted
doc. I had an 'interesting' experience (left comments up on issue). I think
the tooling/doc. important to get right. If it breaks easily or is
inconsistent (or lacks 'polish'), operators will judge the whole
backup/restore tooling chain as not trustworthy and abandon it. Lets not
have this happen to this feature.
+ Matteo's suggestion (with a helpful starter list) that there needs to be
explicit qualification on what is actually being delivered -- including a
listing of limitations (some look serious such as data bleed from other
regions in WALs, but maybe I don't care for my use case...) -- needs to
accompany the merge. Lets fold them into the user doc. in the technical
overview area as suggested so user expectations are properly managed
(otherwise, they expect the world and will just give up when we fall
short). Vladimir did a list of what is in each of the phases above which
would serve as a good start.
+ Is this feature 'experimental' (Matteo asks above). I'd prefer it is not.
If it is, it should be labelled all over that it is so. I see current state
called out as a '... technical preview feature'. Does this mean
not-for-users?


St.Ack











On Mon, Sep 12, 2016 at 8:03 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Sean:
> Do you have more comments ?
>
> Cheers
>
> On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <vladrodio...@gmail.com>
> wrote:
>
> > Sean,
> >
> > Backup/Restore can fail due to various reasons: network outage (cluster
> > wide), various time-outs in HBase and HDFS layer, M/R failure due to
> "HDFS
> > exceeded quota", user error (manual deletion of data) and so on so on.
> That
> > is impossible to enumerate all possible types of failures in a
> distributed
> > system - that is not our goal/task.
> >
> > We focus completely on backup system table consistency in a presence of
> any
> > type of failure. That is what I call "tolerance to failures".
> >
> > On a failure:
> >
> > BACKUP. All backup system information (prior to backup) will be restored
> > and all temporary data, related to a failed session, in HDFS will be
> > deleted
> > RESTORE. We do not care about system data, because restore does not
> change
> > it. Temporary data in HDFS will be cleaned up and table will be in a
> state
> > back to where it was before operation started.
> >
> > This is what user should expect in case of a failure.
> >
> > -Vlad
> >
> >
> > -Vlad
> >
> > On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <bus...@apache.org> wrote:
> >
> > > Failing in a consistent way, with docs that explain the various
> > > expected failures would be sufficient.
> > >
> > > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> > > <vladrodio...@gmail.com> wrote:
> > > > Do not worry Sean, doc is coming today as a preview and our writer
> > Frank
> > > > will be working on a putting  it into Apache repo. Timeline depends
> on
> > > > Franks schedule but I hope we will get it rather sooner than later.
> > > >
> > > > As for failure testing, we are focusing only on a consistent state of
> > > > backup system data in a presence of any type of failures, We are not
> > > going
> > > > to implement  anything more "fancy", than that. We allow both: backup
> > and
> > > > restore to fail. What we do not allow is to have system data
> corrupted.
> > > > Will it suffice for you? Do you have any other concerns, you want us
> to
> > > > address?
> > > >
> > > > -Vlad
> > > >
> > > >
> > > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <bus...@apache.org>
> > wrote:
> > > >
> > > >> "docs will come to Apache soon" does not address my concern around
> > docs
> > > at
> > > >> all, unless said docs have already made it into the project repo. I
> > > don't
> > > >> want third party resources for using a major and important feature
> of
> > > the
> > > >> project, I want us to provide end users with what they need to get
> the
> > > job
> > > >> done.
> > > >>
> > > >> I see some calls for patience on the failure testing, but the appeal
> > to
> > > us
> > > >> having done a bad job of requiring proper tests of previous features
> > > just
> > > >> makes me more concerned about not getting them here. I don't want to
> > set
> > > >> yet another bad example that will then be pointed to in the future.
> > > >>
> > > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhih...@gmail.com> wrote:
> > > >>
> > > >> > Is there any concern which is not addressed ?
> > > >> >
> > > >> > Do we need another Vote thread ?
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <
> apurt...@apache.org
> > >
> > > >> > wrote:
> > > >> >
> > > >> > > Vlad,
> > > >> > >
> > > >> > > I apologize for using the term 'half-baked' in a way that could
> > > seem a
> > > >> > > description of HBASE-7912. I meant that as a general
> hypothetical.
> > > >> > >
> > > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov <
> > > >> > vladrodio...@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > >> I'm not sure that "There is already lots of half-baked code
> > in
> > > the
> > > >> > > > branch,
> > > >> > > > so what's the harm in adding more?"
> > > >> > > >
> > > >> > > > I meant - not production - ready yet. This is 2.0 development
> > > branch
> > > >> > and,
> > > >> > > > hence many features are in works,
> > > >> > > > not being tested well etc. I do not consider backup as half
> > baked
> > > >> > > feature -
> > > >> > > > it has passed our internal QA and has very good doc, which we
> > will
> > > >> > > provide
> > > >> > > > to Apache shortly.
> > > >> > > >
> > > >> > > > -Vlad
> > > >> > > >
> > > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell <
> > > apurt...@apache.org>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > We shouldn't admit half baked changes that won't be
> finished.
> > > >> However
> > > >> > > in
> > > >> > > > > this case the crew working on this feature are long timers
> and
> > > less
> > > >> > > > likely
> > > >> > > > > than just about anyone to leave something in a half baked
> > > state. Of
> > > >> > > > course
> > > >> > > > > there is no guarantee how anything will turn out, but I am
> > > willing
> > > >> to
> > > >> > > > take
> > > >> > > > > a little on faith if they feel their best path forward now
> is
> > to
> > > >> > merge
> > > >> > > to
> > > >> > > > > trunk. I only wish I had bandwidth to have done some real
> > > kicking
> > > >> of
> > > >> > > the
> > > >> > > > > tires by now. Maybe this week.
> > > >> > > > >
> > > >> > > > > (Yes, I'm using some of that time for this email :-) but I
> > type
> > > >> > fast.)
> > > >> > > > >
> > > >> > > > > That said, I would like to agitate for making 2.0 more real
> > and
> > > >> spend
> > > >> > > > some
> > > >> > > > > time on it now that I'm winding down with 0.98. I think that
> > > means
> > > >> > > > > branching for 2.0 real soon now and even evicting things
> from
> > > 2.0
> > > >> > > branch
> > > >> > > > > that aren't finished or stable, leaving them only once again
> > in
> > > the
> > > >> > > > master
> > > >> > > > > branch. Or, maybe just evicting them. Let's take it case by
> > > case.
> > > >> > > > >
> > > >> > > > > I think this feature can come in relatively safely. As added
> > > >> > insurance,
> > > >> > > > > let's admit the possibility it could be reverted on the 2.0
> > > branch
> > > >> if
> > > >> > > > folks
> > > >> > > > > working on stabilizing 2.0 decide to evict it because it is
> > > >> > unfinished
> > > >> > > or
> > > >> > > > > unstable, because that certainly can happen. I would expect
> if
> > > talk
> > > >> > > like
> > > >> > > > > that starts, we'd get help finishing or stabilizing what's
> > under
> > > >> > > > discussion
> > > >> > > > > for revert. Or, we'd have a revert. Either way the outcome
> is
> > > >> > > acceptable.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak <
> > > dimaspi...@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > > I'm not sure that "There is already lots of half-baked
> code
> > in
> > > >> the
> > > >> > > > > branch,
> > > >> > > > > > so what's the harm in adding more?" is a good code commit
> > > >> > philosophy
> > > >> > > > for
> > > >> > > > > a
> > > >> > > > > > fault-tolerant distributed data store. ;)
> > > >> > > > > >
> > > >> > > > > > More seriously, a lack of test coverage for existing
> > features
> > > >> > > shouldn't
> > > >> > > > > be
> > > >> > > > > > used as justification for introducing new features with
> the
> > > same
> > > >> > > > > > shortcomings. Ultimately, it's the end user who will feel
> > the
> > > >> pain,
> > > >> > > so
> > > >> > > > > > shouldn't we do everything we can to mitigate that?
> > > >> > > > > >
> > > >> > > > > > -Dima
> > > >> > > > > >
> > > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov <
> > > >> > > > > vladrodio...@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Sean,
> > > >> > > > > > >
> > > >> > > > > > > * have docs
> > > >> > > > > > >
> > > >> > > > > > > Agree. We have a doc and backup is the most documented
> > > feature
> > > >> > :),
> > > >> > > we
> > > >> > > > > > will
> > > >> > > > > > > release it shortly to Apache.
> > > >> > > > > > >
> > > >> > > > > > > * have sunny-day correctness tests
> > > >> > > > > > >
> > > >> > > > > > > Feature has  close to 60 test cases, which run for
> approx
> > 30
> > > >> min.
> > > >> > > We
> > > >> > > > > can
> > > >> > > > > > > add more, if community do not mind :)
> > > >> > > > > > >
> > > >> > > > > > > * have correctness-in-face-of-failure tests
> > > >> > > > > > >
> > > >> > > > > > > Any examples of these tests in existing features? In
> > works,
> > > we
> > > >> > > have a
> > > >> > > > > > clear
> > > >> > > > > > > understanding of what should be done by the time of 2.0
> > > >> release.
> > > >> > > > > > > That is very close goal for us, to verify IT monkey for
> > > >> existing
> > > >> > > > code.
> > > >> > > > > > >
> > > >> > > > > > > * don't rely on things outside of HBase for normal
> > operation
> > > >> > (okay
> > > >> > > > for
> > > >> > > > > > > advanced operation)
> > > >> > > > > > >
> > > >> > > > > > > We do not.
> > > >> > > > > > >
> > > >> > > > > > > Enormous time has been spent already on the development
> > and
> > > >> > testing
> > > >> > > > the
> > > >> > > > > > > feature, it has passed our internal tests and many
> rounds
> > of
> > > >> code
> > > >> > > > > reviews
> > > >> > > > > > > by HBase committers. We do not mind if someone from
> HBase
> > > >> > community
> > > >> > > > > > > (outside of HW) will review the code, but it will
> probably
> > > >> takes
> > > >> > > > > forever
> > > >> > > > > > to
> > > >> > > > > > > wait for volunteer?, the feature is quite large (1MB+
> > > >> cumulative
> > > >> > > > patch)
> > > >> > > > > > >
> > > >> > > > > > > 2.0 branch is full of half baked features, most of them
> > are
> > > in
> > > >> > > active
> > > >> > > > > > > development, therefore I am not following you here,
> Sean?
> > > Why
> > > >> > > > > HBASE-7912
> > > >> > > > > > is
> > > >> > > > > > > not good enough yet to be integrated into 2.0 branch?
> > > >> > > > > > >
> > > >> > > > > > > -Vlad
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey <
> > > bus...@apache.org
> > > >> >
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh Elser <
> > > >> > > josh.el...@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > > So, the answer to Sean's original question is "as
> > > robust as
> > > >> > > > > snapshots
> > > >> > > > > > > > > presently are"? (independence of backup/restore
> > failure
> > > >> > > tolerance
> > > >> > > > > > from
> > > >> > > > > > > > > snapshot failure tolerance)
> > > >> > > > > > > > >
> > > >> > > > > > > > > Is this just a question WRT context of the change,
> or
> > > is it
> > > >> > > means
> > > >> > > > > > for a
> > > >> > > > > > > > veto
> > > >> > > > > > > > > from you, Sean? Just trying to make sure I'm
> following
> > > >> along
> > > >> > > > > > > adequately.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > I'd say ATM I'm -0, bordering on -1 but not for
> reasons
> > I
> > > can
> > > >> > > > > > articulate
> > > >> > > > > > > > well.
> > > >> > > > > > > >
> > > >> > > > > > > > Here's an attempt.
> > > >> > > > > > > >
> > > >> > > > > > > > We've been trying to move, as a community, towards
> > > minimizing
> > > >> > > risk
> > > >> > > > to
> > > >> > > > > > > > downstream folks by getting "complete enough for use"
> > > gates
> > > >> in
> > > >> > > > place
> > > >> > > > > > > > before we introduce new features. This was spurred by
> a
> > > some
> > > >> > > > features
> > > >> > > > > > > > getting in half-baked and never making it to "can
> really
> > > use"
> > > >> > > > status
> > > >> > > > > > > > (I'm thinking of distributed log replay and the
> zk-less
> > > >> > > assignment
> > > >> > > > > > > > stuff, I don't recall if there was more).
> > > >> > > > > > > >
> > > >> > > > > > > > The gates, generally, included things like:
> > > >> > > > > > > >
> > > >> > > > > > > > * have docs
> > > >> > > > > > > > * have sunny-day correctness tests
> > > >> > > > > > > > * have correctness-in-face-of-failure tests
> > > >> > > > > > > > * don't rely on things outside of HBase for normal
> > > operation
> > > >> > > (okay
> > > >> > > > > for
> > > >> > > > > > > > advanced operation)
> > > >> > > > > > > >
> > > >> > > > > > > > As an example, we kept the MOB work off in a branch
> and
> > > out
> > > >> of
> > > >> > > > master
> > > >> > > > > > > > until it could pass these criteria. The big exemption
> > > we've
> > > >> had
> > > >> > > to
> > > >> > > > > > > > this was the hbase-spark integration, where we all
> > agreed
> > > it
> > > >> > > could
> > > >> > > > > > > > land in master because it was very well isolated (the
> > > slide
> > > >> > away
> > > >> > > > from
> > > >> > > > > > > > including docs as a first-class part of building up
> that
> > > >> > > > integration
> > > >> > > > > > > > has led me to doubt the wisdom of this decision).
> > > >> > > > > > > >
> > > >> > > > > > > > We've also been treating inclusion in a "probably will
> > be
> > > >> > > released
> > > >> > > > to
> > > >> > > > > > > > downstream" branches as a higher bar, requiring
> > > >> > > > > > > >
> > > >> > > > > > > > * don't moderately impact performance when the feature
> > > isn't
> > > >> in
> > > >> > > use
> > > >> > > > > > > > * don't severely impact performance when the feature
> is
> > in
> > > >> use
> > > >> > > > > > > > * either default-to-on or show enough demand to
> believe
> > a
> > > >> > > > non-trivial
> > > >> > > > > > > > number of folks will turn the feature on
> > > >> > > > > > > >
> > > >> > > > > > > > The above has kept MOB and hbase-spark integration out
> > of
> > > >> > > branch-1,
> > > >> > > > > > > > presumably while they've "gotten more stable" in
> master
> > > from
> > > >> > the
> > > >> > > > odd
> > > >> > > > > > > > vendor inclusion.
> > > >> > > > > > > >
> > > >> > > > > > > > Are we going to have a 2.0 release before the end of
> the
> > > >> year?
> > > >> > > > We're
> > > >> > > > > > > > coming up on 1.5 years since the release of version
> 1.0;
> > > >> seems
> > > >> > > like
> > > >> > > > > > > > it's about time, though I haven't seen any concrete
> > plans
> > > >> this
> > > >> > > > year.
> > > >> > > > > > > > Presuming we are going to have one by the end of the
> > > year, it
> > > >> > > > seems a
> > > >> > > > > > > > bit close to still be adding in "features that need
> > > maturing"
> > > >> > on
> > > >> > > > the
> > > >> > > > > > > > branch.
> > > >> > > > > > > >
> > > >> > > > > > > > The lack of a concrete plan for 2.0 keeps me from
> > > considering
> > > >> > > these
> > > >> > > > > > > > things blocker at the moment. But I know first hand
> how
> > > much
> > > >> > > > trouble
> > > >> > > > > > > > folks have had with other features that have gone into
> > > >> > downstream
> > > >> > > > > > > > facing releases without robustness checks (i.e.
> > > replication),
> > > >> > and
> > > >> > > > I'm
> > > >> > > > > > > > concerned about what we're setting up if 2.0 goes out
> > with
> > > >> this
> > > >> > > > > > > > feature in its current state.
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > --
> > > >> > > > > Best regards,
> > > >> > > > >
> > > >> > > > >    - Andy
> > > >> > > > >
> > > >> > > > > Problems worthy of attack prove their worth by hitting
> back. -
> > > Piet
> > > >> > > Hein
> > > >> > > > > (via Tom White)
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Best regards,
> > > >> > >
> > > >> > >    - Andy
> > > >> > >
> > > >> > > Problems worthy of attack prove their worth by hitting back. -
> > Piet
> > > >> Hein
> > > >> > > (via Tom White)
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912

Reply via email to