Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Duo Zhang Tue, 13 Mar 2018 18:31:40 -0700

I would be interested to be the RM for 2.1 :)

Stack <st...@duboce.net>于2018年3月14日 周三02:48写道：


> I took a look. Its great. Our replication is in bad need of loving. The
> patch does a refactor/revamp/fix of an internal facility putting our peer
> management up on a Pv2 basis. It also moves forward the long-rumored
> distributed procedure project.
>
> Its a big change though. Would have been great to have it make 2.0.0 but it
> didn't. I'd be up for it in a minor release rather than wait on 3.0.0 given
> we allow ourselves some leeway adding facility on minors and that it is at
> core a solidifying fix. Needs to be doc'd and tested, verified on a deploy
> beyond unit test. I could help test. Is it proven compatible with existing
> replication deploys?
>
> Who's the 3.0.0 RM? When is it going to roll? Having this in place is best
> argument if folks propose back-ports. Without, we are doomed to repeat the
> 2.0.0 experience.
>
> Thanks,
> St.Ack
>
>
>
> On Tue, Mar 13, 2018 at 3:44 AM, 张铎(Duo Zhang) <palomino...@gmail.com>
> wrote:
>
> > I've already merged it to branch-2...
> >
> > And for the procedure based replication peer modification, the feature
> has
> > been finished, at least no critical TODOs. I will not say it has no bugs
> > but I do not think it will block the 2.1 or 3.0 release too much.
> >
> > And please trust my judgement, I'm not a man who only want to show off.
> For
> > example I just reverted the serial replication feature from branch-2
> before
> > we release beta2 and tried to redo it on master because we found some
> > critical problems. And since the basic idea has not been changed so we
> > decided to do it on master because it will not spend too much time to
> > finish. And for HBASE-19064 it is a big feature so we decided to do it
> on a
> > feature branch. We will select the branches which we want to merge back
> at
> > the time we finish the feature.
> >
> > For the CP cleanup, I think the problem is we started too late, and in
> the
> > past we made mistakes and let the related projects inject too much into
> the
> > internal of HBase. And for B&R, it is something like the serial
> replication
> > feature. For serial replication feature is that we tested it and found
> some
> > critical problems, and for B&R it was not well tested(IIRC) so we were
> > afraid of there will be critical problems. And for the AMv2, I think the
> > problem is premature optimization. We even implemented our own AvlTree,
> and
> > also a very complicated scheduler which always makes us dead lock...
> >
> > These are all very good cases for software engineering in the real
> world...
> > Should be avoided in the future... That's also my duty...
> >
> > Thanks.
> >
> > 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <a...@cloudera.com>:
> >
> > > I thought the problem with releasing 2.0 was that there were too many
> > open
> > > features dragging along not coming to finish- AMv2  (biggest one), CP
> > > cleanup, B&R, etc.
> > > If it was not the case, it wouldn't have been 1 year between first
> alpha
> > > and GA, rather just few months.
> > >
> > > That said, i don't mind new but hardened features which are already
> ready
> > > to ship (implementation only or not) going in minor version, but that's
> > my
> > > personal opinion. But going too aggressive on that can indeed lead to
> the
> > > trap Josh mentioned above.
> > >
> > > For this particular change, my +1 was based on following aspects:
> > > -  it's internal
> > > - moving ops procv2 framework (gives failure recovery, locking, etc)
> > > - Duo's reasoning - "....the synchronous guarantee is really a good
> thing
> > > for our replication related UTs..." and "For the replication peer
> > tracking,
> > > it is the same problem. It is hard to do fencing with zk watcher since
> it
> > > is asynchronous, so the UTs are always kind of flakey in theoretical."
> > >
> > > That said, it's pending review.
> > > @Duo: As you know it's not possible to spend cycles on it right now -
> > > pending 2.0GA - can you please hold it off for few weeks (ideally,
> until
> > GA
> > > + 2-3 weeks) which will give community (whoever interested, at least
> > > me..smile) a decent change to review it.
> > > Thanks
> > >
> > > -- Appy
> > >
> > >
> > > On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <els...@apache.org> wrote:
> > >
> > > > (Sorry for something other than just a vote)
> > > >
> > > > I worry seeing a "big" feature branch merge as us falling back into
> the
> > > > 1.x trap. Starting to backport features into 2.x will keep us
> delaying
> > > 3.0
> > > > as we have less and less incentive to push to release 3.0 in a timely
> > > > manner.
> > > >
> > > > That said, I also don't want my worries to bar a feature which
> appears
> > to
> > > > be related to implementation only (based on one high-level read of
> the
> > > > changes). Perhaps we need to re-think what is allowable for a Y
> release
> > > in
> > > > x.y.z...
> > > >
> > > > +1 for master (which already happened, maybe?)
> > > > +0 for branch-2 (simply because I haven't looked closely enough at
> > > > changes, can read through and try to change to +1 if you need the
> > votes)
> > > >
> > > >
> > > > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> > > >
> > > >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT,
> will
> > > >> merge branch HBASE-19397-branch-2 back to branch-2.
> > > >>
> > > >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <palomino...@gmail.com>:
> > > >>
> > > >> If branch-2.0 will be out soon then let's target this to 2.1. No
> > > problem.
> > > >>>
> > > >>> Thanks.
> > > >>>
> > > >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> > > >>>
> > > >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <
> palomino...@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>> OK, let me merge it master first. And then create a
> > > HBASE-19397-branch-2
> > > >>>>> which will keep rebasing with the newest branch-2 to see if it is
> > > >>>>> stable
> > > >>>>> enough. Since we can define this as a bug fix/refactoring rather
> > > than a
> > > >>>>>
> > > >>>> big
> > > >>>>
> > > >>>>> new feature, it is OK to integrate it at any time. If we think it
> > is
> > > >>>>>
> > > >>>> stable
> > > >>>>
> > > >>>>> enough before cutting branch-2.0 then we can include it in the
> > 2.0.0
> > > >>>>> release, else let's include it in 2.1(Maybe we can backport it to
> > 2.0
> > > >>>>> later?).
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>> I need to cut the Appy-suggested branch-2.0. Shout if
> > > >>>> HBASE-19397-branch-2
> > > >>>> gets to be too much work and I'll do it sooner rather than later.
> > Or,
> > > if
> > > >>>> easier on you, just say and I'll make the branch-2.0 now so you
> can
> > > just
> > > >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> > > >>>> become
> > > >>>> hbase2.1...).
> > > >>>>
> > > >>>> St.Ack
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Thanks all here.
> > > >>>>>
> > > >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <
> ashish.sin...@huawei.com
> > >:
> > > >>>>>
> > > >>>>> +1 to merge on master and 2.1.
> > > >>>>>> Great work.
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Ashish
> > > >>>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: 张铎(Duo Zhang) [mailto:palomino...@gmail.com]
> > > >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
> > > >>>>>> To: dev@hbase.apache.org
> > > >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> > > >>>>>>
> > > >>>>> branch-2.
> > > >>>>
> > > >>>>>
> > > >>>>>> Anyway, if no objections on merging this into master, let's do
> it?
> > > So
> > > >>>>>>
> > > >>>>> that
> > > >>>>>
> > > >>>>>> we can start working on the follow-on features, such as table
> > based
> > > >>>>>> replication storage, and synchronous replication, etc.
> > > >>>>>>
> > > >>>>>> Thanks.
> > > >>>>>>
> > > >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > > >>>>>>
> > > >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> > > >>>>>>>
> > > >>>>>> palomino...@gmail.com>
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> This 'new' feature only changes DDL part, not the core part of
> > > >>>>>>>>
> > > >>>>>>> replication,
> > > >>>>>>>
> > > >>>>>>>> i.e, how to read wal entries and how to replicate it to the
> > remote
> > > >>>>>>>>
> > > >>>>>>> cluster,
> > > >>>>>>>
> > > >>>>>>>> etc. And also there is no pb message/storage layout change,
> you
> > > >>>>>>>>
> > > >>>>>>> can
> > > >>>>
> > > >>>>> think of this as a big refactoring. Theoretically we even do not
> > > >>>>>>>> need to add
> > > >>>>>>>>
> > > >>>>>>> new
> > > >>>>>>>
> > > >>>>>>>> UTs for this feature, i.e, no extra stability works. The only
> > > >>>>>>>> visible change to users is that it may require more time on
> > > >>>>>>>> modifying peers in shell. So in general I think it is less
> hurt
> > to
> > > >>>>>>>> include it in the coming release?
> > > >>>>>>>>
> > > >>>>>>>> And why I think it SHOULD be included in our 2.0 release is
> > that,
> > > >>>>>>>> the synchronous guarantee is really a good thing for our
> > > >>>>>>>>
> > > >>>>>>> replication
> > > >>>>
> > > >>>>> related UTs. The correctness of the current Test***Replication
> > > >>>>>>>> usually depends
> > > >>>>>>>>
> > > >>>>>>> on a
> > > >>>>>>>
> > > >>>>>>>> flakey condition - we will not do a log rolling between the
> > > >>>>>>>> modification
> > > >>>>>>>>
> > > >>>>>>> on
> > > >>>>>>>
> > > >>>>>>>> zk and the actual loading of the modification on RS. And we
> have
> > > >>>>>>>> also
> > > >>>>>>>>
> > > >>>>>>> done
> > > >>>>>>>
> > > >>>>>>>> a hard work to cleanup the lockings in
> ReplicationSourceManager
> > > >>>>>>>>
> > > >>>>>>> and
> > > >>>>
> > > >>>>> add a fat comment to say why it should be synchronized in this
> > > >>>>>>>>
> > > >>>>>>> way.
> > > >>>>
> > > >>>>> In general, the new code is much easier to read, test and debug,
> > > >>>>>>>>
> > > >>>>>>> and
> > > >>>>
> > > >>>>> also reduce the possibility of flakeyness, which could help us a
> > > >>>>>>>>
> > > >>>>>>> lot
> > > >>>>
> > > >>>>> when we start to stabilize our build.
> > > >>>>>>>>
> > > >>>>>>>> Thanks.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> You see it as a big bug fix Duo?
> > > >>>>>>>
> > > >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it
> > more
> > > >>>>>> stable, but we know that we can not fix all the problems since
> we
> > > >>>>>>
> > > >>>>> store
> > > >>>>
> > > >>>>> state in several places and it is a 'mission impossible' to make
> > all
> > > >>>>>>
> > > >>>>> the
> > > >>>>
> > > >>>>> states stay in sync under every situation... So we introduce AM
> v2.
> > > >>>>>> For the replication peer tracking, it is the same problem. It is
> > > hard
> > > >>>>>>
> > > >>>>> to
> > > >>>>
> > > >>>>> do fencing with zk watcher since it is asynchronous, so the UTs
> are
> > > >>>>>>
> > > >>>>> always
> > > >>>>>
> > > >>>>>> kind of flakey in theoretical. And we depend on replication
> > heavily
> > > at
> > > >>>>>> Xiaomi, it is always a pain for us.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> I'm late to review. Will have a look after beta-1 goes out.
> This
> > > >>>>>>>
> > > >>>>>> stuff
> > > >>>>
> > > >>>>> looks great from outside, especially distributed procedure
> > framework
> > > >>>>>>> which we need all over the place.
> > > >>>>>>>
> > > >>>>>>> In general I have no problem w/ this in master and an hbase 2.1
> > > (2.1
> > > >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but
> > let
> > > >>>>>>>
> > > >>>>>> me
> > > >>>>
> > > >>>>> take a looksee sir.
> > > >>>>>>>
> > > >>>>>>> Thanks sir. All the concerns here about whether we should merge
> > > this
> > > >>>>>>
> > > >>>>> into
> > > >>>>
> > > >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
> > > >>>>>>
> > > >>>>> because I
> > > >>>>
> > > >>>>> believe it will help a lot for stabilizing,  I'm OK with merge it
> > to
> > > >>>>>>
> > > >>>>> 2.1
> > > >>>>
> > > >>>>> only if you guys all think so.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> St.Ack
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <a...@cloudera.com>:
> > > >>>>>>>>
> > > >>>>>>>> Same questions as Josh's.
> > > >>>>>>>>> 1) We have RCs for beta1 now, which means only commits that
> can
> > > >>>>>>>>>
> > > >>>>>>>> go
> > > >>>>
> > > >>>>> in
> > > >>>>>>>>>
> > > >>>>>>>> are
> > > >>>>>>>
> > > >>>>>>>> bug fixes only. This change - although important, needed from
> > > >>>>>>>>>
> > > >>>>>>>> long
> > > >>>>
> > > >>>>> time
> > > >>>>>>>>>
> > > >>>>>>>> and
> > > >>>>>>>>
> > > >>>>>>>>> well done (testing, summary, etc) - seems rather very large
> to
> > > >>>>>>>>>
> > > >>>>>>>> get
> > > >>>>
> > > >>>>> into
> > > >>>>>>>>>
> > > >>>>>>>> 2.0
> > > >>>>>>>>
> > > >>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
> > > >>>>>>>>>
> > > >>>>>>>> 2.0.
> > > >>>>
> > > >>>>>
> > > >>>>>>>>> -- Appy
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <
> els...@apache.org>
> > > >>>>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>> -0 From a general project planning point-of-view (not based
> on
> > > >>>>>>>>>> the technical merit of the code) I am uncomfortable about
> > > >>>>>>>>>> pulling in a
> > > >>>>>>>>>>
> > > >>>>>>>>> brand
> > > >>>>>>>>
> > > >>>>>>>>> new feature after we've already made one beta RC.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Duo -- can you expand on why this feature is so important
> that
> > > >>>>>>>>>> we
> > > >>>>>>>>>>
> > > >>>>>>>>> should
> > > >>>>>>>>
> > > >>>>>>>>> break our release plan? Are there problems that would make
> > > >>>>>>>>>> including
> > > >>>>>>>>>>
> > > >>>>>>>>> this
> > > >>>>>>>>
> > > >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> > > >>>>>>>>>>
> > > >>>>>>>>> can
> > > >>>>
> > > >>>>> provide
> > > >>>>>>>
> > > >>>>>>>> on
> > > >>>>>>>>>
> > > >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We aim to move the peer modification framework from zk
> > > >>>>>>>>>>>
> > > >>>>>>>>>> watcher
> > > >>>>
> > > >>>>> to procedure
> > > >>>>>>>>>>> v2 in this issue and the work is done now.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Copy the release note here:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> AddPeerProcedure
> > > >>>>>>>>>>>> RemovePeerProcedure
> > > >>>>>>>>>>>> UpdatePeerConfigProcedure
> > > >>>>>>>>>>>> EnablePeerProcedure
> > > >>>>>>>>>>>> DisablePeerProcedure
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The procedures are all executed with the following stage:
> > > >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give
> up
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> 2.
> > > >>>>
> > > >>>>> Check whether the operation is valid, if not then give up 3.
> > > >>>>>>>>>>>> Update peer storage. Notice that if we have entered this
> > > >>>>>>>>>>>> stage,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> then
> > > >>>>>>>>
> > > >>>>>>>>> we
> > > >>>>>>>>>>>> can not rollback any more.
> > > >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> every
> > > >>>>
> > > >>>>> RS.
> > > >>>>>>
> > > >>>>>>> 5. Do post cleanup if any.
> > > >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
> > > >>>>>>>>>>>> since we
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> have
> > > >>>>>>>>>
> > > >>>>>>>>>> already done the work.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id,
> so
> > > >>>>>>>>>>>> now
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> there
> > > >>>>>>>>
> > > >>>>>>>>> is
> > > >>>>>>>>>
> > > >>>>>>>>>> no concurrent modifications on a single peer.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> the
> > > >>>>
> > > >>>>> peer modification has already taken effect on all RSes.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
> > > >>>>>>>>>>>> manangement,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> and
> > > >>>>>>>
> > > >>>>>>>> refactored the upper layer to remove zk related
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> naming/code/comment.
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
> > > >>>>>>>>>>>> RegionServerObserver, and
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> add
> > > >>>>>>>>
> > > >>>>>>>>> permission check for executeProcedures method which requires
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> caller
> > > >>>>>>>>
> > > >>>>>>>>> to
> > > >>>>>>>>>>>> be system user or super user.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> modifications
> > > >>>>>>>
> > > >>>>>>>> during the rolling upgrading. There is no pb/layout changes
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> on
> > > >>>>
> > > >>>>> the peer/queue storage on zk.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And there are other benefits.
> > > >>>>>>>>>>> First, we have introduced a general procedure framework to
> > > >>>>>>>>>>>
> > > >>>>>>>>>> send
> > > >>>>
> > > >>>>> tasks
> > > >>>>>>>
> > > >>>>>>>> to
> > > >>>>>>>>
> > > >>>>>>>>> RS
> > > >>>>>>>>>>> and report the report back to Master. It can be used to
> > > >>>>>>>>>>> implement
> > > >>>>>>>>>>>
> > > >>>>>>>>>> other
> > > >>>>>>>>
> > > >>>>>>>>> operations such as ACL change.
> > > >>>>>>>>>>> Second, zk is used as a external storage now since we do
> not
> > > >>>>>>>>>>> depend
> > > >>>>>>>>>>>
> > > >>>>>>>>>> on
> > > >>>>>>>
> > > >>>>>>>> zk
> > > >>>>>>>>>
> > > >>>>>>>>>> watcher any more, it will be much easier to implement a
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 'table
> > > >>>>
> > > >>>>> based'
> > > >>>>>>>
> > > >>>>>>>> replication peer/queue storage.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Please vote:
> > > >>>>>>>>>>> [+1] Agree
> > > >>>>>>>>>>> [-1] Disagree
> > > >>>>>>>>>>> [0] Neutral
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>>
> > > >>>>>>>>> -- Appy
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Reply via email to