+1 on merging to master branch. On Tue, Jan 9, 2018 at 6:51 AM, Josh Elser <els...@apache.org> wrote:
> +1 on merge to Master. > > I appreciate the details you shared, Duo, but I think I'm still -0 on a > branch-2 merge at this point. I'm with Stack: would rather pull it into a > fast 2.1 release. > > > On 1/8/18 8:22 PM, 张铎(Duo Zhang) wrote: > >> Anyway, if no objections on merging this into master, let's do it? So that >> we can start working on the follow-on features, such as table based >> replication storage, and synchronous replication, etc. >> >> Thanks. >> >> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>: >> >> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <palomino...@gmail.com> >>> wrote: >>> >>> This 'new' feature only changes DDL part, not the core part of >>>> >>> replication, >>> >>>> i.e, how to read wal entries and how to replicate it to the remote >>>> >>> cluster, >>> >>>> etc. And also there is no pb message/storage layout change, you can >>>> think >>>> of this as a big refactoring. Theoretically we even do not need to add >>>> >>> new >>> >>>> UTs for this feature, i.e, no extra stability works. The only visible >>>> change to users is that it may require more time on modifying peers in >>>> shell. So in general I think it is less hurt to include it in the coming >>>> release? >>>> >>>> And why I think it SHOULD be included in our 2.0 release is that, the >>>> synchronous guarantee is really a good thing for our replication related >>>> UTs. The correctness of the current Test***Replication usually depends >>>> >>> on a >>> >>>> flakey condition - we will not do a log rolling between the modification >>>> >>> on >>> >>>> zk and the actual loading of the modification on RS. And we have also >>>> >>> done >>> >>>> a hard work to cleanup the lockings in ReplicationSourceManager and add >>>> a >>>> fat comment to say why it should be synchronized in this way. In >>>> general, >>>> the new code is much easier to read, test and debug, and also reduce the >>>> possibility of flakeyness, which could help us a lot when we start to >>>> stabilize our build. >>>> >>>> Thanks. >>>> >>>> >>>> You see it as a big bug fix Duo? >>> >>> Kind of. Just like the AM v1, we can do lots of fix to make it more >> stable, >> but we know that we can not fix all the problems since we store state in >> several places and it is a 'mission impossible' to make all the states >> stay >> in sync under every situation... So we introduce AM v2. >> For the replication peer tracking, it is the same problem. It is hard to >> do >> fencing with zk watcher since it is asynchronous, so the UTs are always >> kind of flakey in theoretical. And we depend on replication heavily at >> Xiaomi, it is always a pain for us. >> >> >>> I'm late to review. Will have a look after beta-1 goes out. This stuff >>> looks great from outside, especially distributed procedure framework >>> which >>> we need all over the place. >>> >>> In general I have no problem w/ this in master and an hbase 2.1 (2.1 >>> could >>> be soon after 2.0). Its late for big stuff in 2.0.... but let me take a >>> looksee sir. >>> >>> Thanks sir. All the concerns here about whether we should merge this into >> 2.0 are reasonable, I know. Although I really want this in 2.0 because I >> believe it will help a lot for stabilizing, I'm OK with merge it to 2.1 >> only if you guys all think so. >> >> >>> St.Ack >>> >>> >>> >>> >>> >>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <a...@cloudera.com>: >>>> >>>> Same questions as Josh's. >>>>> 1) We have RCs for beta1 now, which means only commits that can go in >>>>> >>>> are >>> >>>> bug fixes only. This change - although important, needed from long time >>>>> >>>> and >>>> >>>>> well done (testing, summary, etc) - seems rather very large to get into >>>>> >>>> 2.0 >>>> >>>>> now. Needs good justification why it has to be 2.1 instead of 2.0. >>>>> >>>>> -- Appy >>>>> >>>>> >>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <els...@apache.org> wrote: >>>>> >>>>> -0 From a general project planning point-of-view (not based on the >>>>>> technical merit of the code) I am uncomfortable about pulling in a >>>>>> >>>>> brand >>>> >>>>> new feature after we've already made one beta RC. >>>>>> >>>>>> Duo -- can you expand on why this feature is so important that we >>>>>> >>>>> should >>>> >>>>> break our release plan? Are there problems that would make including >>>>>> >>>>> this >>>> >>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you can >>>>>> >>>>> provide >>> >>>> on >>>>> >>>>>> "why does this need to go into 2.0?" would be helpful. >>>>>> >>>>>> >>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote: >>>>>> >>>>>> https://issues.apache.org/jira/browse/HBASE-19397 >>>>>>> >>>>>>> We aim to move the peer modification framework from zk watcher to >>>>>>> procedure >>>>>>> v2 in this issue and the work is done now. >>>>>>> >>>>>>> Copy the release note here: >>>>>>> >>>>>>> Introduce 5 procedures to do peer modifications: >>>>>>> >>>>>>> AddPeerProcedure >>>>>>>> RemovePeerProcedure >>>>>>>> UpdatePeerConfigProcedure >>>>>>>> EnablePeerProcedure >>>>>>>> DisablePeerProcedure >>>>>>>> >>>>>>>> The procedures are all executed with the following stage: >>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up >>>>>>>> 2. Check whether the operation is valid, if not then give up >>>>>>>> 3. Update peer storage. Notice that if we have entered this stage, >>>>>>>> >>>>>>> then >>>> >>>>> we >>>>>>>> can not rollback any more. >>>>>>>> 4. Schedule sub procedures to refresh the peer config on every RS. >>>>>>>> 5. Do post cleanup if any. >>>>>>>> 6. Call post CP hook. The exception thrown will be ignored since we >>>>>>>> >>>>>>> have >>>>> >>>>>> already done the work. >>>>>>>> >>>>>>>> The procedure will hold an exclusive lock on the peer id, so now >>>>>>>> >>>>>>> there >>>> >>>>> is >>>>> >>>>>> no concurrent modifications on a single peer. >>>>>>>> >>>>>>>> And now it is guaranteed that once the procedure is done, the peer >>>>>>>> modification has already taken effect on all RSes. >>>>>>>> >>>>>>>> Abstracte a storage layer for replication peer/queue manangement, >>>>>>>> >>>>>>> and >>> >>>> refactored the upper layer to remove zk related >>>>>>>> >>>>>>> naming/code/comment. >>> >>>> >>>>>>>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and >>>>>>>> >>>>>>> add >>>> >>>>> permission check for executeProcedures method which requires the >>>>>>>> >>>>>>> caller >>>> >>>>> to >>>>>>>> be system user or super user. >>>>>>>> >>>>>>>> On rolling upgrade: just do not do any replication peer >>>>>>>> >>>>>>> modifications >>> >>>> during the rolling upgrading. There is no pb/layout changes on the >>>>>>>> peer/queue storage on zk. >>>>>>>> >>>>>>>> >>>>>>>> And there are other benefits. >>>>>>> First, we have introduced a general procedure framework to send >>>>>>> >>>>>> tasks >>> >>>> to >>>> >>>>> RS >>>>>>> and report the report back to Master. It can be used to implement >>>>>>> >>>>>> other >>>> >>>>> operations such as ACL change. >>>>>>> Second, zk is used as a external storage now since we do not depend >>>>>>> >>>>>> on >>> >>>> zk >>>>> >>>>>> watcher any more, it will be much easier to implement a 'table >>>>>>> >>>>>> based' >>> >>>> replication peer/queue storage. >>>>>>> >>>>>>> Please vote: >>>>>>> [+1] Agree >>>>>>> [-1] Disagree >>>>>>> [0] Neutral >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> >>>>> -- Appy >>>>> >>>>> >>>> >>> >>