This 'new' feature only changes DDL part, not the core part of replication, i.e, how to read wal entries and how to replicate it to the remote cluster, etc. And also there is no pb message/storage layout change, you can think of this as a big refactoring. Theoretically we even do not need to add new UTs for this feature, i.e, no extra stability works. The only visible change to users is that it may require more time on modifying peers in shell. So in general I think it is less hurt to include it in the coming release?
And why I think it SHOULD be included in our 2.0 release is that, the synchronous guarantee is really a good thing for our replication related UTs. The correctness of the current Test***Replication usually depends on a flakey condition - we will not do a log rolling between the modification on zk and the actual loading of the modification on RS. And we have also done a hard work to cleanup the lockings in ReplicationSourceManager and add a fat comment to say why it should be synchronized in this way. In general, the new code is much easier to read, test and debug, and also reduce the possibility of flakeyness, which could help us a lot when we start to stabilize our build. Thanks. 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <a...@cloudera.com>: > Same questions as Josh's. > 1) We have RCs for beta1 now, which means only commits that can go in are > bug fixes only. This change - although important, needed from long time and > well done (testing, summary, etc) - seems rather very large to get into 2.0 > now. Needs good justification why it has to be 2.1 instead of 2.0. > > -- Appy > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <els...@apache.org> wrote: > > > -0 From a general project planning point-of-view (not based on the > > technical merit of the code) I am uncomfortable about pulling in a brand > > new feature after we've already made one beta RC. > > > > Duo -- can you expand on why this feature is so important that we should > > break our release plan? Are there problems that would make including this > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can provide > on > > "why does this need to go into 2.0?" would be helpful. > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote: > > > >> https://issues.apache.org/jira/browse/HBASE-19397 > >> > >> We aim to move the peer modification framework from zk watcher to > >> procedure > >> v2 in this issue and the work is done now. > >> > >> Copy the release note here: > >> > >> Introduce 5 procedures to do peer modifications: > >> > >>> AddPeerProcedure > >>> RemovePeerProcedure > >>> UpdatePeerConfigProcedure > >>> EnablePeerProcedure > >>> DisablePeerProcedure > >>> > >>> The procedures are all executed with the following stage: > >>> 1. Call pre CP hook, if an exception is thrown then give up > >>> 2. Check whether the operation is valid, if not then give up > >>> 3. Update peer storage. Notice that if we have entered this stage, then > >>> we > >>> can not rollback any more. > >>> 4. Schedule sub procedures to refresh the peer config on every RS. > >>> 5. Do post cleanup if any. > >>> 6. Call post CP hook. The exception thrown will be ignored since we > have > >>> already done the work. > >>> > >>> The procedure will hold an exclusive lock on the peer id, so now there > is > >>> no concurrent modifications on a single peer. > >>> > >>> And now it is guaranteed that once the procedure is done, the peer > >>> modification has already taken effect on all RSes. > >>> > >>> Abstracte a storage layer for replication peer/queue manangement, and > >>> refactored the upper layer to remove zk related naming/code/comment. > >>> > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add > >>> permission check for executeProcedures method which requires the caller > >>> to > >>> be system user or super user. > >>> > >>> On rolling upgrade: just do not do any replication peer modifications > >>> during the rolling upgrading. There is no pb/layout changes on the > >>> peer/queue storage on zk. > >>> > >>> > >> And there are other benefits. > >> First, we have introduced a general procedure framework to send tasks to > >> RS > >> and report the report back to Master. It can be used to implement other > >> operations such as ACL change. > >> Second, zk is used as a external storage now since we do not depend on > zk > >> watcher any more, it will be much easier to implement a 'table based' > >> replication peer/queue storage. > >> > >> Please vote: > >> [+1] Agree > >> [-1] Disagree > >> [0] Neutral > >> > >> Thanks. > >> > >> > > > -- > > -- Appy >