[ https://issues.apache.org/jira/browse/HBASE-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509344#comment-13509344 ]
stack commented on HBASE-7212: ------------------------------ On curator double-barrier, it would seem there is no 'abort' as you say. They do have timeouts on barrier enter and leave. Would that be enough See http://www.jarvana.com/jarvana/view/com/netflix/curator/curator-recipes/0.6.4/curator-recipes-0.6.4-javadoc.jar!/com/netflix/curator/framework/recipes/barriers/DistributedDoubleBarrier.html#leave() Rather than 'abort', you could just timeout? That might be simpler still? i.e. your "Need to be able to force failure after a specified timeout elapses" double-barrier does not seem to be enough though. There needs to be a means of telling cluster members to go for a particular snapshot barrier. To this end, I suppose all members need to be watching a snapshot dir and when a new snapshot appears, all try to 'enter' its barrier? bq. Yes -- reached is sent when the coordinator figures out that it has "reached" the global barrier point because all members have taken their part of the global barrier. Is it true that you do not want members to start 'snapshotting' until ALL participants have 'entered' the barrier? Does it matter if they start doing their work soon as they 'enter' the barrier (using curator/zk receipe terms). Reading on, it seems like its fine if members just go about their merry way....working on their part of snapshot. If not all members complete, the coordinator will clean up the incomplete. What do you think of the terms in the zk receipe: i.e. rather than 'reach' a barrier, 'enter' it? Some of the answers you give above should go into doc of this feature. They are quality. I buy your argument for going w/ the more basic barrier rather than 2pc function for snapshots (Yeah, 2pc would be useful for other distributed ops like table enable/disable w/ us 'failing forward' an interrupted table enable or disable) On 'Comms', it was just unclear to me what it was. Makes sense now. > Globally Barriered Procedure mechanism > -------------------------------------- > > Key: HBASE-7212 > URL: https://issues.apache.org/jira/browse/HBASE-7212 > Project: HBase > Issue Type: Sub-task > Components: snapshots > Affects Versions: hbase-6055 > Reporter: Jonathan Hsieh > Assignee: Jonathan Hsieh > Fix For: hbase-6055 > > Attachments: 121127-global-barrier-proc.pdf, hbase-7212.patch, > pre-hbase-7212.patch > > > This is a simplified version of what was proposed in HBASE-6573. Instead of > claiming to be a 2pc or 3pc implementation (which implies logging at each > actor, and recovery operations) this is just provides a best effort global > barrier mechanism called a Procedure. > Users need only to implement a methods to acquireBarrier, to act when > insideBarrier, and to releaseBarrier that use the ExternalException > cooperative error checking mechanism. > Globally consistent snapshots require the ability to quiesce writes to a set > of region servers before a the snapshot operation is executed. Also if any > node fails, it needs to be able to notify them so that they abort. > The first cut of other online snapshots don't need the fully barrier but may > still use this for its error propagation mechanisms. > This version removes the extra layer incurred in the previous implementation > due to the use of generics, separates the coordinator and members, and > reduces the amount of inheritance used in favor of composition. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira