[ https://issues.apache.org/jira/browse/HBASE-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509465#comment-13509465 ]
Jonathan Hsieh commented on HBASE-7212: --------------------------------------- bq. What happens when the coordinator dies (in this case hmaster). Does the new HMaster discover the prev procedure and abort? The new HMaster will delete all znodes associated with the procedure class (all znodes associated with snapshotting procedures), all members still using them should timeout and fail, and new operations need to be issued. For snapshots in particular, there isn't really a chance for a partial snapshot being present when taking one because all the snapshot work is done in a temp dir and atomically put into place with a dir rename op after the coordinator realizes all the members have released/leave'd successfully. There will be junk in these tmp dirs left over but they get cleaned up on the next take snapshot attempt, or when the new master starts. > Globally Barriered Procedure mechanism > -------------------------------------- > > Key: HBASE-7212 > URL: https://issues.apache.org/jira/browse/HBASE-7212 > Project: HBase > Issue Type: Sub-task > Components: snapshots > Affects Versions: hbase-6055 > Reporter: Jonathan Hsieh > Assignee: Jonathan Hsieh > Fix For: hbase-6055 > > Attachments: 121127-global-barrier-proc.pdf, hbase-7212.patch, > pre-hbase-7212.patch > > > This is a simplified version of what was proposed in HBASE-6573. Instead of > claiming to be a 2pc or 3pc implementation (which implies logging at each > actor, and recovery operations) this is just provides a best effort global > barrier mechanism called a Procedure. > Users need only to implement a methods to acquireBarrier, to act when > insideBarrier, and to releaseBarrier that use the ExternalException > cooperative error checking mechanism. > Globally consistent snapshots require the ability to quiesce writes to a set > of region servers before a the snapshot operation is executed. Also if any > node fails, it needs to be able to notify them so that they abort. > The first cut of other online snapshots don't need the fully barrier but may > still use this for its error propagation mechanisms. > This version removes the extra layer incurred in the previous implementation > due to the use of generics, separates the coordinator and members, and > reduces the amount of inheritance used in favor of composition. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira