[ 
https://issues.apache.org/jira/browse/HBASE-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509344#comment-13509344
 ] 

stack commented on HBASE-7212:
------------------------------

On curator double-barrier, it would seem there is no 'abort' as you say.  They 
do have timeouts on barrier enter and leave.  Would that be enough See 
http://www.jarvana.com/jarvana/view/com/netflix/curator/curator-recipes/0.6.4/curator-recipes-0.6.4-javadoc.jar!/com/netflix/curator/framework/recipes/barriers/DistributedDoubleBarrier.html#leave()

Rather than 'abort', you could just timeout?  That might be simpler still?  
i.e. your "Need to be able to force failure after a specified 
timeout elapses"

double-barrier does not seem to be enough though. There needs to be a means of 
telling cluster members to go for a particular snapshot barrier.  To this end, 
I suppose all members need to be watching a snapshot dir and when a new 
snapshot appears, all try to 'enter' its barrier?

bq. Yes -- reached is sent when the coordinator figures out that it has 
"reached" the global barrier point because all members have taken their part of 
the global barrier.

Is it true that you do not want members to start 'snapshotting' until ALL 
participants have 'entered' the barrier?  Does it matter if they start doing 
their work soon as they 'enter' the barrier (using curator/zk receipe terms).  
Reading on, it seems like its fine if members just go about their merry 
way....working on their part of snapshot.  If not all members complete, the 
coordinator will clean up the incomplete.

What do you think of the terms in the zk receipe: i.e. rather than 'reach' a 
barrier, 'enter' it?

Some of the answers you give above should go into doc of this feature.  They 
are quality.

I buy your argument for going w/ the more basic barrier rather than 2pc 
function for snapshots (Yeah, 2pc would be useful for other distributed ops 
like table enable/disable w/ us 'failing forward' an interrupted table enable 
or disable)

On 'Comms', it was just unclear to me what it was.  Makes sense now.

                
> Globally Barriered Procedure mechanism
> --------------------------------------
>
>                 Key: HBASE-7212
>                 URL: https://issues.apache.org/jira/browse/HBASE-7212
>             Project: HBase
>          Issue Type: Sub-task
>          Components: snapshots
>    Affects Versions: hbase-6055
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>             Fix For: hbase-6055
>
>         Attachments: 121127-global-barrier-proc.pdf, hbase-7212.patch, 
> pre-hbase-7212.patch
>
>
> This is a simplified version of what was proposed in HBASE-6573.  Instead of 
> claiming to be a 2pc or 3pc implementation (which implies logging at each 
> actor, and recovery operations) this is just provides a best effort global 
> barrier mechanism called a Procedure.  
> Users need only to implement a methods to acquireBarrier, to act when 
> insideBarrier, and to releaseBarrier that use the ExternalException 
> cooperative error checking mechanism.
> Globally consistent snapshots require the ability to quiesce writes to a set 
> of region servers before a the snapshot operation is executed.  Also if any 
> node fails, it needs to be able to notify them so that they abort.
> The first cut of other online snapshots don't need the fully barrier but may 
> still use this for its error propagation mechanisms.
> This version removes the extra layer incurred in the previous implementation 
> due to the use of generics, separates the coordinator and members, and 
> reduces the amount of inheritance used in favor of composition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to