[ 
https://issues.apache.org/jira/browse/HBASE-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14537462#comment-14537462
 ] 

Jerry He commented on HBASE-13217:
----------------------------------

Hi, [~syuanjiang]

Thanks for your review and comment.

I think the original intention of controllerConnectionFailure() matches what we 
are doing.
{code}
  /**
   * The connection to the rest of the procedure group (member and coordinator) 
has been
   * broken/lost/failed. This should fail any interested subprocedure, but not 
attempt to notify
   * other members since we cannot reach them anymore.
{code}
Most of the sub-exceptions of KeeperException are more serious than 
NoNodeException: e.g connection, ACL, timeout, etc
In these cases, we have a stronger reason not to attempt to notify the master 
or other members via ZK abort.

bq. snapshot procedure might give us incomplete snapshot
Can you explain?
We are not ignoring the KeeperException. We still abort locally.  We only skip 
Proactively trying to notify the master via ZK abort.
We have seen the proactive notification causes mess.
If this procedure member is needed in the snapshot, the entire procedure will 
still fail. The master coordinator is the overall judge and coordinator.

> Procedure fails due to ZK issue
> -------------------------------
>
>                 Key: HBASE-13217
>                 URL: https://issues.apache.org/jira/browse/HBASE-13217
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.0.1, 1.1.0, 0.98.12
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: Stephen Yuan Jiang
>         Attachments: HBASE-13217-v2.patch, HBASE-13217.patch
>
>
> When ever I try to flush explicitly in the trunk code the flush procedure 
> fails due to ZK issue
> {code}
> ERROR: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable 
> via 
> stobdtserver3,16040,1426172670959:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.procedure.Procedure.isCompleted(Procedure.java:368)
>         at 
> org.apache.hadoop.hbase.procedure.flush.MasterFlushTableProcedureManager.isProcedureDone(MasterFlushTableProcedureManager.java:196)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isProcedureDone(MasterRpcServices.java:905)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:47019)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2073)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:273)
>         at 
> org.apache.hadoop.hbase.procedure.ProcedureMember.controllerConnectionFailure(ProcedureMember.java:225)
>         at 
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberAcquired(ZKProcedureMemberRpcs.java:254)
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:166)
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         ... 1 more
> {code}
> Once this occurs, even on restart of the RS the RS becomes unusable.  I have 
> verified that the ZK remains intact and there is no problem with it.  a bit 
> older version of trunk ( 3months) does not have this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to