[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611452#comment-14611452
 ] 

Matteo Bertozzi commented on HBASE-13832:
-----------------------------------------

rethrowing the exception is ok to me, but that is not equivalent to the while 
(isRunning())

in one of your comment above you said: "I think there is no guarantee that when 
master.abort() returns, the WALProcedureStore may still be not stopped".
which is true in a generic way, even though currently we know that at least 
running will be set to false immediately.
what the while (isRunning()) is doing, was just spinning until running was set 
in case the abort will not be "sync" as it is now. 

to get the same behavior you need to force running to false when you set 
syncException. so you prevent other procedure to be added. 

> Procedure V2: master fail to start due to WALProcedureStore sync failures 
> when HDFS data nodes count is low
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13832
>                 URL: https://issues.apache.org/jira/browse/HBASE-13832
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Matteo Bertozzi
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1
>
>         Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
> HBASE-13832-v2.patch, HDFSPipeline.java, hbase-13832-test-hang.patch, 
> hbase-13832-v3.patch
>
>
> when the data node < 3, we got failure in WALProcedureStore#syncLoop() during 
> master start.  The failure prevents master to get started.  
> {noformat}
> 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: Sync slot failed, abort.
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  
> DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
>                      
> original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-    
> 490ece56c772,DISK]]). The current failed datanode replacement policy is 
> DEFAULT, and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
> {noformat}
> One proposal is to implement some similar logic as FSHLog: if IOException is 
> thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
> abort, we could try to roll the log and see whether this resolve the issue; 
> if the new log cannot be created or more exception from rolling the log, we 
> then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to