[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

Matteo Bertozzi (JIRA) Thu, 02 Jul 2015 17:57:41 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612722#comment-14612722
 ]


Matteo Bertozzi commented on HBASE-13832:
-----------------------------------------

calling directly stop() was not what I was proposing. what I was saying was 
just exiting from the syncLoop(). before with the while (isRunning()) we were 
spinning after the signal, to make clear that there were no other run of the 
syncLoop(). in this case we may do another round of the loop and execute stuff 
which in theory is not what you expect after sending the abort signal.

the test does not rely on the 1s/2s timing, it passes even without. but I was 
trying to make the problem more clear  to someone looking the code.

> Procedure V2: master fail to start due to WALProcedureStore sync failures 
> when HDFS data nodes count is low
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13832
>                 URL: https://issues.apache.org/jira/browse/HBASE-13832
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Matteo Bertozzi
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1
>
>         Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
> HBASE-13832-v2.patch, HBASE-13832-v4.patch, HDFSPipeline.java, 
> hbase-13832-test-hang.patch, hbase-13832-v3.patch
>
>
> when the data node < 3, we got failure in WALProcedureStore#syncLoop() during 
> master start.  The failure prevents master to get started.  
> {noformat}
> 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: Sync slot failed, abort.
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  
> DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
>                      
> original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-    
> 490ece56c772,DISK]]). The current failed datanode replacement policy is 
> DEFAULT, and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
> {noformat}
> One proposal is to implement some similar logic as FSHLog: if IOException is 
> thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
> abort, we could try to roll the log and see whether this resolve the issue; 
> if the new log cannot be created or more exception from rolling the log, we 
> then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

Reply via email to