[ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612713#comment-14612713
 ] 

Enis Soztutar commented on HBASE-13832:
---------------------------------------

bq. we know that we must die there. why not exit from that loop?
We can call {{stop(true)}} directly from Sync loop it is fine. It was not there 
in your original patch, that is why I did not change it. 
bq. with the actual implementation of abort we know that running will be false 
after a sendAbortProcessSignal() but that may not be the case in the future
The store can cause an abort to the whole procedure executor or the master 
itself. Right now, it does this through the ProcedureStoreListener calls. I'm 
fine with sending an {{Abortable}} directly to the store. These parts are 
mainly coming from the intiail proc v2 patch. 

Does the test rely on 1s / 2s timing? It may end up being flaky in slow jenkins 
hosts. Other than that +1 for the v4 patch. If you want to do the abort 
changes, we can do it here or a follow up. 

> Procedure V2: master fail to start due to WALProcedureStore sync failures 
> when HDFS data nodes count is low
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13832
>                 URL: https://issues.apache.org/jira/browse/HBASE-13832
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Matteo Bertozzi
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.2, 1.3.0, 1.2.1
>
>         Attachments: HBASE-13832-v0.patch, HBASE-13832-v1.patch, 
> HBASE-13832-v2.patch, HBASE-13832-v4.patch, HDFSPipeline.java, 
> hbase-13832-test-hang.patch, hbase-13832-v3.patch
>
>
> when the data node < 3, we got failure in WALProcedureStore#syncLoop() during 
> master start.  The failure prevents master to get started.  
> {noformat}
> 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: Sync slot failed, abort.
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  
> DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
>                      
> original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-    
> 490ece56c772,DISK]]). The current failed datanode replacement policy is 
> DEFAULT, and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
> {noformat}
> One proposal is to implement some similar logic as FSHLog: if IOException is 
> thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
> abort, we could try to roll the log and see whether this resolve the issue; 
> if the new log cannot be created or more exception from rolling the log, we 
> then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to