[ 
https://issues.apache.org/jira/browse/HADOOP-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639965#action_12639965
 ] 

Sameer Paranjpye commented on HADOOP-4278:
------------------------------------------

@Dhruba: I agree that this is not a blocker for 0.19. The out of phase thread 
deaths don't occur typically in real deployments. Also we haven't yet observed 
this condition occurring frequently on our grids.

However, I think there are real deficiencies in error recovery for HDFS writes. 
# the client does not correctly detect which link in the write pipeline failed
# the client tries to initiate block recovery from the dead Datanode, fails to 
do so and causes the write to fail. This is mostly due to 1. but can also occur 
if the recovery primary fails following a link failure.

Ideally, a writer should fail only if
# the writer itself dies for some reason
# the writer loses all it's replicas

This should be the subject of a different JIRA but I think we should spend some 
energy making it happen. For this issue, the best course might be to disable 
testSimple until we have a complete recovery story.


> TestDatanodeDeath failed occasionally
> -------------------------------------
>
>                 Key: HADOOP-4278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4278
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: dhruba borthakur
>            Priority: Blocker
>             Fix For: 0.19.0
>
>
> TestDatanodeDeath keeps failing occasionally.  For example, see
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3365/testReport/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to