[ 
https://issues.apache.org/jira/browse/HDFS-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated HDFS-6231:
--------------------------------

    Attachment: HDFS-6231.1.patch

I found this problem from observing runs of {{TestPread}} that were hanging.  
It turns out that on most fast machines, {{TestPread}} doesn't actually end up 
triggering a hedged read.  The initial read completes before the hedged read 
threshold, so we don't bother.  On one of my slower VMs, I was seeing the test 
hang.  I was then able to repro even on my fast machines by aggressively 
down-tuning the hedged read threshold.

Here is a patch to fix the bug.
# {{DFSInputStream#getFromOneDataNode}}: This was the main problem.  The 
returned {{Callable}} needs to release a {{CountDownLatch}}, but it wasn't 
doing it in the failure case.  It was only doing it in the success case.  I 
changed it to release the latch inside a finally clause.
# {{DFSInputStream#hedgedFetchBlockByteRange}}: After I applied the first 
change, it exposed another problem here.  If all datanodes die, then we need to 
refetch block locations from the datanode.  That wasn't happening, because this 
code used the condition {{futures == null}} to decide whether or not to refetch 
block locations via a call to {{chooseDataNode}}.  After a hedged read has been 
issued, {{futures}} is always non-null, so this wasn't sufficient.  I changed 
the code to check for empty {{futures}}.  The reason this works is that 
{{getFirstToComplete}} removes failed futures from the list.  This means that 
if all datanodes die, then {{futures}} drops back to an empty list, and then we 
go into {{chooseDataNode}} to refetch block locations.
# In {{TestPread}}, I downtuned the hedged read threshold a lot so that this 
test really does issue hedged reads even on fast machines.  That ought to help 
us catch regressions in the future.  Now that hedged reads are really happening 
during the test runs, I found that I needed to reset the metrics counts in 
order to satisfy some assertions.  This is required because the metrics 
instance is static/global.

I've had multiple successful test runs of {{TestPread}} with this patch on both 
my fast Mac and my slow Windows VM.

> DFSClient hangs infinitely if using hedged reads and all eligible datanodes 
> die.
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-6231
>                 URL: https://issues.apache.org/jira/browse/HDFS-6231
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 3.0.0, 2.4.0
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: HDFS-6231.1.patch
>
>
> When using hedged reads, and all eligible datanodes for the read get flagged 
> as dead or ignored, then the client is supposed to refetch block locations 
> from the NameNode to retry the read.  Instead, we've seen that the client can 
> hang indefinitely.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to