[ https://issues.apache.org/jira/browse/HDFS-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Nauroth updated HDFS-6231: -------------------------------- Attachment: HDFS-6231.1.patch I found this problem from observing runs of {{TestPread}} that were hanging. It turns out that on most fast machines, {{TestPread}} doesn't actually end up triggering a hedged read. The initial read completes before the hedged read threshold, so we don't bother. On one of my slower VMs, I was seeing the test hang. I was then able to repro even on my fast machines by aggressively down-tuning the hedged read threshold. Here is a patch to fix the bug. # {{DFSInputStream#getFromOneDataNode}}: This was the main problem. The returned {{Callable}} needs to release a {{CountDownLatch}}, but it wasn't doing it in the failure case. It was only doing it in the success case. I changed it to release the latch inside a finally clause. # {{DFSInputStream#hedgedFetchBlockByteRange}}: After I applied the first change, it exposed another problem here. If all datanodes die, then we need to refetch block locations from the datanode. That wasn't happening, because this code used the condition {{futures == null}} to decide whether or not to refetch block locations via a call to {{chooseDataNode}}. After a hedged read has been issued, {{futures}} is always non-null, so this wasn't sufficient. I changed the code to check for empty {{futures}}. The reason this works is that {{getFirstToComplete}} removes failed futures from the list. This means that if all datanodes die, then {{futures}} drops back to an empty list, and then we go into {{chooseDataNode}} to refetch block locations. # In {{TestPread}}, I downtuned the hedged read threshold a lot so that this test really does issue hedged reads even on fast machines. That ought to help us catch regressions in the future. Now that hedged reads are really happening during the test runs, I found that I needed to reset the metrics counts in order to satisfy some assertions. This is required because the metrics instance is static/global. I've had multiple successful test runs of {{TestPread}} with this patch on both my fast Mac and my slow Windows VM. > DFSClient hangs infinitely if using hedged reads and all eligible datanodes > die. > -------------------------------------------------------------------------------- > > Key: HDFS-6231 > URL: https://issues.apache.org/jira/browse/HDFS-6231 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs-client > Affects Versions: 3.0.0, 2.4.0 > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Attachments: HDFS-6231.1.patch > > > When using hedged reads, and all eligible datanodes for the read get flagged > as dead or ignored, then the client is supposed to refetch block locations > from the NameNode to retry the read. Instead, we've seen that the client can > hang indefinitely. -- This message was sent by Atlassian JIRA (v6.2#6252)