[ 
https://issues.apache.org/jira/browse/HDFS-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Rose updated HDFS-10597:
--------------------------------
    Description: 
If hedged reads are enabled, even if there is only a single datanode available, 
the hedged read loop will respect the ignored nodes list and never send more 
than one request, but retry for quite some time choosing a datanode.

This is unfortunate, as the ignored nodes list is only ever added to and never 
removed from in the scope of a single request, therefore a single failed read 
fails the entire request *or* delays responses.

There's actually a secondary undesirable behavior here too. If a hedged read 
can't find a datanode, it will delay a successful response considerably. To set 
the stage, lets say 10ms is the hedged read timeout and we only have a single 
replica available, that is, nodes=[DN1]. 

1. [0ms] {{DFSInputStream#hedgedFetchBlockByteRange}} First (not-hedged) read 
is sent to DN1. In the future, the read takes 50ms to succeed. 
ignoredNodes=[DN1]
2. [10ms] Poll timeout. Send hedged request
3. [10ms] {{DFSInputStream#chooseDataNode}} is called to find a node for the 
hedged request. As ignoredNodes includes DN1, there are no nodes available and 
we re-query the NameNode for block locations and sleep, trying again.
4. [+3000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
includes DN1, we re-query the NameNode for block locations and sleep, trying 
again.
5. [+3000+6000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
includes DN1, we re-query the NameNode for block locations and sleep, trying 
again.
6. [+6000ms+9000ms] {{DFSInputStream#chooseDataNode}} is called. As 
ignoredNodes includes DN1, we re-query the NameNode for block locations and 
sleep, trying again.
7. [27010ms] Control flow restored to 
{{DFSInputStream#hedgedFetchBlockByteRange}}, completion service is polled and 
the read that succeeded at [50ms] is returned successfully, except +27000ms 
late (worst case, expected value would be half given RNG).

This is only one scenario (a happy scenario). Supposing that the first read 
eventually fails, the DFSClient will still retry inside of 
{{DFSInputStream#hedgedFetchBlockByteRange}} for the same retries before 
failing.

I've identified one way to fix the behavior, but I'd be interested in thoughts:

{{DFSInputStream#getBestNodeDNAddrPair}}, there's a check to see if a node is 
in the ignored list before allowing it to be returned. Amending this check to 
short-circuit if there's only a single available node avoids the regrettably 
useless retries, that is:

{{nodes.length == 1 || ignoredNodes == null || 
!ignoredNodes.contains(nodes[i])}}

However, with this change, if there's only one DN available, it'll send the 
hedged request to it as well. Better behavior would be to fail hedged requests 
quickly *or* push the waiting work into the hedge pool so that successful, fast 
reads aren't blocked by this issue.

In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads 
enabled, stopping a single datanode leads to the cluster coming to a grinding 
halt.

You can observe this behavior yourself by editing 
{{TestPread#testMaxOutHedgedReadPool}}'s MiniDFSCluster to have a single 
datanode.

  was:
If hedged reads are enabled, even if there is only a single datanode available, 
the hedged read loop will respect the ignored nodes list and never send more 
than one request, but retry for quite some time choosing a datanode.

This is unfortunate, as the ignored nodes list is only ever added to and never 
removed from in the scope of a single request, therefore a single failed read 
fails the entire request *or* delays responses.

There's actually a secondary undesirable behavior here too. If a hedged read 
can't find a datanode, it will delay a successful response considerably. To set 
the stage, lets say 10ms is the hedged read timeout and we only have a single 
replica available, that is, nodes=[DN1]. 

1. [0ms] {{DFSInputStream#hedgedFetchBlockByteRange}} First (not-hedged) read 
is sent to DN1. In the future, the read takes 50ms to succeed. 
ignoredNodes=[DN1]
2. [10ms] Poll timeout. Send hedged request
3. [10ms] {{DFSInputStream#chooseDataNode}} is called to find a node for the 
hedged request. As ignoredNodes includes DN1, there are no nodes available and 
we re-query the NameNode for block locations and sleep, trying again.
4. [+3000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
includes DN1, we re-query the NameNode for block locations and sleep, trying 
again.
5. [+3000+6000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
includes DN1, we re-query the NameNode for block locations and sleep, trying 
again.
6. [+6000ms+9000ms] {{DFSInputStream#chooseDataNode}} is called. As 
ignoredNodes includes DN1, we re-query the NameNode for block locations and 
sleep, trying again.
7. [27010ms] Control flow restored to 
{{DFSInputStream#hedgedFetchBlockByteRange}}, completion service is polled and 
read that succeeded at [50ms] returned successfully, except +27000ms extra 
(worst case, expected value would be half).

This is only one scenario (a happy scenario). Supposing that the first read 
eventually fails, the DFSClient will still retry inside of 
{{DFSInputStream#hedgedFetchBlockByteRange}} for the same retries before 
failing.

I've identified one way to fix the behavior, but I'd be interested in thoughts:

{{DFSInputStream#getBestNodeDNAddrPair}}, there's a check to see if a node is 
in the ignored list before allowing it to be returned. Amending this check to 
short-circuit if there's only a single available node avoids the regrettably 
useless retries, that is:

{{nodes.length == 1 || ignoredNodes == null || 
!ignoredNodes.contains(nodes[i])}}

However, with this change, if there's only one DN available, it'll send the 
hedged request to it as well. Better behavior would be to fail hedged requests 
quickly *or* push the waiting work into the hedge pool so that successful, fast 
reads aren't blocked by this issue.

In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads 
enabled, stopping a single datanode leads to the cluster coming to a grinding 
halt.

You can observe this behavior yourself by editing 
{{TestPread#testMaxOutHedgedReadPool}}'s MiniDFSCluster to have a single 
datanode.


> DFSClient hangs if using hedged reads and all but one eligible replica is 
> down 
> -------------------------------------------------------------------------------
>
>                 Key: HDFS-10597
>                 URL: https://issues.apache.org/jira/browse/HDFS-10597
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0, 2.7.0
>            Reporter: Michael Rose
>
> If hedged reads are enabled, even if there is only a single datanode 
> available, the hedged read loop will respect the ignored nodes list and never 
> send more than one request, but retry for quite some time choosing a datanode.
> This is unfortunate, as the ignored nodes list is only ever added to and 
> never removed from in the scope of a single request, therefore a single 
> failed read fails the entire request *or* delays responses.
> There's actually a secondary undesirable behavior here too. If a hedged read 
> can't find a datanode, it will delay a successful response considerably. To 
> set the stage, lets say 10ms is the hedged read timeout and we only have a 
> single replica available, that is, nodes=[DN1]. 
> 1. [0ms] {{DFSInputStream#hedgedFetchBlockByteRange}} First (not-hedged) read 
> is sent to DN1. In the future, the read takes 50ms to succeed. 
> ignoredNodes=[DN1]
> 2. [10ms] Poll timeout. Send hedged request
> 3. [10ms] {{DFSInputStream#chooseDataNode}} is called to find a node for the 
> hedged request. As ignoredNodes includes DN1, there are no nodes available 
> and we re-query the NameNode for block locations and sleep, trying again.
> 4. [+3000ms] {{DFSInputStream#chooseDataNode}} is called. As ignoredNodes 
> includes DN1, we re-query the NameNode for block locations and sleep, trying 
> again.
> 5. [+3000+6000ms] {{DFSInputStream#chooseDataNode}} is called. As 
> ignoredNodes includes DN1, we re-query the NameNode for block locations and 
> sleep, trying again.
> 6. [+6000ms+9000ms] {{DFSInputStream#chooseDataNode}} is called. As 
> ignoredNodes includes DN1, we re-query the NameNode for block locations and 
> sleep, trying again.
> 7. [27010ms] Control flow restored to 
> {{DFSInputStream#hedgedFetchBlockByteRange}}, completion service is polled 
> and the read that succeeded at [50ms] is returned successfully, except 
> +27000ms late (worst case, expected value would be half given RNG).
> This is only one scenario (a happy scenario). Supposing that the first read 
> eventually fails, the DFSClient will still retry inside of 
> {{DFSInputStream#hedgedFetchBlockByteRange}} for the same retries before 
> failing.
> I've identified one way to fix the behavior, but I'd be interested in 
> thoughts:
> {{DFSInputStream#getBestNodeDNAddrPair}}, there's a check to see if a node is 
> in the ignored list before allowing it to be returned. Amending this check to 
> short-circuit if there's only a single available node avoids the regrettably 
> useless retries, that is:
> {{nodes.length == 1 || ignoredNodes == null || 
> !ignoredNodes.contains(nodes[i])}}
> However, with this change, if there's only one DN available, it'll send the 
> hedged request to it as well. Better behavior would be to fail hedged 
> requests quickly *or* push the waiting work into the hedge pool so that 
> successful, fast reads aren't blocked by this issue.
> In our situation, we run a HBase cluster with HDFS RF=2 and hedged reads 
> enabled, stopping a single datanode leads to the cluster coming to a grinding 
> halt.
> You can observe this behavior yourself by editing 
> {{TestPread#testMaxOutHedgedReadPool}}'s MiniDFSCluster to have a single 
> datanode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to