[ 
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452375#comment-17452375
 ] 

Yuanxin Zhu commented on HDFS-16293:
------------------------------------

[~tasanuma]  Thanks for your review.
Without fixing DataStreamer, the DataStreamer sleeps after executing 
"congestedNodes.clear()", but does not release "dataQueue". The congestedNodes 
thread needs "dataQueue" to execute 
"congestedNodes.add(mock(DatanodeInfo.class))". After the DataStreamer releases 
the "dataQueue", the congestedNodes thread executes once and then sleeps again, 
so there will always be only one congestedNode, similar to a serial process?
With fixing DataStreamer, the DataStreamer can sleep for 50s at most, release 
the "dataQueue". The congestedNodes thread has time to execute multiple times. 
Finally, the verification is successful.

> Client sleeps and holds 'dataQueue' when DataNodes are congested
> ----------------------------------------------------------------
>
>                 Key: HDFS-16293
>                 URL: https://issues.apache.org/jira/browse/HDFS-16293
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 3.2.2, 3.3.1, 3.2.3
>            Reporter: Yuanxin Zhu
>            Assignee: Yuanxin Zhu
>            Priority: Major
>         Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch, 
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for 
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep 
> state after receiving the ACK for many times, but does not release the 
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute 
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to 
> release the 'dataQueue', which is equivalent to that the ResponseProcessor 
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be 
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()', 
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()' 
> according to 'one.isHeartbeatPacket()'
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to