[ https://issues.apache.org/jira/browse/HDFS-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363358#comment-17363358 ]
zhengchenyu commented on HDFS-16070: ------------------------------------ [~ayushsaxena][~inigoiri] I have submit a pull request, can you help me review this patch? > DataTransfer block storm when datanode's io is busy. > ---------------------------------------------------- > > Key: HDFS-16070 > URL: https://issues.apache.org/jira/browse/HDFS-16070 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.3.0, 3.2.1 > Reporter: zhengchenyu > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When I speed up the decommission, I found that some datanode's io is busy, > then I found host's load is very high, and ten thousands data transfer thread > are running. > Then I find log like below. > {code} > # 启动线程的日志 > 2021-06-08 13:42:37,620 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.201.4.49:9866, > datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, > infoSecurePort=0, ipcPort=9867, > storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) > Starting thread to transfer > BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to > 10.201.7.52:9866 > 2021-06-08 13:52:36,345 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.201.4.49:9866, > datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, > infoSecurePort=0, ipcPort=9867, > storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) > Starting thread to transfer > BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to > 10.201.7.31:9866 > 2021-06-08 14:02:37,197 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.201.4.49:9866, > datanodeUuid=6c55b7cb-f8ef-445b-9cca-d82b5b077ed1, infoPort=9864, > infoSecurePort=0, ipcPort=9867, > storageInfo=lv=-57;cid=CID-37e80bd5-733a-4d7b-ba3d-b46269573c72;nsid=215490653;c=1584525570797) > Starting thread to transfer > BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 to > 10.201.16.50:9866 > # 发送完成的标记 > 2021-06-08 13:54:08,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DataTransfer, at bd-tz1-hadoop-004049.zeus.lianjia.com:9866: Transmitted > BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 > (numBytes=7457424) to /10.201.7.52:9866 > 2021-06-08 14:10:47,170 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > DataTransfer, at bd-tz1-hadoop-004049.zeus.lianjia.com:9866: Transmitted > BP-852924019-10.201.1.32-1584525570797:blk_-9223372036449848858_30963611 > (numBytes=7457424) to /10.201.16.50:9866 > {code} > You will see last datatranfser thread was done on 13:54:08, but next > datatranfser was start at 13:52:36. > If datatranfser was not done in 10min(pending timeout + check interval), then > next datatranfser for same block will be running. Then disk and network are > heavy. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org