[ 
https://issues.apache.org/jira/browse/HDFS-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133095#comment-17133095
 ] 

liuyanyu commented on HDFS-15407:
---------------------------------

One reason is hedged Read cancel the timeout request using 
future.cancel(false). This will not cancel the request immediately,in-progress 
tasks are allowed to complete. So those requests to slow datanode will occupy 
the thread pool. And other request can not be submit. 

The other reason is the slow datanode will not be remembered when read next 
buffer. So the read request will continue to choose slow datanode , timeout and 
choose another one. Finally the requests to slow datanode occupy the thread 
pool. Other requests can not be submit.

> Hedged read will not work if a datanode slow for a long time
> ------------------------------------------------------------
>
>                 Key: HDFS-15407
>                 URL: https://issues.apache.org/jira/browse/HDFS-15407
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: 3.1.1, datanode
>    Affects Versions: 3.1.1
>            Reporter: liuyanyu
>            Assignee: liuyanyu
>            Priority: Major
>
> I use cgroups to limit the datanode node IO to 1024Byte/s, use hedged read to 
> read the file, (where dfs.client.hedged.read.threadpool.size is set to 5, 
> dfs.client.hedged.read.threshold.millis is set to 500), the first 5 buffer 
> read timeout, switch other datenode nodes to read successfully. Then stuck 
> for a long time because of SocketTimeoutException. Log as follows
> 2020-06-11 16:40:07,832 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:08,562 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:09,102 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:09,642 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:10,182 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:10,182 | INFO  | main | Execution rejected, Executing in 
> current thread | DFSClient.java:3049
> 2020-06-11 16:40:10,219 | INFO  | main | Execution rejected, Executing in 
> current thread | DFSClient.java:3049
> 2020-06-11 16:50:07,638 | WARN  | hedgedRead-0 | I/O error constructing 
> remote block reader. | BlockReaderFactory.java:764
> java.net.SocketTimeoutException: 600000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009]
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>       at java.io.FilterInputStream.read(FilterInputStream.java:83)
>       at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:551)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:418)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:661)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1063)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1035)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1031)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> 2020-06-11 16:50:07,638 | WARN  | hedgedRead-0 | Connection failure: Failed 
> to connect to /xx.xx.xx.28:25009 for file /testhdfs/test2.jar for block 
> BP-1820384660-xx.xx.xx.74-1585533043013:blk_1082582662_8861386:java.net.SocketTimeoutException:
>  600000 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 
> remote=/xx.xx.xx.28:25009] | DFSInputStream.java:1118
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to