[ 
https://issues.apache.org/jira/browse/HDFS-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141463#comment-17141463
 ] 

Ayush Saxena commented on HDFS-15407:
-------------------------------------

Thanx [~rain_lyy] for the report and sorry for coming late. I see the problem 
that you are facing is due to the fact the slow datanodes are all occupying the 
thread pool?

future.cancel(false) was allowed to make the in progress calls complete then 
explicitly to overcome certain issues as mentioned here :

https://issues.apache.org/jira/browse/HDFS-5776?focusedCommentId=13905955&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13905955

 

That is a pretty old stuff, I am not sure whether these problems still exist or 
not, you can try doing a bit of research on it in the present situation and 
find out what is the situation now, if fixed now, then what fixed and if these 
problems still tend to stay, so just changing to {{future.cancel(true)}} isn't 
a solution.

 

Remembering the slow datanodes, I am not pretty sure how much gracefully we can 
do it and not either sure if it can solve your problem, considering if bunch of 
datanodes behave differently under load.

 

If you tend to have some other solutions or some analysis done, do let me know..

 

[~stack] you got this change in, Have you faced this issue, or do you have any 
pointers to this..

> Hedged read will not work if a datanode slow for a long time
> ------------------------------------------------------------
>
>                 Key: HDFS-15407
>                 URL: https://issues.apache.org/jira/browse/HDFS-15407
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: 3.1.1, datanode
>    Affects Versions: 3.1.1
>            Reporter: liuyanyu
>            Assignee: liuyanyu
>            Priority: Major
>
> I use cgroups to limit the datanode node IO to 1024Byte/s, use hedged read to 
> read the file, (where dfs.client.hedged.read.threadpool.size is set to 5, 
> dfs.client.hedged.read.threshold.millis is set to 500), the first 5 buffer 
> read timeout, switch other datenode nodes to read successfully. Then stuck 
> for a long time because of SocketTimeoutException. Log as follows
> 2020-06-11 16:40:07,832 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:08,562 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:09,102 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:09,642 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:10,182 | INFO  | main | Waited 500ms to read from 
> DatanodeInfoWithStorage[xx.xx.xx.28:25009,DS-9c843ac6-4ea1-4791-a1af-54c1ae3d5daf,DISK];
>  spawning hedged read | DFSInputStream.java:1188
> 2020-06-11 16:40:10,182 | INFO  | main | Execution rejected, Executing in 
> current thread | DFSClient.java:3049
> 2020-06-11 16:40:10,219 | INFO  | main | Execution rejected, Executing in 
> current thread | DFSClient.java:3049
> 2020-06-11 16:50:07,638 | WARN  | hedgedRead-0 | I/O error constructing 
> remote block reader. | BlockReaderFactory.java:764
> java.net.SocketTimeoutException: 600000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/xx.xx.xx.113:62750 remote=/xx.xx.xx.28:25009]
>       at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>       at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
>       at java.io.FilterInputStream.read(FilterInputStream.java:83)
>       at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:551)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:418)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
>       at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:661)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1063)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1035)
>       at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1031)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> 2020-06-11 16:50:07,638 | WARN  | hedgedRead-0 | Connection failure: Failed 
> to connect to /xx.xx.xx.28:25009 for file /testhdfs/test2.jar for block 
> BP-1820384660-xx.xx.xx.74-1585533043013:blk_1082582662_8861386:java.net.SocketTimeoutException:
>  600000 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/xx.xx.xx.113:62750 
> remote=/xx.xx.xx.28:25009] | DFSInputStream.java:1118
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to