[ 
https://issues.apache.org/jira/browse/HDFS-17332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xing Lin updated HDFS-17332:
----------------------------
    Description: 
In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 

 

Example stack trace
{code:java}
[12]<stderr>:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: Failed 
to connect to 10.150.91.13/10.150.91.13:71 for file 
/jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro
 for block 
BP-364971551-10.150.4.19-1448516588954:blk_130854761734_129864739321:java.net.SocketTimeoutException:
 60000 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
remote=10.150.91.13/10.150.91.13:71] 
[12]<stderr>:java.net.SocketTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
remote=10.150.91.13/10.150.91.13:71] [12]<stderr>: at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
[12]<stderr>: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
[12]<stderr>: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
[12]<stderr>: at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
[12]<stderr>: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
[12]<stderr>: at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
[12]<stderr>: at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
 [12]<stderr>: at 
org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
[12]<stderr>: at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
[12]<stderr>: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]<stderr>: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
 [12]<stderr>: at 
com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
 [12]<stderr>: at 
hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
 [12]<stderr>: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
[12]<stderr>: at 
org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
 [12]<stderr>: at 
org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) [12]<stderr>: 
at org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
[12]<stderr>: at 
org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}

  was:In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
cases, the read request will be served successfully by reading from the next 
available DN. The existence of exception stacktrace in the log has caused 
multiple hadoop users at Linkedin to consider this WARN message as the RC/fatal 
error for their jobs.  We would like to improve the log message and avoid 
sending the stacktrace to dfsClient.LOG when a read succeeds. The stackTrace 
when reading reach DN is sent to the log only when we really need to fail a 
read request (when chooseDataNode()/refetchLocations() throws a 
BlockMissingException). 


> DFSInputStream: avoid logging stacktrace until when we really need to fail a 
> read request with a MissingBlockException
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17332
>                 URL: https://issues.apache.org/jira/browse/HDFS-17332
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Minor
>
> In DFSInputStream#actualGetFromOneDataNode(), it would send the exception 
> stacktrace to the dfsClient.LOG whenever we fail on a DN. However, in most 
> cases, the read request will be served successfully by reading from the next 
> available DN. The existence of exception stacktrace in the log has caused 
> multiple hadoop users at Linkedin to consider this WARN message as the 
> RC/fatal error for their jobs.  We would like to improve the log message and 
> avoid sending the stacktrace to dfsClient.LOG when a read succeeds. The 
> stackTrace when reading reach DN is sent to the log only when we really need 
> to fail a read request (when chooseDataNode()/refetchLocations() throws a 
> BlockMissingException). 
>  
> Example stack trace
> {code:java}
> [12]<stderr>:23/11/30 23:01:33 WARN hdfs.DFSClient: Connection failure: 
> Failed to connect to 10.150.91.13/10.150.91.13:71 for file 
> /jobs/kgemb/holistic/dev/ywang11/pcv2/runs/2850541/artifacts/jobAction-train-importer/featurized_dataset/part-109247-95b9909c-b6ab-41aa-bb87-7e76f4aad35f-c000.avro
>  for block 
> BP-364971551-10.150.4.19-1448516588954:blk_130854761734_129864739321:java.net.SocketTimeoutException:
>  60000 millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
> remote=10.150.91.13/10.150.91.13:71] 
> [12]<stderr>:java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/100.101.37.108:40492 
> remote=10.150.91.13/10.150.91.13:71] [12]<stderr>: at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) 
> [12]<stderr>: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) 
> [12]<stderr>: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) 
> [12]<stderr>: at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) 
> [12]<stderr>: at java.io.FilterInputStream.read(FilterInputStream.java:83) 
> [12]<stderr>: at 
> org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:458)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderRemote2.newBlockReader(BlockReaderRemote2.java:412)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:864)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:387)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:736) 
> [12]<stderr>: at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1268)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1216)
>  [12]<stderr>: at 
> org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1608) 
> [12]<stderr>: at 
> org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1568) 
> [12]<stderr>: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]<stderr>: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.lambda$read$0(InstrumentedFSDataInputStream.java:108)
>  [12]<stderr>: at 
> com.linkedin.hadoop.metrics.fs.PerformanceTrackingFSDataInputStream.process(PerformanceTrackingFSDataInputStream.java:39)
>  [12]<stderr>: at 
> hdfs_metrics_shade.org.apache.hadoop.fs.InstrumentedFSDataInputStream$InstrumentedFilterInputStream.read(InstrumentedFSDataInputStream.java:108)
>  [12]<stderr>: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93) 
> [12]<stderr>: at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$read$2(RetryingInputStream.java:153)
>  [12]<stderr>: at 
> org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36) 
> [12]<stderr>: at 
> org.apache.hadoop.fs.RetryingInputStream.read(RetryingInputStream.java:149) 
> [12]<stderr>: at 
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:93){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to