[ https://issues.apache.org/jira/browse/HDFS-6532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445195#comment-16445195 ]
Lars Francke commented on HDFS-6532: ------------------------------------ Hi, I know this is old but we're seeing this same error message on a production cluster and are a bit confused by it as well. Do you happen to have any more information or ideas on the root cause? This is from Spark writing to HDFS. And Spark is killing tasks with that same exception (see blow). Looking at the code I also don't know why things would be interrupted there. The DataNode logs look normal to me at the same time (unfortunately for those I don't have the verbatim logs): 01:12:26 - Receiving Block 01:13:17 - Thread is interrupted 01:13:17 - Terminating 01:13:17 - Premature EOF from inputStream {code:java} 18/04/20 01:12:29 INFO Executor: Executor is trying to kill task 66.0 in stage 231.0 (TID 204526) 18/04/20 01:12:29 INFO DFSClient: Exception in createBlockOutputStream java.io.InterruptedIOException: Interrupted while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/10.194.211.44:52770 remote=/10.194.211.44:1019]. 215000 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:342) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.FilterInputStream.read(FilterInputStream.java:83) at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2462) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1461) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1380) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:558) 18/04/20 01:12:29 INFO DFSClient: Abandoning BP-1887265555-10.194.210.65-1478836813700:blk_5197463151_4124323282 18/04/20 01:12:29 WARN Client: interrupted waiting to send rpc request to server java.lang.InterruptedException at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1094) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:436) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1384) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:558) 18/04/20 01:12:29 WARN DFSClient: DataStreamer Exception java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.ipc.Client.call(Client.java:1463) at org.apache.hadoop.ipc.Client.call(Client.java:1398) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) at com.sun.proxy.$Proxy12.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.abandonBlock(ClientNamenodeProtocolTranslatorPB.java:436) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) at com.sun.proxy.$Proxy13.abandonBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1384) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:558) Caused by: java.lang.InterruptedException at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) at java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1094) at org.apache.hadoop.ipc.Client.call(Client.java:1457) ... 14 more 18/04/20 01:12:29 ERROR Executor: Exception in task 66.0 in stage 231.0 (TID 204526) java.io.InterruptedIOException: Interrupted while waiting for data to be acknowledged by pipeline at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2346) at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2325) at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2461) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2431) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108) at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:103) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1203) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1295) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1203) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745){code} > Intermittent test failure > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > ------------------------------------------------------------------------------------------ > > Key: HDFS-6532 > URL: https://issues.apache.org/jira/browse/HDFS-6532 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.4.0 > Reporter: Yongjun Zhang > Assignee: Yiqun Lin > Priority: Major > Attachments: HDFS-6532.001.patch, HDFS-6532.002.patch, > PreCommit-HDFS-Build #16770 test - testCorruptionDuringWrt [Jenkins].pdf, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption-select_timeout.xml, > TEST-org.apache.hadoop.hdfs.TestCrcCorruption.xml, jstack > > > Per https://builds.apache.org/job/Hadoop-Hdfs-trunk/1774/testReport, we had > the following failure. Local rerun is successful > {code} > Regression > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt > Failing for the past 1 build (Since Failed#1774 ) > Took 50 sec. > Error Message > test timed out after 50000 milliseconds > Stacktrace > java.lang.Exception: test timed out after 50000 milliseconds > at java.lang.Object.wait(Native Method) > at > org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2024) > at > org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2008) > at > org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2107) > at > org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) > at > org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:98) > at > org.apache.hadoop.hdfs.TestCrcCorruption.testCorruptionDuringWrt(TestCrcCorruption.java:133) > {code} > See relevant exceptions in log > {code} > 2014-06-14 11:56:15,283 WARN datanode.DataNode > (BlockReceiver.java:verifyChunks(404)) - Checksum error in block > BP-1675558312-67.195.138.30-1402746971712:blk_1073741825_1001 from > /127.0.0.1:41708 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1139495951_8 at 64512 exp: 1379611785 got: -12163112 > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:353) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:284) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:402) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:537) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:734) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:741) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234) > at java.lang.Thread.run(Thread.java:662) > 2014-06-14 11:56:15,285 WARN datanode.DataNode > (BlockReceiver.java:run(1207)) - IOException in BlockReceiver.run(): > java.io.IOException: Shutting down writer and responder due to a checksum > error in received data. The error response has been sent upstream. > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1352) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1278) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1199) > at java.lang.Thread.run(Thread.java:662) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org