Hi,
I have an error/exception in the distributed runtime. The job sometimes
works, sometimes does not.
The Web UI says:
Error: java.lang.Exception: The slot in which the task was scheduled has
been killed (probably loss of TaskManager).
Looking in the taskmanager logs, I find:
25.Apr. 00:12:42 WARN DFSClient - DFSOutputStream
ResponseProcessor exception for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996
java.io.EOFException: Premature EOF: no length prefix available
at
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1492)
at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)
25.Apr. 00:12:43 WARN DFSClient - DataStreamer Exception
java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)
25.Apr. 00:12:42 WARN DFSClient - DFSOutputStream
ResponseProcessor exception for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988
java.io.IOException: Bad response ERROR for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 from
datanode 172.16.19.81:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:732)
25.Apr. 00:12:46 WARN DFSClient - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716650_1975996 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.19.109:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:47 WARN DFSClient - Error Recovery for block
BP-1944967336-172.16.21.111-1412785070309:blk_1075716642_1975988 in
pipeline 172.16.20.112:50010, 172.16.19.81:50010, 172.16.20.105:50010: bad
datanode 172.16.20.112:50010
25.Apr. 00:12:48 WARN RemoteWatcher - Detected unreachable:
[akka.tcp://[email protected]:6123]
25.Apr. 00:12:53 INFO TaskManager - Disconnecting from
JobManager: JobManager is no longer reachable
25.Apr. 00:12:53 INFO TaskManager - Cancelling all computations
and discarding all cached data.
The jobmanager's logs are saying:
25.Apr. 00:07:37 WARN RemoteWatcher - Detected unreachable:
[akka.tcp://[email protected]:41265]
25.Apr. 00:07:37 INFO JobManager - Task manager akka.tcp://
[email protected]:41265/user/taskmanager terminated.
25.Apr. 00:07:37 INFO InstanceManager - Unregistered task manager
akka.tcp://[email protected]:41265. Number of registered task managers 9.
Number of available slots 18.
25.Apr. 00:07:37 INFO JobManager - Status of job
5f021c291483cdf7e7fae3271bfeacb1 (Wikipedia Extraction (dataset = full))
changed to FAILING The slot in which the task was scheduled has been killed
(probably loss of TaskManager)..
Any idea, what I can do? Change some config settings?
I already have:
taskmanager.heartbeat-interval: 10000
jobmanager.max-heartbeat-delay-before-failure.sec: 90
Just in case you think this might be correlated with FLINK-1916, which I
reported/posted a while ago: It's a different job, running on different
data.
Best,
Stefan