Hi, We are hitting TaskManager deadlock on NetworkBufferPool <https://issues.apache.org/jira/browse/FLINK-2685> bug in Flink 1.3.2. We have set of ETL's merge jobs for a number of tables and stuck with above issue randomly daily.
I'm attaching the thread dump of JobManager and one of the Task Manager (T1) running stuck job. We also observed, sometimes new job scheduled on T1 progresses even another job is stuck there. "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553) (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Map (Map at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)" #1501 daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in Object.wait() [0x00007f9ebf102000] * java.lang.Thread.State: TIMED_WAITING (on object monitor)* * at java.lang.Object.wait(Native Method)* * at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:224)* * - locked <0x00000005e28fe218> (a java.util.ArrayDeque)* at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:193) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:132) - locked <0x00000005e29125f0> (a org.apache.flink.runtime.io.network.api.serialization.SpanningRecordSerializer) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:89) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:79) at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:79) at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:168) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) at java.lang.Thread.run(Thread.java:748) -- Thanks, Amit