Re: TaskManager deadlock on NetworkBufferPool

Fabian Hueske Wed, 04 Apr 2018 14:01:44 -0700

Hi Amit,

The network stack has been redesigned for the upcoming Flink 1.5 release.
The issue might have been fixed by that.


There's already a first release candidate for Flink 1.5.0 available [1].
It would be great if you would have the chance to check if the bug is still
present.

Best, Fabian

[1]
https://lists.apache.org/thread.html/a6b6fb1a42a975608fa8641c86df30b47f022985ade845f1f1ec542a@%3Cdev.flink.apache.org%3E

2018-04-04 20:23 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:

> I searched for 0x00000005e28fe218 in the two files you attached
> to FLINK-2685 but didn't find any hit.
>
> Was this the same instance as the attachment to FLINK-2685 ?
>
> Thanks
>
> On Wed, Apr 4, 2018 at 10:21 AM, Amit Jain <aj201...@gmail.com> wrote:
>
> > +u...@flink.apache.org
> >
> > On Wed, Apr 4, 2018 at 11:33 AM, Amit Jain <aj201...@gmail.com> wrote:
> > > Hi,
> > >
> > > We are hitting TaskManager deadlock on NetworkBufferPool bug in Flink
> > 1.3.2.
> > > We have set of ETL's merge jobs for a number of tables and stuck with
> > above
> > > issue randomly daily.
> > >
> > > I'm attaching the thread dump of JobManager and one of the Task Manager
> > (T1)
> > > running stuck job.
> > > We also observed, sometimes new job scheduled on T1 progresses even
> > another
> > > job is stuck there.
> > >
> > > "CHAIN DataSource (at createInput(ExecutionEnvironment.java:553)
> > > (org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> Map
> > (Map
> > > at main(MergeTableSecond.java:175)) -> Map (Key Extractor) (6/9)"
> #1501
> > > daemon prio=5 os_prio=0 tid=0x00007f9ea84d2fb0 nid=0x22fe in
> > Object.wait()
> > > [0x00007f9ebf102000]
> > >    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> > > at java.lang.Object.wait(Native Method)
> > > at
> > > org.apache.flink.runtime.io.network.buffer.
> > LocalBufferPool.requestBuffer(LocalBufferPool.java:224)
> > > - locked <0x00000005e28fe218> (a java.util.ArrayDeque)
> > > at
> > > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.
> > requestBufferBlocking(LocalBufferPool.java:193)
> > > at
> > > org.apache.flink.runtime.io.network.api.writer.
> > RecordWriter.sendToTarget(RecordWriter.java:132)
> > > - locked <0x00000005e29125f0> (a
> > > org.apache.flink.runtime.io.network.api.serialization.
> > SpanningRecordSerializer)
> > > at
> > > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(
> > RecordWriter.java:89)
> > > at
> > > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(
> > OutputCollector.java:65)
> > > at
> > > org.apache.flink.runtime.operators.util.metrics.
> > CountingCollector.collect(CountingCollector.java:35)
> > > at
> > > org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(
> > ChainedMapDriver.java:79)
> > > at
> > > org.apache.flink.runtime.operators.util.metrics.
> > CountingCollector.collect(CountingCollector.java:35)
> > > at
> > > org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(
> > ChainedMapDriver.java:79)
> > > at
> > > org.apache.flink.runtime.operators.util.metrics.
> > CountingCollector.collect(CountingCollector.java:35)
> > > at
> > > org.apache.flink.runtime.operators.DataSourceTask.
> > invoke(DataSourceTask.java:168)
> > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
> > > at java.lang.Thread.run(Thread.java:748)
> > >
> > > --
> > > Thanks,
> > > Amit
> >
>

Re: TaskManager deadlock on NetworkBufferPool

Reply via email to