We are also seeing something very similar. Looks like a bug.

It seems to get stuck in LocalBufferPool forever and the job has to be
restarted.

Is anyone else facing this too?

On Tue, Apr 9, 2019 at 9:04 PM Indraneel R <vascodaga...@gmail.com> wrote:

> Hi,
>
> We are trying to run a very simple flink pipeline, which is used to
> sessionize events from a kinesis stream. Its an
>  - event time window with a 30 min gap and
>  - trigger interval of 15 mins and
>  - late arrival time duration of 10 hrs
> This is how the graph looks.
>
> [image: Screenshot 2019-04-10 at 12.08.25 AM.png]
> But what we are observing is that after 2-3 days of continuous run the job
> becomes progressively unstable and completely freezes.
>
> And the thread dump analysis revealed that it is actually indefinitely
> waiting at
>     `LocalBufferPool.requestMemorySegment(LocalBufferPool.java:261)`
> for a memory segment to be available.
> And while it is waiting it holds and checkpoint lock, and therefore blocks
> all other threads as well, since they are all requesting for a lock on
> `checkpointLock` object.
>
> But we are not able to figure out why its not able to get any segment.
> Because there is no indication of backpressure, at least on the flink UI.
> And here are our job configurations:
>
> *number of Taskmanagers : 4*
> *jobmanager.heap.size: 8000m*
> *taskmanager.heap.size: 11000m*
> *taskmanager.numberOfTaskSlots: 4*
> *parallelism.default: 16*
> *taskmanager.network.memory.max: 5gb*
> *taskmanager.network.memory.min: 3gb*
> *taskmanager.network.memory.buffers-per-channel: 8*
> *taskmanager.network.memory.floating-buffers-per-gate: 16*
> *taskmanager.memory.size: 13gb  *
>
> *data rate : 250 messages/sec*
> *or 1mb/sec*
>
>
> Any ideas on what could be the issue?
>
> regards
> -Indraneel
>
>
>

Reply via email to