Hi Yufei,
My prime suspect would be changes to the memory configuration introduced in
1.11 [1]
Piotrek
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html#memory-management
pon., 28 gru 2020 o 09:52 Till Rohrmann napisał(a):
> Hi Yufei,
>
> I cannot
Hi Yufei,
I cannot remember exactly the changes in this area between Flink 1.10.0 and
Flink 1.12.0. It sounds a bit as if we were not releasing memory segments
fast enough or had a memory leak. One thing to try out is to increase the
restart delay to see whether it is the first problem.
Hi, Yufei.
Can you reproduce this issue in 1.10.0? The deterministic slot sharing
introduced in 1.12.0 is one possible reason. Before 1.12.0, the
distribution of tasks in slots is not determined. Even if the network
buffers are enough from the perspective of the cluster. Bad
distribution of tasks
Hey,
I’ve found that job will throw “java.io.IOException: Insufficient number of
network buffers: required 51, but only 1 available” after job retstart, and
I’ve observed TM use much more network buffers than before.
My internal branch is under 1.10.0 can easily reproduce, but I use 1.12.0