Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-17 Thread Piotr Nowojski
Hi, If the TM is not responding check the TM logs if there is some long gap in logs. There might be three main reasons for such gaps: 1. Machine is swapping - setup/configure your machine/processes that machine never swap (best to disable swap altogether) 2. Long GC full stops - look how to

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Hao Sun
Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still running inside kubernetes, but it is not responding to any requests, probably due to high load. And from JM side, JM lost heartbeat tracking of the TM, so it marked the TM as died. The „volume“ of Kafka topics, I mean,

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Stefan Richter
Hi, > In addition to your comments, what are the items retained by > NetworkEnvironment? They grew seems like indefinitely, do they ever reduce? > Mostly the network buffers, which should be ok. They are always recycled and should not be released until the network environment is GCed. > I

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Hao Sun
Thanks a lot! This is very helpful. In addition to your comments, what are the items retained by NetworkEnvironment? They grew seems like indefinitely, do they ever reduce? I think there is a GC issue because my task manager is killed somehow after a job run. The duration correlates to the volume

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Stefan Richter
Hi, I cannot spot anything that indicates a leak from your screenshots. Maybe you misinterpret the numbers? In your heap dump, there is only a single instance of org.apache.flink.runtime.io.network.NetworkEnvironment and it retains about 400,000,000 bytes from being GCed because it holds