Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-26 Thread Richard Deurwaarder
Hello, We run into the same problem. We've done most of the same steps/observations: - increase memory - increase cpu - No noticable increase in GC activity - Little network io Our current setup has the liveliness probe disabled and we've increased (akka)timeouts, this seems to help

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-26 Thread Biao Liu
Hi Prakhar, Sorry I don't have much experience on k8s. Maybe some other guys could help. On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur wrote: > Hi, > > So we were deploying our flink clusters on YARN earlier but then we moved > to kubernetes, but then our clusters were not this big. Have you

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-26 Thread Prakhar Mathur
Hi, So we were deploying our flink clusters on YARN earlier but then we moved to kubernetes, but then our clusters were not this big. Have you guys seen issues with job manager rest server becoming irresponsive on kubernetes before? On Fri, Jul 26, 2019, 14:28 Biao Liu wrote: > Hi Prakhar, > >

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-26 Thread Biao Liu
Hi Prakhar, Sorry I could not find any abnormal message from your GC log and stack trace. Have you ever tried deploying the cluster in other ways? Not on Kubernetes. Like on YARN or standalone. Just for narrowing down the scope. On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur wrote: > > On

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-22 Thread Prakhar Mathur
On Mon, Jul 22, 2019, 16:08 Prakhar Mathur wrote: > Hi, > > We enabled GC logging, here are the logs > > [GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)] > 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00, > real=0.06 secs] > [GC (Allocation Failure)

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-18 Thread Biao Liu
Hi, It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct. I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours. Have you ever checked the log

Re: Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-18 Thread Biao Liu
Hi Prakhar, Have you ever checked the garbage collection of master? Which version of Flink are you using? How many TaskManagers in your cluster? Prakhar Mathur 于2019年7月18日周四 下午1:54写道: > Hello, > > We have deployed multiple Flink clusters on Kubernetess with 1 replica of > Jobmanager and

Job Manager becomes irresponsive if the size of the session cluster grows

2019-07-17 Thread Prakhar Mathur
Hello, We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd