Hi, So with Flink 1.5.3 but a smaller parallelism the job works fine?
Best, Aljoscha > On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > > Hello, > > I am trying to get my Beam application (run on newer version of Flink > (1.5.3) but having trouble with that. When I submit application, everything > works fine but after a few mins (as soon as 2 minutes after job start) > cluster just goes bad. Logs are full of timeouts for heartbeats, JobManager > lost leadership, TaskExecutor timed out etc. > > At that time, also WebUI is not usable. Looking into job manager, I did > notice that all of "flink-akka.actor.default-dispatcher" threads are busy > or blocked. Most blocks are on metrics: > > ======================================= > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84) > - waiting to lock <0x000000053df75510> (a > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205) > at > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown > Source) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > ... > ======================================= > > I tried to increase memory, as MetricStore seems to hold quite a lot stuff, > but it is not helping. On 1.4.0 job manager was running with 4GB heap, now, > this behaviour also occur with 10G. > > Any suggestions? > > Best, > Jozef > > P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 task > slots, 2100 running task, streaming mode. Smaller job runs without problem