Hi, How many task slots do you have in the cluster and per machine, and what parallelism are you using?
Piotrek > On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > > Yes, on smaller data and therefore smaller resources and parallelism > exactly same job runs fine > > On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljos...@apache.org> wrote: > >> Hi, >> >> So with Flink 1.5.3 but a smaller parallelism the job works fine? >> >> Best, >> Aljoscha >> >>> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> wrote: >>> >>> Hello, >>> >>> I am trying to get my Beam application (run on newer version of Flink >>> (1.5.3) but having trouble with that. When I submit application, >> everything >>> works fine but after a few mins (as soon as 2 minutes after job start) >>> cluster just goes bad. Logs are full of timeouts for heartbeats, >> JobManager >>> lost leadership, TaskExecutor timed out etc. >>> >>> At that time, also WebUI is not usable. Looking into job manager, I did >>> notice that all of "flink-akka.actor.default-dispatcher" threads are busy >>> or blocked. Most blocks are on metrics: >>> >>> ======================================= >>> java.lang.Thread.State: BLOCKED (on object monitor) >>> at >>> >> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84) >>> - waiting to lock <0x000000053df75510> (a >>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) >>> at >>> >> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205) >>> at >>> >> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown >>> Source) >>> at >>> >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >>> ... >>> ======================================= >>> >>> I tried to increase memory, as MetricStore seems to hold quite a lot >> stuff, >>> but it is not helping. On 1.4.0 job manager was running with 4GB heap, >> now, >>> this behaviour also occur with 10G. >>> >>> Any suggestions? >>> >>> Best, >>> Jozef >>> >>> P.S.: Executed Beam app has problem in setup with 100 parallelism, 100 >> task >>> slots, 2100 running task, streaming mode. Smaller job runs without >> problem >> >>