Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

Chesnay Schepler Fri, 24 Aug 2018 01:08:41 -0700

In 1.5 the latency metric was changed to be reported on the job-level,
that's why you see it under /jobs/.../metrics now, but not in 1.4.

In 1.4 you would see something similar under/jobs/.../vertices/.../metrics, for each vertex.

Additionally it is now a proper histogram, which significantly increasesthe number of accesses to the ConcurrentHashMaps that store metrics forthe UI. It could be that this code is just too slow for the amount ofmetrics.


On 23.08.2018 19:06, Jozef Vilcek wrote:

parallelism is 100.  I tried clusters with 1 and 2 slots per TM yielding
100 or 50 TMs in cluster.

I did notice that URL  http://jobmanager:port/jobs/job_id/metrics  in 1.5.x
returns huge list of "latency.source_id. ...." IDs. Heap dump shows that
hash map takes 1.6GB for me. I am guessing that is the one dispatcher
threads keep updating. Not sure what are those. In 1.4.0 that URL returns
something else, very short list.

On Thu, Aug 23, 2018 at 6:44 PM Piotr Nowojski <pi...@data-artisans.com>
wrote:

Hi,

How many task slots do you have in the cluster and per machine, and what
parallelism are you using?

Piotrek

On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vil...@gmail.com> wrote:

Yes, on smaller data and therefore smaller resources and parallelism
exactly same job runs fine

On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljos...@apache.org>

wrote:

Hi,

So with Flink 1.5.3 but a smaller parallelism the job works fine?

Best,
Aljoscha

On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> wrote:

Hello,

I am trying to get my Beam application (run on newer version of Flink
(1.5.3) but having trouble with that. When I submit application,

everything

works fine but after a few mins (as soon as 2 minutes after job start)
cluster just goes bad. Logs are full of timeouts for heartbeats,

JobManager

lost leadership, TaskExecutor timed out etc.

At that time, also WebUI is not usable. Looking into job manager, I did
notice that all of "flink-akka.actor.default-dispatcher" threads are

busy

or blocked. Most blocks are on metrics:

=======================================
java.lang.Thread.State: BLOCKED (on object monitor)
       at

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84)

       - waiting to lock <0x000000053df75510> (a
org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore)
       at

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205)

at

org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown

Source)
       at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)

       ...
=======================================

I tried to increase memory, as MetricStore seems to hold quite a lot

stuff,

but it is not helping. On 1.4.0 job manager was running with 4GB heap,

now,

this behaviour also occur with 10G.

Any suggestions?

Best,
Jozef

P.S.: Executed Beam app has problem in setup with 100 parallelism, 100

task

slots, 2100 running task, streaming mode. Smaller job runs without

problem

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

Reply via email to