For my small job, I see ~24k those latency metrics @ '/jobs/.../metrics'. That job is much smaller in terms of production parallelism.
Are there any options here. Can it be turned off, reduced histogram metrics, reduced update frequency, ... ? Also, keeping it flat seems to use quite some memory of JM {"id":"latency.source_id.2f6436c1f4f0c70e401663acf945a822.source_subtask_index.2.operator_id.4060d9664a78e1d82671ac80921843cd.operator_subtask_index.1.latency_stddev"} On Fri, Aug 24, 2018 at 10:08 AM Chesnay Schepler <ches...@apache.org> wrote: > In 1.5 the latency metric was changed to be reported on the job-level, > that's why you see it under /jobs/.../metrics now, but not in 1.4. > In 1.4 you would see something similar under > /jobs/.../vertices/.../metrics, for each vertex. > > Additionally it is now a proper histogram, which significantly increases > the number of accesses to the ConcurrentHashMaps that store metrics fort > he UI. It could be that this code is just too slow for the amount of > metrics. > > On 23.08.2018 19:06, Jozef Vilcek wrote: > > parallelism is 100. I tried clusters with 1 and 2 slots per TM yielding > > 100 or 50 TMs in cluster. > > > > I did notice that URL http://jobmanager:port/jobs/job_id/metrics in > 1.5.x > > returns huge list of "latency.source_id. ...." IDs. Heap dump shows that > > hash map takes 1.6GB for me. I am guessing that is the one dispatcher > > threads keep updating. Not sure what are those. In 1.4.0 that URL returns > > something else, very short list. > > > > On Thu, Aug 23, 2018 at 6:44 PM Piotr Nowojski <pi...@data-artisans.com> > > wrote: > > > >> Hi, > >> > >> How many task slots do you have in the cluster and per machine, and what > >> parallelism are you using? > >> > >> Piotrek > >> > >>> On 23 Aug 2018, at 16:21, Jozef Vilcek <jozo.vil...@gmail.com> wrote: > >>> > >>> Yes, on smaller data and therefore smaller resources and parallelism > >>> exactly same job runs fine > >>> > >>> On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek <aljos...@apache.org> > >> wrote: > >>>> Hi, > >>>> > >>>> So with Flink 1.5.3 but a smaller parallelism the job works fine? > >>>> > >>>> Best, > >>>> Aljoscha > >>>> > >>>>> On 23. Aug 2018, at 15:25, Jozef Vilcek <jozo.vil...@gmail.com> > wrote: > >>>>> > >>>>> Hello, > >>>>> > >>>>> I am trying to get my Beam application (run on newer version of Flink > >>>>> (1.5.3) but having trouble with that. When I submit application, > >>>> everything > >>>>> works fine but after a few mins (as soon as 2 minutes after job > start) > >>>>> cluster just goes bad. Logs are full of timeouts for heartbeats, > >>>> JobManager > >>>>> lost leadership, TaskExecutor timed out etc. > >>>>> > >>>>> At that time, also WebUI is not usable. Looking into job manager, I > did > >>>>> notice that all of "flink-akka.actor.default-dispatcher" threads are > >> busy > >>>>> or blocked. Most blocks are on metrics: > >>>>> > >>>>> ======================================= > >>>>> java.lang.Thread.State: BLOCKED (on object monitor) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore.addAll(MetricStore.java:84) > >>>>> - waiting to lock <0x000000053df75510> (a > >>>>> org.apache.flink.runtime.rest.handler.legacy.metrics.MetricStore) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher.lambda$queryMetrics$5(MetricFetcher.java:205) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher$$Lambda$201/995076607.accept(Unknown > >>>>> Source) > >>>>> at > >>>>> > >> > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > >>>>> ... > >>>>> ======================================= > >>>>> > >>>>> I tried to increase memory, as MetricStore seems to hold quite a lot > >>>> stuff, > >>>>> but it is not helping. On 1.4.0 job manager was running with 4GB > heap, > >>>> now, > >>>>> this behaviour also occur with 10G. > >>>>> > >>>>> Any suggestions? > >>>>> > >>>>> Best, > >>>>> Jozef > >>>>> > >>>>> P.S.: Executed Beam app has problem in setup with 100 parallelism, > 100 > >>>> task > >>>>> slots, 2100 running task, streaming mode. Smaller job runs without > >>>> problem > >>>> > >>>> > >> > >