Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-27 Thread Chesnay Schepler
There are a few separate issues in here that we should tackle/investigate in parallel. Improve storage of latency metrics Given how absurdly the number of latency metrics scale with # of operators / parallelism it makes sense to introduce a special case here. I'm not quite sure yet ho

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-24 Thread Jozef Vilcek
With `latencyTrackingInterval` set to `0` cluster runs fine. So, is this something which make sense to be improved? JIRA I can track or file one? On Fri, Aug 24, 2018 at 11:50 AM Chesnay Schepler wrote: > I believe the only thing you can do is disable latency tracking, by > setting the `latencyT

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-24 Thread Chesnay Schepler
I believe the only thing you can do is disable latency tracking, by setting the `latencyTrackingInterval` in `env.getExecutionConfig()` to 0 or a negative value. The update frequency is not configurable and currently set to 10 seconds. Latency metrics are tracked as the cross-product of all su

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-24 Thread Jozef Vilcek
For my small job, I see ~24k those latency metrics @ '/jobs/.../metrics'. That job is much smaller in terms of production parallelism. Are there any options here. Can it be turned off, reduced histogram metrics, reduced update frequency, ... ? Also, keeping it flat seems to use quite some memory o

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-24 Thread Chesnay Schepler
In 1.5 the latency metric was changed to be reported on the job-level, that's why you see it under /jobs/.../metrics now, but not in 1.4. In 1.4 you would see something similar under /jobs/.../vertices/.../metrics, for each vertex. Additionally it is now a proper histogram, which significantly

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-23 Thread Jozef Vilcek
parallelism is 100. I tried clusters with 1 and 2 slots per TM yielding 100 or 50 TMs in cluster. I did notice that URL http://jobmanager:port/jobs/job_id/metrics in 1.5.x returns huge list of "latency.source_id. " IDs. Heap dump shows that hash map takes 1.6GB for me. I am guessing that is

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-23 Thread Piotr Nowojski
Hi, How many task slots do you have in the cluster and per machine, and what parallelism are you using? Piotrek > On 23 Aug 2018, at 16:21, Jozef Vilcek wrote: > > Yes, on smaller data and therefore smaller resources and parallelism > exactly same job runs fine > > On Thu, Aug 23, 2018, 16:1

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-23 Thread Jozef Vilcek
Yes, on smaller data and therefore smaller resources and parallelism exactly same job runs fine On Thu, Aug 23, 2018, 16:11 Aljoscha Krettek wrote: > Hi, > > So with Flink 1.5.3 but a smaller parallelism the job works fine? > > Best, > Aljoscha > > > On 23. Aug 2018, at 15:25, Jozef Vilcek wrot

Re: Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-23 Thread Aljoscha Krettek
Hi, So with Flink 1.5.3 but a smaller parallelism the job works fine? Best, Aljoscha > On 23. Aug 2018, at 15:25, Jozef Vilcek wrote: > > Hello, > > I am trying to get my Beam application (run on newer version of Flink > (1.5.3) but having trouble with that. When I submit application, everyth

Flink cluster crashing going from 1.4.0 -> 1.5.3

2018-08-23 Thread Jozef Vilcek
Hello, I am trying to get my Beam application (run on newer version of Flink (1.5.3) but having trouble with that. When I submit application, everything works fine but after a few mins (as soon as 2 minutes after job start) cluster just goes bad. Logs are full of timeouts for heartbeats, JobManage