Counter metrics for prometheus having unexepcted gaps in grafana

2021-01-19 Thread Manish G
Hi All, I am facing an issue with counter metrics I have added to a flatmap function. My application is deployed in kubernetes, and hence the prometheus metrics generated has pod id as one of its label. Now if pod dies and a new pod comes up, we have a brand new metrics starting from 0. As a resul

Re: Restoring from a checkpoint or savepoint on a different Kafka consumer group

2021-01-19 Thread Piotr Nowojski
Hi Rex, Sorry, I might have misled you. I think you were right in your previous email > So from the sounds of things, regardless of the consumer group's offsets, it will always start from a checkpoint or savepoints offsets if there are some (unless checkpointing offsets is turned off). > > Is thi

Re: Why do we need to use synchrnized(checkpointLock) in SourceFunction.run ?

2021-01-19 Thread Piotr Nowojski
No problem :) Piotrek śr., 20 sty 2021 o 02:12 Kazunori Shinhira napisał(a): > Hi, > > > Thank you for your explanation. > > I now understand the need for checkpoint lock :) > > > > Best, > > 2021年1月19日(火) 18:00 Piotr Nowojski : > >> Hi, >> >> yes exactly :) >> >> > As a result, Source may sav

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Amit Bhatia
Hi Yang, I tried the deployment of flink with three replicas of Jobmanger to test a faster job recovery scenario. Below is my deployment : $ kubectl get po -namit | grep zk eric-data-coordinator-zk-01/1 Running0 6d21h eric-data-coor

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Chirag Dewan
Hi, Can we have multiple replicas with ZK HA in K8 as well?In this case, how does Task Managers and clients recover the Job Manager RPC address? Are they updated in ZK?Also, since there are 3 replicas behind the same service endpoint and only one of them is the leader, how should clients reach

Re: What is checkpoint start delay?

2021-01-19 Thread Rex Fenley
Ok, this makes sense. I'm guessing loading state from S3 into RocksDB is a large contributor to start delay then. Thanks! On Tue, Jan 19, 2021 at 12:16 PM Piotr Nowojski wrote: > Hi Rex, > > start delay is not the same as the alignment time. Start delay is the time > between creation of the che

Flink upgrade to Flink-1.12

2021-01-19 Thread ??????
Hi all, As flink doc says: https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/upgrading.html#preconditions We do not support migration for state in RocksDB that was checkpointed using semi-asynchronous mode. In case your old job was using this mode, you can still change your job

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Yang Wang
If you do not want to run multiple JobManagers simultaneously, then I think the "Job" for application cluster with HA enable is enough. K8s will also launch a new pod/container when the old one terminated exceptionally. Best, Yang Yang Wang 于2021年1月20日周三 上午10:08写道: > Yes. Using a "Deployment" i

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Yang Wang
Yes. Using a "Deployment" instead of "Job" for the application cluster also makes sense. Actually, in the native K8s integration, we always use the deployment for JobManager. But please note that the deployment may relaunch the JobManager pod even though you cancel the Flink job. Best, Yang Ashi

Re: Why do we need to use synchrnized(checkpointLock) in SourceFunction.run ?

2021-01-19 Thread Kazunori Shinhira
Hi, Thank you for your explanation. I now understand the need for checkpoint lock :) Best, 2021年1月19日(火) 18:00 Piotr Nowojski : > Hi, > > yes exactly :) > > > As a result, Source may save wrong offset and lost record if job > recreation occurs at that timing. > > This is just one of the po

Trying to create a generic aggregate UDF

2021-01-19 Thread Dylan Forciea
I am attempting to create an aggregate UDF that takes a generic parameter T, but for the life of me, I can’t seem to get it to work. The UDF I’m trying to implement takes two input arguments, a value that is generic, and a date. It will choose the non-null value with the latest associated date.

question about timers

2021-01-19 Thread Marco Villalobos
If there are timers that have been checkpointed (we use rocksdb), and the system goes down, and then the system goes back up after the timers should have fired, do those timers that were skipped still fire, even though we are past that time? example: for example, if the current time is 1:00 p.m.

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Ashish Nigam
Yang, For Application clusters, does it make sense to deploy JobManager as "Deployment" rather than as a "Job", as suggested in docs? I am asking this because I am thinking of deploying a job manager in HA mode even for application clusters. Thanks Ashish On Tue, Jan 19, 2021 at 6:16 AM Yang Wan

Re: What is checkpoint start delay?

2021-01-19 Thread Piotr Nowojski
Hi Rex, start delay is not the same as the alignment time. Start delay is the time between creation of the checkpoint barrier and the time a task/subtask sees a first checkpoint barrier from any of its inputs. Alignment time is the time between receiving the first checkpoint barrier on a given sub

Re: Restoring from a checkpoint or savepoint on a different Kafka consumer group

2021-01-19 Thread Rex Fenley
Thank you, That's unfortunate, because I imagine we often will want to duplicate a job in order to do some testing out-of-bound from the normal job while slightly tweaking / tuning things. Is there any way to transfer offsets between consumer groups? On Tue, Jan 19, 2021 at 5:45 AM Piotr Nowojski

Re: Why use ListView?

2021-01-19 Thread Rex Fenley
Thanks! On Tue, Jan 19, 2021 at 12:55 AM Timo Walther wrote: > As always, this depends on the use case ;-) > > In general, you should not get a performance regression in using them. > But keep in mind that ListViews/MapViews cannot be backed by a state > backend in every operator, so sometimes t

Re: What is checkpoint start delay?

2021-01-19 Thread Rex Fenley
Thanks for the input. This seems odd though, if start delay is the same as alignment then (1) why is it only ever prominent when right after recovering from a checkpoint? (2) Why is the first checkpoint during the recovery process 10x as long as every other checkpoint? Something else must be going

Re: What is checkpoint start delay?

2021-01-19 Thread Piotr Nowojski
Hey Rex, What do you mean by "Start Delay" when recovering from a checkpoint? Did you mean when taking a checkpoint? If so: 1. https://www.google.com/search?q=flink+checkpoint+start+delay 2. top 3 result (at least for me) https://ci.apache.org/projects/flink/flink-docs-stable/ops/monitoring/check

Re: Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Yang Wang
Usually, you do not need to start multiple JobManager simultaneously. The JobManager is a deployment. A new one pod/container will be launched once it terminated exceptionally. If you still want to start multiple JobManagers to get a faster recovery, you could set the replica greater than 1 for st

Re: Restoring from a checkpoint or savepoint on a different Kafka consumer group

2021-01-19 Thread Piotr Nowojski
Hi, > I read this as, "The offsets committed to Kafka are ignored, the offsets committed within a checkpoint are used". yes, exactly > So from the sounds of things, regardless of the consumer group's offsets, it will always start from a checkpoint or savepoints offsets if there are some (unless

Flink Jobmanager HA deployment on k8s

2021-01-19 Thread Amit Bhatia
Hi, I am deploying Flink 1.12 on K8s. Can anyone confirm if we can deploy multiple job manager pods in K8s for HA or it should always be only a single job manager pod ? Regards, Amit Bhatia

Re: [ANNOUNCE] Apache Flink 1.12.1 released

2021-01-19 Thread Wei Zhong
Thanks Xintong for the great work! Best, Wei > 在 2021年1月19日,18:00,Guowei Ma 写道: > > Thanks Xintong's effort! > Best, > Guowei > > > On Tue, Jan 19, 2021 at 5:37 PM Yangze Guo > wrote: > Thanks Xintong for the great work! > > Best, > Yangze Guo > > On Tue, Jan 19,

AW: Pyflink Join with versioned view / table

2021-01-19 Thread Barth, Torben
Hi Leonard, I have just realized that "last_value" operator does not work since it produces updates if the right side changes. I just need the current state in the moment I receive a message on the left side. It is indeed a lookup which O want to perform and not a real join. Since my topic onl

Test failed in flink-end-to-end-tests/flink-end-to-end-tests-common-kafka

2021-01-19 Thread Smile@LETTers
Hi, I got an error when tried to compile & package Flink (version 1.12 & current master). It can be reproduced by run 'mvn clean test' under flink-end-to-end-tests/flink-end-to-end-tests-common-kafka. It seems that a necessary dependency for test scope was missing and some classes can not be

Re: [ANNOUNCE] Apache Flink 1.12.1 released

2021-01-19 Thread Guowei Ma
Thanks Xintong's effort! Best, Guowei On Tue, Jan 19, 2021 at 5:37 PM Yangze Guo wrote: > Thanks Xintong for the great work! > > Best, > Yangze Guo > > On Tue, Jan 19, 2021 at 4:47 PM Till Rohrmann > wrote: > > > > Thanks a lot for driving this release Xintong. This was indeed a release > with

Re: [ANNOUNCE] Apache Flink 1.12.1 released

2021-01-19 Thread Yangze Guo
Thanks Xintong for the great work! Best, Yangze Guo On Tue, Jan 19, 2021 at 4:47 PM Till Rohrmann wrote: > > Thanks a lot for driving this release Xintong. This was indeed a release with > some obstacles to overcome and you did it very well! > > Cheers, > Till > > On Tue, Jan 19, 2021 at 5:59 A

Re: Pyflink Join with versioned view / table

2021-01-19 Thread Leonard Xu
Hi, Torben Happy to hear you address your problem, the first option can resolve the situation that partial partitions of the Kafka topic did not receive data, but if all partitions didn’t receive data, the watermark won’t be pushed forward, and the temporal join won’t be triggered. Otherwise,

Re: Declaring and using ROW data type in PyFlink DataStream

2021-01-19 Thread Xingbo Huang
Hi meneldor, Yes. As the first version of Python DataStream, release-1.12 has not yet covered all scenarios. In release-1.13, we will extend the function of Python DataStream to cover most scenarios, and CoProcessFunction will obviously be in it. Best, Xingbo meneldor 于2021年1月19日周二 下午4:52写道: >

AW: Pyflink Join with versioned view / table

2021-01-19 Thread Barth, Torben
Hi Leonard, thanks for your answer. My data source is kafka so I cannot use the second option. The first option is unfortunately not working. I introduced the parameter but the updates are still only triggered by a change on the right side. As a workaround I use the last_value operator right n

Re: Why do we need to use synchrnized(checkpointLock) in SourceFunction.run ?

2021-01-19 Thread Piotr Nowojski
Hi, yes exactly :) > As a result, Source may save wrong offset and lost record if job recreation occurs at that timing. This is just one of the possible race conditions that could happen. As offsets are probably 64 bit integers, I'm pretty sure corrupted writes/reads can also happen, when only h

Re: Restoring from a savepoint, constraining factors

2021-01-19 Thread Piotr Nowojski
Hi, Savepoints internally work in the exact same way as checkpoints. I'm not sure what you are referring to as a repartitioning step. If you mean rescalling (changing the parallelism), then this can happen both for the checkpoints and savepoints. You can find answers to a lot of such questions by

Re: Why use ListView?

2021-01-19 Thread Timo Walther
As always, this depends on the use case ;-) In general, you should not get a performance regression in using them. But keep in mind that ListViews/MapViews cannot be backed by a state backend in every operator, so sometimes they are represented as List/Maps on heap. Regards, Timo On 18.01.

Re: Declaring and using ROW data type in PyFlink DataStream

2021-01-19 Thread meneldor
Thank you Xingbo! Do you plan to implement CoProcess functions too? Right now i cant find a convenient method to connect and merge two streams? Regards On Tue, Jan 19, 2021 at 4:16 AM Xingbo Huang wrote: > Hi meneldor, > > 1. Yes. Although release 1.12.1 has not been officially released, it is

Re: Flink Application cluster/standalone job: some JVM Options added to Program Arguments

2021-01-19 Thread Matthias Pohl
You're right. Thinking about it and looking through the code, I agree: The dynamic properties shouldn't be exposed in the main method. I was able to reproduce the described behavior. I created FLINK-21024 covering this. Thanks for reporting this issue, Alexey. Best, Matthias [1] https://issues.a

Re: [ANNOUNCE] Apache Flink 1.12.1 released

2021-01-19 Thread Till Rohrmann
Thanks a lot for driving this release Xintong. This was indeed a release with some obstacles to overcome and you did it very well! Cheers, Till On Tue, Jan 19, 2021 at 5:59 AM Xingbo Huang wrote: > Thanks Xintong for the great work! > > Best, > Xingbo > > Peter Huang 于2021年1月19日周二 下午12:51写道: >