unaligned checkpoint for job with large start delay

2021-12-14 Thread Mason Chen
Hi all, I'm using Flink 1.13 and my job is experiencing high start delay, more so than high alignment time. (our flip 27 kafka source is heavily backpressured). Since our alignment timeout is set to 1s, the unaligned checkpoint never triggers since alignment delay is always below the threshold. I

Re: unaligned checkpoint for job with large start delay

2021-12-16 Thread Piotr Nowojski
Hi Mason, In Flink 1.14 we have also changed the timeout behavior from checking against the alignment duration, to simply checking how old is the checkpoint barrier (so it would also account for the start delay) [1]. It was done in order to solve problems as you are describing. Unfortunately we ca

Re: unaligned checkpoint for job with large start delay

2021-12-17 Thread Mason Chen
Hi Piotr, Thanks for the link to the JIRA ticket, we actually don’t see much state size overhead between checkpoints in aligned vs unaligned, so we will go with your recommendation of using unaligned checkpoints with 0s alignment timeout. For context, we are testing unaligned checkpoints with o

Re: unaligned checkpoint for job with large start delay

2021-12-20 Thread Piotr Nowojski
Hi Mason, Those checkpoint timeouts (30 minutes) have you already observed with the alignment timeout set to 0ms? Or as you were previously running it with 1s alignment timeout? If the latter, it might be because unaligned checkpoints are failing to kick in in the first place. Setting the timeout

Re: unaligned checkpoint for job with large start delay

2022-01-04 Thread Mason Chen
Hi Piotrek, > In other words, something (presumably a watermark) has fired more than 151 > 200 windows at once, which is taking ~1h 10minutes to process and during this > time the checkpoint can not make any progress. Is this number of triggered > windows plausible in your scenario? It seems p

Re: unaligned checkpoint for job with large start delay

2022-01-10 Thread Piotr Nowojski
Hi Mason, Sorry for a late reply, but I was OoO. I think you could confirm it with more custom metrics. Counting how many windows have been registered/fired and plotting that over time. I think it would be more helpful in this case to check how long a task has been blocked being "busy" processin

Re: unaligned checkpoint for job with large start delay

2022-01-11 Thread Mason Chen
Hi Piotrek, No worries—I hope you had a good break. > Counting how many windows have been registered/fired and plotting that over > time. It’s straightforward to count windows that are fired (the trigger exposes the run time context and we can collect the information in that code path). Howeve

RE: unaligned checkpoint for job with large start delay

2022-01-11 Thread Schwalbe Matthias
] https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys From: Mason Chen Sent: Dienstag, 11. Januar 2022 19:20 To: Piotr Nowojski Cc: Mason Chen ; user Subject: Re: unaligned checkpoint for job with large start delay Hi

Re: unaligned checkpoint for job with large start delay

2022-01-11 Thread Piotr Nowojski
gt; > > > > > > > > [1] > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/config/#state-backend-rocksdb-metrics-estimate-num-keys > > > > *From:* Mason Chen > *Sent:* Dienstag, 11. Januar 2022 19:20 > *To:* Piotr Nowojski &g