Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-02 Thread Alexander Filipchik
Hi, Trying to figure out what happened with our Flink job. We use flink 1.11.1 and run a job with unaligned checkpoints and Rocks Db backend. The whole state is around 300Gb judging by the size of savepoints. The job ran ok. At some point we tried to deploy new code, but we couldn't take a save p

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-03 Thread Chesnay Schepler
Is there anything in the Flink logs indicating issues with writing the checkpoint data? When the savepoint could not be created, was anything logged from Flink? How did you shut down the cluster? On 6/3/2021 5:56 AM, Alexander Filipchik wrote: Hi, Trying to figure out what happened with our F

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-03 Thread Alexander Filipchik
On the checkpoints -> what kind of issues should I check for? I was looking for metrics and it looks like they were reporting successful checkpoints. It looks like some files were removed in the shared folder, but I'm not sure how to check for what caused it. Savepoints were failing due to savepoi

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-04 Thread Alexander Filipchik
Looked through the logs and didn't see anything fishy that indicated an exception during checkpointing. To make it clearer, here is the timeline (we use unaligned checkpoints, and state size around 300Gb): T1: Job1 was running T2: Job1 was savepointed, brought down and replaced with Job2. T3: Atte

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-05 Thread afilipchik
Small correction, in T4 and T5 I mean Job2, not Job 1 (as job 1 was save pointed). Thank you, Alex > On Jun 4, 2021, at 3:07 PM, Alexander Filipchik wrote: > >  > Looked through the logs and didn't see anything fishy that indicated an > exception during checkpointing. > To make it clearer,

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-07 Thread Piotr Nowojski
Hi Alex, A quick question. Are you using incremental checkpoints? Best, Piotrek sob., 5 cze 2021 o 21:23 napisał(a): > Small correction, in T4 and T5 I mean Job2, not Job 1 (as job 1 was save > pointed). > > Thank you, > Alex > > On Jun 4, 2021, at 3:07 PM, Alexander Filipchik > wrote: > > 

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-08 Thread Piotr Nowojski
Re-adding user mailing list Hey Alex, In that case I can see two scenarios that could lead to missing files. Keep in mind that incremental checkpoints are referencing previous checkpoints in order to minimise the size of the checkpoint (roughly speaking only changes since the previous checkpoint

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-06-17 Thread Alexander Filipchik
Did some more digging. 1) is not an option as we are not doing any cleanups at the moment. We keep the last 4 checkpoints per job + all the savepoints. 2) I looked at job deployments that happened 1 week before the incident. We have 23 deployments in total and each resulted in a unique job id. I al

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-07-03 Thread Alexander Filipchik
Bumping it up, any known way to catch it if it happens again ? Any logs we should enable? Sent via Superhuman iOS On Thu, Jun 17 2021 at 7:52 AM, Alexander Filipchik wrote: > Did some more digging. > 1) is not an option as we are not doing any clean

Re: Corrupted unaligned checkpoints in Flink 1.11.1

2021-07-05 Thread Piotr Nowojski
Hey Alex, Sorry, I've missed your previous email. I've spent a bit more time searching our Jira for relevant bugs and maybe you were hit by this one: https://issues.apache.org/jira/browse/FLINK-21351 ? > T2: Job1 was savepointed, brought down and replaced with Job2. This in combination with FLINK