Re: Flink Job Failure for version 1.16

Hangxiang Yu Sun, 14 May 2023 19:38:35 -0700

Hi,
I may have missed something, So could you share more:

 I have recently migrated from 1.13.6 to 1.16.1, I can see there is a
> performance degradation...



Are you referring to a decrease in Checkpoint Performance when you mention
performance decline?
It just happens when you upgrade from 1.13.6 to 1.16.1 without any
modifications in configuration and job ?
So Could you share configuration before and after upgrading ?

Is there any issue with this Flink version or the new RocksDB version? What
> should be the action item for this Exception?
> The maximum savepoint size is 80.2 GB and we periodically(every 20
> minutes) take the savepoint for the job.
>

The version of RocksDB has been upgraded in 1.14, but it should not
increase the checkpoint size in theory.
So you found the checkpoint size has increased after upgrading ? Could you
also share some checkpoint metrics / configuration before and after
upgrading ?

On Fri, May 12, 2023 at 9:06 PM neha goyal <nehagoy...@gmail.com> wrote:

> Hi Everyone, can someone please shade some light when the Checkpoint
> Coordinator is suspending Error comes and what should I do to avoid this?
> it is impacting the production pipeline after the version upgrade. It is
> related to resource crunch in the pipeline?
> Thank You
>
> On Thu, May 11, 2023 at 10:35 AM neha goyal <nehagoy...@gmail.com> wrote:
>
>> I have recently migrated from 1.13.6 to 1.16.1, I can see there is a
>> performance degradation for the Flink pipeline which is using Flink's
>> managed state ListState, MapState, etc. Pipelines are frequently failing
>> with the Exception:
>>
>> 06:59:42.021 [Checkpoint Timer] WARN  o.a.f.r.c.CheckpointFailureManager
>> - Failed to trigger or complete checkpoint 36755 for job
>> d0e1a940adab2981dbe0423efe83f140. (0 consecutive failed attempts so far)
>>  org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>> org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
>> Checkpoint expired before completing.
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2165)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:750)
>> 07:18:15.257 [flink-akka.actor.default-dispatcher-31] WARN
>>  a.remote.ReliableDeliverySupervisor - Association with remote system
>> [akka.tcp://fl...@ip-172-31-73-135.ap-southeast-1.compute.internal:43367]
>> has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>  akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
>> akka.remote.ReliableDeliverySupervisor07:18:15.257 [flink-metrics-23] WARN
>>  a.remote.ReliableDeliverySupervisor - Association with remote system
>> [akka.tcp://flink-metr...@ip-172-31-73-135.ap-southeast-1.compute.internal:33639]
>> has failed, address is now gated for [50] ms. Reason: [Disassociated]
>>  akka.event.slf4j.Slf4jLogger$$anonfun$receive$1
>> akka.remote.ReliableDeliverySupervisor07:18:15.331
>> [flink-akka.actor.default-dispatcher-31] WARN
>>  o.a.f.r.c.CheckpointFailureManager - Failed to trigger or complete
>> checkpoint 36756 for job d0e1a940adab2981dbe0423efe83f140. (0 consecutive
>> failed attempts so far)
>>  org.apache.flink.runtime.checkpoint.CheckpointFailureManager
>> org.apache.flink.runtime.checkpoint.CheckpointFailureManagerorg.apache.flink.runtime.checkpoint.CheckpointException:
>> Checkpoint Coordinator is suspending.
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.stopCheckpointScheduler(CheckpointCoordinator.java:1926)
>> at
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinatorDeActivator.jobStatusChanges(CheckpointCoordinatorDeActivator.java:46)
>> at
>> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifyJobStatusChange(DefaultExecutionGraph.java:1566)
>> at
>> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.transitionState(DefaultExecutionGraph.java:1161)
>>
>> Is there any issue with this Flink version or the new RocksDB version?
>> What should be the action item for this Exception?
>> The maximum savepoint size is 80.2 GB and we periodically(every 20
>> minutes) take the savepoint for the job.
>> Checkpoint Type: aligned checkpoint
>>
>

-- 
Best,
Hangxiang.

Re: Flink Job Failure for version 1.16

Reply via email to