t;>> Best,
>>>>> rui
>>>>>
>>>>> Feng Jin 于2023年9月27日周三 19:19写道:
>>>>>
>>>>>>
>>>>>> hi rui,
>>>>>>
>>>>>> In general, checkpoint timeouts are typically associat
performance. When using jemalloc, performance degradation
>>>>> is generally not observed.
>>>>>
>>>>> It is advisable to analyze whether the job's garbage collection (GC)
>>>>> has become more frequent.
>>>>>
>>>>>
>>>>> Best,
>>>>> Feng
>>>>>
>>>>>
>>>>> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote:
>>>>>
>>>>>> After using the jemalloc memory allocator for a period of time,
>>>>>> checkpoint timeout occurs and tasks are stuck. Who has encountered this?
>>>>>> flink version:1.13.2, jiemalloc version: 5.3.0
>>>>>>
>>>>>
rocessing performance. When using jemalloc, performance degradation is
>>>> generally not observed.
>>>>
>>>> It is advisable to analyze whether the job's garbage collection (GC)
>>>> has become more frequent.
>>>>
>>>>
>>>> Best,
>>>> Feng
>>>>
>>>>
>>>> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote:
>>>>
>>>>> After using the jemalloc memory allocator for a period of time,
>>>>> checkpoint timeout occurs and tasks are stuck. Who has encountered this?
>>>>> flink version:1.13.2, jiemalloc version: 5.3.0
>>>>>
>>>>
gt;>> generally not observed.
>>>
>>> It is advisable to analyze whether the job's garbage collection (GC) has
>>> become more frequent.
>>>
>>>
>>> Best,
>>> Feng
>>>
>>>
>>> On Mon, Sep 25, 2023 a
>>
>> It is advisable to analyze whether the job's garbage collection (GC) has
>> become more frequent.
>>
>>
>> Best,
>> Feng
>>
>>
>> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote:
>>
>>> After using the jemalloc memory
, Sep 25, 2023 at 1:21 PM rui chen wrote:
> After using the jemalloc memory allocator for a period of time, checkpoint
> timeout occurs and tasks are stuck. Who has encountered this? flink
> version:1.13.2, jiemalloc version: 5.3.0
>
After using the jemalloc memory allocator for a period of time, checkpoint
timeout occurs and tasks are stuck. Who has encountered this? flink
version:1.13.2, jiemalloc version: 5.3.0
After using the jemalloc memory allocator for a period of time, checkpoint
timeout occurs and tasks are stuck. Who has encountered this? flink
version:1.13.2, jiemalloc version: 5.3.0
e shooting.
> Below is TaskManager log around the time when checkpoint timeout
>
> ==
>
> 2023-05-25 14:47:30,248 INFO
> org.apache.parquet.hadoop.InternalParquetRecordWriter[] -
> Flushing mem column
Hello all,
We recently start to experience Checkpoint timeout randomly. Here are some
background information
1. We are on Flink 1.13.1
2. We have been running these type of streaming jobs for years. When checkpoint
succeeds, it only take a few seconds. After a week ago, we start to see random
Thank you for the help. To follow up, the issue went away when we reverted
back to flink 1.13. May be related to flink-27481. Before reverting, we
tested unaligned checkpoints with a timeout of 10 minutes, which timed out.
Thanks.
On Thu, Apr 28, 2022, 5:38 PM Guowei Ma wrote:
> Hi Sam
>
> I
Hi Sam
I think the first step is to see which part of your Flink APP is blocking
the completion of Checkpoint. Specifically, you can refer to the
"Checkpoint Details" section of the document [1]. Using these methods, you
should be able to observe where the checkpoint is blocked, for example, it
Hello,
I am running into checkpoint timeouts and am looking for guidance on
troubleshooting. What should I be looking at? What configuration parameters
would affect this? I am afraid I am a Flink newbie so I am still picking up
the concepts. Additional notes are below, anything else I can
I see for every consequential checkpoint timeout fail , number of tasks
which completed checkpointing keeps decreasing, why would that happen? Does
flink try to process data beyond old checkpoint barrier which failed to
complete due to timeout?
On Tue, Mar 8, 2022 at 12:48 AM yidan zhao wrote
If the checkpoint timeout leads to the job's fail, then the job will be
recovered and data will be reprocessed from the last completed checkpoint.
If the job doesn't fail, then not.
Mahantesh Patil 于2022年3月8日周二 14:47写道:
> Hello Team,
>
> What happens after checkpoint timeout?
>
Hello Team,
What happens after checkpoint timeout?
Does Flink reprocess data from the previous checkpoint for all tasks?
I have one compute intensive operator with parallelism of 20 and only one
of the parallel tasks seems to get stuck because of data skew. On
checkpoint timeout , will data
m:* Piotr Nowojski
> *Sent:* Montag, 25. Oktober 2021 15:51
> *To:* Alexis Sarda-Espinosa
> *Cc:* Parag Somani ; Caizhi Weng <
> tsreape...@gmail.com>; Flink ML
> *Subject:* Re: Troubleshooting checkpoint timeout
>
>
>
> Hi Alexis,
>
>
>
> > Should I under
checkpoints. Thanks again for all the info.
Regards,
Alexis.
From: Piotr Nowojski
Sent: Montag, 25. Oktober 2021 15:51
To: Alexis Sarda-Espinosa
Cc: Parag Somani ; Caizhi Weng ;
Flink ML
Subject: Re: Troubleshooting checkpoint timeout
Hi Alexis,
> Should I understand these metrics as a prope
t are behind more data than before it
> restarted, no?
>
>
>
> Regards,
>
> Alexis.
>
>
>
> *From:* Piotr Nowojski
> *Sent:* Montag, 25. Oktober 2021 13:35
> *To:* Alexis Sarda-Espinosa
> *Cc:* Parag Somani ; Caizhi Weng <
> tsreape...@gmail.com>
g>>
Sent: Montag, 25. Oktober 2021 09:59
To: Alexis Sarda-Espinosa
mailto:alexis.sarda-espin...@microfocus.com>>
Cc: Parag Somani mailto:somanipa...@gmail.com>>; Caizhi
Weng mailto:tsreape...@gmail.com>>; Flink ML
mailto:user@flink.apache.org>>
Subject: Re: Troubleshooting check
stream operator has lower parallelism?
>
>
>
> Regards,
>
> Alexis.
>
>
>
> *From:* Piotr Nowojski
> *Sent:* Montag, 25. Oktober 2021 09:59
> *To:* Alexis Sarda-Espinosa
> *Cc:* Parag Somani ; Caizhi Weng <
> tsreape...@gmail.com>; Flink ML
> *Subject
eam operator has lower parallelism?
Regards,
Alexis.
From: Piotr Nowojski
Sent: Montag, 25. Oktober 2021 09:59
To: Alexis Sarda-Espinosa
Cc: Parag Somani ; Caizhi Weng ;
Flink ML
Subject: Re: Troubleshooting checkpoint timeout
Hi Alexis,
You can read about those metrics in the documentation
those metrics don’t really help me know in which areas to look
> for issues.
>
>
>
> Regards,
>
> Alexis.
>
>
>
> *From:* Alexis Sarda-Espinosa
> *Sent:* Mittwoch, 20. Oktober 2021 09:43
> *To:* Parag Somani ; Caizhi Weng <
> tsreape...@gmail.com>
>
,
Alexis.
From: Alexis Sarda-Espinosa
Sent: Mittwoch, 20. Oktober 2021 09:43
To: Parag Somani ; Caizhi Weng
Cc: Flink ML
Subject: RE: Troubleshooting checkpoint timeout
Currently the windows are 10 minutes in size with a 1-minute slide time. The
approximate 500 event/minute throughput is already
; Flink ML
Subject: Re: Troubleshooting checkpoint timeout
I had similar problem, where i have concurrent two checkpoints were configured.
Also, i used to save it in S3(using minio) on k8s 1.18 env.
Flink service were getting restarted and timeout was happening. It got resolved:
1. As minio ran
I had similar problem, where i have concurrent two checkpoints were
configured. Also, i used to save it in S3(using minio) on k8s 1.18 env.
Flink service were getting restarted and timeout was happening. It got
resolved:
1. As minio ran out of disk space, caused failure of checkpoints(this was
Hi!
I see you're using sliding event time windows. What's the exact value of
windowLengthMinutes and windowSlideTimeMinutes? If windowLengthMinutes is
large and windowSlideTimeMinutes is small then each record may be assigned
to a large number of windows as the pipeline proceeds, thus gradually
Hello everyone,
I am doing performance tests for one of our streaming applications and, after
increasing the throughput a bit (~500 events per minute), it has started
failing because checkpoints cannot be completed within 10 minutes. The Flink
cluster is not exactly under my control and is
>> checkpoints keep timing out since migrating to 1.10 from 1.9
>> --
>> *From:* Deshpande, Omkar
>> *Sent:* Wednesday, September 16, 2020 5:27 PM
>> *To:* Congxian Qiu
>> *Cc:* user@flink.apache.org ; Yun Tang <
>> myas...@live.com>
>> *Subject:* R
om 1.9
> --
> *From:* Deshpande, Omkar
> *Sent:* Wednesday, September 16, 2020 5:27 PM
> *To:* Congxian Qiu
> *Cc:* user@flink.apache.org ; Yun Tang <
> myas...@live.com>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an
nstead of
taskmanager.heap.size
From: Deshpande, Omkar
Sent: Monday, September 14, 2020 6:23 PM
To: user@flink.apache.org
Subject: flink checkpoint timeout
This email is from an external sender.
Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds
ould I be looking for in the thread dump?
>
> --
> *From:* Yun Tang
> *Sent:* Monday, September 14, 2020 8:52 PM
> *To:* Deshpande, Omkar ; user@flink.apache.org
>
> *Subject:* Re: flink checkpoint timeout
>
> This email is from an external sender.
&
Hello,
I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first
couple of times and then starts failing because of timeouts. The checkpoint
time grows with every checkpoint and starts exceeding 10 minutes. I do not see
any exceptions in the logs. I have enabled debug
/browse/FLINK-14816
Best
Yun Tang
From: Deshpande, Omkar
Sent: Tuesday, September 15, 2020 10:25
To: user@flink.apache.org
Subject: Re: flink checkpoint timeout
I have followed this
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory
Hi John,
which version of Flink are you using. I just tried it out with the current
snapshot version and I could configure the checkpoint timeout via
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointTimeout(1337L);
Could you provide us the logs
Hi John,
Setting the checkpoint timeout is through this API. The default timeout for
checkpoints is 10 minutes [1], not one minute. So, I think it must be
something else.
You can set the log level of JM and TM to Debug, and then see more
checkpoint details. If there is no way to analyze it, you
I have a flink job with a big enough state that makes checkpointing long ( ~ 70
seconds).
I have configured the checkpoint timeout to 180 seconds
(setCheckpointTimeout(18))
But as you can see from the following logs, timeout seems to be ~60 seconds.
Is there another timeout configuration I
checking the counter happens under the lock. Disposing the
> RocksDB instance can then only start when the „client count“ is zero, and
> after it started, no new clients can register. So it is similar to
> reader/writer locking, where all ops on the DB are „reader" and disposing
instance can then only start when the „client count“ is zero,
> and after it started, no new clients can register. So it is similar to
> reader/writer locking, where all ops on the DB are „reader" and disposing
> the instance is the „writer".
>
> I am currently on holidays, mayb
;:
>
> Hi Stefan,
>
> It seems that the similar situation, in which job blocked after checkpoint
> timeout, came across to my job. BTW, this is another job that I raised
> parallelism and throughput of input.
>
> After chk #8 started, the whole operator seems blo
Hi Stefan,
That reason makes sense to me. Thanks for point me out.
About my job, the database currently was never used, I disabled it for some
reasons, but output to s3 was implemented by async io.
I used ForkJoinPool with 50 capacity.
I have tried to rebalance after count window to monitor the
Hi,
the gap between the sync and the async part does not mean too much. What
happens per task is that all operators go through their sync part, and then one
thread executes all the async parts, one after the other. So if an async part
starts late, this is just because it started only after
Hi,
Sorry. This is the correct one.
Best Regards,
Tony Wei
2017-09-28 18:55 GMT+08:00 Tony Wei :
> Hi Stefan,
>
> Sorry for providing partial information. The attachment is the full logs
> for checkpoint #1577.
>
> Why I would say it seems that asynchronous part was not
Hi Stefan,
Sorry for providing partial information. The attachment is the full logs
for checkpoint #1577.
Why I would say it seems that asynchronous part was not executed
immediately is due to all synchronous parts were all finished at 2017-09-27
13:49.
Did that mean the checkpoint barrier event
Hi,
I agree that the memory consumption looks good. If there is only one TM, it
will run inside one JVM. As for the 7 minutes, you mean the reported end-to-end
time? This time measurement starts when the checkpoint is triggered on the job
manager, the first contributor is then the time that it
Hi Stefan,
These are some telemetry information, but I don't have history information
about gc.
[image: 內置圖片 2]
[image: 內置圖片 1]
1) Yes, my state is not large.
2) My DFS is S3, but my cluster is out of AWS. It might be a problem. Since
this is a POC, we might move to AWS in the future or use
Hi,
when the async part takes that long I would have 3 things to look at:
1) Is your state so large? I don’t think this applies in your case, right?
2) Is something wrong with writing to DFS (network, disks, etc)?
3) Are we running low on memory on that task manager?
Do you have telemetry
Hi Tony,
are your checkpoints typically close to the timeout boundary? From what I see,
writing the checkpoint is relatively fast but the time from the checkpoint
trigger to execution seems very long. This is typically the case if your job
has a lot of backpressure and therefore the checkpoint
Hi Stefan,
It seems that I found something strange from JM's log.
It had happened more than once before, but all subtasks would finish their
checkpoint attempts in the end.
2017-09-26 01:23:28,690 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering
checkpoint 1140 @
Hi,
thanks for the information. Unfortunately, I have no immediate idea what the
reason is from the given information. I think most helpful could be a thread
dump, but also metrics on the operator operator level to figure out which part
of the pipeline is the culprit.
Best,
Stefan
> Am
Hi Stefan,
There is no unknown exception in my full log. The Flink version is 1.3.2.
My job is roughly like this.
env.addSource(Kafka)
.map(ParseKeyFromRecord)
.keyBy()
.process(CountAndTimeoutWindow)
.asyncIO(UploadToS3)
.addSink(UpdateDatabase)
It seemed all tasks stopped like the
Hi,
that is very strange indeed. I had a look at the logs and there is no error or
exception reported. I assume there is also no exception in your full logs?
Which version of flink are you using and what operators were running in the
task that stopped? If this happens again, would it be
52 matches
Mail list logo