Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-28 Thread Feng Jin
t;>> Best, >>>>> rui >>>>> >>>>> Feng Jin 于2023年9月27日周三 19:19写道: >>>>> >>>>>> >>>>>> hi rui, >>>>>> >>>>>> In general, checkpoint timeouts are typically associat

Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-28 Thread rui chen
performance. When using jemalloc, performance degradation >>>>> is generally not observed. >>>>> >>>>> It is advisable to analyze whether the job's garbage collection (GC) >>>>> has become more frequent. >>>>> >>>>> >>>>> Best, >>>>> Feng >>>>> >>>>> >>>>> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote: >>>>> >>>>>> After using the jemalloc memory allocator for a period of time, >>>>>> checkpoint timeout occurs and tasks are stuck. Who has encountered this? >>>>>> flink version:1.13.2, jiemalloc version: 5.3.0 >>>>>> >>>>>

Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-27 Thread rui chen
rocessing performance. When using jemalloc, performance degradation is >>>> generally not observed. >>>> >>>> It is advisable to analyze whether the job's garbage collection (GC) >>>> has become more frequent. >>>> >>>> >>>> Best, >>>> Feng >>>> >>>> >>>> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote: >>>> >>>>> After using the jemalloc memory allocator for a period of time, >>>>> checkpoint timeout occurs and tasks are stuck. Who has encountered this? >>>>> flink version:1.13.2, jiemalloc version: 5.3.0 >>>>> >>>>

Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-27 Thread Feng Jin
gt;>> generally not observed. >>> >>> It is advisable to analyze whether the job's garbage collection (GC) has >>> become more frequent. >>> >>> >>> Best, >>> Feng >>> >>> >>> On Mon, Sep 25, 2023 a

Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-27 Thread rui chen
>> >> It is advisable to analyze whether the job's garbage collection (GC) has >> become more frequent. >> >> >> Best, >> Feng >> >> >> On Mon, Sep 25, 2023 at 1:21 PM rui chen wrote: >> >>> After using the jemalloc memory

Re: After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-27 Thread Feng Jin
, Sep 25, 2023 at 1:21 PM rui chen wrote: > After using the jemalloc memory allocator for a period of time, checkpoint > timeout occurs and tasks are stuck. Who has encountered this? flink > version:1.13.2, jiemalloc version: 5.3.0 >

After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-24 Thread rui chen
After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck. Who has encountered this? flink version:1.13.2, jiemalloc version: 5.3.0

After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck

2023-09-24 Thread rui chen
After using the jemalloc memory allocator for a period of time, checkpoint timeout occurs and tasks are stuck. Who has encountered this? flink version:1.13.2, jiemalloc version: 5.3.0

Re: Flink checkpoint timeout

2023-06-01 Thread Hangxiang Yu
e shooting. > Below is TaskManager log around the time when checkpoint timeout > > == > > 2023-05-25 14:47:30,248 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter[] - > Flushing mem column

Flink checkpoint timeout

2023-05-30 Thread Ethan T Yang
Hello all, We recently start to experience Checkpoint timeout randomly. Here are some background information 1. We are on Flink 1.13.1 2. We have been running these type of streaming jobs for years. When checkpoint succeeds, it only take a few seconds. After a week ago, we start to see random

Re: Checkpoint Timeout Troubleshooting

2022-05-05 Thread Sam Ch
Thank you for the help. To follow up, the issue went away when we reverted back to flink 1.13. May be related to flink-27481. Before reverting, we tested unaligned checkpoints with a timeout of 10 minutes, which timed out. Thanks. On Thu, Apr 28, 2022, 5:38 PM Guowei Ma wrote: > Hi Sam > > I

Re: Checkpoint Timeout Troubleshooting

2022-04-28 Thread Guowei Ma
Hi Sam I think the first step is to see which part of your Flink APP is blocking the completion of Checkpoint. Specifically, you can refer to the "Checkpoint Details" section of the document [1]. Using these methods, you should be able to observe where the checkpoint is blocked, for example, it

Checkpoint Timeout Troubleshooting

2022-04-28 Thread Sam Ch
Hello, I am running into checkpoint timeouts and am looking for guidance on troubleshooting. What should I be looking at? What configuration parameters would affect this? I am afraid I am a Flink newbie so I am still picking up the concepts. Additional notes are below, anything else I can

Re: Flink Checkpoint Timeout

2022-03-08 Thread Mahantesh Patil
I see for every consequential checkpoint timeout fail , number of tasks which completed checkpointing keeps decreasing, why would that happen? Does flink try to process data beyond old checkpoint barrier which failed to complete due to timeout? On Tue, Mar 8, 2022 at 12:48 AM yidan zhao wrote

Re: Flink Checkpoint Timeout

2022-03-08 Thread yidan zhao
If the checkpoint timeout leads to the job's fail, then the job will be recovered and data will be reprocessed from the last completed checkpoint. If the job doesn't fail, then not. Mahantesh Patil 于2022年3月8日周二 14:47写道: > Hello Team, > > What happens after checkpoint timeout? >

Flink Checkpoint Timeout

2022-03-07 Thread Mahantesh Patil
Hello Team, What happens after checkpoint timeout? Does Flink reprocess data from the previous checkpoint for all tasks? I have one compute intensive operator with parallelism of 20 and only one of the parallel tasks seems to get stuck because of data skew. On checkpoint timeout , will data

Re: Troubleshooting checkpoint timeout

2021-10-26 Thread Piotr Nowojski
m:* Piotr Nowojski > *Sent:* Montag, 25. Oktober 2021 15:51 > *To:* Alexis Sarda-Espinosa > *Cc:* Parag Somani ; Caizhi Weng < > tsreape...@gmail.com>; Flink ML > *Subject:* Re: Troubleshooting checkpoint timeout > > > > Hi Alexis, > > > > > Should I under

RE: Troubleshooting checkpoint timeout

2021-10-25 Thread Alexis Sarda-Espinosa
checkpoints. Thanks again for all the info. Regards, Alexis. From: Piotr Nowojski Sent: Montag, 25. Oktober 2021 15:51 To: Alexis Sarda-Espinosa Cc: Parag Somani ; Caizhi Weng ; Flink ML Subject: Re: Troubleshooting checkpoint timeout Hi Alexis, > Should I understand these metrics as a prope

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
t are behind more data than before it > restarted, no? > > > > Regards, > > Alexis. > > > > *From:* Piotr Nowojski > *Sent:* Montag, 25. Oktober 2021 13:35 > *To:* Alexis Sarda-Espinosa > *Cc:* Parag Somani ; Caizhi Weng < > tsreape...@gmail.com>

RE: Troubleshooting checkpoint timeout

2021-10-25 Thread Alexis Sarda-Espinosa
g>> Sent: Montag, 25. Oktober 2021 09:59 To: Alexis Sarda-Espinosa mailto:alexis.sarda-espin...@microfocus.com>> Cc: Parag Somani mailto:somanipa...@gmail.com>>; Caizhi Weng mailto:tsreape...@gmail.com>>; Flink ML mailto:user@flink.apache.org>> Subject: Re: Troubleshooting check

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
stream operator has lower parallelism? > > > > Regards, > > Alexis. > > > > *From:* Piotr Nowojski > *Sent:* Montag, 25. Oktober 2021 09:59 > *To:* Alexis Sarda-Espinosa > *Cc:* Parag Somani ; Caizhi Weng < > tsreape...@gmail.com>; Flink ML > *Subject

RE: Troubleshooting checkpoint timeout

2021-10-25 Thread Alexis Sarda-Espinosa
eam operator has lower parallelism? Regards, Alexis. From: Piotr Nowojski Sent: Montag, 25. Oktober 2021 09:59 To: Alexis Sarda-Espinosa Cc: Parag Somani ; Caizhi Weng ; Flink ML Subject: Re: Troubleshooting checkpoint timeout Hi Alexis, You can read about those metrics in the documentation

Re: Troubleshooting checkpoint timeout

2021-10-25 Thread Piotr Nowojski
those metrics don’t really help me know in which areas to look > for issues. > > > > Regards, > > Alexis. > > > > *From:* Alexis Sarda-Espinosa > *Sent:* Mittwoch, 20. Oktober 2021 09:43 > *To:* Parag Somani ; Caizhi Weng < > tsreape...@gmail.com> >

RE: Troubleshooting checkpoint timeout

2021-10-21 Thread Alexis Sarda-Espinosa
, Alexis. From: Alexis Sarda-Espinosa Sent: Mittwoch, 20. Oktober 2021 09:43 To: Parag Somani ; Caizhi Weng Cc: Flink ML Subject: RE: Troubleshooting checkpoint timeout Currently the windows are 10 minutes in size with a 1-minute slide time. The approximate 500 event/minute throughput is already

RE: Troubleshooting checkpoint timeout

2021-10-20 Thread Alexis Sarda-Espinosa
; Flink ML Subject: Re: Troubleshooting checkpoint timeout I had similar problem, where i have concurrent two checkpoints were configured. Also, i used to save it in S3(using minio) on k8s 1.18 env. Flink service were getting restarted and timeout was happening. It got resolved: 1. As minio ran

Re: Troubleshooting checkpoint timeout

2021-10-20 Thread Parag Somani
I had similar problem, where i have concurrent two checkpoints were configured. Also, i used to save it in S3(using minio) on k8s 1.18 env. Flink service were getting restarted and timeout was happening. It got resolved: 1. As minio ran out of disk space, caused failure of checkpoints(this was

Re: Troubleshooting checkpoint timeout

2021-10-19 Thread Caizhi Weng
Hi! I see you're using sliding event time windows. What's the exact value of windowLengthMinutes and windowSlideTimeMinutes? If windowLengthMinutes is large and windowSlideTimeMinutes is small then each record may be assigned to a large number of windows as the pipeline proceeds, thus gradually

Troubleshooting checkpoint timeout

2021-10-19 Thread Alexis Sarda-Espinosa
Hello everyone, I am doing performance tests for one of our streaming applications and, after increasing the throughput a bit (~500 events per minute), it has started failing because checkpoints cannot be completed within 10 minutes. The Flink cluster is not exactly under my control and is

Re: flink checkpoint timeout

2020-10-12 Thread Arvid Heise
>> checkpoints keep timing out since migrating to 1.10 from 1.9 >> -- >> *From:* Deshpande, Omkar >> *Sent:* Wednesday, September 16, 2020 5:27 PM >> *To:* Congxian Qiu >> *Cc:* user@flink.apache.org ; Yun Tang < >> myas...@live.com> >> *Subject:* R

Re: flink checkpoint timeout

2020-10-05 Thread Yu Li
om 1.9 > -- > *From:* Deshpande, Omkar > *Sent:* Wednesday, September 16, 2020 5:27 PM > *To:* Congxian Qiu > *Cc:* user@flink.apache.org ; Yun Tang < > myas...@live.com> > *Subject:* Re: flink checkpoint timeout > > This email is from an

Re: flink checkpoint timeout

2020-09-15 Thread Deshpande, Omkar
nstead of taskmanager.heap.size From: Deshpande, Omkar Sent: Monday, September 14, 2020 6:23 PM To: user@flink.apache.org Subject: flink checkpoint timeout This email is from an external sender. Hello, I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds

Re: flink checkpoint timeout

2020-09-14 Thread Congxian Qiu
ould I be looking for in the thread dump? > > -- > *From:* Yun Tang > *Sent:* Monday, September 14, 2020 8:52 PM > *To:* Deshpande, Omkar ; user@flink.apache.org > > *Subject:* Re: flink checkpoint timeout > > This email is from an external sender. &

flink checkpoint timeout

2020-09-14 Thread Deshpande, Omkar
Hello, I recently upgraded from flink 1.9 to 1.10. The checkpointing succeeds first couple of times and then starts failing because of timeouts. The checkpoint time grows with every checkpoint and starts exceeding 10 minutes. I do not see any exceptions in the logs. I have enabled debug

Re: flink checkpoint timeout

2020-09-14 Thread Yun Tang
/browse/FLINK-14816 Best Yun Tang From: Deshpande, Omkar Sent: Tuesday, September 15, 2020 10:25 To: user@flink.apache.org Subject: Re: flink checkpoint timeout I have followed this https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory

Re: checkpoint timeout

2018-08-30 Thread Till Rohrmann
Hi John, which version of Flink are you using. I just tried it out with the current snapshot version and I could configure the checkpoint timeout via CheckpointConfig checkpointConfig = env.getCheckpointConfig(); checkpointConfig.setCheckpointTimeout(1337L); Could you provide us the logs

Re: checkpoint timeout

2018-08-29 Thread vino yang
Hi John, Setting the checkpoint timeout is through this API. The default timeout for checkpoints is 10 minutes [1], not one minute. So, I think it must be something else. You can set the log level of JM and TM to Debug, and then see more checkpoint details. If there is no way to analyze it, you

checkpoint timeout

2018-08-29 Thread John O
I have a flink job with a big enough state that makes checkpointing long ( ~ 70 seconds). I have configured the checkpoint timeout to 180 seconds (setCheckpointTimeout(18)) But as you can see from the following logs, timeout seems to be ~60 seconds. Is there another timeout configuration I

Re: Stream Task seems to be blocked after checkpoint timeout

2017-10-03 Thread Stefan Richter
checking the counter happens under the lock. Disposing the > RocksDB instance can then only start when the „client count“ is zero, and > after it started, no new clients can register. So it is similar to > reader/writer locking, where all ops on the DB are „reader" and disposing

Re: Stream Task seems to be blocked after checkpoint timeout

2017-10-03 Thread Tony Wei
instance can then only start when the „client count“ is zero, > and after it started, no new clients can register. So it is similar to > reader/writer locking, where all ops on the DB are „reader" and disposing > the instance is the „writer". > > I am currently on holidays, mayb

Re: Stream Task seems to be blocked after checkpoint timeout

2017-10-03 Thread Stefan Richter
;: > > Hi Stefan, > > It seems that the similar situation, in which job blocked after checkpoint > timeout, came across to my job. BTW, this is another job that I raised > parallelism and throughput of input. > > After chk #8 started, the whole operator seems blo

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Tony Wei
Hi Stefan, That reason makes sense to me. Thanks for point me out. About my job, the database currently was never used, I disabled it for some reasons, but output to s3 was implemented by async io. I used ForkJoinPool with 50 capacity. I have tried to rebalance after count window to monitor the

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
Hi, the gap between the sync and the async part does not mean too much. What happens per task is that all operators go through their sync part, and then one thread executes all the async parts, one after the other. So if an async part starts late, this is just because it started only after

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Tony Wei
Hi, Sorry. This is the correct one. Best Regards, Tony Wei 2017-09-28 18:55 GMT+08:00 Tony Wei : > Hi Stefan, > > Sorry for providing partial information. The attachment is the full logs > for checkpoint #1577. > > Why I would say it seems that asynchronous part was not

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Tony Wei
Hi Stefan, Sorry for providing partial information. The attachment is the full logs for checkpoint #1577. Why I would say it seems that asynchronous part was not executed immediately is due to all synchronous parts were all finished at 2017-09-27 13:49. Did that mean the checkpoint barrier event

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
Hi, I agree that the memory consumption looks good. If there is only one TM, it will run inside one JVM. As for the 7 minutes, you mean the reported end-to-end time? This time measurement starts when the checkpoint is triggered on the job manager, the first contributor is then the time that it

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Tony Wei
Hi Stefan, These are some telemetry information, but I don't have history information about gc. [image: 內置圖片 2] [image: 內置圖片 1] 1) Yes, my state is not large. 2) My DFS is S3, but my cluster is out of AWS. It might be a problem. Since this is a POC, we might move to AWS in the future or use

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-28 Thread Stefan Richter
Hi, when the async part takes that long I would have 3 things to look at: 1) Is your state so large? I don’t think this applies in your case, right? 2) Is something wrong with writing to DFS (network, disks, etc)? 3) Are we running low on memory on that task manager? Do you have telemetry

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-27 Thread Stefan Richter
Hi Tony, are your checkpoints typically close to the timeout boundary? From what I see, writing the checkpoint is relatively fast but the time from the checkpoint trigger to execution seems very long. This is typically the case if your job has a lot of backpressure and therefore the checkpoint

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-27 Thread Tony Wei
Hi Stefan, It seems that I found something strange from JM's log. It had happened more than once before, but all subtasks would finish their checkpoint attempts in the end. 2017-09-26 01:23:28,690 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 1140 @

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-27 Thread Stefan Richter
Hi, thanks for the information. Unfortunately, I have no immediate idea what the reason is from the given information. I think most helpful could be a thread dump, but also metrics on the operator operator level to figure out which part of the pipeline is the culprit. Best, Stefan > Am

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-26 Thread Tony Wei
Hi Stefan, There is no unknown exception in my full log. The Flink version is 1.3.2. My job is roughly like this. env.addSource(Kafka) .map(ParseKeyFromRecord) .keyBy() .process(CountAndTimeoutWindow) .asyncIO(UploadToS3) .addSink(UpdateDatabase) It seemed all tasks stopped like the

Re: Stream Task seems to be blocked after checkpoint timeout

2017-09-26 Thread Stefan Richter
Hi, that is very strange indeed. I had a look at the logs and there is no error or exception reported. I assume there is also no exception in your full logs? Which version of flink are you using and what operators were running in the task that stopped? If this happens again, would it be