[jira] [Commented] (FLINK-37375) Checkpoint supports the Operator to customize asynchronous snapshot state

Zakelly Lan (Jira) Sun, 23 Mar 2025 20:21:19 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937775#comment-17937775
 ]


Zakelly Lan commented on FLINK-37375:
-------------------------------------

[~hejufang001] But it still required to be finished before the checkpoint 
marked complete, right? Otherwise the it won't affect Flink in any ways, there 
is no need to introduce such method.

> Checkpoint supports the Operator to customize asynchronous snapshot state
> -------------------------------------------------------------------------
>
>                 Key: FLINK-37375
>                 URL: https://issues.apache.org/jira/browse/FLINK-37375
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.1
>            Reporter: Jufang He
>            Priority: Major
>              Labels: pull-request-available
>
> In some Flink task operators, slow operations such as file uploads or data 
> flushing may be performed during the synchronous phase of Checkpoint. Due to 
> performance issues with external storage components, the synchronous phase 
> may take too long to execute, significantly impacting the job's throughput. 
> For example, during our internal use of Paimon, we observed that uploading 
> files to HDFS during the Checkpoint synchronous phase could encounter random 
> HDFS slow node issues, leading to a substantial negative impact on task 
> throughput.
> To address this issue, I propose supporting a generic operator custom 
> asynchronous snapshot feature, allowing users to move time-consuming logic to 
> the asynchronous phase of Checkpoint, thereby minimizing the blocking of the 
> main thread and improving task throughput. For instance, the Paimon writer 
> operator could write data locally during the Checkpoint synchronous phase and 
> upload files to remote storage during the asynchronous phase. Beyond the 
> Paimon data upload scenario, other operator logic may also experience slow 
> execution during the synchronous phase. This approach aims to uniformly 
> optimize such issues.
> I drafted a flip for this issue: 
> [https://docs.google.com/document/d/1lwxLEQjD6jVhZUBMRGhzQNWKSvdbPbYNQsV265gR4kw]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37375) Checkpoint supports the Operator to customize asynchronous snapshot state

Reply via email to