KevinyhZou created HUDI-5159: -------------------------------- Summary: Support write a success file to partition when it finished in flink streaming append writer Key: HUDI-5159 URL: https://issues.apache.org/jira/browse/HUDI-5159 Project: Apache Hudi Issue Type: New Feature Components: connectors Reporter: KevinyhZou
When we use flink streaming job to consume data from mq and wtite to hudi partition table, we can not know when a partition is wite finished. And this is often necessary to tell the downstream offline task scheduler to run while partition is finished. I think we can use the flink watermark mechanism to implment this. As watermark represents the minimum timestamp in flink streaming job, when the watermark is greater than the hudi partition time, it always means the data is write finished to hudi parititon in an ordered streaming data, and then it is the time to write a success file to the parititon path to represent it finished wirte. It can be designed as below. # Get the field of partitions and values in flink append streaming data, this can be implements in AppendWriteFunction, the emit it to downstream; # Implement a SuccessFileWriteSink to receive these partition values and store them to activePartitions set, # Compare the watermark timestamp and the partition timestamp values converted from activeParitions set, if the wartermark is greater, set the partition to finished partitons set; # Iterate the finished partition set, and get the partition path, and write success file to it while flink job make checkpoint; # Store the active partition set and finished partition set in flink state, avoid the data loss while the job failver. -- This message was sent by Atlassian Jira (v8.20.10#820010)