hudi-bot opened a new issue, #15543:
URL: https://github.com/apache/hudi/issues/15543
When we use flink streaming job to consume data from mq and wtite to hudi
partition table, we can not know when a partition is wite finished. And this is
often necessary to tell the downstream offline task scheduler to run while
partition is finished.
I think we can use the flink watermark mechanism to implment this. As
watermark represents the minimum timestamp in flink streaming job, when the
watermark is greater than the hudi partition time, it always means the data is
write finished to hudi parititon in an ordered streaming data, and then it is
the time to write a success file to the parititon path to represent it finished
wirte.
It can be designed as below.
# Get the field of partitions and values in flink append streaming data,
this can be implements in AppendWriteFunction, the emit it to downstream;
# Implement a SuccessFileWriteSink to receive these partition values and
store them to activePartitions set,
# Compare the watermark timestamp and the partition timestamp values
converted from activeParitions set, if the wartermark is greater, set the
partition to finished partitons set;
# Iterate the finished partition set, and get the partition path, and write
success file to it while flink job make checkpoint;
# Store the active partition set and finished partition set in flink state,
avoid the data loss while the job failver.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5159
- Type: New Feature
---
## Comments
07/Nov/22 14:52;complone;Hi KevinyhZou , hello, it seems that it is also
helpful for our company's needs support, do you mind if I participate in and
sort out the development of this task together?;;;
---
08/Nov/22 05:16;zouyunhe;OK, I have made a implement of this feature, and
will submit a pr in recently days. You can help to review or see what else need
to be added. [~complone] ;;;
---
11/Nov/22 12:33;complone;[~zouyunhe] Okay, I know;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]