ChiehFu opened a new issue, #10914:
URL: https://github.com/apache/hudi/issues/10914

   **Describe the problem you faced**
   
   Hi, 
   
   My team wants to build Flink pipelines to generate financial report and save 
the report results into a Hudi COW table.
   
   The data sources for the report consist of two types of data - snapshot and 
incremental data. To have a complete report we need to ingest both snapshot and 
incremental data, and for that we are thinking about running two Flink jobs 
against the same Hudi table - batch and incremental - sequentially where the 
batch job processes all snapshot data up to the current time, and the stream 
job continuously processes new incremental data.
   
   According to Hudi's documentation, it uses Flink state to store index 
information for the records it has processed and relies on that information to 
perform upsert correctly.  My question is that the stream job doesn't have 
access to the state information of the batch job, would Hudi in the stream job 
be able to perform upsert operations to update records that were previously 
ingested via the batch job correctly? If not, do you have any recommendation on 
how we can set up the Flink-Hudi workflow to meet our use case?
   
   
   **To Reproduce**
   
   **Environment Description**
   
   * EMR emr-6.15.0, Flink 1.17.1, Hadoop 3.3.6, Hive 3.1.3, Zeppelin 0.10.1
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to