garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594096377 @pratyakshsharma So let's forget about my homebrew Spark data source reader. Let's assume I am using delta streamer consuming DFS source, now I'd like to switch to delta streamer consuming Kafka source. The data arrive at DFS and Kafka is asynchronous. DFS source has 30 minutes delay from Kafka. So basically I'd like to switch from: **Kafka -> HDFS raw parquet -> Hudi table** to **Kafka -> Hudi table**. If you have a good solution for this case please let me know. - The problem I have here is Kafka retention time is long but not long enough to cover all the data. All the raw data I have is in DFS and they are keep coming in. If I simply do BULK_INSERT from EARLIEST checkpoint from Kafka, I will lose data. If I do HDFS import first, then UPSERT from: the EARLIEST checkpoint, it could eat up the resources of both my Spark cluster and Kafka cluster because the data volume is huge. the LATEST checkpoint, I will lose data(30 mins gap). - There are some Hudi users are not using Delta Streamer in the first place and would like to switch to it later I believe. And I am one of them. Cause form a user perspective, I won't fully trust a framework until I fully understand and gain enough experience with it. Currently, I couldn't find a perfect way to switch to delta streamer cause: I need to make a non-deltastreamer commit to append the gap data into the Hudi dataset but this commit will let me lose the checkpoint. Let's not say this is a parallel pipeline cause it's confusing. This is a one-time thing to fix the data gap from two different sources and the delta streamer will be the only one to do the sink later.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services