garyli1019 commented on issue #1362: HUDI-644 Enable user to get checkpoint 
from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-594096377
 
 
   @pratyakshsharma So let's forget about my homebrew Spark data source reader. 
Let's assume I am using delta streamer consuming DFS source, now I'd like to 
switch to delta streamer consuming Kafka source. The data arrive at DFS and 
Kafka is asynchronous. DFS source has 30 minutes delay from Kafka.  
   So basically I'd like to switch from: **Kafka -> HDFS raw parquet -> Hudi 
table** to **Kafka -> Hudi table**. If you have a good solution for this case 
please let me know. 
   
   - The problem I have here is Kafka retention time is long but not long 
enough to cover all the data. All the raw data I have is in DFS and they are 
keep coming in. If I simply do BULK_INSERT from EARLIEST checkpoint from Kafka, 
I will lose data. If I do HDFS import first, then UPSERT from:
      the EARLIEST checkpoint, it could eat up the resources of both my Spark 
cluster and Kafka cluster because the data volume is huge. 
      the LATEST checkpoint, I will lose data(30 mins gap). 
   - There are some Hudi users are not using Delta Streamer in the first place 
and would like to switch to it later I believe. And I am one of them. Cause 
form a user perspective, I won't fully trust a framework until I fully 
understand and gain enough experience with it.   
   
   Currently, I couldn't find a perfect way to switch to delta streamer cause:
   I need to make a non-deltastreamer commit to append the gap data into the 
Hudi dataset but this commit will let me lose the checkpoint. Let's not say 
this is a parallel pipeline cause it's confusing. This is a one-time thing to 
fix the data gap from two different sources and the delta streamer will be the 
only one to do the sink later. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to