pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get 
checkpoint from previous commits in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593813910
 
 
   @garyli1019 still I feel all these challenges are arising because you are 
trying to ingest data in the same dataset using 2 different spark jobs. Few 
questions - 
   
   1. If the kafka cluster retention time is too long, have you tried using 
BULK_INSERT mode of Hudi?If not, you can tune parameters around spark and Hudi 
to increase source limit and then ingest the data. Else you can also try using 
DeltaStreamer in continuous mode. 
   2. Also I would like to know the reason behind switching everytime from 
homebrew spark to Hudi. Are you doing some POC on Hudi? Why don't you simply 
use DeltaStreamer and never switch to the other data source? The data loss will 
not happen if you simply rely on one of the data sources :) 
   
   I am a bit skeptical of trying to use 2 pipelines to write to same 
destination path. Additionally we have options available for taking backup of 
your hudi dataset or for migrating existing dataset to Hudi. Anyways if you 
strongly feel the need to write this checkPointGenerator, let us hear the 
opinion of @leesf and @vinothchandar as well on this before proceeding. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to