pratyakshsharma commented on issue #1362: HUDI-644 Enable user to get checkpoint from previous commits in DeltaStreamer URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-593813910 @garyli1019 still I feel all these challenges are arising because you are trying to ingest data in the same dataset using 2 different spark jobs. Few questions - 1. If the kafka cluster retention time is too long, have you tried using BULK_INSERT mode of Hudi?If not, you can tune parameters around spark and Hudi to increase source limit and then ingest the data. Else you can also try using DeltaStreamer in continuous mode. 2. Also I would like to know the reason behind switching everytime from homebrew spark to Hudi. Are you doing some POC on Hudi? Why don't you simply use DeltaStreamer and never switch to the other data source? The data loss will not happen if you simply rely on one of the data sources :) I am a bit skeptical of trying to use 2 pipelines to write to same destination path. Additionally we have options available for taking backup of your hudi dataset or for migrating existing dataset to Hudi. Anyways if you strongly feel the need to write this checkPointGenerator, let us hear the opinion of @leesf and @vinothchandar as well on this before proceeding.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services