[ https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456010#comment-17456010 ]
sivabalan narayanan commented on HUDI-1214: ------------------------------------------- I guess this is the ask. Add ability to serialize checkpoint via spark datasource writes. and then if users starts up deltastreamer, it automatically resumes from last known checkpoint. Here is my take on this ask: Deltastreamer uses Source interface and hence we have ways to determine checkpoints for diff sources and the checkpoint format also differs from one source to another. Spark datasource writers don't use any of these. And one of the typical use-case could be, bootstrap data from a source folder using sparkdatasource and then start a deltastreamer with kafka source. So, the checkpoint formats may also differ. Anyways, as of today, spark datasource does not have a way to determine the checkpoints. I will close this ticket out. But please free to re-open is my understanding is wrong, or if you have ideas to go about this. > Need ability to set deltastreamer checkpoints when doing Spark datasource > writes > -------------------------------------------------------------------------------- > > Key: HUDI-1214 > URL: https://issues.apache.org/jira/browse/HUDI-1214 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration > Reporter: Balaji Varadarajan > Assignee: Trevorzhang > Priority: Major > Labels: sev:high, user-support-issues > Fix For: 0.11.0 > > > Such support is needed for bootstrapping cases when users use spark write to > do initial bootstrap and then subsequently use deltastreamer. > DeltaStreamer manages checkpoints inside hoodie commit files and expects > checkpoints in previously committed metadata. Users are expected to pass > checkpoint or initial checkpoint provider when performing bootstrap through > deltastreamer. Such support is not present when doing bootstrap using Spark > Datasource. -- This message was sent by Atlassian Jira (v8.20.1#820001)