[ 
https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456010#comment-17456010
 ] 

sivabalan narayanan commented on HUDI-1214:
-------------------------------------------

I guess this is the ask. 

Add ability to serialize checkpoint via spark datasource writes. and then if 
users starts up deltastreamer, it automatically resumes from last known 
checkpoint. 

Here is my take on this ask:

Deltastreamer uses Source interface and hence we have ways to determine 
checkpoints for diff sources and the checkpoint format also differs from one 
source to another. Spark datasource writers don't use any of these. 

And one of the typical use-case could be, 

bootstrap data from a source folder using sparkdatasource and then start a 
deltastreamer with kafka source. So, the checkpoint formats may also differ. 

Anyways, as of today, spark datasource does not have a way to determine the 
checkpoints. I will close this ticket out. But please free to re-open is my 
understanding is wrong, or if you have ideas to go about this. 

 

> Need ability to set deltastreamer checkpoints when doing Spark datasource 
> writes
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-1214
>                 URL: https://issues.apache.org/jira/browse/HUDI-1214
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Spark Integration
>            Reporter: Balaji Varadarajan
>            Assignee: Trevorzhang
>            Priority: Major
>              Labels: sev:high, user-support-issues
>             Fix For: 0.11.0
>
>
> Such support is needed  for bootstrapping cases when users use spark write to 
> do initial bootstrap and then subsequently use deltastreamer.
> DeltaStreamer manages checkpoints inside hoodie commit files and expects 
> checkpoints in previously committed metadata. Users are expected to pass 
> checkpoint or initial checkpoint provider when performing bootstrap through 
> deltastreamer. Such support is not present when doing bootstrap using Spark 
> Datasource.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to