Hi Vinoth, Thanks for looking into this.
The source delta file had 7877 records in it, Driver log showing the number of records - 20/01/17 03:51:04 INFO HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions 29, TotalComparisons 7877, SafeParallelism 1 Looking at the commit history, looks like some changes were done on the partition(esp delete). 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration, value=93112 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.numFilesDeleted, value=1 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.commitTime, value=1579233069000 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.duration, value=127138 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalBytesWritten, value=6908850 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalCreateTime, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalFilesInsert, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalFilesUpdate, value=1 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalLogFilesSize, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalPartitionsWritten, value=1 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalRecordsWritten, value=95705 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalScanTime, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.commit.totalUpsertTime, value=7013 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.deltastreamer.duration, value=153732 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.finalize.duration, value=437 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.finalize.numFilesFinalized, value=1 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.index.update.duration, value=0 Assuming, it was an empty commit. Shouldn't it be still writing the checkpoint key read from the last commit to the empty commit file? Since checkpoint key is always needed on the recent commit file to avoid this exception(*HoodieDeltaStreamerException: Unable to find previous checkpoint. Please double check if this table was indeed built via delta streamer*). Can I workaround the problem by passing the most recent checkpoint key in the config while calling deltastreamer.sync()? Thanks Venkatesh On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar <vin...@apache.org> wrote: > Hi Venki, > > Thanks for reporting this. The latest commit file seems to be empty? I am > wondering if this is happening because there was no new data to process and > the tool wrote an empty commit file.. > Can you confirm if this seems to match the case? > > Thanks > Vinoth > > > On Mon, Jan 20, 2020 at 4:00 PM Venki g <venke...@gmail.com> wrote: > > > Correcting the link to commit file > > > > On Mon, Jan 20, 2020 at 3:50 PM Venki g <venke...@gmail.com> wrote: > > > > > Hi, > > > > > > I am using a spark job to upsert the incremental delta files from S3 > into > > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental > spark > > > job is failing with below exception > > > > > > java.lang.RuntimeException: > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: > Unable > > to > > > find previous checkpoint. Please double check if this table was indeed > > > built via delta streamer > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36) > > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28) > > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19) > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > at java.lang.reflect.Method.invoke(Method.java:498) > > > at > > > > > > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684) > > > Caused by: > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException: > Unable > > to > > > find previous checkpoint. Please double check if this table was indeed > > > built via delta streamer > > > at > > > > > > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252) > > > at > > > > > > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214) > > > at > > > > > > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120) > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30) > > > ... 7 more > > > > > > I found the recent commit file does not have > > > ""deltastreamer.checkpoint.key" in the commit file. I checked the > second > > > last commit file and it has this key. > > > > > > Link to driver log(has delta streamer config passed and other info) - > > > https://pastebin.pl/view/raw/9606beb0 > > > > > > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae > > > <https://pastebin.pl/view/raw/defc32ae> * > > > > > > When this happened for the first time, I was able to rollback the > latest > > > commit and loaded the data again and went past this exception. Since, > > this > > > exception has started occurring again, I would like to understand the > > issue > > > here and find the fix if any. > > > > > > Would highly appreciate any help on this. > > > > > > Thanks > > > Venkatesh > > > > > >