Hi Vinoth,

Thanks for looking into this.

The source delta file had 7877 records in it,

Driver log showing the number of records - 20/01/17 03:51:04 INFO
HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions
29, TotalComparisons 7877, SafeParallelism 1

Looking at the commit history, looks like some changes were done on the
partition(esp delete).

20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration,
value=93112
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.clean.numFilesDeleted, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.commitTime, value=1579233069000
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.duration, value=127138
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalBytesWritten, value=6908850
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalCreateTime, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalFilesInsert, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalFilesUpdate, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalLogFilesSize, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalPartitionsWritten, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalRecordsWritten, value=95705
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalScanTime, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.commit.totalUpsertTime, value=7013
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.deltastreamer.duration, value=153732
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.finalize.duration, value=437
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.finalize.numFilesFinalized, value=1
20/01/17 03:53:18 INFO metrics: type=GAUGE,
name=HZ_PARTIES.index.update.duration, value=0

Assuming, it was an empty commit. Shouldn't it be still writing the
checkpoint key read from the last commit to the empty commit file? Since
checkpoint key is always needed on the recent commit file to avoid this
exception(*HoodieDeltaStreamerException: Unable to find previous
checkpoint. Please double check if this table was indeed built via delta
streamer*).

Can I workaround the problem by passing the most recent checkpoint key in
the config while calling deltastreamer.sync()?

Thanks
Venkatesh

On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar <vin...@apache.org> wrote:

> Hi Venki,
>
> Thanks for reporting this. The latest commit file seems to be empty? I am
> wondering if this is happening because there was no new data to process and
> the tool wrote an empty commit file..
> Can you confirm if this seems to match the case?
>
> Thanks
> Vinoth
>
>
> On Mon, Jan 20, 2020 at 4:00 PM Venki g <venke...@gmail.com> wrote:
>
> > Correcting the link to commit file
> >
> > On Mon, Jan 20, 2020 at 3:50 PM Venki g <venke...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am using a spark job to upsert the incremental delta files from S3
> into
> > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental
> spark
> > > job is failing with below exception
> > >
> > > java.lang.RuntimeException:
> > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> Unable
> > to
> > > find previous checkpoint. Please double check if this table was indeed
> > > built via delta streamer
> > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > at
> > >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> > > Caused by:
> > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> Unable
> > to
> > > find previous checkpoint. Please double check if this table was indeed
> > > built via delta streamer
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> > > at
> > >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> > > ... 7 more
> > >
> > > I found the recent commit file does not have
> > > ""deltastreamer.checkpoint.key" in the commit file. I checked the
> second
> > > last commit file and it has this key.
> > >
> > > Link to driver log(has delta streamer config passed and other info) -
> > > https://pastebin.pl/view/raw/9606beb0
> > >
> > > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> > > <https://pastebin.pl/view/raw/defc32ae> *
> > >
> > > When this happened for the first time, I was able to rollback the
> latest
> > > commit and loaded the data again and went past this exception. Since,
> > this
> > > exception has started occurring again, I would like to understand the
> > issue
> > > here and find the fix if any.
> > >
> > > Would highly appreciate any help on this.
> > >
> > > Thanks
> > > Venkatesh
> > >
> >
>

Reply via email to