Hi Venkatesh,

It should keep writing an empty commit with checkpoints.. I think your
situation is different though.. You are seeing a commit wth no checkpoint
key.
>From the code, we always set this.

Since I see com.emr.java on the stack trace, can any of the amazon folks
confirm any code changes on the EMR code that is getting invoked?

Thanks
VInoth

On Tue, Jan 21, 2020 at 12:57 PM Venki g <venke...@gmail.com> wrote:

> Hi Vinoth,
>
> Thanks for looking into this.
>
> The source delta file had 7877 records in it,
>
> Driver log showing the number of records - 20/01/17 03:51:04 INFO
> HoodieBloomIndex: TotalRecords 7877, TotalFiles 30, TotalAffectedPartitions
> 29, TotalComparisons 7877, SafeParallelism 1
>
> Looking at the commit history, looks like some changes were done on the
> partition(esp delete).
>
> 20/01/17 03:53:18 INFO metrics: type=GAUGE, name=HZ_PARTIES.clean.duration,
> value=93112
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.clean.numFilesDeleted, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.commitTime, value=1579233069000
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.duration, value=127138
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalBytesWritten, value=6908850
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCompactedRecordsUpdated, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalCreateTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesInsert, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalFilesUpdate, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalInsertRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesCompacted, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalLogFilesSize, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalPartitionsWritten, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalRecordsWritten, value=95705
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalScanTime, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpdateRecordsWritten, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.commit.totalUpsertTime, value=7013
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.duration, value=153732
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.deltastreamer.hiveSyncDuration, value=0
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.duration, value=437
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.finalize.numFilesFinalized, value=1
> 20/01/17 03:53:18 INFO metrics: type=GAUGE,
> name=HZ_PARTIES.index.update.duration, value=0
>
> Assuming, it was an empty commit. Shouldn't it be still writing the
> checkpoint key read from the last commit to the empty commit file? Since
> checkpoint key is always needed on the recent commit file to avoid this
> exception(*HoodieDeltaStreamerException: Unable to find previous
> checkpoint. Please double check if this table was indeed built via delta
> streamer*).
>
> Can I workaround the problem by passing the most recent checkpoint key in
> the config while calling deltastreamer.sync()?
>
> Thanks
> Venkatesh
>
> On Mon, Jan 20, 2020 at 5:07 PM Vinoth Chandar <vin...@apache.org> wrote:
>
> > Hi Venki,
> >
> > Thanks for reporting this. The latest commit file seems to be empty? I am
> > wondering if this is happening because there was no new data to process
> and
> > the tool wrote an empty commit file..
> > Can you confirm if this seems to match the case?
> >
> > Thanks
> > Vinoth
> >
> >
> > On Mon, Jan 20, 2020 at 4:00 PM Venki g <venke...@gmail.com> wrote:
> >
> > > Correcting the link to commit file
> > >
> > > On Mon, Jan 20, 2020 at 3:50 PM Venki g <venke...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am using a spark job to upsert the incremental delta files from S3
> > into
> > > > Hudi storage using HoodieDeltaStreamer.sync() API , The incremental
> > spark
> > > > job is failing with below exception
> > > >
> > > > java.lang.RuntimeException:
> > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> > Unable
> > > to
> > > > find previous checkpoint. Please double check if this table was
> indeed
> > > > built via delta streamer
> > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:36)
> > > > at com.emr.java.HudiDataLoadJob.run(HudiDataLoadJob.java:28)
> > > > at com.emr.java.HiveDeltaStreamer.main(HiveDeltaStreamer.java:19)
> > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > > > at
> > > >
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> > > > at
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > at java.lang.reflect.Method.invoke(Method.java:498)
> > > > at
> > > >
> > >
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> > > > Caused by:
> > > > org.apache.hudi.utilities.exception.HoodieDeltaStreamerException:
> > Unable
> > > to
> > > > find previous checkpoint. Please double check if this table was
> indeed
> > > > built via delta streamer
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:252)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:214)
> > > > at
> > > >
> > >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:120)
> > > > at com.emr.java.HiveDeltaStreamer.loadData(HiveDeltaStreamer.java:30)
> > > > ... 7 more
> > > >
> > > > I found the recent commit file does not have
> > > > ""deltastreamer.checkpoint.key" in the commit file. I checked the
> > second
> > > > last commit file and it has this key.
> > > >
> > > > Link to driver log(has delta streamer config passed and other info) -
> > > > https://pastebin.pl/view/raw/9606beb0
> > > >
> > > > *Link to most recent commit - https://pastebin.pl/view/raw/defc32ae
> > > > <https://pastebin.pl/view/raw/defc32ae> *
> > > >
> > > > When this happened for the first time, I was able to rollback the
> > latest
> > > > commit and loaded the data again and went past this exception. Since,
> > > this
> > > > exception has started occurring again, I would like to understand the
> > > issue
> > > > here and find the fix if any.
> > > >
> > > > Would highly appreciate any help on this.
> > > >
> > > > Thanks
> > > > Venkatesh
> > > >
> > >
> >
>

Reply via email to