Just use Delta Best Tufan Sent from my iPhone
> On 24 Jul 2022, at 12:20, Shay Elbaz <shay.el...@gm.com> wrote: > > > This is a known issue. Apache Iceberg, Hudi and Delta lake and among the > possible solutions. > Alternatively, instead of writing the output directly to the "official" > location, write it to some staging directory instead. Once the job is done, > rename the staging dir to the official location. > From: kineret M <kiner...@gmail.com> > Sent: Sunday, July 24, 2022 1:06 PM > To: user@spark.apache.org <user@spark.apache.org> > Subject: [EXTERNAL] Partial data with ADLS Gen2 > > ATTENTION: This email originated from outside of GM. > > > > I have spark batch application writing to ADLS Gen2 (hierarchy). > When designing the application I was sure the spark would perform global > commit once the job is committed, but what it really does it commits on each > task, meaning once task completes writing it moves from temp to target > storage. So if the batch fails we have partial data, and on retry we are > getting data duplications. > Our scale is really huge so rolling back (deleting data) is not an option for > us, the search will takes a lot of time. > Is there any "build in" solution, something we can use out of the box? > > Thanks.