Re: [EXTERNAL] Partial data with ADLS Gen2

Tufan Rakshit Sun, 24 Jul 2022 03:59:45 -0700

Just use Delta 

Best 
Tufan
Sent from my iPhone


> On 24 Jul 2022, at 12:20, Shay Elbaz <shay.el...@gm.com> wrote:
> 
> 
> This is a known issue. Apache Iceberg, Hudi and Delta lake and among the 
> possible solutions.
> Alternatively, instead of writing the output directly to the "official" 
> location, write it to some staging directory instead. Once the job is done, 
> rename the staging dir to the official location.
> From: kineret M <kiner...@gmail.com>
> Sent: Sunday, July 24, 2022 1:06 PM
> To: user@spark.apache.org <user@spark.apache.org>
> Subject: [EXTERNAL] Partial data with ADLS Gen2
>  
> ATTENTION: This email originated from outside of GM.
> 
> 
>  
> I have spark batch application writing to ADLS Gen2 (hierarchy). 
> When designing the application I was sure the spark would perform global 
> commit once the job is committed, but what it really does it commits on each 
> task, meaning once task completes writing it moves from temp to target 
> storage. So if the batch fails we have partial data, and on retry we are 
> getting data duplications. 
> Our scale is really huge so rolling back (deleting data) is not an option for 
> us, the search will takes a lot of time. 
> Is there any "build in" solution, something we can use out of the box?
> 
> Thanks.

Re: [EXTERNAL] Partial data with ADLS Gen2

Reply via email to