I have spark batch application writing to ADLS Gen2 (hierarchy).
When designing the application I was sure the spark would perform global
commit once the job is committed, but what it really does it commits on
each task, meaning *once task completes writing it moves from temp to
target storage*. So if the batch fails we have partial data, and on retry
we are getting data duplications.
Our scale is really huge so rolling back (deleting data) is not an option
for us, the search will takes a lot of time.
Is there any "build in" solution, something we can use out of the box?

Thanks.

Reply via email to