It’s out of the box in Spark. When you write data into hfs or any storage it only creates a new parquet folder properly if your Spark job was success else only _temp folder inside to mark it’s still not success (spark was killed) or nothing inside (Spark job was failed).
> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: > > Hello, > > the use case is as follows : > > say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc > (like a basic write to hdfs command), but say due to some reason or rhyme my > job got killed, when the run was in the mid of it, meaning lets say I was > only able to insert 100K rows when my job got killed. > > twist is that I might actually be upserting, and even in append only cases, > my delta change data that is being inserted / written in this run might > actually be spanning across various partitions. > > Now what I am looking for is something to role the changes back, like the > batch insertion should be all or nothing, and even if it is partition, it > must must be atomic to each row/ unit of insertion. > > Kindly help. > > Thanks, > Sumit