But you have to be careful, that is the default setting. There is a way you can overwrite it so that the writing to _temp folder does not take place and you write directly to the main folder.
Moving files from _temp folders to main folders is an additional overhead when you are working on S3 as there is no move operation. I generally have a set of Data Quality checks after each job to ascertain whether everything went fine, the results are stored so that it can be published in a graph for monitoring, thus solving two purposes. Regards, Gourav Sengupta On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com> wrote: > It’s *out of the box* in Spark. > When you write data into hfs or any storage it only creates a new parquet > folder properly if your Spark job was success else only *_temp* folder > inside to mark it’s still not success (spark was killed) or nothing inside > (Spark job was failed). > > > > > > On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: > > Hello, > > the use case is as follows : > > say I am inserting 200K rows as dataframe.write.formate("parquet") etc > etc (like a basic write to hdfs command), but say due to some reason or > rhyme my job got killed, when the run was in the mid of it, meaning lets > say I was only able to insert 100K rows when my job got killed. > > twist is that I might actually be upserting, and even in append only > cases, my delta change data that is being inserted / written in this run > might actually be spanning across various partitions. > > Now what I am looking for is something to role the changes back, like the > batch insertion should be all or nothing, and even if it is partition, it > must must be atomic to each row/ unit of insertion. > > Kindly help. > > Thanks, > Sumit > > >