Thank you Gourav, > Moving files from _temp folders to main folders is an additional overhead > when you are working on S3 as there is no move operation.
Good catch. Is that GCS the same? > I generally have a set of Data Quality checks after each job to ascertain > whether everything went fine, the results are stored so that it can be > published in a graph for monitoring, thus solving two purposes. So that mean after the job done you query the data to check right? > On Aug 8, 2016, at 1:46 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > But you have to be careful, that is the default setting. There is a way you > can overwrite it so that the writing to _temp folder does not take place and > you write directly to the main folder. > > Moving files from _temp folders to main folders is an additional overhead > when you are working on S3 as there is no move operation. > > I generally have a set of Data Quality checks after each job to ascertain > whether everything went fine, the results are stored so that it can be > published in a graph for monitoring, thus solving two purposes. > > > Regards, > Gourav Sengupta > > On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com > <mailto:giaosu...@gmail.com>> wrote: > It’s out of the box in Spark. > When you write data into hfs or any storage it only creates a new parquet > folder properly if your Spark job was success else only _temp folder inside > to mark it’s still not success (spark was killed) or nothing inside (Spark > job was failed). > > > > > >> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in >> <mailto:sumit.kha...@askme.in>> wrote: >> >> Hello, >> >> the use case is as follows : >> >> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc >> (like a basic write to hdfs command), but say due to some reason or rhyme >> my job got killed, when the run was in the mid of it, meaning lets say I was >> only able to insert 100K rows when my job got killed. >> >> twist is that I might actually be upserting, and even in append only cases, >> my delta change data that is being inserted / written in this run might >> actually be spanning across various partitions. >> >> Now what I am looking for is something to role the changes back, like the >> batch insertion should be all or nothing, and even if it is partition, it >> must must be atomic to each row/ unit of insertion. >> >> Kindly help. >> >> Thanks, >> Sumit > >