Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
There is a mv command in GCS but I am not quite sure (because of limitation of data on which I work on it and lack my budget) whether the mv command actually copies and deletes or just re-points the files to a new directory by changing its meta-data. Yes the Data Quality checks are done after the

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
Thank you Gourav, > Moving files from _temp folders to main folders is an additional overhead > when you are working on S3 as there is no move operation. Good catch. Is that GCS the same? > I generally have a set of Data Quality checks after each job to ascertain > whether everything went

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Gourav Sengupta
But you have to be careful, that is the default setting. There is a way you can overwrite it so that the writing to _temp folder does not take place and you write directly to the main folder. Moving files from _temp folders to main folders is an additional overhead when you are working on S3 as

Re: hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Chanh Le
It’s out of the box in Spark. When you write data into hfs or any storage it only creates a new parquet folder properly if your Spark job was success else only _temp folder inside to mark it’s still not success (spark was killed) or nothing inside (Spark job was failed). > On Aug 8, 2016,

hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert 100K rows