Re: hdfs persist rollbacks when spark job is killed

Chanh Le Sun, 07 Aug 2016 23:42:35 -0700

It’s out of the box in Spark. 
When you write data into hfs or any storage it only creates a new parquet 
folder properly if your Spark job was success else only _temp folder inside to 
mark it’s still not success (spark was killed) or nothing inside (Spark job was 
failed).






> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
> 
> Hello,
> 
> the use case is as follows : 
> 
> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc 
> (like a basic write to hdfs  command), but say due to some reason or rhyme my 
> job got killed, when the run was in the mid of it, meaning lets say I was 
> only able to insert 100K rows when my job got killed.
> 
> twist is that I might actually be upserting, and even in append only cases, 
> my delta change data that is being inserted / written in this run might 
> actually be spanning across various partitions.
> 
> Now what I am looking for is something to role the changes back, like the 
> batch insertion should be all or nothing, and even if it is partition, it 
> must must be atomic to each row/ unit of insertion.
> 
> Kindly help.
> 
> Thanks,
> Sumit

Re: hdfs persist rollbacks when spark job is killed

Reply via email to