hdfs persist rollbacks when spark job is killed

Sumit Khanna Sun, 07 Aug 2016 23:36:07 -0700

Hello,

the use case is as follows :


say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
(like a basic write to hdfs  command), but say due to some reason or rhyme
my job got killed, when the run was in the mid of it, meaning lets say I
was only able to insert 100K rows when my job got killed.

twist is that I might actually be upserting, and even in append only cases,
my delta change data that is being inserted / written in this run might
actually be spanning across various partitions.

Now what I am looking for is something to role the changes back, like the
batch insertion should be all or nothing, and even if it is partition, it
must must be atomic to each row/ unit of insertion.

Kindly help.

Thanks,
Sumit

hdfs persist rollbacks when spark job is killed

Reply via email to