Re: hdfs persist rollbacks when spark job is killed

Gourav Sengupta Sun, 07 Aug 2016 23:47:38 -0700

But you have to be careful, that is the default setting. There is a way you
can overwrite it so that the writing to _temp folder does not take place
and you write directly to the main folder.


Moving files from _temp folders to main folders is an additional overhead
when you are working on S3 as there is no move operation.

I generally have a set of Data Quality checks after each job to ascertain
whether everything went fine, the results are stored so that it can be
published in a graph for monitoring, thus solving two purposes.


Regards,
Gourav Sengupta

On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com> wrote:

> It’s *out of the box* in Spark.
> When you write data into hfs or any storage it only creates a new parquet
> folder properly if your Spark job was success else only *_temp* folder
> inside to mark it’s still not success (spark was killed) or nothing inside
> (Spark job was failed).
>
>
>
>
>
> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
>
> Hello,
>
> the use case is as follows :
>
> say I am inserting 200K rows as dataframe.write.formate("parquet") etc
> etc (like a basic write to hdfs  command), but say due to some reason or
> rhyme my job got killed, when the run was in the mid of it, meaning lets
> say I was only able to insert 100K rows when my job got killed.
>
> twist is that I might actually be upserting, and even in append only
> cases, my delta change data that is being inserted / written in this run
> might actually be spanning across various partitions.
>
> Now what I am looking for is something to role the changes back, like the
> batch insertion should be all or nothing, and even if it is partition, it
> must must be atomic to each row/ unit of insertion.
>
> Kindly help.
>
> Thanks,
> Sumit
>
>
>

Re: hdfs persist rollbacks when spark job is killed

Reply via email to