Re: hdfs persist rollbacks when spark job is killed

Chanh Le Sun, 07 Aug 2016 23:52:41 -0700

Thank you Gourav,

> Moving files from _temp folders to main folders is an additional overhead 
> when you are working on S3 as there is no move operation.


Good catch. Is that GCS the same?

> I generally have a set of Data Quality checks after each job to ascertain 
> whether everything went fine, the results are stored so that it can be 
> published in a graph for monitoring, thus solving two purposes.


So that mean after the job done you query the data to check right?



> On Aug 8, 2016, at 1:46 PM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> But you have to be careful, that is the default setting. There is a way you 
> can overwrite it so that the writing to _temp folder does not take place and 
> you write directly to the main folder. 
> 
> Moving files from _temp folders to main folders is an additional overhead 
> when you are working on S3 as there is no move operation. 
> 
> I generally have a set of Data Quality checks after each job to ascertain 
> whether everything went fine, the results are stored so that it can be 
> published in a graph for monitoring, thus solving two purposes.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com 
> <mailto:giaosu...@gmail.com>> wrote:
> It’s out of the box in Spark. 
> When you write data into hfs or any storage it only creates a new parquet 
> folder properly if your Spark job was success else only _temp folder inside 
> to mark it’s still not success (spark was killed) or nothing inside (Spark 
> job was failed).
> 
> 
> 
> 
> 
>> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in 
>> <mailto:sumit.kha...@askme.in>> wrote:
>> 
>> Hello,
>> 
>> the use case is as follows : 
>> 
>> say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc 
>> (like a basic write to hdfs  command), but say due to some reason or rhyme 
>> my job got killed, when the run was in the mid of it, meaning lets say I was 
>> only able to insert 100K rows when my job got killed.
>> 
>> twist is that I might actually be upserting, and even in append only cases, 
>> my delta change data that is being inserted / written in this run might 
>> actually be spanning across various partitions.
>> 
>> Now what I am looking for is something to role the changes back, like the 
>> batch insertion should be all or nothing, and even if it is partition, it 
>> must must be atomic to each row/ unit of insertion.
>> 
>> Kindly help.
>> 
>> Thanks,
>> Sumit
> 
>

Re: hdfs persist rollbacks when spark job is killed

Reply via email to