Re: hdfs persist rollbacks when spark job is killed

Gourav Sengupta Mon, 08 Aug 2016 02:02:09 -0700

There is a mv command in GCS but I am not quite sure (because of limitation
of data on which I work on it and lack my budget) whether the mv command
actually copies and deletes or just re-points the files to a new directory
by changing its meta-data.


Yes the Data Quality checks are done after the job has completed
successfully (without quitting). If the Data Quality checks fail within
certain threshold then the data is not deleted, but just generate a
warning. If more than a particular threshold, then the data is deleted and
then a warning is raised.



Regards,
Gourav Sengupta

On Mon, Aug 8, 2016 at 7:51 AM, Chanh Le <giaosu...@gmail.com> wrote:

> Thank you Gourav,
>
> Moving files from _temp folders to main folders is an additional overhead
> when you are working on S3 as there is no move operation.
>
> Good catch. Is that GCS the same?
>
> I generally have a set of Data Quality checks after each job to ascertain
> whether everything went fine, the results are stored so that it can be
> published in a graph for monitoring, thus solving two purposes.
>
>
> So that mean after the job done you query the data to check right?
>
>
>
> On Aug 8, 2016, at 1:46 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> But you have to be careful, that is the default setting. There is a way
> you can overwrite it so that the writing to _temp folder does not take
> place and you write directly to the main folder.
>
> Moving files from _temp folders to main folders is an additional overhead
> when you are working on S3 as there is no move operation.
>
> I generally have a set of Data Quality checks after each job to ascertain
> whether everything went fine, the results are stored so that it can be
> published in a graph for monitoring, thus solving two purposes.
>
>
> Regards,
> Gourav Sengupta
>
> On Mon, Aug 8, 2016 at 7:41 AM, Chanh Le <giaosu...@gmail.com> wrote:
>
>> It’s *out of the box* in Spark.
>> When you write data into hfs or any storage it only creates a new parquet
>> folder properly if your Spark job was success else only *_temp* folder
>> inside to mark it’s still not success (spark was killed) or nothing inside
>> (Spark job was failed).
>>
>>
>>
>>
>>
>> On Aug 8, 2016, at 1:35 PM, Sumit Khanna <sumit.kha...@askme.in> wrote:
>>
>> Hello,
>>
>> the use case is as follows :
>>
>> say I am inserting 200K rows as dataframe.write.formate("parquet") etc
>> etc (like a basic write to hdfs  command), but say due to some reason or
>> rhyme my job got killed, when the run was in the mid of it, meaning lets
>> say I was only able to insert 100K rows when my job got killed.
>>
>> twist is that I might actually be upserting, and even in append only
>> cases, my delta change data that is being inserted / written in this run
>> might actually be spanning across various partitions.
>>
>> Now what I am looking for is something to role the changes back, like the
>> batch insertion should be all or nothing, and even if it is partition, it
>> must must be atomic to each row/ unit of insertion.
>>
>> Kindly help.
>>
>> Thanks,
>> Sumit
>>
>>
>>
>
>

Re: hdfs persist rollbacks when spark job is killed

Reply via email to