There is a mv command in GCS but I am not quite sure (because of limitation
of data on which I work on it and lack my budget) whether the mv command
actually copies and deletes or just re-points the files to a new directory
by changing its meta-data.
Yes the Data Quality checks are done after the
Thank you Gourav,
> Moving files from _temp folders to main folders is an additional overhead
> when you are working on S3 as there is no move operation.
Good catch. Is that GCS the same?
> I generally have a set of Data Quality checks after each job to ascertain
> whether everything went
But you have to be careful, that is the default setting. There is a way you
can overwrite it so that the writing to _temp folder does not take place
and you write directly to the main folder.
Moving files from _temp folders to main folders is an additional overhead
when you are working on S3 as
It’s out of the box in Spark.
When you write data into hfs or any storage it only creates a new parquet
folder properly if your Spark job was success else only _temp folder inside to
mark it’s still not success (spark was killed) or nothing inside (Spark job was
failed).
> On Aug 8, 2016,
Hello,
the use case is as follows :
say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc
(like a basic write to hdfs command), but say due to some reason or rhyme
my job got killed, when the run was in the mid of it, meaning lets say I
was only able to insert 100K rows