Re: Safe to write to parquet at the same time?

Cheng Lian Tue, 04 Aug 2015 04:48:07 -0700

It should be safe for Spark 1.4.1 and later versions.

Now Spark SQL adds a job-wise UUID to output file names to distinguishfiles written by different write jobs. So those two write jobs you gaveshould play well with each other. And the job committed later willgenerate a summary file for all Parquet data files it sees. (However,Parquet summary file generation may fail due to various reasons and isgenerally not reliable.)


Cheng

On 8/4/15 10:37 AM, Philip Weaver wrote:

I think this question applies regardless if I have two completelyseparate Spark jobs or tasks on different machines, or two cores thatare part of the same task on the same machine.
If two jobs/tasks/cores/stages both save to the same parquet directoryin parallel like this:
    df1.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)

    df2.write.mode(SaveMode.Append).partitionBy(a, b).parquet(dir)


Will the result be equivalent to this?

    df1.unionAll(df2).write.mode(SaveMode.Append).partitionBy(a,
    b).parquet(dir)


What if we ensure that 'dir' does not exist first?

- Philip

Re: Safe to write to parquet at the same time?

Reply via email to