Spark Dataframe Writer _temporary directory

Richard Primera Sun, 28 Jan 2018 21:25:06 -0800

In a situation where multiple workflows write different partitions of the
same table.


Example:

10 Different processes are writing parquet or orc files for different
partitions of the same table foo, at 
/staging/tables/foo/partition_field=1,/staging/tables/foo/partition_field=2,/staging/tables/foo/partition_field=3...

It appears to me that it is currently not possible to do this simultaneously
for the same directory in a consistently stable way, since whenever a
Dataframe writer starts, it stores temporary files at
/staging/tables/foo/_temporary directory, which all writers use, and they
all eliminate it when they end writing. This has the effect that whatever
Dataframe writer ends up first, ends up deleting the temporary files of all
other writers that haven't finished. 

I believe this can be bypassed by having them all write to a
/staging/tables/foo/_temporary_someHash directory instead.

Is there currently a way to achieve this without having to edit the source
code?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Dataframe Writer _temporary directory

Reply via email to