[pyspark 2.4.0] write with overwrite mode fails

Rishi Shah Sat, 07 Dec 2019 20:11:28 -0800

Hi All,

df = spark.read.csv(PATH)
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.repartition(col1,
col2).write.mode('overwrite').partitionBy('col1').parquet(OUT_PATH)


works fine and overwrites the partitioned directory as expected.

However this doesn't overwrite when previous run was abruptly interrupted
and the partitioned directory only has _started flag file & no _SUCCESS or
_committed. In this case, second run doesn't overwrite, causing partition
to have duplicated files. Could someone please help?

-- 
Regards,

Rishi Shah

[pyspark 2.4.0] write with overwrite mode fails

Reply via email to