Hi All,

df = spark.read.csv(PATH)
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.repartition(col1,
col2).write.mode('overwrite').partitionBy('col1').parquet(OUT_PATH)

works fine and overwrites the partitioned directory as expected.

However this doesn't overwrite when previous run was abruptly interrupted
and the partitioned directory only has _started flag file & no _SUCCESS or
_committed. In this case, second run doesn't overwrite, causing partition
to have duplicated files. Could someone please help?

-- 
Regards,

Rishi Shah

Reply via email to