[Pyspark 2.4] not able to partition the data frame by dates

Rishi Shah Wed, 31 Jul 2019 19:55:53 -0700

Hi All,

I have a dataframe of size 2.7T (parquet) which I need to partition by
date, however below spark program doesn't help - keeps failing due to *file
already exists exception..*


df = spark.read.parquet(INPUT_PATH)
df.repartition('date_field').write.partitionBy('date_field').mode('overwrite').parquet(PATH)

I did notice that couple of tasks failed and probably that's why it tried
spinning up new ones which write to the same .staging directory?

-- 
Regards,

Rishi Shah

[Pyspark 2.4] not able to partition the data frame by dates

Reply via email to