Write DataFrame with Partition and choose Filename in PySpark

Marco Costantini Thu, 04 May 2023 13:37:37 -0700

Hello,

I am testing writing my DataFrame to S3 using the DataFrame `write` method.
It mostly does a great job. However, it fails one of my requirements. Here
are my requirements.


- Write to S3
- use `partitionBy` to automatically make folders based on my chosen
partition columns
- control the resultant filename (whole or in part)

I can get the first two requirements met but not the third.

Here's an example. When I use the commands...

df.write.partitionBy("year","month").mode("append")\
    .json('s3a://bucket_name/test_folder/')

... I get the partitions I need. However, the filenames are something
like:part-00000-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json


Now, I understand Spark's need to include the partition number in the
filename. However, it sure would be nice to control the rest of the file
name.


Any advice? Please and thank you.

Marco.

Write DataFrame with Partition and choose Filename in PySpark

Reply via email to