I have a DataFrame that I need to write to S3 according to a specific
partitioning. The code looks like this:

dataframe
  .write
  .mode(SaveMode.Append)
  .partitionBy("year", "month", "date", "country", "predicate")
  .parquet(outputPath)

The partitionBy splits the data into a fairly large number of folders (~400)
with just a little bit of data (~1GB) in each. And here comes the problem -
because the default value of spark.sql.shuffle.partitions is 200, the 1GB of
data in each folder is split into 200 small parquet files, resulting in
roughly 80000 parquet files being written in total. This is not optimal for
a number of reasons and I would like to avoid this.

I could perhaps set the spark.sql.shuffle.partitions to a much smaller
number, say 10, but as I understand this setting also controls the number of
partitions for shuffles in joins and aggregation, so I don't really want to
change this.

Is there another way to control how many files are written?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to