Turns out that calling repartition(numberOfParquetFilesPerPartition) just before write will create exactly numberOfParquetFilesPerPartition files in each folder.
dataframe .repartition(10) .write .mode(SaveMode.Append) .partitionBy("year", "month", "date", "country", "predicate") .parquet(outputPath) I'm not sure why this works - I would have thought that repartition(10) would partition the original dataframe into 10 partitions BEFORE partitionBy does its magic, but apparently that is not the case... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436p25437.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org