Turns out that calling repartition(numberOfParquetFilesPerPartition) just
before write will create exactly numberOfParquetFilesPerPartition files in
each folder. 

dataframe 
  .repartition(10)
  .write 
  .mode(SaveMode.Append) 
  .partitionBy("year", "month", "date", "country", "predicate") 
  .parquet(outputPath) 

I'm not sure why this works - I would have thought that repartition(10)
would partition the original dataframe into 10 partitions BEFORE partitionBy
does its magic, but apparently that is not the case... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436p25437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to