Re: How to control number of parquet files generated when using partitionBy

glennie Fri, 20 Nov 2015 06:00:07 -0800

Turns out that calling repartition(numberOfParquetFilesPerPartition) just
before write will create exactly numberOfParquetFilesPerPartition files in
each folder.


dataframe 
  .repartition(10)
  .write 
  .mode(SaveMode.Append) 
  .partitionBy("year", "month", "date", "country", "predicate") 
  .parquet(outputPath) 

I'm not sure why this works - I would have thought that repartition(10)
would partition the original dataframe into 10 partitions BEFORE partitionBy
does its magic, but apparently that is not the case... 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-control-number-of-parquet-files-generated-when-using-partitionBy-tp25436p25437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to control number of parquet files generated when using partitionBy

Reply via email to