pyspark sql: number of partitions and partition by size?

Wei Chen Fri, 13 Nov 2015 14:14:34 -0800

Hey Friends,

I am trying to use sqlContext.write.parquet() to write dataframe to parquet
files. I have the following questions.


1. number of partitions
The default number of partition seems to be 200. Is there any way other
than using df.repartition(n) to change this number? I was told repartition
can be very expensive.

2. partition by size
When I use df.partitionBy(['year']), if the number of entries with
"year=2006" is very small, the sizes of the files under partition
"year=2006" can be very small. If we can assign a size to each partition
file, that'll be very helpful.


Thank you,
Wei

pyspark sql: number of partitions and partition by size?

Reply via email to