Hi.
No. of partitions are determined by the RDD it uses in the plan it creates.
It uses NewHadoopRDD which gives partitions by getSplits of input format it
is using. It uses FilteringParquetRowInputFormat subclass of
ParquetInputFormat. To change the no of partitions write a new input format
and
Funny enough, I observe different behaviour on EC2 vs EMR (Spark on EMR
installed with
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark
https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark). Both
with Spark 1.3.1/Hadoop 2.
Reading a folder with 12 Parquet gives
Hi Eric.
Q1:
When I read parquet files, I've tested that Spark generates so many
partitions as parquet files exist in the path.
Q2:
To reduce the number of partitions you can use rdd.repartition(x), x=
number of partitions. Depend on your case, repartition could be a heavy task
Regards.
Hello guys
Q1: How does Spark determine the number of partitions when reading a Parquet
file?
val df = sqlContext.parquetFile(path)
Is it some way related to the number of Parquet row groups in my input?
Q2: How can I reduce this number of partitions? Doing this:
df.rdd.coalesce(200).count