Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path.
Q2: To reduce the number of partitions you can use rdd.repartition(x), x=> number of partitions. Depend on your case, repartition could be a heavy task Regards. Miguel. On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom < eric.eijkelenb...@gmail.com> wrote: > Hello guys > > Q1: How does Spark determine the number of partitions when reading a > Parquet file? > > val df = sqlContext.parquetFile(path) > > Is it some way related to the number of Parquet row groups in my input? > > Q2: How can I reduce this number of partitions? Doing this: > > df.rdd.coalesce(200).count > > from the spark-shell causes job execution to hang… > > Any ideas? Thank you in advance. > > Eric > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Saludos. Miguel Ángel