Re: Parquet number of partitions

2015-05-07 Thread Archit Thakur
Hi. No. of partitions are determined by the RDD it uses in the plan it creates. It uses NewHadoopRDD which gives partitions by getSplits of input format it is using. It uses FilteringParquetRowInputFormat subclass of ParquetInputFormat. To change the no of partitions write a new input format and

Re: Parquet number of partitions

2015-05-07 Thread Eric Eijkelenboom
Funny enough, I observe different behaviour on EC2 vs EMR (Spark on EMR installed with https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark). Both with Spark 1.3.1/Hadoop 2. Reading a folder with 12 Parquet gives

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x= number of partitions. Depend on your case, repartition could be a heavy task Regards.

Parquet number of partitions

2015-05-05 Thread Eric Eijkelenboom
Hello guys Q1: How does Spark determine the number of partitions when reading a Parquet file? val df = sqlContext.parquetFile(path) Is it some way related to the number of Parquet row groups in my input? Q2: How can I reduce this number of partitions? Doing this: df.rdd.coalesce(200).count