Re: Spark DataFrames uses too many partition

2015-08-12 Thread Alasdair McBride
Thanks Silvio! On 11 Aug 2015 17:44, Silvio Fiorito silvio.fior...@granturing.com wrote: You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote: I am using DataFrames with

RE: Spark DataFrames uses too many partition

2015-08-12 Thread Alasdair McBride
Thank you Hao; that was a fantastic response. I have raised SPARK-9782 for this. I also would love to have dynamic partitioning. I mentioned it in the Jira. On 12 Aug 2015 02:19, Cheng, Hao hao.ch...@intel.com wrote: That's a good question, we don't support reading small files in a single

Re: Spark DataFrames uses too many partition

2015-08-12 Thread Al M
The DataFrames parallelism currently controlled through configuration option spark.sql.shuffle.partitions. The default value is 200 I have raised an Improvement Jira to make it possible to specify the number of partitions in https://issues.apache.org/jira/browse/SPARK-9872 -- View this

Re: Spark DataFrames uses too many partition

2015-08-11 Thread Silvio Fiorito
You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading

RE: Spark DataFrames uses too many partition

2015-08-11 Thread Cheng, Hao
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the