You need to configure the spark.sql.shuffle.partitions parameter to a different 
value. It defaults to 200.




On 8/11/15, 11:31 AM, "Al M" <alasdair.mcbr...@gmail.com> wrote:

>I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the
>partitioning makes no sense to me.
>
>I am loading lots of very small files and joining them together.  Every file
>is loaded by Spark with just one partition.  Each time I join two small
>files the partition count increases to 200.  This makes my application take
>10x as long as if I coalesce everything to 1 partition after each join.
>
>With normal RDDs it would not expand out the partitions to 200 after joining
>two files with one partition each.  It would either keep it at one or expand
>it to two.
>
>Why do DataFrames expand out the partitions so much?
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to