You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200.
On 8/11/15, 11:31 AM, "Al M" <alasdair.mcbr...@gmail.com> wrote: >I am using DataFrames with Spark 1.4.1. I really like DataFrames but the >partitioning makes no sense to me. > >I am loading lots of very small files and joining them together. Every file >is loaded by Spark with just one partition. Each time I join two small >files the partition count increases to 200. This makes my application take >10x as long as if I coalesce everything to 1 partition after each join. > >With normal RDDs it would not expand out the partitions to 200 after joining >two files with one partition each. It would either keep it at one or expand >it to two. > >Why do DataFrames expand out the partitions so much? > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. > >--------------------------------------------------------------------- >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional commands, e-mail: user-h...@spark.apache.org >