Thanks Silvio! On 11 Aug 2015 17:44, "Silvio Fiorito" <silvio.fior...@granturing.com> wrote:
> You need to configure the spark.sql.shuffle.partitions parameter to a > different value. It defaults to 200. > > > > > On 8/11/15, 11:31 AM, "Al M" <alasdair.mcbr...@gmail.com> wrote: > > >I am using DataFrames with Spark 1.4.1. I really like DataFrames but the > >partitioning makes no sense to me. > > > >I am loading lots of very small files and joining them together. Every > file > >is loaded by Spark with just one partition. Each time I join two small > >files the partition count increases to 200. This makes my application > take > >10x as long as if I coalesce everything to 1 partition after each join. > > > >With normal RDDs it would not expand out the partitions to 200 after > joining > >two files with one partition each. It would either keep it at one or > expand > >it to two. > > > >Why do DataFrames expand out the partitions so much? > > > > > > > >-- > >View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html > >Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >For additional commands, e-mail: user-h...@spark.apache.org > > >