Re: Spark DataFrames uses too many partition

Alasdair McBride Wed, 12 Aug 2015 00:59:11 -0700

Thanks Silvio!
On 11 Aug 2015 17:44, "Silvio Fiorito" <silvio.fior...@granturing.com>
wrote:


> You need to configure the spark.sql.shuffle.partitions parameter to a
> different value. It defaults to 200.
>
>
>
>
> On 8/11/15, 11:31 AM, "Al M" <alasdair.mcbr...@gmail.com> wrote:
>
> >I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the
> >partitioning makes no sense to me.
> >
> >I am loading lots of very small files and joining them together.  Every
> file
> >is loaded by Spark with just one partition.  Each time I join two small
> >files the partition count increases to 200.  This makes my application
> take
> >10x as long as if I coalesce everything to 1 partition after each join.
> >
> >With normal RDDs it would not expand out the partitions to 200 after
> joining
> >two files with one partition each.  It would either keep it at one or
> expand
> >it to two.
> >
> >Why do DataFrames expand out the partitions so much?
> >
> >
> >
> >--
> >View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html
> >Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Spark DataFrames uses too many partition

Reply via email to