RE: Spark DataFrames uses too many partition

Cheng, Hao Tue, 11 Aug 2015 18:19:44 -0700

That's a good question, we don't support reading small files in a single 
partition yet, but it's definitely an issue we need to optimize, do you mind to 
create a jira issue for this? Hopefully we can merge that in 1.6 release.


200 is the default partition number for parallel tasks after the data shuffle, 
and we have to change that value according to the file size, cluster size etc..

Ideally, this number would be set dynamically and automatically, however, spark 
sql doesn't support the complex cost based model yet, particularly for the 
multi-stages job. (https://issues.apache.org/jira/browse/SPARK-4630)

Hao

-----Original Message-----
From: Al M [mailto:alasdair.mcbr...@gmail.com] 
Sent: Tuesday, August 11, 2015 11:31 PM
To: user@spark.apache.org
Subject: Spark DataFrames uses too many partition

I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the 
partitioning makes no sense to me.

I am loading lots of very small files and joining them together.  Every file is 
loaded by Spark with just one partition.  Each time I join two small files the 
partition count increases to 200.  This makes my application take 10x as long 
as if I coalesce everything to 1 partition after each join.

With normal RDDs it would not expand out the partitions to 200 after joining 
two files with one partition each.  It would either keep it at one or expand it 
to two.

Why do DataFrames expand out the partitions so much?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark DataFrames uses too many partition

Reply via email to