Hi,

I want to know how you coalesce the partition to one to improve the performance


Thanks
在2015年08月11日 23:31,Al M 写道:
I am using DataFrames with Spark 1.4.1.  I really like DataFrames but the
partitioning makes no sense to me.

I am loading lots of very small files and joining them together.  Every file
is loaded by Spark with just one partition.  Each time I join two small
files the partition count increases to 200.  This makes my application take
10x as long as if I coalesce everything to 1 partition after each join.

With normal RDDs it would not expand out the partitions to 200 after joining
two files with one partition each.  It would either keep it at one or expand
it to two.

Why do DataFrames expand out the partitions so much?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-DataFrames-uses-too-many-partition-tp24214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to