Hi,

Is there any document that provides some guidelines with some examples that
illustrate when different performance optimizations would be useful? I am
interested in knowing the guidelines for using optimizations like cache(),
persist(), repartition(), coalesce(), and broadcast variables.  I studied
the online programming guide, but I would like some more details (something
along the lines of Aaron Davidson's talk which illustrates the use of
repartition() with an example during the Spark summit).

In particular, I have a dataset that is about 1.2TB (about 30 files) that I
am trying to load using sc.textFile on a cluster with a total memory of 3TB
(170 GB per node). But I am not able to successfully complete the loading.
THe program is continuously active in the mapPartitions task but  does not
get past that even after a long time. I have tried some of the above
optimizations. But that has not helped and I am not sure if I am using these
optimizations in the right way or which of the above optimizations would be
most appropriate to this problem.  So I would appreciate any guidelines. 

thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-examples-tp19707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to