Hi, Is there any document that provides some guidelines with some examples that illustrate when different performance optimizations would be useful? I am interested in knowing the guidelines for using optimizations like cache(), persist(), repartition(), coalesce(), and broadcast variables. I studied the online programming guide, but I would like some more details (something along the lines of Aaron Davidson's talk which illustrates the use of repartition() with an example during the Spark summit).
In particular, I have a dataset that is about 1.2TB (about 30 files) that I am trying to load using sc.textFile on a cluster with a total memory of 3TB (170 GB per node). But I am not able to successfully complete the loading. THe program is continuously active in the mapPartitions task but does not get past that even after a long time. I have tried some of the above optimizations. But that has not helped and I am not sure if I am using these optimizations in the right way or which of the above optimizations would be most appropriate to this problem. So I would appreciate any guidelines. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-examples-tp19707.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org