Here's the tuning guidelines if you haven't seen it already. http://spark.apache.org/docs/latest/tuning.html
You could try the following to get it loaded: - Use kryo Serialization <http://spark.apache.org/docs/latest/tuning.html#data-serialization> - Enable RDD Compression - Set Storage level to MEMORY_AND_DISK_SER Thanks Best Regards On Tue, Nov 25, 2014 at 8:02 AM, SK <skrishna...@gmail.com> wrote: > Hi, > > Is there any document that provides some guidelines with some examples that > illustrate when different performance optimizations would be useful? I am > interested in knowing the guidelines for using optimizations like cache(), > persist(), repartition(), coalesce(), and broadcast variables. I studied > the online programming guide, but I would like some more details (something > along the lines of Aaron Davidson's talk which illustrates the use of > repartition() with an example during the Spark summit). > > In particular, I have a dataset that is about 1.2TB (about 30 files) that I > am trying to load using sc.textFile on a cluster with a total memory of 3TB > (170 GB per node). But I am not able to successfully complete the loading. > THe program is continuously active in the mapPartitions task but does not > get past that even after a long time. I have tried some of the above > optimizations. But that has not helped and I am not sure if I am using > these > optimizations in the right way or which of the above optimizations would be > most appropriate to this problem. So I would appreciate any guidelines. > > thanks > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-examples-tp19707.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >