Here's the tuning guidelines if you haven't seen it already.
http://spark.apache.org/docs/latest/tuning.html
You could try the following to get it loaded:
- Use kryo Serialization
<http://spark.apache.org/docs/latest/tuning.html#data-serialization>
- Enable RDD Compression
- Set Storage level to MEMORY_AND_DISK_SER
Thanks
Best Regards
On Tue, Nov 25, 2014 at 8:02 AM, SK wrote:
> Hi,
>
> Is there any document that provides some guidelines with some examples that
> illustrate when different performance optimizations would be useful? I am
> interested in knowing the guidelines for using optimizations like cache(),
> persist(), repartition(), coalesce(), and broadcast variables. I studied
> the online programming guide, but I would like some more details (something
> along the lines of Aaron Davidson's talk which illustrates the use of
> repartition() with an example during the Spark summit).
>
> In particular, I have a dataset that is about 1.2TB (about 30 files) that I
> am trying to load using sc.textFile on a cluster with a total memory of 3TB
> (170 GB per node). But I am not able to successfully complete the loading.
> THe program is continuously active in the mapPartitions task but does not
> get past that even after a long time. I have tried some of the above
> optimizations. But that has not helped and I am not sure if I am using
> these
> optimizations in the right way or which of the above optimizations would be
> most appropriate to this problem. So I would appreciate any guidelines.
>
> thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-examples-tp19707.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>