Here's the tuning guidelines if you haven't seen it already.
http://spark.apache.org/docs/latest/tuning.html

You could try the following to get it loaded:

- Use kryo Serialization
<http://spark.apache.org/docs/latest/tuning.html#data-serialization>
- Enable RDD Compression
- Set Storage level to MEMORY_AND_DISK_SER


Thanks
Best Regards

On Tue, Nov 25, 2014 at 8:02 AM, SK <skrishna...@gmail.com> wrote:

> Hi,
>
> Is there any document that provides some guidelines with some examples that
> illustrate when different performance optimizations would be useful? I am
> interested in knowing the guidelines for using optimizations like cache(),
> persist(), repartition(), coalesce(), and broadcast variables.  I studied
> the online programming guide, but I would like some more details (something
> along the lines of Aaron Davidson's talk which illustrates the use of
> repartition() with an example during the Spark summit).
>
> In particular, I have a dataset that is about 1.2TB (about 30 files) that I
> am trying to load using sc.textFile on a cluster with a total memory of 3TB
> (170 GB per node). But I am not able to successfully complete the loading.
> THe program is continuously active in the mapPartitions task but  does not
> get past that even after a long time. I have tried some of the above
> optimizations. But that has not helped and I am not sure if I am using
> these
> optimizations in the right way or which of the above optimizations would be
> most appropriate to this problem.  So I would appreciate any guidelines.
>
> thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-performance-optimization-examples-tp19707.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to