Thanks a lot!
I just realize the spark is not a really in-memory version of mapreduce J
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve it?
Hi All,
I am trying with some small data set. It is only 200m, and what I am doing
is just do a distinct count on it.
But there are a lot of spilling happen in the log (I attached in the end of
the email).
Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge
instance
You could try setting the following to tweak the application a little bit:
.set(spark.rdd.compress,true)
.set(spark.storage.memoryFraction, 1)
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
For shuffle behavior, you can look at this document