RE: Why always spilling to disk and how to improve it?

2015-01-14 Thread Shuai Zheng
Thanks a lot! I just realize the spark is not a really in-memory version of mapreduce J From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, January 13, 2015 3:53 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Why always spilling to disk and how to improve it?

Why always spilling to disk and how to improve it?

2015-01-13 Thread Shuai Zheng
Hi All, I am trying with some small data set. It is only 200m, and what I am doing is just do a distinct count on it. But there are a lot of spilling happen in the log (I attached in the end of the email). Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge instance

Re: Why always spilling to disk and how to improve it?

2015-01-13 Thread Akhil Das
You could try setting the following to tweak the application a little bit: .set(spark.rdd.compress,true) .set(spark.storage.memoryFraction, 1) .set(spark.serializer, org.apache.spark.serializer.KryoSerializer) For shuffle behavior, you can look at this document