Hi All,
I am trying with some small data set. It is only 200m, and what I am doing is just do a distinct count on it. But there are a lot of spilling happen in the log (I attached in the end of the email). Basically I use 10G memory, run on a one-node EMR cluster with r3*8xlarge instance type (which has 244G memory and 32 vCPU). My code is simple, run in the spark-shell (~/spark/bin/spark-shell --executor-cores 4 --executor-memory 10G) val llg = sc.textFile("s3://./part-r-00000") // File is around 210.5M, 4.7M rows inside //val llg = sc.parallelize(List("-240990|161327,9051480,0,2,30.48,75", "-240990|161324,9051480,0,2,30.48,75")) val ids = llg.flatMap(line => line.split(",").slice(0,1)) //Try to get the first column as key val counts = ids.distinct.count I think I should have enough memory, so there should not have any spilling happen. Anyone can give me some idea why or where I can tuning the system to reduce the spilling (it is not an issue on this dataset, but I want to see how to tuning it up). The Spark UI shows only 24.2MB on the shuffle write. And if I have 10G memory for executor, why it need to spill. 2015-01-13 20:01:53,010 INFO [sparkDriver-akka.actor.default-dispatcher-2] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Updated info of block broadcast_2_piece0 2015-01-13 20:01:53,011 INFO [Spark Context Cleaner] spark.ContextCleaner (Logging.scala:logInfo(59)) - Cleaned broadcast 2 2015-01-13 20:01:53,399 INFO [Executor task launch worker-5] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 149 spilling in-memory map of 23.4 MB to disk (3 times so far) 2015-01-13 20:01:53,516 INFO [Executor task launch worker-7] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 151 spilling in-memory map of 23.4 MB to disk (3 times so far) 2015-01-13 20:01:53,531 INFO [Executor task launch worker-6] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 150 spilling in-memory map of 23.2 MB to disk (3 times so far) 2015-01-13 20:01:53,793 INFO [Executor task launch worker-4] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 148 spilling in-memory map of 23.4 MB to disk (3 times so far) 2015-01-13 20:01:54,460 INFO [Executor task launch worker-5] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 149 spilling in-memory map of 23.2 MB to disk (4 times so far) 2015-01-13 20:01:54,469 INFO [Executor task launch worker-7] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 151 spilling in-memory map of 23.2 MB to disk (4 times so far) 2015-01-13 20:01:55,144 INFO [Executor task launch worker-6] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 150 spilling in-memory map of 24.2 MB to disk (4 times so far) 2015-01-13 20:01:55,192 INFO [Executor task launch worker-4] collection.ExternalAppendOnlyMap (Logging.scala:logInfo(59)) - Thread 148 spilling in-memory map of 23.2 MB to disk (4 times so far) I am trying to collect more benchmark for next step bigger dataset and more complex logic. Regards, Shuai