Why the length of each task varies

2015-07-27 Thread Gavin Liu
I am implementing wordcount on the spark cluster (1 master, 3 slaves) in standalone mode. I have 546G data, and the dfs.blocksize I set is 256MB. Therefore, the amount of tasks are 2186. My 3 slaves each uses 22 cores and 72 memory to do the processing, so the computing ability of each slave should

Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-04 Thread Gavin Liu
Hi, I am using TeraSort benchmark from ehiggs's branch https://github.com/ehiggs/spark-terasort . Then I noticed that in TeraSort.scala, it is using Kryo Serializer. So I made a small change from "org.apache.spark.serializer.KryoSerializer" to "org.apac