Why the length of each task varies

2015-07-27 Thread Gavin Liu
I am implementing wordcount on the spark cluster (1 master, 3 slaves) in standalone mode. I have 546G data, and the dfs.blocksize I set is 256MB. Therefore, the amount of tasks are 2186. My 3 slaves each uses 22 cores and 72 memory to do the processing, so the computing ability of each slave

Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Gavin Liu
Hi, I am using TeraSort benchmark from ehiggs's branch https://github.com/ehiggs/spark-terasort https://github.com/ehiggs/spark-terasort . Then I noticed that in TeraSort.scala, it is using Kryo Serializer. So I made a small change from org.apache.spark.serializer.KryoSerializer to