Why I didn't see the benefits of using KryoSerializer

java8964 Tue, 17 Mar 2015 09:06:15 -0700

Hi, I am new to Spark. I tried to understand the memory benefits of using 
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G 
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the 
spark-env.sh
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"export 
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=4g
First test case:val 
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web 
UI, I can see "Size in Memory" is 1787M, and "Fraction Cached" is 70% with 7 
cached partitions.This matched with what I thought, and first count finished 
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val 
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see "Size in Memory" is 1231M, and "Fraction Cached" 
is 100% with 10 cached partitions. It looks like caching the default "java 
serialized format" reduce the memory usage, but coming with a cost that first 
count finished around 39s and 2nd count finished around 9s. So the job runs 
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" 
/opt/spark/bin/spark-shellval 
log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark 
Properties of "Environment" shows "












spark.driver.extraJavaOptions


      -Dspark.serializer=org.apache.spark.serializer.KryoSerializer



". This is not there for first 2 test cases.But in the web UI of "Storage", the 
"Size in Memory" is 1234M, with 100% "Fraction Cached" and 10 cached 
partitions. The first count took 46s and 2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both 
counts. Anything I did wrong? Why the memory foot print of "MEMORY_ONLY_SER" 
for KryoSerializer still use the same size as default Java serializer, with 
worse duration?
Thanks
Yong

Why I didn't see the benefits of using KryoSerializer

Reply via email to