Hi, I am new to Spark. I tried to understand the memory benefits of using KryoSerializer. I have this one box standalone test environment, which is 24 cores with 24G memory. I installed Hadoop 2.2 plus Spark 1.2.0. I put one text file in the hdfs about 1.2G. Here is the settings in the spark-env.sh export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=4"export SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport SPARK_EXECUTOR_MEMORY=4g First test case:val log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count() The data is about 3M rows. For the first test case, from the storage in the web UI, I can see "Size in Memory" is 1787M, and "Fraction Cached" is 70% with 7 cached partitions.This matched with what I thought, and first count finished about 17s, and 2nd count finished about 6s. 2nd test case after restart the spark-shell:val log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count() Now from the web UI, I can see "Size in Memory" is 1231M, and "Fraction Cached" is 100% with 10 cached partitions. It looks like caching the default "java serialized format" reduce the memory usage, but coming with a cost that first count finished around 39s and 2nd count finished around 9s. So the job runs slower, with less memory usage. So far I can understand all what happened and the tradeoff. Now the problem comes with when I tried to test with KryoSerializer SPARK_JAVA_OPTS="-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" /opt/spark/bin/spark-shellval log=sc.textFile("hdfs://namenode:9000/test_1g/")log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count() First, I saw that the new serializer setting passed in, as proven in the Spark Properties of "Environment" shows "
spark.driver.extraJavaOptions -Dspark.serializer=org.apache.spark.serializer.KryoSerializer ". This is not there for first 2 test cases.But in the web UI of "Storage", the "Size in Memory" is 1234M, with 100% "Fraction Cached" and 10 cached partitions. The first count took 46s and 2nd count took 23s. I don't get much less memory size as I expected, but longer run time for both counts. Anything I did wrong? Why the memory foot print of "MEMORY_ONLY_SER" for KryoSerializer still use the same size as default Java serializer, with worse duration? Thanks Yong