RE: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread java8964
Hi, Imran:
Thanks for your information.
I found a benchmark online about serialization which compares Java vs Kryo vs 
gridgain at here: 
http://gridgain.blogspot.com/2012/12/java-serialization-good-fast-and-faster.html
From my test result, in the above benchmark case for the SimpleObject, Kryo is 
slightly faster than Java serialization, but only use half of the space vs 
Java serialization.
So now I understand more about what kind of benefits I should expect from using 
KryoSerializer.
But I have some questions related to Spark SQL. If I use Spark SQL, should I 
expect less memory usage? I mean in Spark SQL, everything is controlled by 
Spark. If I pass in 
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer and save the 
table in Cache, so it will use much less memory? Do I also need to specify 
StorageLevel.MEMORY_ONLY_SER if I want to use less memory? Where I can set 
that in Spark SQL?
Thanks
Yong

From: iras...@cloudera.com
Date: Fri, 20 Mar 2015 11:54:38 -0500
Subject: Re: Why I didn't see the benefits of using KryoSerializer
To: java8...@hotmail.com
CC: user@spark.apache.org

Hi Yong,
yes I think your analysis is correct.  I'd imagine almost all serializers out 
there will just convert a string to its utf-8 representation.  You might be 
interested in adding compression on top of a serializer, which would probably 
bring the string size down in almost all cases, but then you also need to take 
the time for compression.  Kryo is generally more efficient than the java 
serializer on complicated object types.
I guess I'm still a little surprised that kryo is slower than java 
serialization for you.  You might try setting spark.kryo.referenceTracking to 
false if you are just serializing objects with no circular references.  I think 
that will improve the performance a little, though I dunno how much.
It might be worth running your experiments again with slightly more complicated 
objects and see what you observe.
Imran

On Thu, Mar 19, 2015 at 12:57 PM, java8964 java8...@hotmail.com wrote:



I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between 
org.apache.spark.serializer.JavaSerializer and 
org.apache.spark.serializer.KryoSerializer, both having the method named 
writeObject.
In my test case, for each line of my text file, it is about 140 bytes of 
String. When either JavaSerializer.writeObject(140 bytes of String) or 
KryoSerializer.writeObject(140 bytes of String), I didn't see difference in the 
underline OutputStream space usage.
Does this mean that KryoSerializer really doesn't give us any benefit for 
String type? I understand that for primitives types, it shouldn't have any 
benefits, but how about String type?
When we talk about lower the memory using KryoSerializer in spark, under what 
case it can bring significant benefits? It is my first experience with the 
KryoSerializer, so maybe I am total wrong about its usage.
Thanks
Yong 
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Why I didn't see the benefits of using KryoSerializer
Date: Tue, 17 Mar 2015 12:01:35 -0400




Hi, I am new to Spark. I tried to understand the memory benefits of using 
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G 
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the 
spark-env.sh
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4export 
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=4g
First test case:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web 
UI, I can see Size in Memory is 1787M, and Fraction Cached is 70% with 7 
cached partitions.This matched with what I thought, and first count finished 
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see Size in Memory is 1231M, and Fraction Cached 
is 100% with 10 cached partitions. It looks like caching the default java 
serialized format reduce the memory usage, but coming with a cost that first 
count finished around 39s and 2nd count finished around 9s. So the job runs 
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
/opt/spark/bin/spark-shellval 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark

Re: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread Imran Rashid
Hi Yong,

yes I think your analysis is correct.  I'd imagine almost all serializers
out there will just convert a string to its utf-8 representation.  You
might be interested in adding compression on top of a serializer, which
would probably bring the string size down in almost all cases, but then you
also need to take the time for compression.  Kryo is generally more
efficient than the java serializer on complicated object types.

I guess I'm still a little surprised that kryo is slower than java
serialization for you.  You might try setting
spark.kryo.referenceTracking to false if you are just serializing objects
with no circular references.  I think that will improve the performance a
little, though I dunno how much.

It might be worth running your experiments again with slightly more
complicated objects and see what you observe.

Imran


On Thu, Mar 19, 2015 at 12:57 PM, java8964 java8...@hotmail.com wrote:

 I read the Spark code a little bit, trying to understand my own question.

 It looks like the different is really between
 org.apache.spark.serializer.JavaSerializer and
 org.apache.spark.serializer.KryoSerializer, both having the method named
 writeObject.

 In my test case, for each line of my text file, it is about 140 bytes of
 String. When either JavaSerializer.writeObject(140 bytes of String) or
 KryoSerializer.writeObject(140 bytes of String), I didn't see difference in
 the underline OutputStream space usage.

 Does this mean that KryoSerializer really doesn't give us any benefit for
 String type? I understand that for primitives types, it shouldn't have any
 benefits, but how about String type?

 When we talk about lower the memory using KryoSerializer in spark, under
 what case it can bring significant benefits? It is my first experience with
 the KryoSerializer, so maybe I am total wrong about its usage.

 Thanks

 Yong

 --
 From: java8...@hotmail.com
 To: user@spark.apache.org
 Subject: Why I didn't see the benefits of using KryoSerializer
 Date: Tue, 17 Mar 2015 12:01:35 -0400


 Hi, I am new to Spark. I tried to understand the memory benefits of using
 KryoSerializer.

 I have this one box standalone test environment, which is 24 cores with
 24G memory. I installed Hadoop 2.2 plus Spark 1.2.0.

 I put one text file in the hdfs about 1.2G.  Here is the settings in the
 spark-env.sh

 export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4
 export SPARK_WORKER_MEMORY=32g
 export SPARK_DRIVER_MEMORY=2g
 export SPARK_EXECUTOR_MEMORY=4g

 First test case:
 val log=sc.textFile(hdfs://namenode:9000/test_1g/)
 log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
 log.count()
 log.count()

 The data is about 3M rows. For the first test case, from the storage in
 the web UI, I can see Size in Memory is 1787M, and Fraction Cached is
 70% with 7 cached partitions.
 This matched with what I thought, and first count finished about 17s, and
 2nd count finished about 6s.

 2nd test case after restart the spark-shell:
 val log=sc.textFile(hdfs://namenode:9000/test_1g/)
 log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
 log.count()
 log.count()

 Now from the web UI, I can see Size in Memory is 1231M, and Fraction
 Cached is 100% with 10 cached partitions. It looks like caching the
 default java serialized format reduce the memory usage, but coming with a
 cost that first count finished around 39s and 2nd count finished around 9s.
 So the job runs slower, with less memory usage.

 So far I can understand all what happened and the tradeoff.

 Now the problem comes with when I tried to test with KryoSerializer

 SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
 /opt/spark/bin/spark-shell
 val log=sc.textFile(hdfs://namenode:9000/test_1g/)
 log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)
 log.count()
 log.count()

 First, I saw that the new serializer setting passed in, as proven in the
 Spark Properties of Environment shows 

 spark.driver.extraJavaOptions

   -Dspark.serializer=org.apache.spark.serializer.KryoSerializer
   . This is not there for first 2 test cases.
 But in the web UI of Storage, the Size in Memory is 1234M, with 100%
 Fraction Cached and 10 cached partitions. The first count took 46s and
 2nd count took 23s.

 I don't get much less memory size as I expected, but longer run time for
 both counts. Anything I did wrong? Why the memory foot print of 
 MEMORY_ONLY_SER
 for KryoSerializer still use the same size as default Java serializer, with
 worse duration?

 Thanks

 Yong



RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964
I read the Spark code a little bit, trying to understand my own question.
It looks like the different is really between 
org.apache.spark.serializer.JavaSerializer and 
org.apache.spark.serializer.KryoSerializer, both having the method named 
writeObject.
In my test case, for each line of my text file, it is about 140 bytes of 
String. When either JavaSerializer.writeObject(140 bytes of String) or 
KryoSerializer.writeObject(140 bytes of String), I didn't see difference in the 
underline OutputStream space usage.
Does this mean that KryoSerializer really doesn't give us any benefit for 
String type? I understand that for primitives types, it shouldn't have any 
benefits, but how about String type?
When we talk about lower the memory using KryoSerializer in spark, under what 
case it can bring significant benefits? It is my first experience with the 
KryoSerializer, so maybe I am total wrong about its usage.
Thanks
Yong 
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Why I didn't see the benefits of using KryoSerializer
Date: Tue, 17 Mar 2015 12:01:35 -0400




Hi, I am new to Spark. I tried to understand the memory benefits of using 
KryoSerializer.
I have this one box standalone test environment, which is 24 cores with 24G 
memory. I installed Hadoop 2.2 plus Spark 1.2.0.
I put one text file in the hdfs about 1.2G.  Here is the settings in the 
spark-env.sh
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4export 
SPARK_WORKER_MEMORY=32gexport SPARK_DRIVER_MEMORY=2gexport 
SPARK_EXECUTOR_MEMORY=4g
First test case:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)log.count()log.count()
The data is about 3M rows. For the first test case, from the storage in the web 
UI, I can see Size in Memory is 1787M, and Fraction Cached is 70% with 7 
cached partitions.This matched with what I thought, and first count finished 
about 17s, and 2nd count finished about 6s.
2nd test case after restart the spark-shell:val 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
Now from the web UI, I can see Size in Memory is 1231M, and Fraction Cached 
is 100% with 10 cached partitions. It looks like caching the default java 
serialized format reduce the memory usage, but coming with a cost that first 
count finished around 39s and 2nd count finished around 9s. So the job runs 
slower, with less memory usage.
So far I can understand all what happened and the tradeoff.
Now the problem comes with when I tried to test with KryoSerializer
SPARK_JAVA_OPTS=-Dspark.serializer=org.apache.spark.serializer.KryoSerializer 
/opt/spark/bin/spark-shellval 
log=sc.textFile(hdfs://namenode:9000/test_1g/)log.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY_SER)log.count()log.count()
First, I saw that the new serializer setting passed in, as proven in the Spark 
Properties of Environment shows 











spark.driver.extraJavaOptions


  -Dspark.serializer=org.apache.spark.serializer.KryoSerializer



. This is not there for first 2 test cases.But in the web UI of Storage, the 
Size in Memory is 1234M, with 100% Fraction Cached and 10 cached 
partitions. The first count took 46s and 2nd count took 23s.
I don't get much less memory size as I expected, but longer run time for both 
counts. Anything I did wrong? Why the memory foot print of MEMORY_ONLY_SER 
for KryoSerializer still use the same size as default Java serializer, with 
worse duration?
Thanks
Yong