Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-06 Thread Gylfi
Hi. 

Just a few quick comment on your question. 

If you drill into (click the link of the subtasks) you can get more detailed
view of the tasks. 
One of the things reported is the time for serialization. 
If that is your dominant factor it should be reflected there, right? 

Are you sure the input data is not getting cached between runs (i.e. does
the order of the experiments matter and did you explicitly flush the
operation system memory between runs etc. etc.)? 
If you now run the old experiment again, does it take the same amount of
time again? 

Did you validate that the results where actually correct? 

Hope this helps..

Regards, 
Gylfi.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621p23659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Gavin Liu
Hi,

I am using TeraSort benchmark from ehiggs's branch 
https://github.com/ehiggs/spark-terasort
https://github.com/ehiggs/spark-terasort  . Then I noticed that in
TeraSort.scala, it is using Kryo Serializer. So I made a small change from
org.apache.spark.serializer.KryoSerializer to
org.apache.spark.serializer.JavaSerializer to see the time difference.

Curiously, using Java Serializer is much quicker than using Kryo and there
is no error reported when I run the program. Here is the record from history
server, first one is kryo. second one is java default. 

1.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/kryo.png 

2.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/java.png 

I am wondering if I did something wrong or there is any other reason behind
this result.

Thanks for any help,
Gavin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Akhil Das
Looks like, it spend more time writing/transferring the 40GB of shuffle
when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle?

Thanks
Best Regards

On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu ilovesonsofanar...@gmail.com
wrote:

 Hi,

 I am using TeraSort benchmark from ehiggs's branch
 https://github.com/ehiggs/spark-terasort
 https://github.com/ehiggs/spark-terasort  . Then I noticed that in
 TeraSort.scala, it is using Kryo Serializer. So I made a small change from
 org.apache.spark.serializer.KryoSerializer to
 org.apache.spark.serializer.JavaSerializer to see the time difference.

 Curiously, using Java Serializer is much quicker than using Kryo and there
 is no error reported when I run the program. Here is the record from
 history
 server, first one is kryo. second one is java default.

 1.
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/kryo.png

 2.
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/java.png

 I am wondering if I did something wrong or there is any other reason behind
 this result.

 Thanks for any help,
 Gavin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Will Briggs
That code doesn't appear to be registering classes with Kryo, which means the 
fully-qualified classname is stored with every Kryo record. The Spark 
documentation has more on this: 
https://spark.apache.org/docs/latest/tuning.html#data-serialization

Regards,
Will

On July 5, 2015, at 2:31 AM, Gavin Liu ilovesonsofanar...@gmail.com wrote:

Hi,

I am using TeraSort benchmark from ehiggs's branch 
https://github.com/ehiggs/spark-terasort
https://github.com/ehiggs/spark-terasort  . Then I noticed that in
TeraSort.scala, it is using Kryo Serializer. So I made a small change from
org.apache.spark.serializer.KryoSerializer to
org.apache.spark.serializer.JavaSerializer to see the time difference.

Curiously, using Java Serializer is much quicker than using Kryo and there
is no error reported when I run the program. Here is the record from history
server, first one is kryo. second one is java default. 

1.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/kryo.png 

2.
http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/java.png 

I am wondering if I did something wrong or there is any other reason behind
this result.

Thanks for any help,
Gavin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org