Kryo won’t make a major impact on PySpark because it just stores data as byte[] 
objects, which are fast to serialize even with Java. But it may be worth a try 
— you would just set spark.serializer and not try to register any classes. What 
might make more impact is storing data as MEMORY_ONLY_SER and turning on 
spark.rdd.compress, which will compress them. In Java this can add some CPU 
overhead but Python runs quite a bit slower so it might not matter, and it 
might speed stuff up by reducing GC or letting you cache more data.

Matei

On Apr 14, 2014, at 12:24 PM, Diana Carroll <dcarr...@cloudera.com> wrote:

> I'm looking at the Tuning Guide suggestion to use Kryo instead of default 
> serialization.  My questions:
> 
> Does pyspark use Java serialization by default, as Scala spark does?  If so, 
> then...
> can I use Kryo with pyspark instead?  The instructions say I should register 
> my classes with the Kryo Serialization, but that's in Java/Scala.  If I 
> simply set the spark.serializer variable for my SparkContext, will it at 
> least use Kryo for Spark's own classes, even if I can't register any of my 
> own classes?
> 
> Thanks,
> Diana

Reply via email to