Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as MEMORY_ONLY_SER and turning on spark.rdd.compress, which will compress them. In Java this can add some CPU overhead but Python runs quite a bit slower so it might not matter, and it might speed stuff up by reducing GC or letting you cache more data.
Matei On Apr 14, 2014, at 12:24 PM, Diana Carroll <dcarr...@cloudera.com> wrote: > I'm looking at the Tuning Guide suggestion to use Kryo instead of default > serialization. My questions: > > Does pyspark use Java serialization by default, as Scala spark does? If so, > then... > can I use Kryo with pyspark instead? The instructions say I should register > my classes with the Kryo Serialization, but that's in Java/Scala. If I > simply set the spark.serializer variable for my SparkContext, will it at > least use Kryo for Spark's own classes, even if I can't register any of my > own classes? > > Thanks, > Diana