Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I
tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many
articles say straight-forward Python implementation is the slowest due to
the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and
deserialization time of PySpark do not seem to be any bigger than that of
Scala. The main contributor was "Executor Computing Time". Thus, we cannot
sure whether this is due to serialization or because Python code is
basically slower than Scala code. 

So my question is that does "Task Deserialization Time" in Spark WebUI
actually include serialization/deserialization times in PySpark? If this is
not the case, how can I actually measure the serialization/deserialization
overhead? 

Thanks,
Yeoul



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to