Hi all, I am trying to analyze PySpark performance overhead. People just say PySpark is slower than Scala due to the Serialization/Deserialization overhead. I tried with the example in this post: https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many articles say straight-forward Python implementation is the slowest due to the serialization/deserialization overhead.
However, when I actually looked at the log in the Web UI, serialization and deserialization time of PySpark do not seem to be any bigger than that of Scala. The main contributor was "Executor Computing Time". Thus, we cannot sure whether this is due to serialization or because Python code is basically slower than Scala code. So my question is that does "Task Deserialization Time" in Spark WebUI actually include serialization/deserialization times in PySpark? If this is not the case, how can I actually measure the serialization/deserialization overhead? Thanks, Yeoul -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org