Hi,

Spark document describes "Since the data is always serialized on the Python side, all the constants use the serialized formats.".

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.StorageLevel

But when I cached dataframe and looked StorageLevel, it shows that cached dataframe is deserialized. It looks wrong behavior against document. but does anyone please comment about it ?

Here is code to reproduce.

from pyspark.sql.functions import col, length
source = [
    ("frighten",), ("watch",), ("dish",),
    ("reflect",), ("serious",), ("summer",),
    ("embrace",), ("transition",), ("Venus",)]
dfa = spark.createDataFrame(source, ["word"]).withColumn("num", length(col("word")))
dfa2 = dfa.cache()

dfa.storageLevel  >>> shows "StorageLevel(True, True, False, True, 1)"
dfa2.storageLevel >>> same as above


Thanks for reading!

Mitsutoshi Kiuchi

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to