Hi,
Spark document describes "Since the data is always serialized on the
Python side, all the constants use the serialized formats.".
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.StorageLevel
But when I cached dataframe and looked StorageLevel, it shows that
cached dataframe is deserialized. It looks wrong behavior against
document. but does anyone please comment about it ?
Here is code to reproduce.
from pyspark.sql.functions import col, length
source = [
("frighten",), ("watch",), ("dish",),
("reflect",), ("serious",), ("summer",),
("embrace",), ("transition",), ("Venus",)]
dfa = spark.createDataFrame(source, ["word"]).withColumn("num",
length(col("word")))
dfa2 = dfa.cache()
dfa.storageLevel >>> shows "StorageLevel(True, True, False, True, 1)"
dfa2.storageLevel >>> same as above
Thanks for reading!
Mitsutoshi Kiuchi
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org