When I first learnt Spark, I was told that *cache()* is desirable anytime
one performs more than one Action on an RDD or DataFrame. For example,
consider the PySpark toy example below; it shows two approaches to doing
the same thing.

# Approach 1 (bad?)
df2 = someTransformation(df1)
a = df2.count()
b = df2.first() # This step could take long, because df2 has to be created
all over again

# Approach 2 (good?)
df2 = someTransformation(df1)
df2.cache()
a = df2.count()
b = df2.first() # Because df2 is already cached, this action is quick
df2.unpersist()

The second approach shown above is somewhat clunky, because it requires one
to cache any dataframe that will be Acted on more than once, followed by
the need to call *unpersist()* later to free up memory.

*So my question is: is the second approach still necessary/desirable when
operating on DataFrames in newer versions of Spark (>=1.6)?*

Thanks!!

Apu

Reply via email to