Hi,
As I understand Spark memory allocation is used for execution ,memory and
storage memory. The sum is deterministic (memory allocated in simplest
form). So by using storage cache you impact the sum.
Now
1. cache() is an alias to persist(memory_only)
2. caching is only done once.
3.
Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that
do not cache the DataFrame, the first() is usually fast enough (only compute the
partitions as needed).
On Fri, Sep 2, 2016 at 1:05 PM, apu wrote:
> When I first learnt Spark, I was told that
When I first learnt Spark, I was told that *cache()* is desirable anytime
one performs more than one Action on an RDD or DataFrame. For example,
consider the PySpark toy example below; it shows two approaches to doing
the same thing.
# Approach 1 (bad?)
df2 = someTransformation(df1)
a =