Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Mich Talebzadeh
Hi, As I understand Spark memory allocation is used for execution ,memory and storage memory. The sum is deterministic (memory allocated in simplest form). So by using storage cache you impact the sum. Now 1. cache() is an alias to persist(memory_only) 2. caching is only done once. 3.

Re: Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread Davies Liu
Caching a RDD/DataFrame always has some cost, in this case, I'd suggest that do not cache the DataFrame, the first() is usually fast enough (only compute the partitions as needed). On Fri, Sep 2, 2016 at 1:05 PM, apu wrote: > When I first learnt Spark, I was told that

Is cache() still necessary for Spark DataFrames?

2016-09-02 Thread apu
When I first learnt Spark, I was told that *cache()* is desirable anytime one performs more than one Action on an RDD or DataFrame. For example, consider the PySpark toy example below; it shows two approaches to doing the same thing. # Approach 1 (bad?) df2 = someTransformation(df1) a =