Hi, As I understand Spark memory allocation is used for execution ,memory and storage memory. The sum is deterministic (memory allocated in simplest form). So by using storage cache you impact the sum.
Now 1. cache() is an alias to persist(memory_only) 2. caching is only done once. 3. Both dataframes and rdds can be cached. If you cache rdd or df it will persist in memory until it is evicted as Spark uses an LRU (Least Recently Used) chain. So if your rdd is moderately small and it is accessed iteratively, then caching it would be advantages for faster access. Otherwise, leave it as it is. Spark doc <http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence>explains this. You can perform some tests by running both approaches and check Spark UI (default port 4040) under Storage tab to see the amount of data cached. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 2 September 2016 at 21:21, Davies Liu <dav...@databricks.com> wrote: > Caching a RDD/DataFrame always has some cost, in this case, I'd suggest > that > do not cache the DataFrame, the first() is usually fast enough (only > compute the > partitions as needed). > > On Fri, Sep 2, 2016 at 1:05 PM, apu <apumishra...@gmail.com> wrote: > > When I first learnt Spark, I was told that cache() is desirable anytime > one > > performs more than one Action on an RDD or DataFrame. For example, > consider > > the PySpark toy example below; it shows two approaches to doing the same > > thing. > > > > # Approach 1 (bad?) > > df2 = someTransformation(df1) > > a = df2.count() > > b = df2.first() # This step could take long, because df2 has to be > created > > all over again > > > > # Approach 2 (good?) > > df2 = someTransformation(df1) > > df2.cache() > > a = df2.count() > > b = df2.first() # Because df2 is already cached, this action is quick > > df2.unpersist() > > > > The second approach shown above is somewhat clunky, because it requires > one > > to cache any dataframe that will be Acted on more than once, followed by > the > > need to call unpersist() later to free up memory. > > > > So my question is: is the second approach still necessary/desirable when > > operating on DataFrames in newer versions of Spark (>=1.6)? > > > > Thanks!! > > > > Apu > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >