Re: Is cache() still necessary for Spark DataFrames?

Mich Talebzadeh Fri, 02 Sep 2016 14:15:07 -0700

Hi,

As I understand Spark memory allocation is used for execution ,memory and
storage memory. The sum is deterministic (memory allocated in simplest
form). So by using storage cache you impact the sum.

Now

   1. cache() is an alias to persist(memory_only)
   2. caching is only done once.
   3. Both dataframes and rdds can be cached.

If you cache rdd or df it will persist in memory until it is evicted as
Spark uses an LRU (Least Recently Used) chain. So if your rdd is moderately
small and it is accessed iteratively, then caching it would be advantages
for faster access. Otherwise, leave it as it is.  Spark doc
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence>explains
this.

You can perform some tests by running both approaches and check Spark UI
(default port 4040) under Storage tab to see the amount of data cached.

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On 2 September 2016 at 21:21, Davies Liu <dav...@databricks.com> wrote:

> Caching a RDD/DataFrame always has some cost, in this case, I'd suggest
> that
> do not cache the DataFrame, the first() is usually fast enough (only
> compute the
> partitions as needed).
>
> On Fri, Sep 2, 2016 at 1:05 PM, apu <apumishra...@gmail.com> wrote:
> > When I first learnt Spark, I was told that cache() is desirable anytime
> one
> > performs more than one Action on an RDD or DataFrame. For example,
> consider
> > the PySpark toy example below; it shows two approaches to doing the same
> > thing.
> >
> > # Approach 1 (bad?)
> > df2 = someTransformation(df1)
> > a = df2.count()
> > b = df2.first() # This step could take long, because df2 has to be
> created
> > all over again
> >
> > # Approach 2 (good?)
> > df2 = someTransformation(df1)
> > df2.cache()
> > a = df2.count()
> > b = df2.first() # Because df2 is already cached, this action is quick
> > df2.unpersist()
> >
> > The second approach shown above is somewhat clunky, because it requires
> one
> > to cache any dataframe that will be Acted on more than once, followed by
> the
> > need to call unpersist() later to free up memory.
> >
> > So my question is: is the second approach still necessary/desirable when
> > operating on DataFrames in newer versions of Spark (>=1.6)?
> >
> > Thanks!!
> >
> > Apu
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Is cache() still necessary for Spark DataFrames?

Reply via email to