If you have 2 different RDD (as 2 different references and RDD ID shown in your 
example), then YES, Spark will cache 2 exactly same thing in the memory.


There is no way that spark will compare and know that they are the same 
content. You define them as 2 RDD, then they are different RDDs, and will be 
cached individually.


Yong


________________________________
From: Taotao.Li <charles.up...@gmail.com>
Sent: Sunday, November 20, 2016 6:18 AM
To: Rabin Banerjee
Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das
Subject: Re: Will spark cache table once even if I call read/cache on the same 
table multiple times

hi, you can check my stackoverflow question : 
http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812

On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee 
<dev.rabin.baner...@gmail.com<mailto:dev.rabin.baner...@gmail.com>> wrote:
Hi Yong,

  But every time  val tabdf = sqlContext.table(tablename) is called tabdf.rdd 
is having a new id which can be checked by calling 
tabdf.rdd.id<http://tabdf.rdd.id> .
And,
https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268

Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will 
spark cache same data again and again ??

For example ,

val tabdf = sqlContext.table("employee")
tabdf.cache()
tabdf.someTransformation.someAction
println(tabledf.rdd.id<http://tabledf.rdd.id>)
val tabdf1 = sqlContext.table("employee")
tabdf1.cache() <= Will spark again go to disk read and load data into memory or 
look into cache ?
tabdf1.someTransformation.someAction
println(tabledf1.rdd.id<http://tabledf1.rdd.id>)

Regards,
R Banerjee




On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang 
<java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote:

That's correct, as long as you don't change the StorageLevel.


https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166



Yong

________________________________
From: Rabin Banerjee 
<dev.rabin.baner...@gmail.com<mailto:dev.rabin.baner...@gmail.com>>
Sent: Friday, November 18, 2016 10:36 AM
To: user; Mich Talebzadeh; Tathagata Das
Subject: Will spark cache table once even if I call read/cache on the same 
table multiple times

Hi All ,

  I am working in a project where code is divided into multiple reusable module 
. I am not able to understand spark persist/cache on that context.

My Question is Will spark cache table once even if I call read/cache on the 
same table multiple times ??

 Sample Code ::

  TableReader::

   def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = {
         val tabdf = sqlContext.table(tablename)
         if(persist) {
             tabdf.cache()
            }
      return tableDF
}

 Now
Module1::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Module2::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

....

ModuleN::
 val emp = TableReader.getTable("employee")
 emp.someTransformation.someAction

Will spark cache emp table once , or it will cache every time I am calling ?? 
Shall I maintain a global hashmap to handle that ? something like 
Map[String,DataFrame] ??

 Regards,
Rabin Banerjee







--
___________________
Quant | Engineer | Boy
___________________
blog:    
http://litaotao.github.io<http://litaotao.github.io?utm_source=spark_mail>
github: www.github.com/litaotao<http://www.github.com/litaotao>

Reply via email to