If you have 2 different RDD (as 2 different references and RDD ID shown in your example), then YES, Spark will cache 2 exactly same thing in the memory.
There is no way that spark will compare and know that they are the same content. You define them as 2 RDD, then they are different RDDs, and will be cached individually. Yong ________________________________ From: Taotao.Li <charles.up...@gmail.com> Sent: Sunday, November 20, 2016 6:18 AM To: Rabin Banerjee Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das Subject: Re: Will spark cache table once even if I call read/cache on the same table multiple times hi, you can check my stackoverflow question : http://stackoverflow.com/questions/36195105/what-happens-if-i-cache-the-same-rdd-twice-in-spark/36195812#36195812 On Sat, Nov 19, 2016 at 3:16 AM, Rabin Banerjee <dev.rabin.baner...@gmail.com<mailto:dev.rabin.baner...@gmail.com>> wrote: Hi Yong, But every time val tabdf = sqlContext.table(tablename) is called tabdf.rdd is having a new id which can be checked by calling tabdf.rdd.id<http://tabdf.rdd.id> . And, https://github.com/apache/spark/blob/b6de0c98c70960a97b07615b0b08fbd8f900fbe7/core/src/main/scala/org/apache/spark/SparkContext.scala#L268 Spark is maintaining the Map if [RDD_ID,RDD] , as RDD id is changing , will spark cache same data again and again ?? For example , val tabdf = sqlContext.table("employee") tabdf.cache() tabdf.someTransformation.someAction println(tabledf.rdd.id<http://tabledf.rdd.id>) val tabdf1 = sqlContext.table("employee") tabdf1.cache() <= Will spark again go to disk read and load data into memory or look into cache ? tabdf1.someTransformation.someAction println(tabledf1.rdd.id<http://tabledf1.rdd.id>) Regards, R Banerjee On Fri, Nov 18, 2016 at 9:14 PM, Yong Zhang <java8...@hotmail.com<mailto:java8...@hotmail.com>> wrote: That's correct, as long as you don't change the StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166 Yong ________________________________ From: Rabin Banerjee <dev.rabin.baner...@gmail.com<mailto:dev.rabin.baner...@gmail.com>> Sent: Friday, November 18, 2016 10:36 AM To: user; Mich Talebzadeh; Tathagata Das Subject: Will spark cache table once even if I call read/cache on the same table multiple times Hi All , I am working in a project where code is divided into multiple reusable module . I am not able to understand spark persist/cache on that context. My Question is Will spark cache table once even if I call read/cache on the same table multiple times ?? Sample Code :: TableReader:: def getTableDF(tablename:String,persist:Boolean = false) : DataFrame = { val tabdf = sqlContext.table(tablename) if(persist) { tabdf.cache() } return tableDF } Now Module1:: val emp = TableReader.getTable("employee") emp.someTransformation.someAction Module2:: val emp = TableReader.getTable("employee") emp.someTransformation.someAction .... ModuleN:: val emp = TableReader.getTable("employee") emp.someTransformation.someAction Will spark cache emp table once , or it will cache every time I am calling ?? Shall I maintain a global hashmap to handle that ? something like Map[String,DataFrame] ?? Regards, Rabin Banerjee -- ___________________ Quant | Engineer | Boy ___________________ blog: http://litaotao.github.io<http://litaotao.github.io?utm_source=spark_mail> github: www.github.com/litaotao<http://www.github.com/litaotao>