Hi, With following code snippet, I cached the raw RDD(which is already in memory, but just for illustration) and its DataFrame. I thought that the df cache would take less space than the rdd cache,which is wrong because from the UI that I see the rdd cache takes 168B,while the df cache takes 272B. What data is cached when df.cache is called and actually cache the data? It looks that the df only cached the avg(age) which should be much smaller in size,
val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22))) rdd.cache rdd.toDF().registerTempTable("TBL_STUDENT") val df = sqlContext.sql("select avg(age) from TBL_STUDENT") df.cache() df.show