Hi,
With following code snippet, I cached the raw RDD(which is already in memory, 
but just for illustration) and its DataFrame.
I thought that the df cache would take less space than the rdd cache,which is 
wrong because from the UI that I see the rdd cache takes 168B,while the df 
cache takes 272B.
What data is cached when df.cache is called and actually cache the data?  It 
looks that the df only cached the avg(age) which should be much smaller in size,

val conf = new SparkConf().setMaster("local").setAppName("SparkSQL_Cache")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._
    val rdd=sc.parallelize(Array(Student("Jack",21), Student("Mary", 22)))
    rdd.cache
    rdd.toDF().registerTempTable("TBL_STUDENT")
    val df = sqlContext.sql("select avg(age) from TBL_STUDENT")
    df.cache()
    df.show

Reply via email to