Hi All, My question is about lazy running mode for SchemaRDD, I guess. I know lazy mode is good, however, I still have this demand.
For example, here is the first SchemaRDD, named result.(select * from table where num>1 and num < 4): results: org.apache.spark.sql.SchemaRDD = SchemaRDD[59] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Filter ((num#0 > 1) && (num#0 < 4)) ExistingRdd [num#0,str1#1,str2#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208 Then I create the second RDD with: select num, str1 from table from result results1: org.apache.spark.sql.SchemaRDD = SchemaRDD[60] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == Project [num#0,str1#1] Filter ((num#0 > 1) && (num#0 < 4)) ExistingRdd [num#0,str1#1,str2#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208 Actually, I want the second RDD's plan is based on result not the original table. How can I create a new SchemaRDD whose plan starts from last RDD? Thanks, Tim