Hi All, I have a use case where I have cached my schemaRDD and I want to launch executors just on the partition which I know of (prime use-case of PartitionPruningRDD).
I tried something like following :- val partitionIdx = 2 val schemaRdd = hiveContext.table("myTable") //myTable is cached in memory val partitionPrunedRDD = new PartitionPrunedRDD(schemaRdd, _ == partitionIdx) val partitionSchemaRDD = hiveContext.applySchema(partitionPrunedRDD, schemaRdd.schema) partitionSchemaRDD.registerTempTable("myTablePartition2") hiveContext.hql("select * from myTablePartition2 where id=10001") If I do this, if I expect my executor to run query in 500ms, it is running in 3000-4000 ms. I think this is happening because I did "applySchema" and lost the queryExecution plan. But, if I do partitionSchemaRDD.cache as well, then I get the 500ms performance but in this case, same partition/data is getting cached twice. My question is that can we create a PartitionPruningCachedSchemaRDD like class which can prune the partitions of InMemoryColumnarTableScan's RDD[CachedBatch] and launch executor on just the selected partition(s)? Thanks -Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Point-lookup-optimisation-in-SchemaRDD-tp21555.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org