Spark SQL - Point lookup optimisation in SchemaRDD?

nitin Mon, 09 Feb 2015 07:23:39 -0800

Hi All,

I have a use case where I have cached my schemaRDD and I want to launch
executors just on the partition which I know of (prime use-case of
PartitionPruningRDD).


I tried something like following :-

val partitionIdx = 2
val schemaRdd = hiveContext.table("myTable") //myTable is cached in memory
val partitionPrunedRDD = new PartitionPrunedRDD(schemaRdd, _ ==
partitionIdx)
val partitionSchemaRDD = hiveContext.applySchema(partitionPrunedRDD,
schemaRdd.schema)
partitionSchemaRDD.registerTempTable("myTablePartition2")
hiveContext.hql("select * from myTablePartition2 where id=10001")

If I do this, if I expect my executor to run query in 500ms, it is running
in 3000-4000 ms. I think this is happening because I did "applySchema" and
lost the queryExecution plan. 

But, if I do partitionSchemaRDD.cache as well, then I get the 500ms
performance but in this case, same partition/data is getting cached twice. 

My question is that can we create a PartitionPruningCachedSchemaRDD like
class which can prune the partitions of InMemoryColumnarTableScan's
RDD[CachedBatch] and launch executor on just the selected partition(s)?

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Point-lookup-optimisation-in-SchemaRDD-tp21555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL - Point lookup optimisation in SchemaRDD?

Reply via email to