subject:"Spark SQL \- Point lookup optimisation in SchemaRDD\?"

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin

I was able to resolve this use case (Thanks Cheng Lian) where I wanted to
launch executor on just the specific partition while also getting the batch
pruning optimisations of Spark SQL by doing following :-

val query = sql(SELECT * FROM cac
hedTable WHERE key = 1)
val plannedRDD = query.queryExecution.toRdd
val prunedRDD = PartitionPruningRDD.create(plannedRDD, _ == 3)
prunedRDD.collect()

Thanks a lot Cheng for suggesting the approach to do things other way round.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Point-lookup-optimisation-in-SchemaRDD-tp21555p21613.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin

Hi All,

I have a use case where I have cached my schemaRDD and I want to launch
executors just on the partition which I know of (prime use-case of
PartitionPruningRDD).

I tried something like following :-

val partitionIdx = 2
val schemaRdd = hiveContext.table(myTable) //myTable is cached in memory
val partitionPrunedRDD = new PartitionPrunedRDD(schemaRdd, _ ==
partitionIdx)
val partitionSchemaRDD = hiveContext.applySchema(partitionPrunedRDD,
schemaRdd.schema)
partitionSchemaRDD.registerTempTable(myTablePartition2)
hiveContext.hql(select * from myTablePartition2 where id=10001)

If I do this, if I expect my executor to run query in 500ms, it is running
in 3000-4000 ms. I think this is happening because I did applySchema and
lost the queryExecution plan. 

But, if I do partitionSchemaRDD.cache as well, then I get the 500ms
performance but in this case, same partition/data is getting cached twice. 

My question is that can we create a PartitionPruningCachedSchemaRDD like
class which can prune the partitions of InMemoryColumnarTableScan's
RDD[CachedBatch] and launch executor on just the selected partition(s)?

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Point-lookup-optimisation-in-SchemaRDD-tp21555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

Spark SQL - Point lookup optimisation in SchemaRDD?

2 matches

Site Navigation

Mail list logo

Footer information