Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-02 Thread Michael David Pedersen
Awesome, thank you Michael for the detailed example! I'll look into whether I can use this approach for my use case. If so, I could avoid the overhead of repeatedly registering a temp table for one-off queries, instead registering the table once and relying on the injected strategy. Don't know

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael Armbrust
registerTempTable is backed by an in-memory hash table that maps table name (a string) to a logical query plan. Fragments of that logical query plan may or may not be cached (but calling register alone will not result in any materialization of results). In Spark 2.0 we renamed this function to

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
it would be great if we establish this. I know in Hive these temporary tables "CREATE TEMPRARY TABLE ..." are private to the session and are put in a hidden staging directory as below /user/hive/warehouse/.hive-staging_hive_2016-07-10_22-58-47_319_5605745346163312826-10 and removed when the

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Thanks for the link, I hadn't come across this. According to https://forums.databricks.com/questions/400/what-is-the- > difference-between-registertemptable-a.html > > and I quote > > "registerTempTable() > > registerTempTable() creates an in-memory table that is scoped to the > cluster in which

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Mich Talebzadeh
A bit of gray area here I am afraid, I was trying to experiment with it According to https://forums.databricks.com/questions/400/what-is-the-difference-between-registertemptable-a.html and I quote "registerTempTable() registerTempTable() creates an in-memory table that is scoped to the cluster

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-11-01 Thread Michael David Pedersen
Hi again Mich, "But the thing is that I don't explicitly cache the tempTables ..". > > I believe tempTable is created in-memory and is already cached > That surprises me since there is a sqlContext.cacheTable method to explicitly cache a table in memory. Or am I missing something? This could

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
well I suppose one can drop tempTable as below scala> df.registerTempTable("tmp") scala> spark.sql("select count(1) from tmp").show ++ |count(1)| ++ | 904180| ++ scala> spark.sql("drop table if exists tmp") res22: org.apache.spark.sql.DataFrame = [] Also your point

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you again for your reply. As I see you are caching the table already sorted > > val keyValRDDSorted = keyValRDD.sortByKey().cache > > and the next stage is you are creating multiple tempTables (different > ranges) that cache a subset of rows already cached in RDD. The data stored

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
Hi Michael, As I see you are caching the table already sorted val keyValRDDSorted = keyValRDD.sortByKey().cache and the next stage is you are creating multiple tempTables (different ranges) that cache a subset of rows already cached in RDD. The data stored in tempTable is in Hive columnar

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
Hi Mich, Thank you for your quick reply! What type of table is the underlying table? Is it Hbase, Hive ORC or what? > It is a custom datasource, but ultimately backed by HBase. > By Key you mean a UNIQUE ID or something similar and then you do multiple > scans on the tempTable which stores

Re: Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Mich Talebzadeh
here a more efficient way of achieving the > desired result? > > Any pointers would be much appreciated. > > Many thanks, > Michael > > PS: This question was also asked on StackOverflow - > http://stackoverflow.com/questions/40129411/efficient- > filtering-on-spark-sql-dataframes-with-ordered-keys. >

Efficient filtering on Spark SQL dataframes with ordered keys

2016-10-31 Thread Michael David Pedersen
sult? Any pointers would be much appreciated. Many thanks, Michael PS: This question was also asked on StackOverflow - http://stackoverflow.com/questions/40129411/efficient-filtering-on-spark-sql-dataframes-with-ordered-keys .