Awesome, thank you Michael for the detailed example!
I'll look into whether I can use this approach for my use case. If so, I
could avoid the overhead of repeatedly registering a temp table for one-off
queries, instead registering the table once and relying on the injected
strategy. Don't know
Thanks for the link, I hadn't come across this.
According to https://forums.databricks.com/questions/400/what-is-the-
> difference-between-registertemptable-a.html
>
> and I quote
>
> "registerTempTable()
>
> registerTempTable() creates an in-memory table that is scoped to the
> cluster in which
Hi again Mich,
"But the thing is that I don't explicitly cache the tempTables ..".
>
> I believe tempTable is created in-memory and is already cached
>
That surprises me since there is a sqlContext.cacheTable method to
explicitly cache a table in memory. Or am I missing something? This could
Hi Mich,
Thank you again for your reply.
As I see you are caching the table already sorted
>
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> and the next stage is you are creating multiple tempTables (different
> ranges) that cache a subset of rows already cached in RDD. The data stored
Hi Mich,
Thank you for your quick reply!
What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>
It is a custom datasource, but ultimately backed by HBase.
> By Key you mean a UNIQUE ID or something similar and then you do multiple
> scans on the tempTable which stores
Hello,
I've got a Spark SQL dataframe containing a "key" column. The queries I
want to run start by filtering on the key range. My question in outline: is
it possible to sort the dataset by key so as to do efficient key range
filters, before subsequently running a more complex SQL query?
I'm
Hi,
I'm wanting to take a SQL string as a user input, then transform it before
execution. In particular, I want to modify the top-level projection (select
clause), injecting additional columns to be retrieved by the query.
I was hoping to achieve this by hooking into Catalyst using