Hello,

I've asked the following question [1] on Stackoverflow but didn't get an answer, yet. I use now this channel to give it more visibility, and hopefully find someone who can help.

"*Context.* I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an application that iterates through each of those query files and passes it to a standalone Spark application. This latter /first/ parses the query, extracts the used tables, registers them (using: registerTempTable() in Spark < 2 and createOrReplaceTempView() in Spark 2), and executes effectively the query (spark.sql()).

*Challenge.* Since registering the tables can sometimes be time consuming, I would like to register the tables only once when they are first used, and keep that in form of metadata that can readily be used in the subsequent queries without the need to re-register the tables again. It's a sort of intra-job caching but not any of the caching Spark offers (table caching), as far as I know.

Is that possible? if not can anyone suggest another approach to accomplish the same goal (i.e., iterating through separate query files and run a querying Spark application without registering the tables that have already been registered before)."

[1]: http://stackoverflow.com/questions/40549924/sparksql-intra-sparksql-application-table-registration

Cheers,
Mohamed

Reply via email to