Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be likely be asked by client APIs. I.e there is some option to use
startup time to precompute/cache - but query response time requirement on
large data set is very stringent

Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also OK).

* Does SparkSQL execution uses underlying partition information ? (Data is
from HDFS)
* Are there any ways to give "hints" to the SparkSQL execution about any
precomputed/pre-cached RDDs?
* Packages spark.sql.execution, spark.sql.execution.joins and other sql.xxx
packages - would using these for tuning query plan is recommended? Would
like to keep this as-needed if possible
* Features not in current release but scheduled for upcoming release would
also be good to know.

Thanks,

PS: This is not a small topic so if someone prefers to start a offline
thread on details, I can do that and summarize the conclusions back to this
thread.

Reply via email to