On 1/27/15 11:38 AM, Manoj Samel wrote:
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.
Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option
to cache data and also pre-compute some results sets, hash maps etc.
that would be likely be asked by client APIs. I.e there is some option
to use startup time to precompute/cache - but query response time
requirement on large data set is very stringent
Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also
OK).
* Does SparkSQL execution uses underlying partition information ?
(Data is from HDFS)
No. For example, if the underlying data has already been partitioned by
some key, Spark SQL doesn't know it, and can't leverage that information
to avoid shuffle when doing aggregation on that key. However,
partitioning the data ahead of time does help minimizing shuffle network
IO. There's a JIRA ticket to enable Spark SQL aware of underlying data
distribution.
* Are there any ways to give "hints" to the SparkSQL execution about
any precomputed/pre-cached RDDs?
Instead of caching raw RDD, it's recommended to transform raw RDD to
SchemaRDD and then cache it, so that in-memory columnar storage can be
used. Also Spark SQL recognizes cached SchemaRDDs automatically.
* Packages spark.sql.execution, spark.sql.execution.joins and other
sql.xxx packages - would using these for tuning query plan is
recommended? Would like to keep this as-needed if possible
Not sure whether I understood this question. Are you trying to use
internal APIs to do customized optimizations?
* Features not in current release but scheduled for upcoming release
would also be good to know.
Thanks,
PS: This is not a small topic so if someone prefers to start a offline
thread on details, I can do that and summarize the conclusions back to
this thread.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]