SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for
multiple users i.e. always up and running. At startup, there is option to
cache data and also pre-compute some results sets, hash maps etc. that
would be likely be asked by client APIs. I.e there is some option to use
startup time to precompute/cache - but query response time requirement on
large data set is very stringent

Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also OK).

* Does SparkSQL execution uses underlying partition information ? (Data is
from HDFS)
* Are there any ways to give hints to the SparkSQL execution about any
precomputed/pre-cached RDDs?
* Packages spark.sql.execution, spark.sql.execution.joins and other sql.xxx
packages - would using these for tuning query plan is recommended? Would
like to keep this as-needed if possible
* Features not in current release but scheduled for upcoming release would
also be good to know.

Thanks,

PS: This is not a small topic so if someone prefers to start a offline
thread on details, I can do that and summarize the conclusions back to this
thread.


Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian


On 1/27/15 5:55 PM, Cheng Lian wrote:


On 1/27/15 11:38 AM, Manoj Samel wrote:

Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db.

Use case is Spark Yarn app will start and serve as query server for 
multiple users i.e. always up and running. At startup, there is 
option to cache data and also pre-compute some results sets, hash 
maps etc. that would be likely be asked by client APIs. I.e there is 
some option to use startup time to precompute/cache - but query 
response time requirement on large data set is very stringent


Hoping to use SparkSQL (but a combination of SQL and RDD APIs is also 
OK).


* Does SparkSQL execution uses underlying partition information ? 
(Data is from HDFS)
No. For example, if the underlying data has already been partitioned 
by some key, Spark SQL doesn't know it, and can't leverage that 
information to avoid shuffle when doing aggregation on that key. 
However, partitioning the data ahead of time does help minimizing 
shuffle network IO. There's a JIRA ticket to enable Spark SQL aware of 
underlying data distribution.


Maybe you are asking about locality? If that's the case, just want to 
add that Spark SQL does understand locality information of the 
underlying data. It's obtained from Hadoop InputFormat.


* Are there any ways to give hints to the SparkSQL execution about 
any precomputed/pre-cached RDDs?
Instead of caching raw RDD, it's recommended to transform raw RDD to 
SchemaRDD and then cache it, so that in-memory columnar storage can be 
used. Also Spark SQL recognizes cached SchemaRDDs automatically.
* Packages spark.sql.execution, spark.sql.execution.joins and other 
sql.xxx packages - would using these for tuning query plan is 
recommended? Would like to keep this as-needed if possible
Not sure whether I understood this question. Are you trying to use 
internal APIs to do customized optimizations?
* Features not in current release but scheduled for upcoming release 
would also be good to know.


Thanks,

PS: This is not a small topic so if someone prefers to start a 
offline thread on details, I can do that and summarize the 
conclusions back to this thread.








-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org