Hi Spark devs, It is hard to track everything going on in Spark with so many pull requests and JIRA tickets. Below are 4 major improvements that will likely be in Spark 1.6. We have already done prototyping for all of them, and want feedback on their design.
1. SPARK-9850 Adaptive query execution in Spark https://issues.apache.org/jira/browse/SPARK-9850 Historically, query planning is done using statistics before the execution begins. However, the query engine doesn't always have perfect statistics before execution, especially on fresh data with blackbox UDFs. SPARK-9850 proposes adaptively picking executions plans based on runtime statistics. 2. SPARK-9999 Type-safe API on top of Catalyst/DataFrame https://issues.apache.org/jira/browse/SPARK-9999 A high level, typed API built on top of Catalyst/DataFrames. This API can leverage all the work in Project Tungsten to have more robust and efficient execution (including memory management, code generation, and query optimization). This API is tentatively named Dataset (i.e. the last D in RDD). 3. SPARK-10000 Unified memory management (by consolidating cache and execution memory) https://issues.apache.org/jira/browse/SPARK-10000 Spark statically divides memory into multiple fractions. The two biggest ones are cache (aka storage) memory and execution memory. Out of the box, only 16% of the memory is used for execution. That is to say, if an application is not using caching, it is wasting majority of the memory resource with the default configuration. SPARK-10000 proposes a solution to dynamically allocate memory for these two fractions, and should improve performance for large workloads without configuration tuning. 4. SPARK-10810 Improved session management in Spark SQL and DataFrames https://issues.apache.org/jira/browse/SPARK-10810 Session isolation & management is important in SQL query engines. In Spark, this is slightly more complicated since users can also use DataFrames interactively beyond SQL. SPARK-10810 implements session management for both SQL's JDBC/ODBC servers, as well as the DataFrame API. Most of this work has been merged already in this pull request: https://github.com/apache/spark/pull/8909