Hi Spark devs,

It is hard to track everything going on in Spark with so many pull requests
and JIRA tickets. Below are 4 major improvements that will likely be in
Spark 1.6. We have already done prototyping for all of them, and want
feedback on their design.


1. SPARK-9850 Adaptive query execution in Spark
https://issues.apache.org/jira/browse/SPARK-9850

Historically, query planning is done using statistics before the execution
begins. However, the query engine doesn't always have perfect statistics
before execution, especially on fresh data with blackbox UDFs. SPARK-9850
proposes adaptively picking executions plans based on runtime statistics.


2. SPARK-9999 Type-safe API on top of Catalyst/DataFrame
https://issues.apache.org/jira/browse/SPARK-9999

A high level, typed API built on top of Catalyst/DataFrames. This API can
leverage all the work in Project Tungsten to have more robust and efficient
execution (including memory management, code generation, and query
optimization). This API is tentatively named Dataset (i.e. the last D in
RDD).


3. SPARK-10000 Unified memory management (by consolidating cache and
execution memory)
https://issues.apache.org/jira/browse/SPARK-10000

Spark statically divides memory into multiple fractions. The two biggest
ones are cache (aka storage) memory and execution memory. Out of the box,
only 16% of the memory is used for execution. That is to say, if an
application is not using caching, it is wasting majority of the memory
resource with the default configuration. SPARK-10000 proposes a solution to
dynamically allocate memory for these two fractions, and should improve
performance for large workloads without configuration tuning.


4. SPARK-10810 Improved session management in Spark SQL and DataFrames
https://issues.apache.org/jira/browse/SPARK-10810

Session isolation & management is important in SQL query engines. In Spark,
this is slightly more complicated since users can also use DataFrames
interactively beyond SQL. SPARK-10810 implements session management for
both SQL's JDBC/ODBC servers, as well as the DataFrame API.

Most of this work has been merged already in this pull request:
https://github.com/apache/spark/pull/8909

Reply via email to