[ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15687:
--------------------------------
    Description: 
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

>From the architectural perspective: 
- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we encode nested data? What are the operations on nested data, and how 
do we handle these operations in a columnar format?
- What is the transition plan towards the end state?

>From an external API perspective:
- Can we expose a more efficient column batch user-defined function API?
- How do we leverage this to integrate with 3rd party tools?
- Can we have a spec for a fixed version of the column batch format that can be 
externalized and use that in data source API v2?

  was:
This ticket tracks progress in making the entire engine columnar, especially in 
the context of nested data type support.

In Spark 2.0, we have used the internal column batch interface in Parquet 
reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
Other parts of the engine are already using whole-stage code generation, which 
is in many ways more efficient than a columnar execution engine for flat data 
types.

The goal here is to figure out a story to work towards making column batch the 
common data exchange format between operators outside whole-stage code 
generation, as well as with external systems (e.g. Pandas).

Some of the important questions to answer are:

- What is the end state architecture?
- Should aggregation be columnar?
- Should sorting be columnar?
- How do we encode nested data? What are the operations on nested data, and how 
do we handle these operations in a columnar format?
- What is the transition plan towards the end state?



> Columnar execution engine
> -------------------------
>
>                 Key: SPARK-15687
>                 URL: https://issues.apache.org/jira/browse/SPARK-15687
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to