[jira] [Comment Edited] (SPARK-15687) Columnar execution engine

Kazuaki Ishizaki (JIRA) Thu, 14 Jul 2016 07:27:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376999#comment-15376999
 ]


Kazuaki Ishizaki edited comment on SPARK-15687 at 7/14/16 2:25 PM:
-------------------------------------------------------------------

It would be good to introduce trait for {{iterator[T]}}. According to my 
experiment in my 
[WIP|https://github.com/apache/spark/pull/13171/files#diff-28cb12941b992ff680c277c651b59aa0R445],
 the following external APIs are necessary:
{code:java}
  def numColumns: Integer
  def column(column: Integer): 
org.apache.spark.sql.execution.vectorized.ColumnVector
  def numRows: Integer
  def hasNextRow: Boolean
{code}

For a column structure, since {{ColumnVector}} is declared as {{abstract}}, we 
can have different implementations. Actually, I implemented 
{{ByteByfferColumnVector}} to wrap {{CachedBatch}}. Is there any good 
implementation for new columnar storage by using trait/interface instead of 
abstract? I am interested in its advantage.




was (Author: kiszk):
It would be good to introduce trait for {{iterator[T]}}. According to my 
experiment in my 
[WIP|https://github.com/apache/spark/pull/13171/files#diff-28cb12941b992ff680c277c651b59aa0R445],
 the following external APIs are necessary:
{{
  def numColumns: Integer
  def column(column: Integer): 
org.apache.spark.sql.execution.vectorized.ColumnVector
  def numRows: Integer
  def hasNextRow: Boolean
}}

For a column structure, since {{ColumnVector}} is declared as {{abstract}}, we 
can have different implementations. Actually, I implemented 
{{ByteByfferColumnVector}} to wrap {{CachedBatch}}. Is there any good 
implementation for new columnar storage by using trait/interface instead of 
abstract? I am interested in its advantage.



> Columnar execution engine
> -------------------------
>
>                 Key: SPARK-15687
>                 URL: https://issues.apache.org/jira/browse/SPARK-15687
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially 
> in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet 
> reading (via a vectorized Parquet decoder) and low cardinality aggregation. 
> Other parts of the engine are already using whole-stage code generation, 
> which is in many ways more efficient than a columnar execution engine for 
> flat data types.
> The goal here is to figure out a story to work towards making column batch 
> the common data exchange format between operators outside whole-stage code 
> generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and 
> how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can 
> be externalized and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15687) Columnar execution engine

Reply via email to