[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376999#comment-15376999 ]
Kazuaki Ishizaki edited comment on SPARK-15687 at 7/14/16 2:25 PM: ------------------------------------------------------------------- It would be good to introduce trait for {{iterator[T]}}. According to my experiment in my [WIP|https://github.com/apache/spark/pull/13171/files#diff-28cb12941b992ff680c277c651b59aa0R445], the following external APIs are necessary: {code:java} def numColumns: Integer def column(column: Integer): org.apache.spark.sql.execution.vectorized.ColumnVector def numRows: Integer def hasNextRow: Boolean {code} For a column structure, since {{ColumnVector}} is declared as {{abstract}}, we can have different implementations. Actually, I implemented {{ByteByfferColumnVector}} to wrap {{CachedBatch}}. Is there any good implementation for new columnar storage by using trait/interface instead of abstract? I am interested in its advantage. was (Author: kiszk): It would be good to introduce trait for {{iterator[T]}}. According to my experiment in my [WIP|https://github.com/apache/spark/pull/13171/files#diff-28cb12941b992ff680c277c651b59aa0R445], the following external APIs are necessary: {{ def numColumns: Integer def column(column: Integer): org.apache.spark.sql.execution.vectorized.ColumnVector def numRows: Integer def hasNextRow: Boolean }} For a column structure, since {{ColumnVector}} is declared as {{abstract}}, we can have different implementations. Actually, I implemented {{ByteByfferColumnVector}} to wrap {{CachedBatch}}. Is there any good implementation for new columnar storage by using trait/interface instead of abstract? I am interested in its advantage. > Columnar execution engine > ------------------------- > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org