[ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310663#comment-15310663 ]
Kazuaki Ishizaki commented on SPARK-15687: ------------------------------------------ Thank you for creating interesting JIRA entry. Based on my experiments (SPARK-13805, SPARK-14098, SPARK-15117, and SPARK-15380) to enable columnar storage at whole stage codegen, I have some (implementation perspective?) questions: * How we pass columnar format among operators? Currently, we use {{Iterater(Row)}} to pass data between operators. * Who decides which (columnar or row-oriented) data format? Logical planner, Physical planner, or others? * Will we use Apache Arrow format as an internal format? * We have two internal columnar formats: {{ColumnarBatch}} and {{CachedBatch}}. Will we integrate these two into one? > Columnar execution engine > ------------------------- > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially > in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet > reading (via a vectorized Parquet decoder) and low cardinality aggregation. > Other parts of the engine are already using whole-stage code generation, > which is in many ways more efficient than a columnar execution engine for > flat data types. > The goal here is to figure out a story to work towards making column batch > the common data exchange format between operators outside whole-stage code > generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and > how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can > be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org