[ https://issues.apache.org/jira/browse/SPARK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162147#comment-17162147 ]
Robert Joseph Evans commented on SPARK-32334: --------------------------------------------- I think I can get the conversation started here. {{SparkPlan}} supports a few APIs for columnar processing right now. * {{supportsColumnar}} which returns true if {{executeColumnar}} should be called to process columnar data. * {{vectorTypes}} an optional set of class names for the columnar output of this stage which is a performance improvement for the code generation phase of converting the data to rows. * {{executeColumnar}} the main entry point to columnar execution. * {{doExecuteColumnar}} what users are expected to implement if {{supportsColumnar}} returns true. When {{supportsColumnar}} returns true it is assumed that both the input and the output of the stage will be columnar data. With this information {{ApplyColumnarRulesAndInsertTransitions}} will insert {{RowToColumnarExec}} and {{ColumnarToRowExec}} transitions. {{ColumnarToRowExec}} is by far the more optimized because it is widely used today. One of the goals of this issue is to try and make something like {{ArrowEvalPythonExec}} be columnar. If we just set {{executeColumnar}} to true for it the incoming data layout would be columnar, but it most likely would not be Arrow formatted, so it would still require some kind of transition from one columnar format to an Arrow based format. There is also no guarantee that the size of the batch will correspond to what this operator wants. {{RowToColumnarExec}} goes off of the {{spark.sql.inMemoryColumnarStorage.batchSize}} config, but {{ArrowEvalPythonExec}} uses {{spark.sql.execution.arrow.maxRecordsPerBatch}}. To get around both of these issues I would propose that we let `SparkPlan` optionally ask for both a specific type of input and a specific target size. We might also want a better way to say what type of output it is going to produce so we can optimize away some transitions if they are not needed. > Investigate commonizing Columnar and Row data transformations > -------------------------------------------------------------- > > Key: SPARK-32334 > URL: https://issues.apache.org/jira/browse/SPARK-32334 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Thomas Graves > Priority: Major > > We introduced more Columnar Support with SPARK-27396. > With that we recognized that there is code that is doing very similar > transformations from ColumnarBatch or Arrow into InternalRow and vice versa. > For instance: > [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L56-L58] > [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L389] > We should investigate if we can commonize that code. > We are also looking at making the internal caching serialization pluggable to > allow for different cache implementations. > ([https://github.com/apache/spark/pull/29067]). > It was recently brought up that we should investigate if using the data > source v2 api makes sense and is feasible for some of these transformations > to allow it to be easily extended. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org