[ 
https://issues.apache.org/jira/browse/SPARK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162147#comment-17162147
 ] 

Robert Joseph Evans commented on SPARK-32334:
---------------------------------------------

I think I can get the conversation started here.

{{SparkPlan}} supports a few APIs for columnar processing right now.  
* {{supportsColumnar}} which returns true if {{executeColumnar}} should be 
called to process columnar data.
* {{vectorTypes}} an optional set of class names for the columnar output of 
this stage which is a performance improvement for the code generation  phase of 
converting the data to rows.
* {{executeColumnar}} the main entry point to columnar execution.
* {{doExecuteColumnar}} what users are expected to implement if 
{{supportsColumnar}} returns true.

When {{supportsColumnar}} returns true it is assumed that both the input and 
the output of the stage will be columnar data. With this information 
{{ApplyColumnarRulesAndInsertTransitions}} will insert {{RowToColumnarExec}} 
and {{ColumnarToRowExec}} transitions.  {{ColumnarToRowExec}} is by far the 
more optimized because it is widely used today.

One of the goals of this issue is to try and make something like 
{{ArrowEvalPythonExec}} be columnar.  If we just set {{executeColumnar}} to 
true for it the incoming data layout would be columnar, but it most likely 
would not be Arrow formatted, so it would still require some kind of transition 
from one columnar format to an Arrow based format.  There is also no guarantee 
that the size of the batch will correspond to what this operator wants. 
{{RowToColumnarExec}} goes off of the 
{{spark.sql.inMemoryColumnarStorage.batchSize}} config, but 
{{ArrowEvalPythonExec}} uses {{spark.sql.execution.arrow.maxRecordsPerBatch}}.

To get around both of these issues I would propose that we let `SparkPlan` 
optionally ask for both a specific type of input and a specific target size. We 
might also want a better way to say what type of output it is going to produce 
so we can optimize away some transitions if they are not needed.



> Investigate commonizing Columnar and Row data transformations 
> --------------------------------------------------------------
>
>                 Key: SPARK-32334
>                 URL: https://issues.apache.org/jira/browse/SPARK-32334
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Thomas Graves
>            Priority: Major
>
> We introduced more Columnar Support with SPARK-27396.
> With that we recognized that there is code that is doing very similar 
> transformations from ColumnarBatch or Arrow into InternalRow and vice versa.  
> For instance: 
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L56-L58]
> [https://github.com/apache/spark/blob/a4382f7fe1c36a51c64f460c6cb91e93470e0825/sql/core/src/main/scala/org/apache/spark/sql/execution/Columnar.scala#L389]
> We should investigate if we can commonize that code.
> We are also looking at making the internal caching serialization pluggable to 
> allow for different cache implementations. 
> ([https://github.com/apache/spark/pull/29067]). 
> It was recently brought up that we should investigate if using the data 
> source v2 api makes sense and is feasible for some of these transformations 
> to allow it to be easily extended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to