[jira] [Updated] (SPARK-37124) Support RowToColumnarExec with Arrow

Chendi.Xue (Jira) Mon, 01 Nov 2021 20:23:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chendi.Xue updated SPARK-37124:
-------------------------------
    Description: 
This Jira is aim to support Arrow format in RowToColumnarExec 

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, so RowToColumnarExec doesn't support write to Arrow format so far.

since Arrow API is now being more stable, and using pandas udf will perform 
much better than python udf.

I am  proposing to support RowToColumnarExec with Arrow.

What I did in this PR is to add a load api in ArrowColumnVector to load 
arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec 
doExecute.

 

UTs are also added to test this new API and RowToColumnarExec with ArrowFormat

  was:
This Jira is aim to add Arrow format as an alternative for ColumnVector 
solution.

Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector 
in spark, and since Arrow API is now being more stable, and using pandas udf 
will perform much better than python udf.

I am  proposing to fully support arrow format as an alternative to ColumnVector 
just like the other two.

What I did in this PR is to create a new class in the same package with 
OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all 
put APIs.

UTs are covering all Data Format with testing on writing to columnVector and 
reading from columnVector. I also added 3 UTs for testing on loading from 
ArrowRecordBatch and allocateColumns .

        Summary: Support RowToColumnarExec with Arrow  (was: Support Writable 
ArrowColumnarVector)

> Support RowToColumnarExec with Arrow
> ------------------------------------
>
>                 Key: SPARK-37124
>                 URL: https://issues.apache.org/jira/browse/SPARK-37124
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Chendi.Xue
>            Priority: Major
>
> This Jira is aim to support Arrow format in RowToColumnarExec 
> Current ArrowColumnVector is not fully equivalent to 
> OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support 
> write to Arrow format so far.
> since Arrow API is now being more stable, and using pandas udf will perform 
> much better than python udf.
> I am  proposing to support RowToColumnarExec with Arrow.
> What I did in this PR is to add a load api in ArrowColumnVector to load 
> arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec 
> doExecute.
>  
> UTs are also added to test this new API and RowToColumnarExec with ArrowFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-37124) Support RowToColumnarExec with Arrow

Reply via email to