[ https://issues.apache.org/jira/browse/SPARK-37124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chendi.Xue updated SPARK-37124: ------------------------------- Description: This Jira is aim to support Arrow format in RowToColumnarExec Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support write to Arrow format so far. since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to support RowToColumnarExec with Arrow. What I did in this PR is to add a load api in ArrowColumnVector to load arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec doExecute. UTs are also added to test this new API and RowToColumnarExec with ArrowFormat was: This Jira is aim to add Arrow format as an alternative for ColumnVector solution. Current ArrowColumnVector is not fully equivalent to OnHeap/OffHeapColumnVector in spark, and since Arrow API is now being more stable, and using pandas udf will perform much better than python udf. I am proposing to fully support arrow format as an alternative to ColumnVector just like the other two. What I did in this PR is to create a new class in the same package with OnHeap/OffHeapColumnVector and extend from WritableColumnVector to support all put APIs. UTs are covering all Data Format with testing on writing to columnVector and reading from columnVector. I also added 3 UTs for testing on loading from ArrowRecordBatch and allocateColumns . Summary: Support RowToColumnarExec with Arrow (was: Support Writable ArrowColumnarVector) > Support RowToColumnarExec with Arrow > ------------------------------------ > > Key: SPARK-37124 > URL: https://issues.apache.org/jira/browse/SPARK-37124 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.2.0 > Reporter: Chendi.Xue > Priority: Major > > This Jira is aim to support Arrow format in RowToColumnarExec > Current ArrowColumnVector is not fully equivalent to > OnHeap/OffHeapColumnVector in spark, so RowToColumnarExec doesn't support > write to Arrow format so far. > since Arrow API is now being more stable, and using pandas udf will perform > much better than python udf. > I am proposing to support RowToColumnarExec with Arrow. > What I did in this PR is to add a load api in ArrowColumnVector to load > arrowRecordBatch to ArrowColumnVector, then called inside RowToColumnarExec > doExecute. > > UTs are also added to test this new API and RowToColumnarExec with ArrowFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org