[ 
https://issues.apache.org/jira/browse/SPARK-56350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-56350:
----------------------------------
        Parent: SPARK-54137
    Issue Type: Sub-task  (was: Improvement)

> Skip ColumnarToRow for Arrow-backed input to Python UDFs
> --------------------------------------------------------
>
>                 Key: SPARK-56350
>                 URL: https://issues.apache.org/jira/browse/SPARK-56350
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: L. C. Hsieh
>            Assignee: L. C. Hsieh
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> When a Spark operator produces Arrow-backed ColumnarBatch (e.g., connectors 
> that read columnar formats like Parquet into Arrow vectors), the current 
> execution path for Arrow Python UDFs performs a wasteful columnar → row → 
> columnar round-trip: ColumnarToRowExec converts the columnar data to 
> InternalRow, then ArrowWriter converts each row back to Arrow columnar format 
> for IPC serialization to the Python worker.
> This round-trip is expensive due to per-row virtual dispatch, type 
> conversion, null checking, and poor cache locality in ArrowWriter.write(). 
> Since the data is already in Arrow format, the conversion is entirely 
> unnecessary.
> Arrow FieldVectors can be extracted directly from ArrowColumnVector and 
> serialized to IPC via VectorUnloader — skipping both ColumnarToRowExec and 
> ArrowWriter. Pass-through columns are kept as ColumnVector references 
> throughout, avoiding row materialization entirely in the fully columnar path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to