[
https://issues.apache.org/jira/browse/SPARK-56350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
L. C. Hsieh reassigned SPARK-56350:
-----------------------------------
Assignee: L. C. Hsieh
> Skip ColumnarToRow for Arrow-backed input to Python UDFs
> --------------------------------------------------------
>
> Key: SPARK-56350
> URL: https://issues.apache.org/jira/browse/SPARK-56350
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: L. C. Hsieh
> Assignee: L. C. Hsieh
> Priority: Major
>
> When a Spark operator produces Arrow-backed ColumnarBatch (e.g., connectors
> that read columnar formats like Parquet into Arrow vectors), the current
> execution path for Arrow Python UDFs performs a wasteful columnar → row →
> columnar round-trip: ColumnarToRowExec converts the columnar data to
> InternalRow, then ArrowWriter converts each row back to Arrow columnar format
> for IPC serialization to the Python worker.
> This round-trip is expensive due to per-row virtual dispatch, type
> conversion, null checking, and poor cache locality in ArrowWriter.write().
> Since the data is already in Arrow format, the conversion is entirely
> unnecessary.
> Arrow FieldVectors can be extracted directly from ArrowColumnVector and
> serialized to IPC via VectorUnloader — skipping both ColumnarToRowExec and
> ArrowWriter. Pass-through columns are kept as ColumnVector references
> throughout, avoiding row materialization entirely in the fully columnar path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]