Yicong Huang created SPARK-54639:
------------------------------------
Summary: Optimize Arrow serializers by avoiding unnecessary Table
creation
Key: SPARK-54639
URL: https://issues.apache.org/jira/browse/SPARK-54639
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Several serializers in pyspark.sql.pandas.serializers unnecessarily create
pa.Table objects when processing single RecordBatch instances. When converting
Arrow RecordBatches to pandas Series, the code creates a pa.Table wrapper for
each batch just to iterate over columns, which introduces unnecessary object
creation, extra function call overhead, and increases GC pressure.
The issue appears in multiple serializers:
{code:python}
# ArrowStreamPandasSerializer.load_stream()
# ArrowStreamAggPandasUDFSerializer.load_stream()
# GroupPandasUDFSerializer.load_stream()
for batch in batches:
pandas_batches = [
self.arrow_to_pandas(c, i)
for i, c in enumerate(pa.Table.from_batches([batch]).itercolumns())
]
{code}
We can optimize this by directly accessing columns from RecordBatch instead:
{code:python}
for batch in batches:
pandas_batches = [
self.arrow_to_pandas(batch.column(i), i)
for i in range(batch.num_columns)
]
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]