Yicong Huang created SPARK-54639:
------------------------------------

             Summary: Optimize Arrow serializers by avoiding unnecessary Table 
creation
                 Key: SPARK-54639
                 URL: https://issues.apache.org/jira/browse/SPARK-54639
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Several serializers in pyspark.sql.pandas.serializers unnecessarily create 
pa.Table objects when processing single RecordBatch instances. When converting 
Arrow RecordBatches to pandas Series, the code creates a pa.Table wrapper for 
each batch just to iterate over columns, which introduces unnecessary object 
creation, extra function call overhead, and increases GC pressure.

The issue appears in multiple serializers:

{code:python}
# ArrowStreamPandasSerializer.load_stream() 
# ArrowStreamAggPandasUDFSerializer.load_stream() 
# GroupPandasUDFSerializer.load_stream() 

for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(c, i)
        for i, c in enumerate(pa.Table.from_batches([batch]).itercolumns())
    ]
{code}

We can optimize this by directly accessing columns from RecordBatch instead:

{code:python}
for batch in batches:
    pandas_batches = [
        self.arrow_to_pandas(batch.column(i), i)
        for i in range(batch.num_columns)
    ]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to