Yicong Huang created SPARK-55183:
------------------------------------

             Summary: Extract assign_cols_by_name logic into 
ArrowBatchTransformer
                 Key: SPARK-55183
                 URL: https://issues.apache.org/jira/browse/SPARK-55183
             Project: Spark
          Issue Type: Sub-task
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


Problem:
ArrowStreamGroupUDFSerializer.dump_stream contains inline logic to reorder 
RecordBatch columns to match the expected schema order when 
`assign_cols_by_name=True`. This pattern mixes data transformation with 
serialization logic.

Current code (ArrowStreamGroupUDFSerializer.dump_stream):
```python
if self._assign_cols_by_name:
    batch_iter = (
        (
            pa.RecordBatch.from_arrays(
                [batch.column(field.name) for field in arrow_type],
                names=[field.name for field in arrow_type],
            ),
            arrow_type,
        )
        for batch, arrow_type in batch_iter
    )
```


Proposal:
Extract this column reordering transformation into ArrowBatchTransformer as a 
reusable pure function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to