Yicong Huang created SPARK-55183:
------------------------------------
Summary: Extract assign_cols_by_name logic into
ArrowBatchTransformer
Key: SPARK-55183
URL: https://issues.apache.org/jira/browse/SPARK-55183
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
Problem:
ArrowStreamGroupUDFSerializer.dump_stream contains inline logic to reorder
RecordBatch columns to match the expected schema order when
`assign_cols_by_name=True`. This pattern mixes data transformation with
serialization logic.
Current code (ArrowStreamGroupUDFSerializer.dump_stream):
```python
if self._assign_cols_by_name:
batch_iter = (
(
pa.RecordBatch.from_arrays(
[batch.column(field.name) for field in arrow_type],
names=[field.name for field in arrow_type],
),
arrow_type,
)
for batch, arrow_type in batch_iter
)
```
Proposal:
Extract this column reordering transformation into ArrowBatchTransformer as a
reusable pure function.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]