[ 
https://issues.apache.org/jira/browse/SPARK-55183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-55183:
-----------------------------------
    Labels: pull-request-available  (was: )

> Extract assign_cols_by_name transformer from ArrowStreamGroupUDFSerializer
> --------------------------------------------------------------------------
>
>                 Key: SPARK-55183
>                 URL: https://issues.apache.org/jira/browse/SPARK-55183
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Priority: Major
>              Labels: pull-request-available
>
> Problem:
> ArrowStreamGroupUDFSerializer.dump_stream contains inline logic to reorder 
> RecordBatch columns to match the expected schema order when 
> `assign_cols_by_name=True`. This pattern mixes data transformation with 
> serialization logic.
> Current code (ArrowStreamGroupUDFSerializer.dump_stream):
> ```python
> if self._assign_cols_by_name:
>     batch_iter = (
>         (
>             pa.RecordBatch.from_arrays(
>                 [batch.column(field.name) for field in arrow_type],
>                 names=[field.name for field in arrow_type],
>             ),
>             arrow_type,
>         )
>         for batch, arrow_type in batch_iter
>     )
> ```
> Proposal:
> Extract this column reordering transformation into ArrowBatchTransformer as a 
> reusable pure function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to