Yicong Huang created SPARK-55222:
------------------------------------
Summary: Unify _create_batch with transformer composition
Key: SPARK-55222
URL: https://issues.apache.org/jira/browse/SPARK-55222
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
There are 3 `_create_batch` overrides in serializers.py:
- `ArrowStreamPandasSerializer._create_batch` (L470) — Series only
- `ArrowStreamPandasUDFSerializer._create_batch` (L601) — Series + DataFrame
- `ArrowStreamPandasUDTFSerializer._create_batch` (L900) — DataFrame only
Goals:
1. Merge into a single implementation
2. Use transformer composition instead of inline logic
3. Standardize input format
Approach:
For struct outputs (DataFrame → StructArray), compose existing transformers:
{code:python}
ArrowBatchTransformer.wrap_struct(
PandasBatchTransformer.to_arrow(df, schema, ...)
).column(0)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]