[
https://issues.apache.org/jira/browse/SPARK-55194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-55194:
-----------------------------------
Labels: pull-request-available (was: )
> Remove GroupArrowUDFSerializer by moving flatten logic to mapper
> ----------------------------------------------------------------
>
> Key: SPARK-55194
> URL: https://issues.apache.org/jira/browse/SPARK-55194
> Project: Spark
> Issue Type: Sub-task
> Components: PySpark
> Affects Versions: 4.2.0
> Reporter: Yicong Huang
> Priority: Major
> Labels: pull-request-available
>
> {{GroupArrowUDFSerializer}} exists only to add a {{flatten_struct}} call in
> {{load_stream}}, inheriting everything else from
> {{ArrowStreamGroupUDFSerializer}}:
> {code:python}
> class GroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer):
> def load_stream(self, stream):
> for (batches,) in self._load_group_dataframes(stream, num_dfs=1):
> batch_iter = map(ArrowBatchTransformer.flatten_struct, batches)
> yield batch_iter
> {code}
> This creates an unnecessary inheritance layer. The flatten operation is a
> data transformation that belongs closer to where it's used (the mapper), not
> in the serializer.
> Proposal: Move {{flatten_struct}} to the mapper and delete
> {{GroupArrowUDFSerializer}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]