Yicong Huang created SPARK-55194:
------------------------------------
Summary: Remove GroupArrowUDFSerializer by moving flatten logic to
mapper
Key: SPARK-55194
URL: https://issues.apache.org/jira/browse/SPARK-55194
Project: Spark
Issue Type: Sub-task
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
{{GroupArrowUDFSerializer}} exists only to add a {{flatten_struct}} call in
{{load_stream}}, inheriting everything else from
{{ArrowStreamGroupUDFSerializer}}:
{code:python}
class GroupArrowUDFSerializer(ArrowStreamGroupUDFSerializer):
def load_stream(self, stream):
for (batches,) in self._load_group_dataframes(stream, num_dfs=1):
batch_iter = map(ArrowBatchTransformer.flatten_struct, batches)
yield batch_iter
{code}
This creates an unnecessary inheritance layer. The flatten operation is a data
transformation that belongs closer to where it's used (the mapper), not in the
serializer.
Proposal: Move {{flatten_struct}} to the mapper and delete
{{GroupArrowUDFSerializer}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]