Yicong Huang created SPARK-54657:
------------------------------------
Summary: Refactor pyspark.sql.pandas.serializers for improved
maintainability
Key: SPARK-54657
URL: https://issues.apache.org/jira/browse/SPARK-54657
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Yicong Huang
The {{serializers.py}} file has grown to ~2200 lines with 25+ serializer
classes. Many share duplicated patterns that could be consolidated.
The main issues:
1. *Duplicated load_stream patterns* - The "dataframes_in_group" reading loop
is repeated in 6+ classes:
{code:python}
# This pattern appears in GroupArrowUDFSerializer,
ArrowStreamAggArrowUDFSerializer,
# ArrowStreamAggPandasUDFSerializer, GroupPandasUDFSerializer,
CogroupArrowUDFSerializer, etc.
dataframes_in_group = None
while dataframes_in_group is None or dataframes_in_group > 0:
dataframes_in_group = read_int(stream)
if dataframes_in_group == 1:
# process batches...
elif dataframes_in_group != 0:
raise PySparkValueError(...)
{code}
2. *Duplicated dump_stream patterns* - The START_ARROW_STREAM writing appears
in 4+ classes:
{code:python}
# Repeated in ArrowStreamUDFSerializer, ArrowStreamPandasUDFSerializer,
# ArrowStreamArrowUDFSerializer, ApplyInPandasWithStateSerializer, etc.
should_write_start_length = True
for batch in iterator:
if should_write_start_length:
write_int(SpecialLengths.START_ARROW_STREAM, stream)
should_write_start_length = False
yield batch
{code}
3. *Cogroup and single group handling are separate* -
{{GroupArrowUDFSerializer}} and {{CogroupArrowUDFSerializer}} have nearly
identical logic except one reads 1 dataframe per group, the other reads 2.
4. *File is too large* to navigate easily.
Proposed refactoring:
- Extract common patterns into mixins ({{GroupedLoadStreamMixin}},
{{StartArrowStreamDumpMixin}})
- Unify cogroup/single group handling logic
- Split into submodules
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]