Yicong Huang created SPARK-54657:
------------------------------------

             Summary: Refactor pyspark.sql.pandas.serializers for improved 
maintainability
                 Key: SPARK-54657
                 URL: https://issues.apache.org/jira/browse/SPARK-54657
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.2.0
            Reporter: Yicong Huang


The {{serializers.py}} file has grown to ~2200 lines with 25+ serializer 
classes. Many share duplicated patterns that could be consolidated.

The main issues:

1. *Duplicated load_stream patterns* - The "dataframes_in_group" reading loop 
is repeated in 6+ classes:
{code:python}
# This pattern appears in GroupArrowUDFSerializer, 
ArrowStreamAggArrowUDFSerializer,
# ArrowStreamAggPandasUDFSerializer, GroupPandasUDFSerializer, 
CogroupArrowUDFSerializer, etc.
dataframes_in_group = None
while dataframes_in_group is None or dataframes_in_group > 0:
    dataframes_in_group = read_int(stream)
    if dataframes_in_group == 1:
        # process batches...
    elif dataframes_in_group != 0:
        raise PySparkValueError(...)
{code}

2. *Duplicated dump_stream patterns* - The START_ARROW_STREAM writing appears 
in 4+ classes:
{code:python}
# Repeated in ArrowStreamUDFSerializer, ArrowStreamPandasUDFSerializer, 
# ArrowStreamArrowUDFSerializer, ApplyInPandasWithStateSerializer, etc.
should_write_start_length = True
for batch in iterator:
    if should_write_start_length:
        write_int(SpecialLengths.START_ARROW_STREAM, stream)
        should_write_start_length = False
    yield batch
{code}

3. *Cogroup and single group handling are separate* - 
{{GroupArrowUDFSerializer}} and {{CogroupArrowUDFSerializer}} have nearly 
identical logic except one reads 1 dataframe per group, the other reads 2.

4. *File is too large* to navigate easily.

Proposed refactoring:
- Extract common patterns into mixins ({{GroupedLoadStreamMixin}}, 
{{StartArrowStreamDumpMixin}})
- Unify cogroup/single group handling logic
- Split into submodules



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to