[ 
https://issues.apache.org/jira/browse/SPARK-55349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-55349.
-----------------------------------
      Assignee: Yicong Huang
    Resolution: Done

Issue resolved by pull request 54125
https://github.com/apache/spark/pull/54125

> Consolidate pandas-to-Arrow conversion utilities in serializers
> ---------------------------------------------------------------
>
>                 Key: SPARK-55349
>                 URL: https://issues.apache.org/jira/browse/SPARK-55349
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Yicong Huang
>            Assignee: Yicong Huang
>            Priority: Major
>
> The pandas UDF serializers contain significant code duplication for 
> converting pandas data to Arrow format. Multiple `_create_batch` and 
> `_create_array` methods exist across different serializer classes with nearly 
> identical logic:
> {code:python}
> # ArrowStreamPandasSerializer
> def _create_batch(self, series):
>     arrs = []
>     for s, t in series:
>         # ... conversion logic ...
>     return pa.RecordBatch.from_arrays(arrs, ...)
> # ArrowStreamPandasUDFSerializer  
> def _create_batch(self, series):
>     # ... similar conversion logic ...
> # ArrowStreamPandasUDTFSerializer
> def _create_array(self, series, spark_type):
>     # ... conversion logic ...
> {code}
> This duplication makes the code harder to maintain and increases the risk of 
> inconsistent behavior.
> Proposal: Extract the common conversion logic into a dedicated 
> `PandasToArrowConversion` class in `conversion.py`:
> {code:python}
> class PandasToArrowConversion:
>     @classmethod
>     def dataframe_to_batch(cls, data, schema, ...) -> pa.RecordBatch: ...
>     
>     @classmethod
>     def series_to_array(cls, series, spark_type, ...) -> pa.Array: ...
> {code}
> This reduces code duplication and provides a single, well-tested conversion 
> path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to