[PR] [PYTHON] Restore string-to-decimal type mismatch error in Arrow Python UDF on Pandas 3 [spark]

via GitHub Wed, 06 May 2026 00:14:46 -0700


zhengruifeng opened a new pull request, #55701:
URL: https://github.com/apache/spark/pull/55701


   ### What changes were proposed in this pull request?
   
   Coerce non-`object` string-dtype pandas series to `object` dtype inside 
`_create_converter_from_pandas` when the target Spark type is `DecimalType`. 
This restores the type-mismatch error that the legacy Arrow Python UDF path 
relies on, which Pandas 3's Arrow-backed string dtype silently bypasses.
   
   ### Why are the changes needed?
   
   In Pandas 3 (or any Pandas with `future.infer_string=True`), 
`pd.Series(['1', '2'])` is backed by `ArrowStringArrayNumpySemantics`. 
`pa.Array.from_pandas(series, type=pa.decimal128(...))` then silently casts 
those strings to the decimal target, where Pandas 2's `object` series would 
have raised `ArrowTypeError`. The legacy `SQL_ARROW_BATCHED_UDF` path goes 
through `PandasToArrowConversion.convert(...)` and depends on that exception to 
surface a `PythonException` for invalid UDF return types. The CI failure 
surfaces in `ArrowPythonUDFLegacyTests::test_type_coercion_string_to_numeric` 
(and its connect parity sibling) as `AssertionError: PythonException not 
raised`.
   
   This was originally observed in [the Pandas-3 build for `master` at SHA 
ca4d88d](https://github.com/apache/spark/actions/runs/25402959034/job/74508177559).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Behavior on Pandas 2 is unchanged. On Pandas 3, an Arrow Python UDF 
declaring a `DecimalType` return type and returning string values will now 
raise a `PythonException` (matching Pandas 2 behavior) instead of silently 
producing an Arrow-cast decimal column.
   
   ### How was this patch tested?
   
   Added `test_converter_from_pandas_decimal_string_dtype` in 
`python/pyspark/sql/tests/pandas/test_converter.py`.
   
   Verified end-to-end against `pandas==3.0.2`, `pyarrow==23.0.1`, Python 3.13:
   - `ArrowPythonUDFLegacyTests::test_type_coercion_string_to_numeric` 
reproduces the CI failure without the fix and passes with the fix.
   - `ArrowPythonUDFParityLegacyTests::test_type_coercion_string_to_numeric` 
(connect) passes with the fix.
   - Full `python/pyspark/sql/tests/arrow/test_arrow_python_udf.py` run: 278 
passed, 14 skipped (the remaining failures are pre-existing 
`test_udf_with_input_file_name*` cases unrelated to this change).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (claude-opus-4-7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [PYTHON] Restore string-to-decimal type mismatch error in Arrow Python UDF on Pandas 3 [spark]

Reply via email to