zhengruifeng opened a new pull request, #55701: URL: https://github.com/apache/spark/pull/55701
### What changes were proposed in this pull request? Coerce non-`object` string-dtype pandas series to `object` dtype inside `_create_converter_from_pandas` when the target Spark type is `DecimalType`. This restores the type-mismatch error that the legacy Arrow Python UDF path relies on, which Pandas 3's Arrow-backed string dtype silently bypasses. ### Why are the changes needed? In Pandas 3 (or any Pandas with `future.infer_string=True`), `pd.Series(['1', '2'])` is backed by `ArrowStringArrayNumpySemantics`. `pa.Array.from_pandas(series, type=pa.decimal128(...))` then silently casts those strings to the decimal target, where Pandas 2's `object` series would have raised `ArrowTypeError`. The legacy `SQL_ARROW_BATCHED_UDF` path goes through `PandasToArrowConversion.convert(...)` and depends on that exception to surface a `PythonException` for invalid UDF return types. The CI failure surfaces in `ArrowPythonUDFLegacyTests::test_type_coercion_string_to_numeric` (and its connect parity sibling) as `AssertionError: PythonException not raised`. This was originally observed in [the Pandas-3 build for `master` at SHA ca4d88d](https://github.com/apache/spark/actions/runs/25402959034/job/74508177559). ### Does this PR introduce _any_ user-facing change? No. Behavior on Pandas 2 is unchanged. On Pandas 3, an Arrow Python UDF declaring a `DecimalType` return type and returning string values will now raise a `PythonException` (matching Pandas 2 behavior) instead of silently producing an Arrow-cast decimal column. ### How was this patch tested? Added `test_converter_from_pandas_decimal_string_dtype` in `python/pyspark/sql/tests/pandas/test_converter.py`. Verified end-to-end against `pandas==3.0.2`, `pyarrow==23.0.1`, Python 3.13: - `ArrowPythonUDFLegacyTests::test_type_coercion_string_to_numeric` reproduces the CI failure without the fix and passes with the fix. - `ArrowPythonUDFParityLegacyTests::test_type_coercion_string_to_numeric` (connect) passes with the fix. - Full `python/pyspark/sql/tests/arrow/test_arrow_python_udf.py` run: 278 passed, 14 skipped (the remaining failures are pre-existing `test_udf_with_input_file_name*` cases unrelated to this change). ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (claude-opus-4-7) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
