zhengruifeng opened a new pull request, #55974: URL: https://github.com/apache/spark/pull/55974
### What changes were proposed in this pull request? This PR makes `pyspark.sql.tests.coercion.test_pandas_udf_return_type.PandasUDFReturnTypeTests` work under pandas >= 3.0 and on systems whose `tzdata` package no longer ships the legacy `US/*` aliases (e.g. Ubuntu 24.04 / noble). Two changes: 1. **Switch the tz-aware fixture from `US/Eastern` to `America/New_York`.** The values returned by `pd.date_range(...).values` are identical for the two aliases (both are UTC-5 with the same DST rules), so the golden file does not need to be regenerated. 2. **Remap the loaded golden DataFrame in memory for pandas >= 3.0.** The on-disk golden file was generated under pandas 2, where the default datetime ndarray resolution is `datetime64[ns]` and `pd.Categorical` keeps `object`-dtyped categories. Under pandas 3 those defaults are `datetime64[us]` and `str`-dtyped categories. The lookup keys built by `repr_value` therefore no longer match the golden column names. We rebuild the affected column names at load time (without touching the file on disk) so the same golden works for both pandas versions. ### Why are the changes needed? Currently scheduled CI runs on the `python-312-pandas-3` image fail in this suite: - `pd.date_range(\"19700101\", periods=2, tz=\"US/Eastern\").values` raises `zoneinfo._common.ZoneInfoNotFoundError: 'No time zone found with key US/Eastern'`. pandas 3 dropped `pytz` as a hard dependency and resolves tz names through stdlib `zoneinfo`, which on Ubuntu 24.04 cannot find `US/Eastern` because Ubuntu moved the legacy aliases out of `tzdata` into a separate `tzdata-legacy` package that the CI image does not install. Example failure: https://github.com/apache/spark/actions/runs/26002965955/job/76430490989 - After the alias fix, `golden.loc[str_t, str_v]` raises `KeyError` because the column keys in the golden file are pandas-2-shaped (`datetime64[ns]`, `Categorical(..., object)`) but the lookup keys built at runtime are pandas-3-shaped (`datetime64[us]`, `Categorical(..., str)`). ### Does this PR introduce _any_ user-facing change? No. Test-only change. ### How was this patch tested? Ran the suite locally with pandas 3.0.2 (python 3.13): ``` python/run-tests --testnames \"pyspark.sql.tests.coercion.test_pandas_udf_return_type PandasUDFReturnTypeTests\" ``` The previous `ZoneInfoNotFoundError` and `KeyError` errors are gone. Note: there are still a few remaining pandas-3 assertion mismatches caused by the underlying nanosecond->microsecond resolution change propagating into cast results (e.g. \`bigint\` row for datetime/timedelta columns), and one cell where pandas 3 succeeds where pandas 2 errored (\`['12', '34']@list\` vs \`decimal(10,0)\`). Those are pre-existing pandas-3 incompatibilities and are out of scope for this PR. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
