EwoutH commented on issue #20719: URL: https://github.com/apache/arrow/issues/20719#issuecomment-4153772926
I just encountered this issue, it seems to still exist. Here's how I encountered it in a SPSS → pandas → parquet pipeline: When reading SPSS `.sav` files via `pandas.read_spss()` (which uses [pyreadstat](https://github.com/Roche/pyreadstat)), numeric variables with partial value labels produce `object`-dtype columns containing a mix of floats and strings. For example, a household size variable with codes 1–9 as floats and `"10 personen of meer"` as a label string. Calling `df.to_parquet()` then triggers the exact error described in this issue: ``` ArrowInvalid: ("Could not convert '10 personen of meer' with type str: tried to convert to double", 'Conversion failed for column HHLft1 with type object') ``` This also affects `object`-dtype columns (not just categoricals), so it's not caught by the fixes in #33727 or https://github.com/pandas-dev/pandas/issues/46863. The required workaround — casting all non-numeric columns to `str` before writing — works but is easy to miss since the error only surfaces at serialization time, not when the mixed-type column is created. The original suggestion from 2018 (fall back to string when type inference encounters mixed types, https://github.com/apache/arrow/issues/3280) would resolve this entire class of issues cleanly. I've also filed https://github.com/Roche/pyreadstat/issues/323 to address this at the source (pyreadstat should produce string columns when applying value labels to numeric variables). CC: @pitrou @jorisvandenbossche -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
