EwoutH commented on issue #20719:
URL: https://github.com/apache/arrow/issues/20719#issuecomment-4153772926

   I just encountered this issue, it seems to still exist.
   
   Here's how I encountered it in a SPSS → pandas → parquet pipeline:
   
   When reading SPSS `.sav` files via `pandas.read_spss()` (which uses 
[pyreadstat](https://github.com/Roche/pyreadstat)), numeric variables with 
partial value labels produce `object`-dtype columns containing a mix of floats 
and strings. For example, a household size variable with codes 1–9 as floats 
and `"10 personen of meer"` as a label string. Calling `df.to_parquet()` then 
triggers the exact error described in this issue:
   
   ```
   ArrowInvalid: ("Could not convert '10 personen of meer' with type str:
   tried to convert to double", 'Conversion failed for column HHLft1 with type 
object')
   ```
   
   This also affects `object`-dtype columns (not just categoricals), so it's 
not caught by the fixes in #33727 or 
https://github.com/pandas-dev/pandas/issues/46863.
   
   The required workaround — casting all non-numeric columns to `str` before 
writing — works but is easy to miss since the error only surfaces at 
serialization time, not when the mixed-type column is created. The original 
suggestion from 2018 (fall back to string when type inference encounters mixed 
types, https://github.com/apache/arrow/issues/3280) would resolve this entire 
class of issues cleanly.
   
   I've also filed https://github.com/Roche/pyreadstat/issues/323 to address 
this at the source (pyreadstat should produce string columns when applying 
value labels to numeric variables).
   
   CC: @pitrou @jorisvandenbossche


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to