Re: [I] [Python] Hive partition columns being forced to dict type [arrow]

via GitHub Thu, 18 Sep 2025 14:13:36 -0700


JasonTam commented on issue #47592:
URL: https://github.com/apache/arrow/issues/47592#issuecomment-3309744219


   @AlenkaF Thanks for the details.
   
   I'm worried that I might have misdiagnosed my issue. Perhaps it's more of a 
pandas conversion problem. I tried to make a minimal example here:
   
   
   ```python
   import pandas as pd
   df = pd.DataFrame({
       0: [1, 2, 3],
       1: [4, 5, 6]
   })
   df.to_parquet("./test-pd-data/run_date=2025-09-17/0.parquet")
   
   from pyarrow.dataset import dataset
   ds = dataset("test-pd-data", format="parquet", partitioning="hive")
   table = ds.to_table()
   print(table.schema)
   
   table.to_pandas()
   ```
   
   table.schema:
   ```
   0: int64
   1: int64
   run_date: string
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
437
   ```
   
   Error:
   ```
   ValueError                                Traceback (most recent call last)
   table.to_pandas()
   
   File .venv/lib/python3.12/site-packages/pyarrow/array.pxi:1020, in 
pyarrow.lib._PandasConvertible.to_pandas()
   
   File .venv/lib/python3.12/site-packages/pyarrow/table.pxi:5177, in 
pyarrow.lib.Table._to_pandas()
   
   File .venv/lib/python3.12/site-packages/pyarrow/pandas_compat.py:803, in 
table_to_dataframe(options, table, categories, ignore_metadata, types_mapper)
       798     ext_columns_dtypes = _get_extension_dtypes(
       799         table, [], types_mapper, options, categories
       800     )
       802 _check_data_column_metadata_consistency(all_columns)
   --> .venv/lib/python3.12/site-packages/pyarrow/pandas_compat.py:803) columns 
= _deserialize_column_index(table, all_columns, column_indexes)
       805 column_names = table.column_names
       806 result = pa.lib.table_to_blocks(options, table, categories,
       807                                 list(ext_columns_dtypes.keys()))
   
   File .venv/lib/python3.12/site-packages/pyarrow/pandas_compat.py:967, in 
_deserialize_column_index(block_table, all_columns, column_indexes)
       965 # if we're reconstructing the index
       966 if len(column_indexes) > 0:
   --> .venv/lib/python3.12/site-packages/pyarrow/pandas_compat.py:967)     
columns = _reconstruct_columns_from_metadata(columns, column_indexes)
       969 return columns
   ...
       132     # Explicit copy, or required since NumPy can't view from / to 
object.
   --> .venv/lib/python3.12/site-packages/pandas/core/dtypes/astype.py:133)     
return arr.astype(dtype, copy=True)
       135 return arr.astype(dtype, copy=copy)
   
   ValueError: invalid literal for int() with base 10: 'run_date'
   ```
   
   The thing is, if I create & write the parquet file with pyarrow, I don't run 
into this issue.
   
   Pandas version: 2.2.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Hive partition columns being forced to dict type [arrow]

Reply via email to