AlenkaF commented on issue #47592:
URL: https://github.com/apache/arrow/issues/47592#issuecomment-3306544041
When using `read_table` from the `parquet` module, you access the dataset
functionality under the hood if it is available. In this case hive partitioning
will be used by default with partitioned fields inferred as dictionary types.
See `infer_dictionary=True` in:
https://github.com/apache/arrow/blob/479662e88d81661d772d40435ae251425a697757/python/pyarrow/parquet/core.py#L1419-L1422
See also:
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.HivePartitioning.html#pyarrow.dataset.HivePartitioning.discover
If you want partition fields as regular string/int columns, you can use the
`dataset` module directly. Then, Arrow will infer partition column types
instead of dictionary types:
```python
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import os
>>> t = pa.table({"col1": [1, 2, 3], "col2": [4, 5, 6]})
>>> partition_dir = os.path.join("data", "run_date=2025-09-17",
"job_id=abc123")
>>> os.makedirs(partition_dir, exist_ok=True)
>>> pq.write_table(t, os.path.join(partition_dir, "0.parquet"))
>>> from pyarrow.dataset import dataset
>>> ds = dataset("data", format="parquet", partitioning="hive")
>>> table = ds.to_table()
>>> print(table.schema)
col1: int64
col2: int64
run_date: string
job_id: string
>>> table
pyarrow.Table
col1: int64
col2: int64
run_date: string
job_id: string
----
col1: [[1,2,3]]
col2: [[4,5,6]]
run_date: [["2025-09-17","2025-09-17","2025-09-17"]]
job_id: [["abc123","abc123","abc123"]]
```
vs using the `parquet` module:
```python
>>> pq.read_table("data")
pyarrow.Table
col1: int64
col2: int64
run_date: dictionary<values=string, indices=int32, ordered=0>
job_id: dictionary<values=string, indices=int32, ordered=0>
----
col1: [[1,2,3]]
col2: [[4,5,6]]
run_date: [ -- dictionary:
["2025-09-17"] -- indices:
[0,0,0]]
job_id: [ -- dictionary:
["abc123"] -- indices:
[0,0,0]]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]