Re: [I] Hive partition columns being forced to int when data columns are int [arrow]

via GitHub Thu, 18 Sep 2025 02:48:26 -0700


AlenkaF commented on issue #47592:
URL: https://github.com/apache/arrow/issues/47592#issuecomment-3306544041


   When using `read_table` from the `parquet` module, you access the dataset 
functionality under the hood if it is available. In this case hive partitioning 
will be used by default with partitioned fields inferred as dictionary types. 
See `infer_dictionary=True` in:
    
https://github.com/apache/arrow/blob/479662e88d81661d772d40435ae251425a697757/python/pyarrow/parquet/core.py#L1419-L1422
   
   See also: 
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.HivePartitioning.html#pyarrow.dataset.HivePartitioning.discover
   
   If you want partition fields as regular string/int columns, you can use the 
`dataset` module directly. Then, Arrow will infer partition column types 
instead of dictionary types:
   
   ```python
   >>> import pyarrow as pa
   >>> import pyarrow.parquet as pq
   >>> import os
   
   >>> t = pa.table({"col1": [1, 2, 3], "col2": [4, 5, 6]})
   >>> partition_dir = os.path.join("data", "run_date=2025-09-17", 
"job_id=abc123")
   >>> os.makedirs(partition_dir, exist_ok=True)
   >>> pq.write_table(t, os.path.join(partition_dir, "0.parquet"))
   
   >>> from pyarrow.dataset import dataset
   >>> ds = dataset("data", format="parquet", partitioning="hive")
   >>> table = ds.to_table()
   >>> print(table.schema)
   col1: int64
   col2: int64
   run_date: string
   job_id: string
   
   >>> table
   pyarrow.Table
   col1: int64
   col2: int64
   run_date: string
   job_id: string
   ----
   col1: [[1,2,3]]
   col2: [[4,5,6]]
   run_date: [["2025-09-17","2025-09-17","2025-09-17"]]
   job_id: [["abc123","abc123","abc123"]]
   ```
   
   vs using the `parquet` module:
   
   ```python
   >>> pq.read_table("data")
   pyarrow.Table
   col1: int64
   col2: int64
   run_date: dictionary<values=string, indices=int32, ordered=0>
   job_id: dictionary<values=string, indices=int32, ordered=0>
   ----
   col1: [[1,2,3]]
   col2: [[4,5,6]]
   run_date: [  -- dictionary:
   ["2025-09-17"]  -- indices:
   [0,0,0]]
   job_id: [  -- dictionary:
   ["abc123"]  -- indices:
   [0,0,0]]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Hive partition columns being forced to int when data columns are int [arrow]

Reply via email to