Re: [I] [Python] Hive partition columns being forced to dict type [arrow]

via GitHub Fri, 19 Sep 2025 09:58:43 -0700


JasonTam commented on issue #47592:
URL: https://github.com/apache/arrow/issues/47592#issuecomment-3312967056


   Yes, only seeing issues on the read side. I have the error/stacktrace in the 
previous message from the minimal example.
   
   Maybe I'm trying to do something that's not the intended way. The goal is to 
write some pq partitions with pandas where the partition information only 
exists in the filepath. (Just like my previous example)
   ```python
   import pandas as pd
   df = pd.DataFrame({
       0: [1, 2, 3],
       1: [4, 5, 6]
   })
   df.to_parquet("./test-pd-data/run_date=2025-09-17/0.parquet")
   df.to_parquet("./test-pd-data/run_date=2025-09-18/0.parquet")
   df.to_parquet("./test-pd-data/run_date=2025-09-18/1.parquet")
   # etc
   ```
   
   A separate system will be reading this dataset `test-pd-data` in and 
inferring the hive partitions. 
   ```python
   pd.read_parquet("./test-pd-data/", engine="pyarrow", partitioning="hive")
   ```
   
   But are you saying that both writer and reader need to be in sync with 
metadata (especially for the partition fields)? In my use-case, the writer 
doesnt have any information about the partition field aside from being in the 
filepath.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Hive partition columns being forced to dict type [arrow]

Reply via email to