JasonTam commented on issue #47592:
URL: https://github.com/apache/arrow/issues/47592#issuecomment-3312967056
Yes, only seeing issues on the read side. I have the error/stacktrace in the
previous message from the minimal example.
Maybe I'm trying to do something that's not the intended way. The goal is to
write some pq partitions with pandas where the partition information only
exists in the filepath. (Just like my previous example)
```python
import pandas as pd
df = pd.DataFrame({
0: [1, 2, 3],
1: [4, 5, 6]
})
df.to_parquet("./test-pd-data/run_date=2025-09-17/0.parquet")
df.to_parquet("./test-pd-data/run_date=2025-09-18/0.parquet")
df.to_parquet("./test-pd-data/run_date=2025-09-18/1.parquet")
# etc
```
A separate system will be reading this dataset `test-pd-data` in and
inferring the hive partitions.
```python
pd.read_parquet("./test-pd-data/", engine="pyarrow", partitioning="hive")
```
But are you saying that both writer and reader need to be in sync with
metadata (especially for the partition fields)? In my use-case, the writer
doesnt have any information about the partition field aside from being in the
filepath.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]