asfimport opened a new issue, #30800:
URL: https://github.com/apache/arrow/issues/30800
Add a note to the docs that if partitioning and schema are both specified at
opening of a dataset and partitioning names are not included in the data,
schema needs to include the partitioning names (directory or hive partitioning)
in a case that filtering will be done.
Example:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# Define the data
table = pa.table({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]})
# Write to partitioned dataset
# The files will include columns "two" and "three"
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one'])
# Reading the partitioned dataset with schema not including partitioned names
# will error
schema = pa.schema([("three", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)
# And will not if done like so:
schema = pa.schema([("three", "double"), ("one", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)
```
**Reporter**: [Alenka
Frim](https://issues.apache.org/jira/browse/ARROW-15311) / @AlenkaF
<sub>**Note**: *This issue was originally created as
[ARROW-15311](https://issues.apache.org/jira/browse/ARROW-15311). Please see
the [migration documentation](https://github.com/apache/arrow/issues/14542) for
further details.*</sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]