&res created ARROW-17228: ---------------------------- Summary: dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns Key: ARROW-17228 URL: https://issues.apache.org/jira/browse/ARROW-17228 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 8.0.0 Environment: Python 3.9.13 pyarrow 8.0.0 Reporter: &res
In the code below: {code:java} import pyarrow as pa import pyarrow.dataset as ds table = pa.Table.from_arrays( [ pa.array(['a', 'b', 'c'], pa.string()), pa.array(['a', 'b', 'c'], pa.string()), ], names=['region', "Other"] ) table_dataset = ds.dataset(table) columns = { "Region": ds.field('region'), "Other": ds.field('Other'), } scanner = table_dataset.scanner(columns=columns) ds.write_dataset( scanner, 'newpath', partitioning=['Region'], partitioning_flavor='hive', format='parquet') {code} I get this exception: {code:java} KeyError: 'Column Region does not exist in schema' {code} I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema). I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967 The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning -- This message was sent by Atlassian Jira (v8.20.10#820010)