[ https://issues.apache.org/jira/browse/ARROW-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li updated ARROW-17228: ----------------------------- Summary: [Python] dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns (was: dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns) > [Python] dataset.write_data should use Scanner.projected_schema when passed a > scanner with projected columns > ------------------------------------------------------------------------------------------------------------ > > Key: ARROW-17228 > URL: https://issues.apache.org/jira/browse/ARROW-17228 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 8.0.0 > Environment: Python 3.9.13 > pyarrow 8.0.0 > Reporter: &res > Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In the code below: > {code:java} > import pyarrow as pa > import pyarrow.dataset as ds > table = pa.Table.from_arrays( > [ > pa.array(['a', 'b', 'c'], pa.string()), > pa.array(['a', 'b', 'c'], pa.string()), > ], > names=['region', "Other"] > ) > table_dataset = ds.dataset(table) > columns = { > "Region": ds.field('region'), > "Other": ds.field('Other'), > } > scanner = table_dataset.scanner(columns=columns) > ds.write_dataset( > scanner, > 'newpath', > partitioning=['Region'], partitioning_flavor='hive', > format='parquet') > {code} > I get this exception: > {code:java} > KeyError: 'Column Region does not exist in schema' > {code} > I suspect it is because write_dataset isn't looking at the correct schema. It > should look at scanner.project_schema (rather than scanner.dataset_schema). > I think it's just a matter of updating this line: > https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967 > > The issue was raised here: > https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning > -- This message was sent by Atlassian Jira (v8.20.10#820010)