[jira] [Updated] (ARROW-17228) [Python] dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns

David Li (Jira) Tue, 02 Aug 2022 05:50:07 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Li updated ARROW-17228:
-----------------------------
    Summary: [Python] dataset.write_data should use Scanner.projected_schema 
when passed a scanner with projected columns  (was: dataset.write_data should 
use Scanner.projected_schema when passed a scanner with projected columns)

> [Python] dataset.write_data should use Scanner.projected_schema when passed a 
> scanner with projected columns
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17228
>                 URL: https://issues.apache.org/jira/browse/ARROW-17228
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>         Environment: Python 3.9.13
> pyarrow 8.0.0
>            Reporter: &res
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> In the code below:
> {code:java}
> import pyarrow as pa
> import pyarrow.dataset as ds
> table = pa.Table.from_arrays(
>     [
>         pa.array(['a', 'b', 'c'], pa.string()),
>         pa.array(['a', 'b', 'c'], pa.string()),
>     ],
>     names=['region', "Other"]
> )
> table_dataset = ds.dataset(table)
> columns = {
>     "Region": ds.field('region'),
>     "Other": ds.field('Other'),
> }
> scanner = table_dataset.scanner(columns=columns)
> ds.write_dataset(
>     scanner,
>     'newpath',
>     partitioning=['Region'], partitioning_flavor='hive',
>     format='parquet')
>  {code}
> I get this exception:
> {code:java}
> KeyError: 'Column Region does not exist in schema'
>  {code}
> I suspect it is because write_dataset isn't looking at the correct schema. It 
> should look at scanner.project_schema (rather than scanner.dataset_schema).
> I think it's just a matter of updating this line: 
> https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967
>  
> The issue was raised here: 
> https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17228) [Python] dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns

Reply via email to