[jira] [Commented] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

Rok Mihevc (Jira) Tue, 10 Jan 2023 23:51:14 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662243#comment-17662243
 ]


Rok Mihevc commented on ARROW-5220:
-----------------------------------

This issue has been migrated to [issue 
#21694|https://github.com/apache/arrow/issues/21694] on GitHub. Please see the 
[migration documentation|https://github.com/apache/arrow/issues/14542] for 
further details.

> [Python] index / unknown columns in specified schema in Table.from_pandas
> -------------------------------------------------------------------------
>
>                 Key: ARROW-5220
>                 URL: https://issues.apache.org/jira/browse/ARROW-5220
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The {{Table.from_pandas}} method allows to specify a schema ("This can be 
> used to indicate the type of columns if we cannot infer it automatically.").
> But, if you also want to specify the type of the index, you get an error:
> {code:python}
> df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
> df.index = pd.Index(['a', 'b', 'c'], name='index')
> my_schema = pa.schema([('index', pa.string()),
>                        ('a', pa.int64()),
>                        ('b', pa.float64()),
>                       ])
> table = pa.Table.from_pandas(df, schema=my_schema)
> {code}
> gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
> from the schema in the dataframe, and thus does not find column 'index').
> This also has the consequence that re-using the schema does not work: 
> {{table1 = pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
> schema=table1.schema)}}
> Extra note: also unknown columns in general give this error (column specified 
> in the schema that are not in the dataframe).
> At least in pyarrow 0.11, this did not give an error (eg noticed this from 
> the code in example in ARROW-3861). So before, unknown columns in the 
> specified schema were ignored, while now they raise an error. Was this a 
> conscious change?  
> So before also specifying the index in the schema "worked" in the sense that 
> it didn't raise an error, but it was also ignored, so didn't actually do what 
> you would expect)
> Questions:
> - I think that we should support specifying the index in the passed 
> {{schema}} ? So that the example above works (although this might be 
> complicated with RangeIndex that is not serialized any more)
> - But what to do in general with additional columns in the schema that are 
> not in the DataFrame? Are we fine with keep raising an error as it is now 
> (the error message could be improved then)? Or do we again want to ignore 
> them? (or, it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

Reply via email to