[ https://issues.apache.org/jira/browse/ARROW-5220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17662243#comment-17662243 ]
Rok Mihevc commented on ARROW-5220: ----------------------------------- This issue has been migrated to [issue #21694|https://github.com/apache/arrow/issues/21694] on GitHub. Please see the [migration documentation|https://github.com/apache/arrow/issues/14542] for further details. > [Python] index / unknown columns in specified schema in Table.from_pandas > ------------------------------------------------------------------------- > > Key: ARROW-5220 > URL: https://issues.apache.org/jira/browse/ARROW-5220 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Joris Van den Bossche > Assignee: Joris Van den Bossche > Priority: Minor > Labels: pull-request-available > Fix For: 0.15.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The {{Table.from_pandas}} method allows to specify a schema ("This can be > used to indicate the type of columns if we cannot infer it automatically."). > But, if you also want to specify the type of the index, you get an error: > {code:python} > df = pd.DataFrame({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]}) > df.index = pd.Index(['a', 'b', 'c'], name='index') > my_schema = pa.schema([('index', pa.string()), > ('a', pa.int64()), > ('b', pa.float64()), > ]) > table = pa.Table.from_pandas(df, schema=my_schema) > {code} > gives {{KeyError: 'index'}} (because it tries to look up the "column names" > from the schema in the dataframe, and thus does not find column 'index'). > This also has the consequence that re-using the schema does not work: > {{table1 = pa.Table.from_pandas(df1); table2 = pa.Table.from_pandas(df2, > schema=table1.schema)}} > Extra note: also unknown columns in general give this error (column specified > in the schema that are not in the dataframe). > At least in pyarrow 0.11, this did not give an error (eg noticed this from > the code in example in ARROW-3861). So before, unknown columns in the > specified schema were ignored, while now they raise an error. Was this a > conscious change? > So before also specifying the index in the schema "worked" in the sense that > it didn't raise an error, but it was also ignored, so didn't actually do what > you would expect) > Questions: > - I think that we should support specifying the index in the passed > {{schema}} ? So that the example above works (although this might be > complicated with RangeIndex that is not serialized any more) > - But what to do in general with additional columns in the schema that are > not in the DataFrame? Are we fine with keep raising an error as it is now > (the error message could be improved then)? Or do we again want to ignore > them? (or, it could actually also add them as all nulls to the table) -- This message was sent by Atlassian Jira (v8.20.10#820010)