[ https://issues.apache.org/jira/browse/ARROW-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105308#comment-16105308 ]
Phillip Cloud commented on ARROW-1291: -------------------------------------- I'm -1 on allowing numeric column names since it adds an IMO unnecessary coupling to pandas semantics. With such a change, any tool that wants to read data out of an arrow array must now consider the origin of the data's column names, and cannot simply assume that the columns in the schema are always a simple list of strings. I don't think it's easy to make this behavior transparent to tools that use arrow, while OTOH a list of strings is easy to deal with in pretty much any system that arrow is a part of or will be a part of. Since this is really only useful when doing pandas -> arrow -> pandas, and users of pandas can already refer to columns by positional index with {{.iloc}} I'm not convinced we should allow this. I think adding metadata for indexes has less far-reaching effects because it's an optional feature of pandas that isn't a core part of arrow, while column names are non-negotiable. I don't think it's too much to ask people to explicitly write out their column names as strings. I *am* willing to be convinced though :) > [Python] pa.RecordBatch.from_pandas doesn't accept DataFrame with numeric > column names > -------------------------------------------------------------------------------------- > > Key: ARROW-1291 > URL: https://issues.apache.org/jira/browse/ARROW-1291 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.5.0 > Reporter: Li Jin > Priority: Minor > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame([1]) > pa.RecordBatch.from_pandas(df) > {code} > Exception: > {code} > TypeError Traceback (most recent call last) > <ipython-input-5-670ba4a2ddb2> in <module>() > 3 > 4 df = pd.DataFrame([1]) > ----> 5 pa.RecordBatch.from_pandas(df) > table.pxi in pyarrow.lib.RecordBatch.from_pandas() > table.pxi in pyarrow.lib._dataframe_to_arrays() > /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in construct_metadata(df, index_levels, preserve_index, types) > 187 arrow_type=arrow_type > 188 ) > --> 189 for name, arrow_type in zip(df.columns, df_types) > 190 ] + ( > 191 [ > /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in <listcomp>(.0) > 187 arrow_type=arrow_type > 188 ) > --> 189 for name, arrow_type in zip(df.columns, df_types) > 190 ] + ( > 191 [ > /home/icexelloss/miniconda3/envs/spark-dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py > in get_column_metadata(column, name, arrow_type) > 125 raise TypeError( > 126 'Column name must be a string. Got column {} of type > {}'.format( > --> 127 name, type(name).__name__ > 128 ) > 129 ) > TypeError: Column name must be a string. Got column 0 of type int64 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)