[ https://issues.apache.org/jira/browse/ARROW-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-5104: ----------------------------------------- Fix Version/s: 0.14.0 > [Python/C++] Schema for empty tables include index column as integer > -------------------------------------------------------------------- > > Key: ARROW-5104 > URL: https://issues.apache.org/jira/browse/ARROW-5104 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.13.0 > Reporter: Florian Jetter > Priority: Minor > Fix For: 0.14.0 > > > The schema for an empty table/dataframe still includes the index as an > integer column instead of being serialized solely as a metadata reference > (see ARROW-1639) > In the example below, the empty dataframe still holds `__index_level_0__` as > an integer column. Proper behavior would be to exclude it and reference the > index information in the pandas metadata as it is the case for a non-empty > column > {code} > In [1]: import pandas as pd > im > In [2]: import pyarrow as pa > In [3]: non_empty = pd.DataFrame({"col": [1]}) > In [4]: empty = non_empty.drop(0) > In [5]: empty > Out[5]: > Empty DataFrame > Columns: [col] > Index: [] > In [6]: pa.Table.from_pandas(non_empty) > Out[6]: > pyarrow.Table > col: int64 > metadata > -------- > OrderedDict([(b'pandas', > b'{"index_columns": [{"kind": "range", "name": null, "start": ' > b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,' > b' "field_name": null, "pandas_type": "unicode", "numpy_type":' > b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": [' > b'{"name": "col", "field_name": "col", "pandas_type": "int64",' > b' "numpy_type": "int64", "metadata": null}], "creator": {"lib' > b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu' > b'll}')]) > In [7]: pa.Table.from_pandas(empty) > Out[7]: > pyarrow.Table > col: int64 > __index_level_0__: int64 > metadata > -------- > OrderedDict([(b'pandas', > b'{"index_columns": ["__index_level_0__"], "column_indexes": [' > b'{"name": null, "field_name": null, "pandas_type": "unicode",' > b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]' > b', "columns": [{"name": "col", "field_name": "col", "pandas_t' > b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n' > b'ame": null, "field_name": "__index_level_0__", "pandas_type"' > b': "int64", "numpy_type": "int64", "metadata": null}], "creat' > b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve' > b'rsion": null}')]) > In [8]: pa.__version__ > Out[8]: '0.13.0' > In [9]: ! python --version > Python 3.6.7 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)