[ 
https://issues.apache.org/jira/browse/ARROW-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929121#comment-16929121
 ] 

Joris Van den Bossche commented on ARROW-5104:
----------------------------------------------

Yeah, I don't think there is anything we can do from our side (we don't want to 
special case empty Int64 index I think).

With the new {{preserve_index=True}} you can ensure to _always_ have the index 
as a column, but for the other way around (ensure it is always serialized as 
metadata), the user needs to ensure the index is actually a RangeIndex.

 

So [~fjetter] if you have other ideas how to deal with this, that's very 
welcome, but for now closing this issue.

> [Python/C++] Schema for empty tables include index column as integer
> --------------------------------------------------------------------
>
>                 Key: ARROW-5104
>                 URL: https://issues.apache.org/jira/browse/ARROW-5104
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.13.0
>            Reporter: Florian Jetter
>            Priority: Minor
>             Fix For: 0.15.0
>
>
> The schema for an empty table/dataframe still includes the index as an 
> integer column instead of being serialized solely as a metadata reference 
> (see ARROW-1639)
> In the example below, the empty dataframe still holds `__index_level_0__` as 
> an integer column. Proper behavior would be to exclude it and reference the 
> index information in the pandas metadata as it is the case for a non-empty 
> column
> {code}
> In [1]: import pandas as pd
> im
> In [2]: import pyarrow as pa
> In [3]: non_empty =  pd.DataFrame({"col": [1]})
> In [4]: empty = non_empty.drop(0)
> In [5]: empty
> Out[5]:
> Empty DataFrame
> Columns: [col]
> Index: []
> In [6]: pa.Table.from_pandas(non_empty)
> Out[6]:
> pyarrow.Table
> col: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": [{"kind": "range", "name": null, "start": '
>               b'0, "stop": 1, "step": 1}], "column_indexes": [{"name": null,'
>               b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>               b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>               b'{"name": "col", "field_name": "col", "pandas_type": "int64",'
>               b' "numpy_type": "int64", "metadata": null}], "creator": {"lib'
>               b'rary": "pyarrow", "version": "0.13.0"}, "pandas_version": nu'
>               b'll}')])
> In [7]: pa.Table.from_pandas(empty)
> Out[7]:
> pyarrow.Table
> col: int64
> __index_level_0__: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
>               b'{"name": null, "field_name": null, "pandas_type": "unicode",'
>               b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
>               b', "columns": [{"name": "col", "field_name": "col", "pandas_t'
>               b'ype": "int64", "numpy_type": "int64", "metadata": null}, {"n'
>               b'ame": null, "field_name": "__index_level_0__", "pandas_type"'
>               b': "int64", "numpy_type": "int64", "metadata": null}], "creat'
>               b'or": {"library": "pyarrow", "version": "0.13.0"}, "pandas_ve'
>               b'rsion": null}')])
> In [8]: pa.__version__
> Out[8]: '0.13.0'
> In [9]: ! python --version
> Python 3.6.7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to