[jira] [Comment Edited] (ARROW-5139) [Python/C++] Empty column selection no longer restores index

Joris Van den Bossche (JIRA) Mon, 06 May 2019 02:38:20 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833597#comment-16833597
 ]


Joris Van den Bossche edited comment on ARROW-5139 at 5/6/19 9:37 AM:
----------------------------------------------------------------------

[~fjetter] thanks for the report! A little bit simpler reproducible example, 
without parquet (but the same underlying reason, the rangeindex is indeed not 
constructed again for empty tables):

{code}
In [1]: import pyarrow as pa

In [2]: pa.__version__  
Out[2]: '0.12.0'

In [3]: df = pd.DataFrame( 
   ...:     {"a": [1, 2]} 
   ...: ) 

In [4]: table = pa.Table.from_pandas(df, columns=[], preserve_index=True)     

In [5]: table
Out[5]: 
pyarrow.Table
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
              b'{"name": null, "field_name": null, "pandas_type": "unicode",'
              b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
              b', "columns": [{"name": null, "field_name": "__index_level_0_'
              b'_", "pandas_type": "int64", "numpy_type": "int64", "metadata'
              b'": null}], "pandas_version": "0.23.4"}')])

In [6]: print(table.to_pandas())
Empty DataFrame
Columns: []
Index: [0, 1]

In [7]: table.to_pandas().index
Out[7]: Int64Index([0, 1], dtype='int64')
{code}
 
But the above, now gives:

{code}
In [4]: table 
Out[4]: 
pyarrow.Table

metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [], "creator": {"'
            b'library": "pyarrow", "version": "0.13.1.dev126+ga9ae4a9f.d201905'
            b'03"}, "pandas_version": "0.24.2"}'}

In [5]: print(table.to_pandas())   
Empty DataFrame
Columns: []
Index: []

In [6]: table.to_pandas().index    
Out[6]: RangeIndex(start=0, stop=0, step=1)
{code}
 


was (Author: jorisvandenbossche):
[~fjetter] thanks for the report! A little bit easier reproducible example, 
without parquet (but the same underlying reason, the rangeindex is indeed not 
constructed again for empty tables):

{code}
In [1]: import pyarrow as pa

In [2]: pa.__version__  
Out[2]: '0.12.0'

In [3]: df = pd.DataFrame( 
   ...:     {"a": [1, 2]} 
   ...: ) 

In [4]: table = pa.Table.from_pandas(df, columns=[], preserve_index=True)     

In [5]: table
Out[5]: 
pyarrow.Table
__index_level_0__: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": ["__index_level_0__"], "column_indexes": ['
              b'{"name": null, "field_name": null, "pandas_type": "unicode",'
              b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]'
              b', "columns": [{"name": null, "field_name": "__index_level_0_'
              b'_", "pandas_type": "int64", "numpy_type": "int64", "metadata'
              b'": null}], "pandas_version": "0.23.4"}')])

In [6]: print(table.to_pandas())
Empty DataFrame
Columns: []
Index: [0, 1]

In [7]: table.to_pandas().index
Out[7]: Int64Index([0, 1], dtype='int64')
{code}
 
But the above, now gives:

{code}
In [4]: table 
Out[4]: 
pyarrow.Table

metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [], "creator": {"'
            b'library": "pyarrow", "version": "0.13.1.dev126+ga9ae4a9f.d201905'
            b'03"}, "pandas_version": "0.24.2"}'}

In [5]: print(table.to_pandas())   
Empty DataFrame
Columns: []
Index: []

In [6]: table.to_pandas().index    
Out[6]: RangeIndex(start=0, stop=0, step=1)
{code}
 

> [Python/C++] Empty column selection no longer restores index
> ------------------------------------------------------------
>
>                 Key: ARROW-5139
>                 URL: https://issues.apache.org/jira/browse/ARROW-5139
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.12.1
>            Reporter: Florian Jetter
>            Priority: Minor
>              Labels: parquet
>
> The index of a dataframe is no longer reconstructed when using empty column 
> selection. This is a regression to 0.12.1 and probably only happens for 
> pd.RangeIndex
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from kartothek.serialization import ParquetSerializer
> from storefact import get_store_from_url
> print(pa.__version__)
> df = pd.DataFrame(
>     {"a": [1, 2]}
> )
> print(df.index)
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(table, buf)
> reader = pa.BufferReader(buf.getvalue().to_pybytes())
> table_restored = pq.read_pandas(reader, columns=[])
> df_restored = table_restored.to_pandas()
> print(len(df_restored))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (ARROW-5139) [Python/C++] Empty column selection no longer restores index

Reply via email to