[jira] [Updated] (ARROW-1941) Table <–> DataFrame roundtrip failing

Wes McKinney (JIRA) Thu, 21 Dec 2017 10:03:34 -0800

     [ 
https://issues.apache.org/jira/browse/ARROW-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney updated ARROW-1941:
--------------------------------
    Fix Version/s: 0.9.0

> Table <–> DataFrame roundtrip failing
> -------------------------------------
>
>                 Key: ARROW-1941
>                 URL: https://issues.apache.org/jira/browse/ARROW-1941
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>            Reporter: Thomas Buhrmann
>             Fix For: 0.9.0
>
>
> Although it is possible to create an Arrow table with a column containing 
> only empty lists (cast to a particular type, e.g. string), in a roundtrip 
> through pandas the original type is lost, it seems, and subsequently attempts 
> to convert to pandas then fail.
> To reproduce in PyArrow 0.8.0:
> {code}
> import pyarrow as pa
> # Create table with array of empty lists, forced to have type list(string)
> arrays = {
>     'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
>     'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
> }
> rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
> tbl = pa.Table.from_batches([rb])
> print("Schema 1 (correct):\n{}".format(tbl.schema))
> # First roundtrip changes schema
> df = tbl.to_pandas()
> tbl2 = pa.Table.from_pandas(df)
> print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))
> # Second roundtrip explodes
> df2 = tbl2.to_pandas()
> {code}
> This results in the following output:
> {code}
> Schema 1 (correct):
> c1: list<item: string>
>   child 0, item: string
> c2: list<item: string>
>   child 0, item: string
> Schema 2 (wrong):
> c1: list<item: string>
>   child 0, item: string
> c2: list<item: null>
>   child 0, item: null
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": 
> [{"na'
>             b'me": null, "field_name": null, "pandas_type": "unicode", 
> "numpy_'
>             b'type": "object", "metadata": {"encoding": "UTF-8"}}], 
> "columns":'
>             b' [{"name": "c1", "field_name": "c1", "pandas_type": 
> "list[unicod'
>             b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", 
> "'
>             b'field_name": "c2", "pandas_type": "list[float64]", 
> "numpy_type":'
>             b' "object", "metadata": null}, {"name": null, "field_name": 
> "__in'
>             b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", 
> "'
>             b'metadata": null}], "pandas_version": "0.21.1"}'}
> ...
> > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> > null
> {code}
> I.e., the array of empty lists of strings gets converted into an array of 
> lists of type null, and in the pandas schema to lists of type float64.
> If one changes the empty lists to values of None in the creation of the 
> record batches, the roundtrip doesn't explode, but it will silently convert 
> the column to a simple column of type float (i.e. I lose the list type) in 
> pandas. This doesn't help, since other batches from the same source might 
> have non-empty lists and would end up with a different inferred schema, and 
> so can't be concatenated into a single table.
> (If this attempt at a double roundtrip seems weird, in my use case I receive 
> data from a server in RecordBatches, which I convert to pandas for 
> manipulation. I then serialize this data to disk using Arrow, and later need 
> to read it back into pandas again for further manipulation. So I need to be 
> able to go through various rounds of table->df->table->df->table etc., where 
> at any time a record batch may have columns that contain only empty lists).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (ARROW-1941) Table <–> DataFrame roundtrip failing

Reply via email to