[jira] [Created] (ARROW-1941) Table <–> DataFrame roundtrip failing

Thomas Buhrmann (JIRA) Wed, 20 Dec 2017 01:56:04 -0800

Thomas Buhrmann created ARROW-1941:
--------------------------------------

             Summary: Table <–> DataFrame roundtrip failing
                 Key: ARROW-1941
                 URL: https://issues.apache.org/jira/browse/ARROW-1941
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.8.0
            Reporter: Thomas Buhrmann



Although it is possible to create an Arrow table with a column containing only 
empty lists (cast to a particular type, e.g. string), in a roundtrip through 
pandas the original type is lost, it seems, and subsequently attempts to 
convert to pandas then fail.

To reproduce in PyArrow 0.8.0:

{code}
import pyarrow as pa

# Create table with array of empty lists, forced to have type list(string)
arrays = {
    'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())),
    'c2': pa.array([[], [], []], type=pa.list_(pa.string())),
}
rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys()))
tbl = pa.Table.from_batches([rb])
print("Schema 1 (correct):\n{}".format(tbl.schema))

# First roundtrip changes schema
df = tbl.to_pandas()
tbl2 = pa.Table.from_pandas(df)
print("\nSchema 2 (wrong):\n{}".format(tbl2.schema))

# Second roundtrip explodes
df2 = tbl2.to_pandas()
{code}

This results in the following output:

{code}
Schema 1 (correct):
c1: list<item: string>
  child 0, item: string
c2: list<item: string>
  child 0, item: string

Schema 2 (wrong):
c1: list<item: string>
  child 0, item: string
c2: list<item: null>
  child 0, item: null
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod'
            b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "'
            b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":'
            b' "object", "metadata": null}, {"name": null, "field_name": "__in'
            b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "'
            b'metadata": null}], "pandas_version": "0.21.1"}'}

...

> ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: 
> null
{code}

I.e., the array of empty lists of strings gets converted into an array of lists 
of type null, and in the pandas schema to lists of type float64.

If one changes the empty lists to values of None in the creation of the record 
batches, the roundtrip doesn't explode, but it will silently convert the column 
to a simple column of type float (i.e. I lose the list type) in pandas. This 
doesn't help, since other batches from the same source might have non-empty 
lists and would end up with a different inferred schema, and so can't be 
concatenated into a single table.

(If this attempt at a double roundtrip seems weird, in my use case I receive 
data from a server in RecordBatches, which I convert to pandas for 
manipulation. I then serialize this data to disk using Arrow, and later need to 
read it back into pandas again for further manipulation. So I need to be able 
to go through various rounds of table->df->table->df->table etc., where at any 
time a record batch may have columns that contain only empty lists).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1941) Table <–> DataFrame roundtrip failing

Reply via email to