[ https://issues.apache.org/jira/browse/ARROW-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-1941: ---------------------------------- Labels: pull-request-available (was: ) > Table <–> DataFrame roundtrip failing > ------------------------------------- > > Key: ARROW-1941 > URL: https://issues.apache.org/jira/browse/ARROW-1941 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.8.0 > Reporter: Thomas Buhrmann > Assignee: Licht Takeuchi > Labels: pull-request-available > Fix For: 0.9.0 > > > Although it is possible to create an Arrow table with a column containing > only empty lists (cast to a particular type, e.g. string), in a roundtrip > through pandas the original type is lost, it seems, and subsequently attempts > to convert to pandas then fail. > To reproduce in PyArrow 0.8.0: > {code} > import pyarrow as pa > # Create table with array of empty lists, forced to have type list(string) > arrays = { > 'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())), > 'c2': pa.array([[], [], []], type=pa.list_(pa.string())), > } > rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys())) > tbl = pa.Table.from_batches([rb]) > print("Schema 1 (correct):\n{}".format(tbl.schema)) > # First roundtrip changes schema > df = tbl.to_pandas() > tbl2 = pa.Table.from_pandas(df) > print("\nSchema 2 (wrong):\n{}".format(tbl2.schema)) > # Second roundtrip explodes > df2 = tbl2.to_pandas() > {code} > This results in the following output: > {code} > Schema 1 (correct): > c1: list<item: string> > child 0, item: string > c2: list<item: string> > child 0, item: string > Schema 2 (wrong): > c1: list<item: string> > child 0, item: string > c2: list<item: null> > child 0, item: null > __index_level_0__: int64 > metadata > -------- > {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": > [{"na' > b'me": null, "field_name": null, "pandas_type": "unicode", > "numpy_' > b'type": "object", "metadata": {"encoding": "UTF-8"}}], > "columns":' > b' [{"name": "c1", "field_name": "c1", "pandas_type": > "list[unicod' > b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", > "' > b'field_name": "c2", "pandas_type": "list[float64]", > "numpy_type":' > b' "object", "metadata": null}, {"name": null, "field_name": > "__in' > b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", > "' > b'metadata": null}], "pandas_version": "0.21.1"}'} > ... > > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: > > null > {code} > I.e., the array of empty lists of strings gets converted into an array of > lists of type null, and in the pandas schema to lists of type float64. > If one changes the empty lists to values of None in the creation of the > record batches, the roundtrip doesn't explode, but it will silently convert > the column to a simple column of type float (i.e. I lose the list type) in > pandas. This doesn't help, since other batches from the same source might > have non-empty lists and would end up with a different inferred schema, and > so can't be concatenated into a single table. > (If this attempt at a double roundtrip seems weird, in my use case I receive > data from a server in RecordBatches, which I convert to pandas for > manipulation. I then serialize this data to disk using Arrow, and later need > to read it back into pandas again for further manipulation. So I need to be > able to go through various rounds of table->df->table->df->table etc., where > at any time a record batch may have columns that contain only empty lists). -- This message was sent by Atlassian JIRA (v6.4.14#64029)