Thomas Buhrmann created ARROW-1941: -------------------------------------- Summary: Table <–> DataFrame roundtrip failing Key: ARROW-1941 URL: https://issues.apache.org/jira/browse/ARROW-1941 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Thomas Buhrmann
Although it is possible to create an Arrow table with a column containing only empty lists (cast to a particular type, e.g. string), in a roundtrip through pandas the original type is lost, it seems, and subsequently attempts to convert to pandas then fail. To reproduce in PyArrow 0.8.0: {code} import pyarrow as pa # Create table with array of empty lists, forced to have type list(string) arrays = { 'c1': pa.array([["test"], ["a", "b"], None], type=pa.list_(pa.string())), 'c2': pa.array([[], [], []], type=pa.list_(pa.string())), } rb = pa.RecordBatch.from_arrays(list(arrays.values()), list(arrays.keys())) tbl = pa.Table.from_batches([rb]) print("Schema 1 (correct):\n{}".format(tbl.schema)) # First roundtrip changes schema df = tbl.to_pandas() tbl2 = pa.Table.from_pandas(df) print("\nSchema 2 (wrong):\n{}".format(tbl2.schema)) # Second roundtrip explodes df2 = tbl2.to_pandas() {code} This results in the following output: {code} Schema 1 (correct): c1: list<item: string> child 0, item: string c2: list<item: string> child 0, item: string Schema 2 (wrong): c1: list<item: string> child 0, item: string c2: list<item: null> child 0, item: null __index_level_0__: int64 metadata -------- {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":' b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod' b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "' b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":' b' "object", "metadata": null}, {"name": null, "field_name": "__in' b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "' b'metadata": null}], "pandas_version": "0.21.1"}'} ... > ArrowNotImplementedError: Not implemented type for list in DataFrameBlock: > null {code} I.e., the array of empty lists of strings gets converted into an array of lists of type null, and in the pandas schema to lists of type float64. If one changes the empty lists to values of None in the creation of the record batches, the roundtrip doesn't explode, but it will silently convert the column to a simple column of type float (i.e. I lose the list type) in pandas. This doesn't help, since other batches from the same source might have non-empty lists and would end up with a different inferred schema, and so can't be concatenated into a single table. (If this attempt at a double roundtrip seems weird, in my use case I receive data from a server in RecordBatches, which I convert to pandas for manipulation. I then serialize this data to disk using Arrow, and later need to read it back into pandas again for further manipulation. So I need to be able to go through various rounds of table->df->table->df->table etc., where at any time a record batch may have columns that contain only empty lists). -- This message was sent by Atlassian JIRA (v6.4.14#64029)