Giora Simchoni created ARROW-6607: ------------------------------------- Summary: Support for set/list columns in python Key: ARROW-6607 URL: https://issues.apache.org/jira/browse/ARROW-6607 Project: Apache Arrow Issue Type: Wish Components: Python Environment: python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10 Reporter: Giora Simchoni
Hi, Using python 3.6.7, pandas 0.24.2, pyarrow 0.14.1 on WSL in Windows 10... ```python import pandas as pd df = pd.DataFrame(\{'a': [1,2,3], 'b': [set([1,2]), set([2,3]), set([3,4,5])]}) df.to_feather('test.ft') ``` I get: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2131, in to_feather to_feather(self, fname) File "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather feather.write_feather(df, path) File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather writer.write(df) File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 93, in write table = Table.from_pandas(df, preserve_index=False) File "pyarrow/table.pxi", line 1174, in pyarrow.lib.Table.from_pandas File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 496, in dataframe_to_arrays for c, f in zip(columns_to_convert, convert_fields)] File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 496, in <listcomp> for c, f in zip(columns_to_convert, convert_fields)] File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 487, in convert_column raise e File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 481, in convert_column result = pa.array(col, type=type_, from_pandas=True, safe=safe) File "pyarrow/array.pxi", line 191, in pyarrow.lib.array File "pyarrow/array.pxi", line 78, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: ('Could not convert \{1, 2} with type set: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object') ``` And obviously `df.drop('b', axis=1).to_feather('test.ft')` works. Questions: (1) Is it possible to support these kind of set/list columns? (2) Anyone has an idea on how to deal with this? I *cannot* unnest these set/list columns as this would explode the DataFrame. My only other idea is to convert set `\{1,2}` into a string `1,2` and parse it after reading the file. And hoping it won't be slow. Update: With lists column the error is different: ```python import pandas as pd df = pd.DataFrame(\{'a': [1,2,3], 'b': [[1,2], [2,3], [3,4,5]]}) df.to_feather('test.ft') ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/gioras/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2131, in to_feather to_feather(self, fname) File "/home/gioras/.local/lib/python3.6/site-packages/pandas/io/feather_format.py", line 83, in to_feather feather.write_feather(df, path) File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 182, in write_feather writer.write(df) File "/home/gioras/.local/lib/python3.6/site-packages/pyarrow/feather.py", line 97, in write self.writer.write_array(name, col.data.chunk(0)) File "pyarrow/feather.pxi", line 67, in pyarrow.lib.FeatherWriter.write_array File "pyarrow/error.pxi", line 93, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: list<item: int64> ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)