[ https://issues.apache.org/jira/browse/ARROW-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs resolved ARROW-9976. ------------------------------------ Resolution: Fixed > [Python] ArrowCapacityError when doing Table.from_pandas with large dataframe > ----------------------------------------------------------------------------- > > Key: ARROW-9976 > URL: https://issues.apache.org/jira/browse/ARROW-9976 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.1 > Reporter: quentin lhoest > Assignee: Krisztian Szucs > Priority: Minor > > When calling Table.from_pandas() with a large dataset with a column of > vectors (np.array), there is an `ArrowCapacityError` > To reproduce: > {code:python} > import pandas as pd > import numpy as np > import pyarrow as pa > n = 1713614 > df = pd.DataFrame.from_dict({"a": list(np.zeros((n, 128))), "b": range(n)}) > pa.Table.from_pandas(df) > {code} > With a smaller n it works. > Error raised: > {noformat} > --------------------------------------------------------------------------- > ArrowCapacityError Traceback (most recent call last) > <ipython-input-7-1a7b68a179a0> in <module> > ----> 1 _ = pa.Table.from_pandas(df) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 591 for i, maybe_fut in enumerate(arrays): > 592 if isinstance(maybe_fut, futures.Future): > --> 593 arrays[i] = maybe_fut.result() > 594 > 595 types = [x.type for x in arrays] > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in result(self, timeout) > 423 raise CancelledError() > 424 elif self._state == FINISHED: > --> 425 return self.__get_result() > 426 > 427 self._condition.wait(timeout) > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py > in __get_result(self) > 382 def __get_result(self): > 383 if self._exception: > --> 384 raise self._exception > 385 else: > 386 return self._result > ~/.pyenv/versions/3.7.2/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py > in run(self) > 55 > 56 try: > ---> 57 result = self.fn(*self.args, **self.kwargs) > 58 except BaseException as exc: > 59 self.future.set_exception(exc) > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/pandas_compat.py > in convert_column(col, field) > 557 > 558 try: > --> 559 result = pa.array(col, type=type_, from_pandas=True, > safe=safe) > 560 except (pa.ArrowInvalid, > 561 pa.ArrowNotImplementedError, > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib.array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib._ndarray_to_array() > ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in > pyarrow.lib.check_status() > ArrowCapacityError: List array cannot contain more than 2147483646 child > elements, have 2147483648 > {noformat} > I guess one needs to chunk the data before creating the arrays ? -- This message was sent by Atlassian Jira (v8.3.4#803005)