[ https://issues.apache.org/jira/browse/ARROW-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Krisztian Szucs updated ARROW-10739: ------------------------------------ Fix Version/s: 10.0.0 > [Python] Pickling a sliced array serializes all the buffers > ----------------------------------------------------------- > > Key: ARROW-10739 > URL: https://issues.apache.org/jira/browse/ARROW-10739 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Maarten Breddels > Assignee: Alessandro Molina > Priority: Critical > Fix For: 10.0.0 > > > If a large array is sliced, and pickled, it seems the full buffer is > serialized, this leads to excessive memory usage and data transfer when using > multiprocessing or dask. > {code:java} > >>> import pyarrow as pa > >>> ar = pa.array(['foo'] * 100_000) > >>> ar.nbytes > 700004 > >>> import pickle > >>> len(pickle.dumps(ar.slice(10, 1))) > 700165 > NumPy for instance > >>> import numpy as np > >>> ar_np = np.array(ar) > >>> ar_np > array(['foo', 'foo', 'foo', ..., 'foo', 'foo', 'foo'], dtype=object) > >>> import pickle > >>> len(pickle.dumps(ar_np[10:11])) > 165{code} > I think this makes sense if you know arrow, but kind of unexpected as a user. > Is there a workaround for this? For instance copy an arrow array to get rid > of the offset, and trim the buffers? -- This message was sent by Atlassian Jira (v8.20.10#820010)