[ https://issues.apache.org/jira/browse/ARROW-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-2449: ---------------------------------- Priority: Minor (was: Major) > [Python] Efficiently serialize functions containing NumPy arrays > ----------------------------------------------------------------- > > Key: ARROW-2449 > URL: https://issues.apache.org/jira/browse/ARROW-2449 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Affects Versions: 0.9.0 > Reporter: Richard Shin > Priority: Minor > > It is my understanding that pyarrow falls back to serializing functions (and > other complex Python objects) using cloudpickle, which means that the > contents of those functions are also serialized using the fallback method, > rather than the efficient method described in > [https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.] > It would be good to get the benefit of fast zero-copy (de)serialization for > objects like NumPy arrays contained inside functions. > {code:java} > In [1]: import numpy as np, pyarrow as pa > In [2]: pa.__version__ > Out[2]: '0.9.0' > In [3]: arr = np.random.rand(10000) > In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer()) > The slowest run took 38.29 times longer than the fastest. This could mean > that an intermediate result is being cached. > 10000 loops, best of 3: 68.7 µs per loop > In [5]: def arr_f(): return arr > In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer()) > The slowest run took 5.89 times longer than the fastest. This could mean that > an intermediate result is being cached. > 1000 loops, best of 3: 539 µs per loop > {code} > For comparison: > {code:java} > In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr)) > 1000 loops, best of 3: 193 µs per loop > In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f)) > The slowest run took 4.02 times longer than the fastest. This could mean that > an intermediate result is being cached. > 1000 loops, best of 3: 429 µs per loop > {code} > cc [~pcmoritz] -- This message was sent by Atlassian JIRA (v7.6.3#76005)