[jira] [Created] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays

Richard Shin (JIRA) Wed, 11 Apr 2018 23:32:41 -0700

Richard Shin created ARROW-2449:
-----------------------------------

             Summary: [Python] Efficiently serialize functions containing NumPy 
arrays 
                 Key: ARROW-2449
                 URL: https://issues.apache.org/jira/browse/ARROW-2449
             Project: Apache Arrow
          Issue Type: Improvement
    Affects Versions: 0.9.0
            Reporter: Richard Shin



It is my understanding that pyarrow falls back to serializing functions (and 
other complex Python objects) using cloudpickle, which means that the contents 
of those functions are also serialized using the fallback method, rather than 
the efficient method described in 
[https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
 It would be good to get the benefit of fast zero-copy (de)serialization for 
objects like NumPy arrays contained inside functions.

{code}
In [1]: import numpy as np, pyarrow as pa

In [2]: pa.__version__
Out[2]: '0.9.0'

In [3]: arr = np.random.rand(10000)

In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
The slowest run took 38.29 times longer than the fastest. This could mean that 
an intermediate result is being cached.
10000 loops, best of 3: 68.7 µs per loop

In [5]: def arr_f(): return arr

In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
The slowest run took 5.89 times longer than the fastest. This could mean that 
an intermediate result is being cached.
1000 loops, best of 3: 539 µs per loop
{code}

For comparison:

{code}
In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
1000 loops, best of 3: 193 µs per loop

In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
The slowest run took 4.02 times longer than the fastest. This could mean that 
an intermediate result is being cached.
1000 loops, best of 3: 429 µs per loop
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays

Reply via email to