[jira] [Updated] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays

Antoine Pitrou (JIRA) Wed, 17 Apr 2019 08:06:23 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-2449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Antoine Pitrou updated ARROW-2449:
----------------------------------
    Priority: Minor  (was: Major)

> [Python] Efficiently serialize functions containing NumPy arrays 
> -----------------------------------------------------------------
>
>                 Key: ARROW-2449
>                 URL: https://issues.apache.org/jira/browse/ARROW-2449
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Richard Shin
>            Priority: Minor
>
> It is my understanding that pyarrow falls back to serializing functions (and 
> other complex Python objects) using cloudpickle, which means that the 
> contents of those functions are also serialized using the fallback method, 
> rather than the efficient method described in 
> [https://ray-project.github.io/2017/10/15/fast-python-serialization-with-ray-and-arrow.html.]
>  It would be good to get the benefit of fast zero-copy (de)serialization for 
> objects like NumPy arrays contained inside functions.
> {code:java}
> In [1]: import numpy as np, pyarrow as pa
> In [2]: pa.__version__
> Out[2]: '0.9.0'
> In [3]: arr = np.random.rand(10000)
> In [4]: %timeit pa.deserialize(pa.serialize(arr).to_buffer())
> The slowest run took 38.29 times longer than the fastest. This could mean 
> that an intermediate result is being cached.
> 10000 loops, best of 3: 68.7 µs per loop
> In [5]: def arr_f(): return arr
> In [6]: %timeit pa.deserialize(pa.serialize(arr_f).to_buffer())
> The slowest run took 5.89 times longer than the fastest. This could mean that 
> an intermediate result is being cached.
> 1000 loops, best of 3: 539 µs per loop
> {code}
> For comparison:
> {code:java}
> In [7]: %timeit cloudpickle.loads(cloudpickle.dumps(arr))
> 1000 loops, best of 3: 193 µs per loop
> In [8]: %timeit cloudpickle.loads(cloudpickle.dumps(arr_f))
> The slowest run took 4.02 times longer than the fastest. This could mean that 
> an intermediate result is being cached.
> 1000 loops, best of 3: 429 µs per loop
> {code}
> cc [~pcmoritz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2449) [Python] Efficiently serialize functions containing NumPy arrays

Reply via email to