[ 
https://issues.apache.org/jira/browse/ARROW-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266194#comment-16266194
 ] 

ASF GitHub Bot commented on ARROW-1854:
---------------------------------------

pcmoritz commented on issue #1360: ARROW-1854: [Python] Use pickle to serialize 
numpy arrays of objects.
URL: https://github.com/apache/arrow/pull/1360#issuecomment-347040022
 
 
   That's a very nice speedup!
   
   However, I'd vote against merging this for now. It seems not that crucial 
for numpy arrays of custom objects to be fast and it is better to be general. 
Also I'd like to fall back to pickle as little as possible so we can transfer 
data between languages easily in the future.
   
   If somebody runs into this being too slow of course we should reinvestigate. 
There are some speedups possible even if we don't fall back to pickle (like 
getting rid of the temporary copy of the list and using an iterator instead).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Improve performance of serializing object dtype ndarrays
> -----------------------------------------------------------------
>
>                 Key: ARROW-1854
>                 URL: https://issues.apache.org/jira/browse/ARROW-1854
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> I haven't looked carefully at the hot path for this, but I would expect these 
> statements to have roughly the same performance (offloading the ndarray 
> serialization to pickle)
> {code}
> In [1]: import pickle
> In [2]: import numpy as np
> In [3]: import pyarrow as pa
> a
> In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
> In [5]: timeit serialized = pa.serialize(arr).to_buffer()
> 10 loops, best of 3: 27.1 ms per loop
> In [6]: timeit pickled = pickle.dumps(arr)
> 100 loops, best of 3: 6.03 ms per loop
> {code}
> [~robertnishihara] [~pcmoritz] I encountered this while working on 
> ARROW-1783, but it can likely be resolved independently



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to