[I] [Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy [arrow]

via GitHub Fri, 07 Jun 2024 02:10:33 -0700


jorisvandenbossche opened a new issue, #42026:
URL: https://github.com/apache/arrow/issues/42026


   When having a binary array, converting that object dtype with `to_pandas()` 
(eg from converting a table to pandas) vs `to_numpy()` (or from calling 
`np.asarray(..)` on a pyarrow array) gives a considerable performance 
difference, although both are resulting in exactly the same numpy object dtype 
array (for `to_pandas` just wrapped in a pandas Series, but that should not 
give much overhead).
   
   Example:
   ```python
   import numpy as np
   import pyarrow as pa
   
   def random_ascii(length):
       return bytes(np.random.randint(65, 123, size=length, dtype='i1'))
   
   arr = pa.chunked_array([pa.array(random_ascii(i) for i in 
np.random.randint(20, 100, 1_000_000)) for _ in range(10)])
   ```
   
   ```
   In [60]: %timeit _ = arr.to_pandas()
   1.98 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [61]: %timeit _ = arr.to_numpy(zero_copy_only=False)
   382 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   (noticed this in https://github.com/geopandas/geopandas/pull/3322)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy [arrow]

Reply via email to