jorisvandenbossche opened a new issue, #42026:
URL: https://github.com/apache/arrow/issues/42026

   When having a binary array, converting that object dtype with `to_pandas()` 
(eg from converting a table to pandas) vs `to_numpy()` (or from calling 
`np.asarray(..)` on a pyarrow array) gives a considerable performance 
difference, although both are resulting in exactly the same numpy object dtype 
array (for `to_pandas` just wrapped in a pandas Series, but that should not 
give much overhead).
   
   Example:
   ```python
   import numpy as np
   import pyarrow as pa
   
   def random_ascii(length):
       return bytes(np.random.randint(65, 123, size=length, dtype='i1'))
   
   arr = pa.chunked_array([pa.array(random_ascii(i) for i in 
np.random.randint(20, 100, 1_000_000)) for _ in range(10)])
   ```
   
   ```
   In [60]: %timeit _ = arr.to_pandas()
   1.98 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
   
   In [61]: %timeit _ = arr.to_numpy(zero_copy_only=False)
   382 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   ```
   
   (noticed this in https://github.com/geopandas/geopandas/pull/3322)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to