jorisvandenbossche opened a new issue, #42026:
URL: https://github.com/apache/arrow/issues/42026
When having a binary array, converting that object dtype with `to_pandas()`
(eg from converting a table to pandas) vs `to_numpy()` (or from calling
`np.asarray(..)` on a pyarrow array) gives a considerable performance
difference, although both are resulting in exactly the same numpy object dtype
array (for `to_pandas` just wrapped in a pandas Series, but that should not
give much overhead).
Example:
```python
import numpy as np
import pyarrow as pa
def random_ascii(length):
return bytes(np.random.randint(65, 123, size=length, dtype='i1'))
arr = pa.chunked_array([pa.array(random_ascii(i) for i in
np.random.randint(20, 100, 1_000_000)) for _ in range(10)])
```
```
In [60]: %timeit _ = arr.to_pandas()
1.98 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit _ = arr.to_numpy(zero_copy_only=False)
382 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
(noticed this in https://github.com/geopandas/geopandas/pull/3322)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]