[ https://issues.apache.org/jira/browse/ARROW-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Balanca updated ARROW-11006: --------------------------------- Description: The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance: {code:java} N = 1000000 np_arr = np.arange(N) pa_arr = pa.array(np_arr) %timeit l = [np_arr.view() for _ in range(N)] 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times). I would believe that a bit part of this overhead is due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays. There are a lot of use cases of PyArrow <=> Numpy interaction projects where I think most would be interested in not paying any Pandas compatibility additional cost. And in this particular case, it could be valuable to implement a direct Numpy conversion method for some Array subclasses (starting with the simple `NumericArray`). ` was: The method `to_numpy` is quite slow compare Numpy slice and viewing performance. For instance: {code:java} N = 1000000 np_arr = np.arange(N) pa_arr = pa.array(np_arr) %timeit l = [np_arr.view() for _ in range(N)] 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) {code} The previous benchmark is clearly an extreme case, but the idea is that for any operation not available in PyArrow, failing back on Numpy is a good option and the cost of extracting should be as minimal as possible (there are scenarios where you can't cache easily this view, so you end up calling `to_numpy` a fair amount of times). I would believe that part of this overhead is probably due to PyArrow implementing a very generic Pandas conversion, and using this one even for very simple Numpy-like dense arrays. > [Python] Array to_numpy slow compared to Numpy.view > --------------------------------------------------- > > Key: ARROW-11006 > URL: https://issues.apache.org/jira/browse/ARROW-11006 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Paul Balanca > Assignee: Paul Balanca > Priority: Minor > > The method `to_numpy` is quite slow compare Numpy slice and viewing > performance. For instance: > {code:java} > N = 1000000 > np_arr = np.arange(N) > pa_arr = pa.array(np_arr) > %timeit l = [np_arr.view() for _ in range(N)] > 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)] > 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) > {code} > The previous benchmark is clearly an extreme case, but the idea is that for > any operation not available in PyArrow, failing back on Numpy is a good > option and the cost of extracting should be as minimal as possible (there are > scenarios where you can't cache easily this view, so you end up calling > `to_numpy` a fair amount of times). > I would believe that a bit part of this overhead is due to PyArrow > implementing a very generic Pandas conversion, and using this one even for > very simple Numpy-like dense arrays. > There are a lot of use cases of PyArrow <=> Numpy interaction projects where > I think most would be interested in not paying any Pandas compatibility > additional cost. And in this particular case, it could be valuable to > implement a direct Numpy conversion method for some Array subclasses > (starting with the simple `NumericArray`). > ` -- This message was sent by Atlassian Jira (v8.3.4#803005)