[jira] [Updated] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

Paul Balanca (Jira) Tue, 22 Dec 2020 08:02:13 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paul Balanca updated ARROW-11006:
---------------------------------
    Description: 
The method `to_numpy` is quite slow compare Numpy slice and viewing 
performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any 
operation not available in PyArrow, failing back on Numpy is a good option and 
the cost of extracting should be as minimal as possible (there are scenarios 
where you can't cache easily this view, so you end up calling `to_numpy` a fair 
amount of times).

I would believe that a bit part of this overhead is due to PyArrow implementing 
a very generic Pandas conversion, and using this one even for very simple 
Numpy-like dense arrays.

There are a lot of use cases of PyArrow <=> Numpy interaction projects where I 
think most would be interested in not paying any Pandas compatibility 
additional cost. And in this particular case, it could be valuable to implement 
a direct Numpy conversion method for some Array subclasses (starting with the 
simple `NumericArray`).

`

  was:
The method `to_numpy` is quite slow compare Numpy slice and viewing 
performance. For instance:
{code:java}
N = 1000000
np_arr = np.arange(N)
pa_arr = pa.array(np_arr)

%timeit l = [np_arr.view() for _ in range(N)]
251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
{code}
The previous benchmark is clearly an extreme case, but the idea is that for any 
operation not available in PyArrow, failing back on Numpy is a good option and 
the cost of extracting should be as minimal as possible (there are scenarios 
where you can't cache easily this view, so you end up calling `to_numpy` a fair 
amount of times).

I would believe that part of this overhead is probably due to PyArrow 
implementing a very generic Pandas conversion, and using this one even for very 
simple Numpy-like dense arrays.


> [Python] Array to_numpy slow compared to Numpy.view
> ---------------------------------------------------
>
>                 Key: ARROW-11006
>                 URL: https://issues.apache.org/jira/browse/ARROW-11006
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Paul Balanca
>            Assignee: Paul Balanca
>            Priority: Minor
>
> The method `to_numpy` is quite slow compare Numpy slice and viewing 
> performance. For instance:
> {code:java}
> N = 1000000
> np_arr = np.arange(N)
> pa_arr = pa.array(np_arr)
> %timeit l = [np_arr.view() for _ in range(N)]
> 251 ms ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit l = [pa_arr.to_numpy(zero_copy_only=True) for _ in range(N)]
> 1.2 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> {code}
> The previous benchmark is clearly an extreme case, but the idea is that for 
> any operation not available in PyArrow, failing back on Numpy is a good 
> option and the cost of extracting should be as minimal as possible (there are 
> scenarios where you can't cache easily this view, so you end up calling 
> `to_numpy` a fair amount of times).
> I would believe that a bit part of this overhead is due to PyArrow 
> implementing a very generic Pandas conversion, and using this one even for 
> very simple Numpy-like dense arrays.
> There are a lot of use cases of PyArrow <=> Numpy interaction projects where 
> I think most would be interested in not paying any Pandas compatibility 
> additional cost. And in this particular case, it could be valuable to 
> implement a direct Numpy conversion method for some Array subclasses 
> (starting with the simple `NumericArray`).
> `



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11006) [Python] Array to_numpy slow compared to Numpy.view

Reply via email to