On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <zachary.pin...@yale.edu> wrote:
> Hello all,
>
> As a result of the "fast greyscale conversion" thread, I noticed an anomaly 
> with numpy.ndararray.sum(): summing along certain axes is much slower with 
> sum() than versus doing it explicitly, but only with integer dtypes and when 
> the size of the dtype is less than the machine word. I checked in 32-bit and 
> 64-bit modes and in both cases only once the dtype got as large as that did 
> the speed difference go away. See below...
>
> Is this something to do with numpy or something inexorable about machine / 
> memory architecture?
>
> Zach
>
> Timings -- 64-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 2.57 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.75 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 6.37 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 100 loops, best of 3: 16.6 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 15.1 ms per loop
>
>
>
> Timings -- 32-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 138 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 3.68 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 140 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.17 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 22.4 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 12.2 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 29.2 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 10 loops, best of 3: 23.8 ms per loop

One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:

    >> i.dtype
       dtype('int32')
    >> i.sum(axis=-1).dtype
       dtype('int64') #  <-- dtype changed
    >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
       dtype('int32')

Here are my timings

    >> i = numpy.ones((1024,1024,4), numpy.int32)
    >> timeit i.sum(axis=-1)
    1 loops, best of 3: 278 ms per loop
    >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
    100 loops, best of 3: 12.1 ms per loop
    >> import bottleneck as bn
    >> timeit bn.func.nansum_3d_int32_axis2(i)
    100 loops, best of 3: 8.27 ms per loop

Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):

    >> timeit i.astype(numpy.int64)
    10 loops, best of 3: 29.2 ms per loop

No.

Initializing the output also adds some time:

    >> timeit np.empty((1024,1024,4), dtype=np.int32)
    100000 loops, best of 3: 2.67 us per loop
    >> timeit np.empty((1024,1024,4), dtype=np.int64)
    100000 loops, best of 3: 12.8 us per loop

Switching back and forth between the input and output array takes more
"memory" time too with int64 arrays compared to int32.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to