On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <zachary.pin...@yale.edu> wrote: > Hello all, > > As a result of the "fast greyscale conversion" thread, I noticed an anomaly > with numpy.ndararray.sum(): summing along certain axes is much slower with > sum() than versus doing it explicitly, but only with integer dtypes and when > the size of the dtype is less than the machine word. I checked in 32-bit and > 64-bit modes and in both cases only once the dtype got as large as that did > the speed difference go away. See below... > > Is this something to do with numpy or something inexorable about machine / > memory architecture? > > Zach > > Timings -- 64-bit mode: > ---------------------- > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > In [3]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 2.57 ms per loop > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > In [6]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 4.75 ms per loop > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > In [9]: timeit i.sum(axis=-1) > 10 loops, best of 3: 131 ms per loop > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 6.37 ms per loop > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > In [12]: timeit i.sum(axis=-1) > 100 loops, best of 3: 16.6 ms per loop > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 15.1 ms per loop > > > > Timings -- 32-bit mode: > ---------------------- > In [2]: i = numpy.ones((1024,1024,4), numpy.int8) > In [3]: timeit i.sum(axis=-1) > 10 loops, best of 3: 138 ms per loop > In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 3.68 ms per loop > > In [5]: i = numpy.ones((1024,1024,4), numpy.int16) > In [6]: timeit i.sum(axis=-1) > 10 loops, best of 3: 140 ms per loop > In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 4.17 ms per loop > > In [8]: i = numpy.ones((1024,1024,4), numpy.int32) > In [9]: timeit i.sum(axis=-1) > 10 loops, best of 3: 22.4 ms per loop > In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 100 loops, best of 3: 12.2 ms per loop > > In [11]: i = numpy.ones((1024,1024,4), numpy.int64) > In [12]: timeit i.sum(axis=-1) > 10 loops, best of 3: 29.2 ms per loop > In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] > 10 loops, best of 3: 23.8 ms per loop
One difference is that i.sum() changes the output dtype of int input when the int dtype is less than the default int dtype: >> i.dtype dtype('int32') >> i.sum(axis=-1).dtype dtype('int64') # <-- dtype changed >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype dtype('int32') Here are my timings >> i = numpy.ones((1024,1024,4), numpy.int32) >> timeit i.sum(axis=-1) 1 loops, best of 3: 278 ms per loop >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3] 100 loops, best of 3: 12.1 ms per loop >> import bottleneck as bn >> timeit bn.func.nansum_3d_int32_axis2(i) 100 loops, best of 3: 8.27 ms per loop Does making an extra copy of the input explain all of the speed difference (is this what np.sum does internally?): >> timeit i.astype(numpy.int64) 10 loops, best of 3: 29.2 ms per loop No. Initializing the output also adds some time: >> timeit np.empty((1024,1024,4), dtype=np.int32) 100000 loops, best of 3: 2.67 us per loop >> timeit np.empty((1024,1024,4), dtype=np.int64) 100000 loops, best of 3: 12.8 us per loop Switching back and forth between the input and output array takes more "memory" time too with int64 arrays compared to int32. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion