I agree that documenting this better would be useful to many people. So if someone what to summarize this and put it in the doc, I think many people will appreciate this.
Fred On Thu, Mar 7, 2013 at 10:28 PM, Kurt Smith <kwmsm...@gmail.com> wrote: > On Thu, Mar 7, 2013 at 12:26 PM, Frédéric Bastien <no...@nouiz.org> wrote: >> Hi, >> >> It is normal that unaligned access are slower. The hardware have been >> optimized for aligned access. So this is a user choice space vs speed. > > The quantitative difference is still important, so this thread is > useful for future reference, I think. If reading in data into a > packed array is 3x faster than reading into an aligned array, but the > core computation is 4x slower with a packed array...you get the idea. > > I would have benefitted years ago knowing (1) numpy structured dtypes > are packed by default, and (2) computations with unaligned data can be > several factors slower than aligned. That's strong motivation to > always make sure I'm using 'aligned=True' except when memory usage is > an issue, or for file IO with packed binary data, etc. > >> We can't go around that. We can only minimize the cost of unaligned >> access in some cases, but not all and those optimization depend of the >> CPU. But newer CPU have lowered in cost of unaligned access. >> >> I'm surprised that Theano worked with the unaligned input. I added >> some check to make this raise an error, as we do not support that! >> Francesc, can you check if Theano give the good result? It is possible >> that someone (maybe me), just copy the input to an aligned ndarray >> when we receive an not aligned one. That could explain why it worked, >> but my memory tell me that we raise an error. >> >> As you saw in the number, this is a bad example for Theano as the >> function compiled is too fast . Their is more Theano overhead then >> computation time in that example. We have reduced recently the >> overhead, but we can do more to lower it. >> >> Fred >> >> On Thu, Mar 7, 2013 at 1:06 PM, Francesc Alted <franc...@continuum.io> wrote: >>> On 3/7/13 6:47 PM, Francesc Alted wrote: >>>> On 3/6/13 7:42 PM, Kurt Smith wrote: >>>>> And regarding performance, doing simple timings shows a 30%-ish >>>>> slowdown for unaligned operations: >>>>> >>>>> In [36]: %timeit packed_arr['b']**2 >>>>> 100 loops, best of 3: 2.48 ms per loop >>>>> >>>>> In [37]: %timeit aligned_arr['b']**2 >>>>> 1000 loops, best of 3: 1.9 ms per loop >>>> >>>> Hmm, that clearly depends on the architecture. On my machine: >>>> >>>> In [1]: import numpy as np >>>> >>>> In [2]: aligned_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=True) >>>> >>>> In [3]: packed_dt = np.dtype([('a', 'i1'), ('b', 'i8')], align=False) >>>> >>>> In [4]: aligned_arr = np.ones((10**6,), dtype=aligned_dt) >>>> >>>> In [5]: packed_arr = np.ones((10**6,), dtype=packed_dt) >>>> >>>> In [6]: baligned = aligned_arr['b'] >>>> >>>> In [7]: bpacked = packed_arr['b'] >>>> >>>> In [8]: %timeit baligned**2 >>>> 1000 loops, best of 3: 1.96 ms per loop >>>> >>>> In [9]: %timeit bpacked**2 >>>> 100 loops, best of 3: 7.84 ms per loop >>>> >>>> That is, the unaligned column is 4x slower (!). numexpr allows >>>> somewhat better results: >>>> >>>> In [11]: %timeit numexpr.evaluate('baligned**2') >>>> 1000 loops, best of 3: 1.13 ms per loop >>>> >>>> In [12]: %timeit numexpr.evaluate('bpacked**2') >>>> 1000 loops, best of 3: 865 us per loop >>> >>> Just for completeness, here it is what Theano gets: >>> >>> In [18]: import theano >>> >>> In [20]: a = theano.tensor.vector() >>> >>> In [22]: f = theano.function([a], a**2) >>> >>> In [23]: %timeit f(baligned) >>> 100 loops, best of 3: 7.74 ms per loop >>> >>> In [24]: %timeit f(bpacked) >>> 100 loops, best of 3: 12.6 ms per loop >>> >>> So yeah, Theano is also slower for the unaligned case (but less than 2x >>> in this case). >>> >>>> >>>> Yes, in this case, the unaligned array goes faster (as much as 30%). >>>> I think the reason is that numexpr optimizes the unaligned access by >>>> doing a copy of the different chunks in internal buffers that fits in >>>> L1 cache. Apparently this is very beneficial in this case (not sure >>>> why, though). >>>> >>>>> >>>>> Whereas summing shows just a 10%-ish slowdown: >>>>> >>>>> In [38]: %timeit packed_arr['b'].sum() >>>>> 1000 loops, best of 3: 1.29 ms per loop >>>>> >>>>> In [39]: %timeit aligned_arr['b'].sum() >>>>> 1000 loops, best of 3: 1.14 ms per loop >>>> >>>> On my machine: >>>> >>>> In [14]: %timeit baligned.sum() >>>> 1000 loops, best of 3: 1.03 ms per loop >>>> >>>> In [15]: %timeit bpacked.sum() >>>> 100 loops, best of 3: 3.79 ms per loop >>>> >>>> Again, the 4x slowdown is here. Using numexpr: >>>> >>>> In [16]: %timeit numexpr.evaluate('sum(baligned)') >>>> 100 loops, best of 3: 2.16 ms per loop >>>> >>>> In [17]: %timeit numexpr.evaluate('sum(bpacked)') >>>> 100 loops, best of 3: 2.08 ms per loop >>> >>> And with Theano: >>> >>> In [26]: f2 = theano.function([a], a.sum()) >>> >>> In [27]: %timeit f2(baligned) >>> 100 loops, best of 3: 2.52 ms per loop >>> >>> In [28]: %timeit f2(bpacked) >>> 100 loops, best of 3: 7.43 ms per loop >>> >>> Again, the unaligned case is significantly slower (as much as 3x here!). >>> >>> -- >>> Francesc Alted >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion