On 3/7/13 7:26 PM, Frédéric Bastien wrote: > Hi, > > It is normal that unaligned access are slower. The hardware have been > optimized for aligned access. So this is a user choice space vs speed. > We can't go around that.
Well, my benchmarks apparently say that numexpr can get better performance when tackling computations on unaligned arrays (30% faster). This puzzled me a bit yesterday, but after thinking a bit about what was happening, the explanation is clear to me now. The aligned and unaligned arrays were not contiguous, as they had a gap between elements (a consequence of the layout of structure arrays): 8 bytes for the aligned case and 1 byte for the packed one. The hardware of modern machines fetches a complete cache line (64 bytes typically) whenever an element is accessed and that means that, even though we are only making use of one field in the computations, both fields are brought into cache. That means that, for aligned object, 16 MB (16 bytes * 1 million elements) are transmitted to the cache, while the unaligned object only have to transmit 9 MB (9 bytes * 1 million). Of course, transmitting 16 MB is pretty much work than just 9 MB. Now, the elements land in cache aligned for the aligned case and unaligned for the packed case, and as you say, unaligned access in cache is pretty slow for the CPU, and this is the reason why NumPy can take up to 4x more time to perform the computation. So why numexpr is performing much better for the packed case? Well, it turns out that numexpr has machinery to detect that an array is unaligned, and does an internal copy for every block that is brought to the cache to be computed. This block size is between 1024 elements (8 KB for double precision) and 4096 elements when linked with VML support, and that means that a copy normally happens at L1 or L2 cache speed, which is much faster than memory-to-memory copy. After the copy numexpr can perform operations with aligned data at full CPU speed. The paradox is that, by doing more copies, you may end performing faster computations. This is the joy of programming with memory hierarchy in mind. This is to say that there is more in the equation than just if an array is aligned or not. You must take in account how (and how much!) data travels from storage to CPU before making assumptions on the performance of your programs. > We can only minimize the cost of unaligned > access in some cases, but not all and those optimization depend of the > CPU. But newer CPU have lowered in cost of unaligned access. > > I'm surprised that Theano worked with the unaligned input. I added > some check to make this raise an error, as we do not support that! > Francesc, can you check if Theano give the good result? It is possible > that someone (maybe me), just copy the input to an aligned ndarray > when we receive an not aligned one. That could explain why it worked, > but my memory tell me that we raise an error. It seems to work for me: In [10]: f = theano.function([a], a**2) In [11]: f(baligned) Out[11]: array([ 1., 1., 1., ..., 1., 1., 1.]) In [12]: f(bpacked) Out[12]: array([ 1., 1., 1., ..., 1., 1., 1.]) In [13]: f2 = theano.function([a], a.sum()) In [14]: f2(baligned) Out[14]: array(1000000.0) In [15]: f2(bpacked) Out[15]: array(1000000.0) > > As you saw in the number, this is a bad example for Theano as the > function compiled is too fast . Their is more Theano overhead then > computation time in that example. We have reduced recently the > overhead, but we can do more to lower it. Yeah. I was mainly curious about how different packages handle unaligned arrays. -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion