Thanks, I think that the alignment requirement in IPC is different from this one: we enforce 8/64 byte alignment when serializing for IPC, but we (only) recommend 64 byte alignment in memory addresses (at least this is my understanding from the above link).
I did test adding two arrays and the result is independent of the alignment (on my machine, compiler, etc). Yibo, thanks a lot for that example. I am unsure whether it captures the cache alignment concept, though: in the example we are reading a long (8 bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which is both slow and often undefined behavior. I think that the bench we want is to change 63 to 64-8 (which is still not 64-bytes cache aligned but aligned with a long), the difference vanishes (under the same gotchas that you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE. Alternatively, add an int32 with an offset of 4. I benched both with explicit (via intrinsics) SIMD and without (i.e. let the compiler do it for us), and the alignment does not impact the benches. Best, Jorge [1] https://stackoverflow.com/a/27184001/931303 On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yibo....@arm.com> wrote: > Did a quick bench of accessing long buffer not 8 bytes aligned. Giving > enough conditions, looks it does shows unaligned access has some penalty > over aligned access. But I don't think this is an issue in practice. > > Please be very skeptical to this benchmark. It's hard to get it right > given the complexity of hardware, compiler, benchmark tool and env. > > https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk > > > On 9/7/21 7:55 AM, Micah Kornfield wrote: > >> > >> My own impression is that the emphasis may be slightly exagerated. But > >> perhaps some other benchmarks would prove differently. > > > > > > This is probably true. [1] is the original mailing list discussion. I > > think lack of measurable differences and high overhead for 64 byte > > alignment was the reason for relaxing to 8 byte alignment. > > > > Specifically, I performed two types of tests, a "random sum" where we > >> compute the sum of the values taken at random indices, and "sum", where > we > >> sum all values of the array (buffer[1] of the primitive array), both for > >> array ranging from 2^10 to 2^25 elements. I was expecting that, at > least in > >> the latter, prefetching would help, but I do not observe any difference. > > > > > > The most likely place I think where this could make a difference would be > > for operations on wider types (Decimal128 and Decimal256). Another > place > > where I think alignment could help is when adding two primitive arrays > (it > > sounds like this was summing a single array?). > > > > [1] > > > https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E > > > > On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org> > wrote: > > > >> > >> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit : > >>> Thanks a lot Antoine for the pointers. Much appreciated! > >>> > >>> Generally, it should not hurt to align allocations to 64 bytes anyway, > >>>> since you are generally dealing with large enough data that the > >>>> (small) memory overhead doesn't matter. > >>> > >>> Not for performance. However, 64 byte alignment in Rust requires > >>> maintaining a custom container, a custom allocator, and the inability > to > >>> interoperate with `std::Vec` and the ecosystem that is based on it, > since > >>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For > >> anyone > >>> interested, the background for this is this old PR [1] in this in > arrow2 > >>> [2]. > >> > >> I see. In the C++ implementation, we are not compatible with the default > >> allocator either (but C++ allocators as defined by the standard library > >> don't support resizing, which doesn't make them terribly useful for > >> Arrow anyway). > >> > >>> Neither myself in micro benches nor Ritchie from polars (query engine) > in > >>> large scale benches observe any difference in the archs we have > >> available. > >>> This is not consistent with the emphasis we put on the memory > alignments > >>> discussion [3], and I am trying to understand the root cause for this > >>> inconsistency. > >> > >> My own impression is that the emphasis may be slightly exagerated. But > >> perhaps some other benchmarks would prove differently. > >> > >>> By prefetching I mean implicit; no intrinsics involved. > >> > >> Well, I'm not aware that implicit prefetching depends on alignment. > >> > >> Regards > >> > >> Antoine. > >> > > >