Thanks Yibo, Yes, I expect aligned SIMD loads to be faster.
My understanding is that we do not need an alignment requirement for this, though: split the buffer in 3, [unaligned][aligned][unaligned], use aligned loads for the middle and un-aligned (or not even SIMD) for the prefix and suffix. This is generic over the size of the SIMD and buffer slicing, where alignment can be lost. Or am I missing something? Best, Jorge On Wed, Sep 8, 2021 at 4:26 AM Yibo Cai <yibo....@arm.com> wrote: > Thanks Jorge, > > I'm wondering if the 64 bytes alignment requirement is for cache or for > simd register(avx512?). > > For simd, looks register width alignment does helps. > E.g., _mm_load_si128 can only load 128 bits aligned data, it performs > better than _mm_loadu_si128, which supports unaligned load. > > Again, be very skeptical to the benchmark :) > https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk > > > On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote: > > Thanks, > > > > I think that the alignment requirement in IPC is different from this one: > > we enforce 8/64 byte alignment when serializing for IPC, but we (only) > > recommend 64 byte alignment in memory addresses (at least this is my > > understanding from the above link). > > > > I did test adding two arrays and the result is independent of the > alignment > > (on my machine, compiler, etc). > > > > Yibo, thanks a lot for that example. I am unsure whether it captures the > > cache alignment concept, though: in the example we are reading a long (8 > > bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), > which > > is both slow and often undefined behavior. I think that the bench we want > > is to change 63 to 64-8 (which is still not 64-bytes cache aligned but > > aligned with a long), the difference vanishes (under the same gotchas > that > > you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE. > > Alternatively, add an int32 with an offset of 4. > > > > I benched both with explicit (via intrinsics) SIMD and without (i.e. let > > the compiler do it for us), and the alignment does not impact the > benches. > > > > Best, > > Jorge > > > > [1] https://stackoverflow.com/a/27184001/931303 > > > > > > > > > > > > On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yibo....@arm.com> wrote: > > > >> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving > >> enough conditions, looks it does shows unaligned access has some penalty > >> over aligned access. But I don't think this is an issue in practice. > >> > >> Please be very skeptical to this benchmark. It's hard to get it right > >> given the complexity of hardware, compiler, benchmark tool and env. > >> > >> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk > >> > >> > >> On 9/7/21 7:55 AM, Micah Kornfield wrote: > >>>> > >>>> My own impression is that the emphasis may be slightly exagerated. But > >>>> perhaps some other benchmarks would prove differently. > >>> > >>> > >>> This is probably true. [1] is the original mailing list discussion. I > >>> think lack of measurable differences and high overhead for 64 byte > >>> alignment was the reason for relaxing to 8 byte alignment. > >>> > >>> Specifically, I performed two types of tests, a "random sum" where we > >>>> compute the sum of the values taken at random indices, and "sum", > where > >> we > >>>> sum all values of the array (buffer[1] of the primitive array), both > for > >>>> array ranging from 2^10 to 2^25 elements. I was expecting that, at > >> least in > >>>> the latter, prefetching would help, but I do not observe any > difference. > >>> > >>> > >>> The most likely place I think where this could make a difference would > be > >>> for operations on wider types (Decimal128 and Decimal256). Another > >> place > >>> where I think alignment could help is when adding two primitive arrays > >> (it > >>> sounds like this was summing a single array?). > >>> > >>> [1] > >>> > >> > https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E > >>> > >>> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org> > >> wrote: > >>> > >>>> > >>>> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit : > >>>>> Thanks a lot Antoine for the pointers. Much appreciated! > >>>>> > >>>>> Generally, it should not hurt to align allocations to 64 bytes > anyway, > >>>>>> since you are generally dealing with large enough data that the > >>>>>> (small) memory overhead doesn't matter. > >>>>> > >>>>> Not for performance. However, 64 byte alignment in Rust requires > >>>>> maintaining a custom container, a custom allocator, and the inability > >> to > >>>>> interoperate with `std::Vec` and the ecosystem that is based on it, > >> since > >>>>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For > >>>> anyone > >>>>> interested, the background for this is this old PR [1] in this in > >> arrow2 > >>>>> [2]. > >>>> > >>>> I see. In the C++ implementation, we are not compatible with the > default > >>>> allocator either (but C++ allocators as defined by the standard > library > >>>> don't support resizing, which doesn't make them terribly useful for > >>>> Arrow anyway). > >>>> > >>>>> Neither myself in micro benches nor Ritchie from polars (query > engine) > >> in > >>>>> large scale benches observe any difference in the archs we have > >>>> available. > >>>>> This is not consistent with the emphasis we put on the memory > >> alignments > >>>>> discussion [3], and I am trying to understand the root cause for this > >>>>> inconsistency. > >>>> > >>>> My own impression is that the emphasis may be slightly exagerated. But > >>>> perhaps some other benchmarks would prove differently. > >>>> > >>>>> By prefetching I mean implicit; no intrinsics involved. > >>>> > >>>> Well, I'm not aware that implicit prefetching depends on alignment. > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>> > >> > > >