Thanks Yibo,

Yes, I expect aligned SIMD loads to be faster.

My understanding is that we do not need an alignment requirement for this,
though: split the buffer in 3, [unaligned][aligned][unaligned], use aligned
loads for the middle and un-aligned (or not even SIMD) for the prefix and
suffix. This is generic over the size of the SIMD and buffer slicing, where
alignment can be lost. Or am I missing something?

Best,
Jorge





On Wed, Sep 8, 2021 at 4:26 AM Yibo Cai <yibo....@arm.com> wrote:

> Thanks Jorge,
>
> I'm wondering if the 64 bytes alignment requirement is for cache or for
> simd register(avx512?).
>
> For simd, looks register width alignment does helps.
> E.g., _mm_load_si128 can only load 128 bits aligned data, it performs
> better than _mm_loadu_si128, which supports unaligned load.
>
> Again, be very skeptical to the benchmark :)
> https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk
>
>
> On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:
> > Thanks,
> >
> > I think that the alignment requirement in IPC is different from this one:
> > we enforce 8/64 byte alignment when serializing for IPC, but we (only)
> > recommend 64 byte alignment in memory addresses (at least this is my
> > understanding from the above link).
> >
> > I did test adding two arrays and the result is independent of the
> alignment
> > (on my machine, compiler, etc).
> >
> > Yibo, thanks a lot for that example. I am unsure whether it captures the
> > cache alignment concept, though: in the example we are reading a long (8
> > bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0),
> which
> > is both slow and often undefined behavior. I think that the bench we want
> > is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
> > aligned with a long), the difference vanishes (under the same gotchas
> that
> > you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
> > Alternatively, add an int32 with an offset of 4.
> >
> > I benched both with explicit (via intrinsics) SIMD and without (i.e. let
> > the compiler do it for us), and the alignment does not impact the
> benches.
> >
> > Best,
> > Jorge
> >
> > [1] https://stackoverflow.com/a/27184001/931303
> >
> >
> >
> >
> >
> > On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yibo....@arm.com> wrote:
> >
> >> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> >> enough conditions, looks it does shows unaligned access has some penalty
> >> over aligned access. But I don't think this is an issue in practice.
> >>
> >> Please be very skeptical to this benchmark. It's hard to get it right
> >> given the complexity of hardware, compiler, benchmark tool and env.
> >>
> >> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
> >>
> >>
> >> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>>>
> >>>> My own impression is that the emphasis may be slightly exagerated. But
> >>>> perhaps some other benchmarks would prove differently.
> >>>
> >>>
> >>> This is probably true.  [1] is the original mailing list discussion.  I
> >>> think lack of measurable differences and high overhead for 64 byte
> >>> alignment was the reason for relaxing to 8 byte alignment.
> >>>
> >>> Specifically, I performed two types of tests, a "random sum" where we
> >>>> compute the sum of the values taken at random indices, and "sum",
> where
> >> we
> >>>> sum all values of the array (buffer[1] of the primitive array), both
> for
> >>>> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> >> least in
> >>>> the latter, prefetching would help, but I do not observe any
> difference.
> >>>
> >>>
> >>> The most likely place I think where this could make a difference would
> be
> >>> for operations on wider types (Decimal128 and Decimal256).   Another
> >> place
> >>> where I think alignment could help is when adding two primitive arrays
> >> (it
> >>> sounds like this was summing a single array?).
> >>>
> >>> [1]
> >>>
> >>
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >>>
> >>> On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>>>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>>>
> >>>>> Generally, it should not hurt to align allocations to 64 bytes
> anyway,
> >>>>>> since you are generally dealing with large enough data that the
> >>>>>> (small) memory overhead doesn't matter.
> >>>>>
> >>>>> Not for performance. However, 64 byte alignment in Rust requires
> >>>>> maintaining a custom container, a custom allocator, and the inability
> >> to
> >>>>> interoperate with `std::Vec` and the ecosystem that is based on it,
> >> since
> >>>>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >>>> anyone
> >>>>> interested, the background for this is this old PR [1] in this in
> >> arrow2
> >>>>> [2].
> >>>>
> >>>> I see. In the C++ implementation, we are not compatible with the
> default
> >>>> allocator either (but C++ allocators as defined by the standard
> library
> >>>> don't support resizing, which doesn't make them terribly useful for
> >>>> Arrow anyway).
> >>>>
> >>>>> Neither myself in micro benches nor Ritchie from polars (query
> engine)
> >> in
> >>>>> large scale benches observe any difference in the archs we have
> >>>> available.
> >>>>> This is not consistent with the emphasis we put on the memory
> >> alignments
> >>>>> discussion [3], and I am trying to understand the root cause for this
> >>>>> inconsistency.
> >>>>
> >>>> My own impression is that the emphasis may be slightly exagerated. But
> >>>> perhaps some other benchmarks would prove differently.
> >>>>
> >>>>> By prefetching I mean implicit; no intrinsics involved.
> >>>>
> >>>> Well, I'm not aware that implicit prefetching depends on alignment.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>
> >>
> >
>

Reply via email to