Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yibo....@arm.com> wrote:

> Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
> enough conditions, looks it does shows unaligned access has some penalty
> over aligned access. But I don't think this is an issue in practice.
>
> Please be very skeptical to this benchmark. It's hard to get it right
> given the complexity of hardware, compiler, benchmark tool and env.
>
> https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk
>
>
> On 9/7/21 7:55 AM, Micah Kornfield wrote:
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >
> >
> > This is probably true.  [1] is the original mailing list discussion.  I
> > think lack of measurable differences and high overhead for 64 byte
> > alignment was the reason for relaxing to 8 byte alignment.
> >
> > Specifically, I performed two types of tests, a "random sum" where we
> >> compute the sum of the values taken at random indices, and "sum", where
> we
> >> sum all values of the array (buffer[1] of the primitive array), both for
> >> array ranging from 2^10 to 2^25 elements. I was expecting that, at
> least in
> >> the latter, prefetching would help, but I do not observe any difference.
> >
> >
> > The most likely place I think where this could make a difference would be
> > for operations on wider types (Decimal128 and Decimal256).   Another
> place
> > where I think alignment could help is when adding two primitive arrays
> (it
> > sounds like this was summing a single array?).
> >
> > [1]
> >
> https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E
> >
> > On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
> >>> Thanks a lot Antoine for the pointers. Much appreciated!
> >>>
> >>> Generally, it should not hurt to align allocations to 64 bytes anyway,
> >>>> since you are generally dealing with large enough data that the
> >>>> (small) memory overhead doesn't matter.
> >>>
> >>> Not for performance. However, 64 byte alignment in Rust requires
> >>> maintaining a custom container, a custom allocator, and the inability
> to
> >>> interoperate with `std::Vec` and the ecosystem that is based on it,
> since
> >>> std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
> >> anyone
> >>> interested, the background for this is this old PR [1] in this in
> arrow2
> >>> [2].
> >>
> >> I see. In the C++ implementation, we are not compatible with the default
> >> allocator either (but C++ allocators as defined by the standard library
> >> don't support resizing, which doesn't make them terribly useful for
> >> Arrow anyway).
> >>
> >>> Neither myself in micro benches nor Ritchie from polars (query engine)
> in
> >>> large scale benches observe any difference in the archs we have
> >> available.
> >>> This is not consistent with the emphasis we put on the memory
> alignments
> >>> discussion [3], and I am trying to understand the root cause for this
> >>> inconsistency.
> >>
> >> My own impression is that the emphasis may be slightly exagerated. But
> >> perhaps some other benchmarks would prove differently.
> >>
> >>> By prefetching I mean implicit; no intrinsics involved.
> >>
> >> Well, I'm not aware that implicit prefetching depends on alignment.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>

Reply via email to