Thanks Jorge,

I'm wondering if the 64 bytes alignment requirement is for cache or for simd register(avx512?).

For simd, looks register width alignment does helps.
E.g., _mm_load_si128 can only load 128 bits aligned data, it performs better than _mm_loadu_si128, which supports unaligned load.

Again, be very skeptical to the benchmark :)
https://quick-bench.com/q/NxyDu89azmKJmiVxF29Ei8FybWk


On 9/7/21 7:16 PM, Jorge Cardoso Leitão wrote:
Thanks,

I think that the alignment requirement in IPC is different from this one:
we enforce 8/64 byte alignment when serializing for IPC, but we (only)
recommend 64 byte alignment in memory addresses (at least this is my
understanding from the above link).

I did test adding two arrays and the result is independent of the alignment
(on my machine, compiler, etc).

Yibo, thanks a lot for that example. I am unsure whether it captures the
cache alignment concept, though: in the example we are reading a long (8
bytes) from a pointer that is not aligned with 8 bytes (63 % 8 != 0), which
is both slow and often undefined behavior. I think that the bench we want
is to change 63 to 64-8 (which is still not 64-bytes cache aligned but
aligned with a long), the difference vanishes (under the same gotchas that
you mentioned) https://quick-bench.com/q/EKIpQFJsAogSHXXLqamoWSTy-eE.
Alternatively, add an int32 with an offset of 4.

I benched both with explicit (via intrinsics) SIMD and without (i.e. let
the compiler do it for us), and the alignment does not impact the benches.

Best,
Jorge

[1] https://stackoverflow.com/a/27184001/931303





On Tue, Sep 7, 2021 at 4:29 AM Yibo Cai <yibo....@arm.com> wrote:

Did a quick bench of accessing long buffer not 8 bytes aligned. Giving
enough conditions, looks it does shows unaligned access has some penalty
over aligned access. But I don't think this is an issue in practice.

Please be very skeptical to this benchmark. It's hard to get it right
given the complexity of hardware, compiler, benchmark tool and env.

https://quick-bench.com/q/GmyqRk6saGfRu8XnMUyoSXs4SCk


On 9/7/21 7:55 AM, Micah Kornfield wrote:

My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.


This is probably true.  [1] is the original mailing list discussion.  I
think lack of measurable differences and high overhead for 64 byte
alignment was the reason for relaxing to 8 byte alignment.

Specifically, I performed two types of tests, a "random sum" where we
compute the sum of the values taken at random indices, and "sum", where
we
sum all values of the array (buffer[1] of the primitive array), both for
array ranging from 2^10 to 2^25 elements. I was expecting that, at
least in
the latter, prefetching would help, but I do not observe any difference.


The most likely place I think where this could make a difference would be
for operations on wider types (Decimal128 and Decimal256).   Another
place
where I think alignment could help is when adding two primitive arrays
(it
sounds like this was summing a single array?).

[1]

https://lists.apache.org/thread.html/945b65fb4bc8bcdab695b572f9e9c2dca4cd89012fdbd896a6f2d886%401460092304%40%3Cdev.arrow.apache.org%3E

On Mon, Sep 6, 2021 at 3:05 PM Antoine Pitrou <anto...@python.org>
wrote:


Le 06/09/2021 à 23:20, Jorge Cardoso Leitão a écrit :
Thanks a lot Antoine for the pointers. Much appreciated!

Generally, it should not hurt to align allocations to 64 bytes anyway,
since you are generally dealing with large enough data that the
(small) memory overhead doesn't matter.

Not for performance. However, 64 byte alignment in Rust requires
maintaining a custom container, a custom allocator, and the inability
to
interoperate with `std::Vec` and the ecosystem that is based on it,
since
std::Vec allocates with alignment T (.e.g int32), not 64 bytes. For
anyone
interested, the background for this is this old PR [1] in this in
arrow2
[2].

I see. In the C++ implementation, we are not compatible with the default
allocator either (but C++ allocators as defined by the standard library
don't support resizing, which doesn't make them terribly useful for
Arrow anyway).

Neither myself in micro benches nor Ritchie from polars (query engine)
in
large scale benches observe any difference in the archs we have
available.
This is not consistent with the emphasis we put on the memory
alignments
discussion [3], and I am trying to understand the root cause for this
inconsistency.

My own impression is that the emphasis may be slightly exagerated. But
perhaps some other benchmarks would prove differently.

By prefetching I mean implicit; no intrinsics involved.

Well, I'm not aware that implicit prefetching depends on alignment.

Regards

Antoine.




Reply via email to