On 2026-03-13 Bruno Haible wrote:
> Thanks for these references. I'm applying the attached fix to Gnulib.
> 
> In particular, I appreciate your finding that the combination of
> memcpy and __builtin_assume_aligned produces the best possible code
> (with gcc >= 4.7 and clang).

Another way is __attribute__((may_alias)) which is supported in GCC
since 3.3. The example in the 3.3.6 manual[1] is still there in the
15.2.0 manual[2].

I don't remember why I didn't use may_alias. Perhaps I missed it
because I focused more on the unaligned uses, and for those one would
also need the aligned(1) attribute. The memcpy method feels simpler and
is more portable too.

[1] https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Type-Attributes.html
[2] https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/Common-Type-Attributes.html

> What I'm doing differently than you did:
>   - I don't distinguish "strict-align" and "non-strict-align"
>     architectures, because in most "non-strict-align" architectures,
>     unaligned accesses are slow. Compilers know this, and they prefer
>     to emit a few instructions that each uses 1 cycle, than a single
>     instruction which uses 10 or 20 cycles.
>   - So, the only distinction we need to make is regarding the
>     compiler:
>       - gcc >= 4.7, clang,
>       - MSVC,
>       - other compilers.
>     In Gnulib, we don't care much about optimizing for 10 years old
>     gcc versions. Making sure to get good code for gcc versions >= 10
>     (and clang) is what we care about.

An important (perhaps the main?) use case for the aligned functions is
to avoid strict aliasing violations. For example, see how longword_ptr
is used in lib/memchr.c and lib/memchr2.c in Gnulib. The type punning
could be replaced with native-endian stdc_load8_aligned calls (although
the GNU C may_alias attribute is a simpler fix when it's supported).

If stdc_load8_aligned is used to fix aliasing issues, it's essential
that the resulting code is still fast. The current Gnulib code falls
back to byte-by-byte access if __builtin_assume_aligned isn't supported
(and compiler isn't MSVC), so at least Oracle Developer Studio on SPARC
will produce slow code.

Developer Studio supports #pragmas to control aliasing.[3] However, my
alias.c test program doesn't "miscompile" with Developer Studio 12.6. I
tried setting -xalias_level=std and even =strong at -O5 optimization
level (-O3 or higher is needed for inlining to happen at all). Thus, I
couldn't test if a #pragma would make a difference.

I don't know if, in Gnulib context, there is any other possibly-relevant
strict-align compiler that doesn't support __builtin_assume_aligned or
the may_alias attribute.

[3] https://docs.oracle.com/cd/E77782_01/html/E77788/bjaiu.html

>     Find attached the test program, with which I evaluated which
>     variant produces the best code.

Thanks!

I didn't explain when the performance of unaligned access matters. In
xz, if unaligned access is known to be fast, different code is used in
a few places. It can reduce compression time by a double-digit
percentage without any arch-specific code. But if the unaligned code
paths are enabled when the inline functions for unaligned loads aren't
optimized to a single instruction, the result is a major
deoptimization. See this commit message:

    
https://github.com/tukaani-project/xz/commit/7971566247914ec1854b125ff99c2a617f5c1e3a

Gnulib uses byte-by-byte code in the unaligned stdc_load8 functions.
It's not only old compilers that don't optimize those properly. Based
on testing on godbolt.org, current MSVC on all archs produce bad code
from the byte-by-byte code, but memcpy is fine. GCC 15.2.0 on s390x[4]
might be worse with the byte-by-byte code too.

[4] https://gcc.godbolt.org/z/a4s8PEdrP

The above use case might not be common outside compressors, so I'm not
saying that the unaligned stdc_ functions should be optimized better in
Gnulib (it might take more effort than one expects). I just wanted to
highlight that in some very specific situations the unaligned functions
aren't merely convenience functions; they can help with performance too.

-- 
Lasse Collin

Reply via email to