On 2026-03-13 Bruno Haible wrote: > Thanks for these references. I'm applying the attached fix to Gnulib. > > In particular, I appreciate your finding that the combination of > memcpy and __builtin_assume_aligned produces the best possible code > (with gcc >= 4.7 and clang).
Another way is __attribute__((may_alias)) which is supported in GCC since 3.3. The example in the 3.3.6 manual[1] is still there in the 15.2.0 manual[2]. I don't remember why I didn't use may_alias. Perhaps I missed it because I focused more on the unaligned uses, and for those one would also need the aligned(1) attribute. The memcpy method feels simpler and is more portable too. [1] https://gcc.gnu.org/onlinedocs/gcc-3.3.6/gcc/Type-Attributes.html [2] https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/Common-Type-Attributes.html > What I'm doing differently than you did: > - I don't distinguish "strict-align" and "non-strict-align" > architectures, because in most "non-strict-align" architectures, > unaligned accesses are slow. Compilers know this, and they prefer > to emit a few instructions that each uses 1 cycle, than a single > instruction which uses 10 or 20 cycles. > - So, the only distinction we need to make is regarding the > compiler: > - gcc >= 4.7, clang, > - MSVC, > - other compilers. > In Gnulib, we don't care much about optimizing for 10 years old > gcc versions. Making sure to get good code for gcc versions >= 10 > (and clang) is what we care about. An important (perhaps the main?) use case for the aligned functions is to avoid strict aliasing violations. For example, see how longword_ptr is used in lib/memchr.c and lib/memchr2.c in Gnulib. The type punning could be replaced with native-endian stdc_load8_aligned calls (although the GNU C may_alias attribute is a simpler fix when it's supported). If stdc_load8_aligned is used to fix aliasing issues, it's essential that the resulting code is still fast. The current Gnulib code falls back to byte-by-byte access if __builtin_assume_aligned isn't supported (and compiler isn't MSVC), so at least Oracle Developer Studio on SPARC will produce slow code. Developer Studio supports #pragmas to control aliasing.[3] However, my alias.c test program doesn't "miscompile" with Developer Studio 12.6. I tried setting -xalias_level=std and even =strong at -O5 optimization level (-O3 or higher is needed for inlining to happen at all). Thus, I couldn't test if a #pragma would make a difference. I don't know if, in Gnulib context, there is any other possibly-relevant strict-align compiler that doesn't support __builtin_assume_aligned or the may_alias attribute. [3] https://docs.oracle.com/cd/E77782_01/html/E77788/bjaiu.html > Find attached the test program, with which I evaluated which > variant produces the best code. Thanks! I didn't explain when the performance of unaligned access matters. In xz, if unaligned access is known to be fast, different code is used in a few places. It can reduce compression time by a double-digit percentage without any arch-specific code. But if the unaligned code paths are enabled when the inline functions for unaligned loads aren't optimized to a single instruction, the result is a major deoptimization. See this commit message: https://github.com/tukaani-project/xz/commit/7971566247914ec1854b125ff99c2a617f5c1e3a Gnulib uses byte-by-byte code in the unaligned stdc_load8 functions. It's not only old compilers that don't optimize those properly. Based on testing on godbolt.org, current MSVC on all archs produce bad code from the byte-by-byte code, but memcpy is fine. GCC 15.2.0 on s390x[4] might be worse with the byte-by-byte code too. [4] https://gcc.godbolt.org/z/a4s8PEdrP The above use case might not be common outside compressors, so I'm not saying that the unaligned stdc_ functions should be optimized better in Gnulib (it might take more effort than one expects). I just wanted to highlight that in some very specific situations the unaligned functions aren't merely convenience functions; they can help with performance too. -- Lasse Collin
