On Fri, Jul 10, 2020 at 11:45 PM H.J. Lu via Gcc <gcc@gcc.gnu.org> wrote: > > On Fri, Jul 10, 2020 at 10:30 AM Florian Weimer <fwei...@redhat.com> wrote: > > > > Most Linux distributions still compile against the original x86-64 > > baseline that was based on the AMD K8 (minus the 3DNow! parts, for Intel > > EM64T compatibility). > > > > There has been an attempt to use the existing AT_PLATFORM-based loading > > mechanism in the glibc dynamic linker to enable a selection of optimized > > libraries. But the general selection mechanism in glibc is problematic: > > > > hwcaps subdirectory selection in the dynamic loader > > <https://sourceware.org/pipermail/libc-alpha/2020-May/113757.html> > > > > We also have the problem that the glibc version of "haswell" is distinct > > from GCC's -march=haswell (and presumably other compilers): > > > > Definition of "haswell" platform is inconsistent with GCC > > <https://sourceware.org/bugzilla/show_bug.cgi?id=24080> > > > > And that the selection criteria are not what people expect: > > > > Epyc and other current AMD CPUs do not select the "haswell" platform > > subdirectory > > <https://sourceware.org/bugzilla/show_bug.cgi?id=23249> > > > > Since the hwcaps-based selection does not work well regardless of > > architecture (even in cases the kernel provides glibc with data), I > > worked on a new mechanism that does not have the problems associated > > with the old mechanism: > > > > [PATCH 00/30] RFC: elf: glibc-hwcaps support > > <https://sourceware.org/pipermail/libc-alpha/2020-June/115250.html> > > > > (Don't be concerned that these patches have not been reviewed; we are > > busy preparing the glibc 2.32 release, and these changes do not alter > > the glibc ABI itself, so they do not have immediate priority. I'm > > fairly confident that a version of these changes will make it into glibc > > 2.33, and I hope to backport them into Fedora 33, Fedora 32, and Red Hat > > Enterprise Linux 8.4. Debian as well, but I have never done anything > > like it there, so I don't know if the patches will be accepted.) > > > > Out of the box, this should work fairly well for IBM POWER and Z, where > > there is a clear progression of silicon versions (at least on paper > > —virtualization may blur the picture somewhat). > > > > However, for x86, we do not have such a clear progression of > > micro-architecture versions. This is not just as a result of the > > AMD/Intel competition, but also due to ongoing product differentiation > > within one chip vendor. I think we need these levels broadly for the > > following reasons: > > > > * Selecting on individual CPU features (similar to the old hwcaps > > mechanism) in glibc has scalability issues, particularly for > > LD_LIBRARY_PATH processing. > > > > * Developers need guidance about useful targets for optimization. I > > think there is value in limiting the choices, in the sense that “if > > you are able to test three builds in total, these are the things you > > should build”. > > > > * glibc and the compilers should align in their definition of the > > levels, so that developers can use an -march= option to build for a > > particular level that is recognized by glibc. This is why I think the > > description of the levels should go into the psABI supplement. > > > > * A preference order for these levels avoids falling back to the K8 > > baseline if the platform progresses to a new version due to > > glibc/kernel/hypervisor/hardware upgrades. > > > > I'm including a proposal for the levels below. I use single letters for > > them, but I expect that the concrete implementation of this proposal > > will use names like “x86-100”, “x86-101”, like in the glibc patch > > referenced above. (But we can discuss other approaches.) > > > > I looked at various machines in the Red Hat labs and talked to Intel and > > AMD engineers about this, but this concrete proposal is based on my own > > analysis of the situation. I excluded CPU features related to > > cryptography and cache management, including hardware transactional > > memory, and CPU timing. I assume that we will see some of these > > features being disabled by the firmware or the kernel over time. That > > would eliminate entire levels from selection, which is not desirable. > > For cryptographic code, I expect that localized selection of an > > optimized implementation works because such code tends to be isolated > > blocks, running for dozens of cycles each time, not something that gets > > scattered all over the place by the compiler. > > > > We previously discussed not emitting VZEROUPPER at later levels, but I > > don't think this is beneficial because the ABI does not have > > callee-saved vector registers, so it can only be useful with local > > functions (or whatever LTO considers local), where there is no ABI > > impact anyway. > > > > I did not include FSGSBASE because the FS base is already available at > > %fs:0. Changing the FS base in userspace breaks too much, so the main > > benefit is the tighter encoding of rdfsbase, which seems very slim. > > > > Not covered in this are tuning decisions. I think we can benefit from > > some variance in this area between implementations; it should not affect > > correctness. 32-bit support is also a separate matter. > > > > * Level A > > > > CMPXCHG16B, LAHF/SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3 > > > > This is one step above the K8 baseline and corresponds to a mainline CPU > > model ca. 2008 to 2011. It is also implemented by recent-ish > > generations of Intel Atom server CPUs (although I haven't tested the > > latest version). A 32-bit variant would have to list many additional > > CPU features here. > > > > * Level B > > > > AVX, plus everything in level A. > > > > This step is so small that it probably can be dropped, unless the > > benefits from using VEX encoding are truly significant. > > > > For AVX and some of the following features, it is assumed that the > > run-time selection takes full support coverage (from silicon to the > > kernel) into account. > > > > * Level C > > > > AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, plus everything in level B. > > > > This is close to what glibc currently calls "haswell". > > > > * Level D > > > > AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL, plus everything in > > level C. > > > > This is the AVX-512 level implemented by Xeon Scalable Processors, not > > the Xeon Phi variant. > > > > > > glibc (or an alternative loader implementation) would search for > > libraries starting at level D, going back to level A, and finally the > > baseline implementation in the default library location. > > > > I expect that some distributions will also use these levels to set a > > baseline for the entire distribution (i.e., everything would be built to > > level A or maybe even level C), and these libraries would then be > > installed in the default location. > > > > I'll be glad if I can get any feedback on this proposal. I plan to turn > > it into a merge request for the x86-64 psABI document eventually. > > > > Looks good. I like it.
Likewise. Btw, did you check that VIA family chips slot into Level A at least? Where do AMD bdverN slot in? > My only concerns are > > 1. Names like “x86-100”, “x86-101”, what features do they support? Indeed I didn't get the -100, -101 part. On the GCC side I'd have suggested -march=generic-{A,B,C,D} implying the respective -mtune. Do the patches end up annotating ELF binaries with the architecture level and does ld.so check that info? For example IIRC there's a penalty to switch between VEX and not VEX encoded instructions so even on AVX capable hardware it might be profitable to use non-AVX libraries if the program is using only architecture level A? On that side, does architecture level B+ suggest using VEX encoding everywhere? It would be indeed nice to have the architecture levels documented in the psABI. > 2. I have a library with AVX2 and FMA, which directory should it go? Eventually GCC/gas can annotate objects with the lowest architecture level that is applicable? Thanks for doing this, Richard. > Can we pass such info to ld.so and ld.so prints out the best directory > name? > > -- > H.J.