On Wednesday, 19 January 2022 00:13:32 PST Lars Knoll wrote: > The main thing I’m wondering about is how much performance we gain by from a > multi arch build Qt for different x86_64 architectures opposed to building > maybe for v2 and detecting/using AVX and AVX512 at runtime. I would assume > it’s little we gain, as there are very few places where the compilers > auto-vectorizer will be able to emit AVX instructions, but I might be wrong > here. > AVX is only used by a couple of classes in Qt Core and the drawhelper in Qt > Gui. Qt Gui already does runtime detection, so it would be only about > adding that to the methods in Qt Core.
Hello Lars That's a misconception. AVX and especially AVX2 introduce a lot of codegen opportunities for the compilers, which they've been able to use for years. The v3 level also introduces some lesser-known features like MOVBE and BMI, which are new scalar instructions, not SIMD. But even v2 brings in interesting instructions that match practically 1:1 some C library functions, like ROUNDPS and ROUNDPD. QtGui draw helpers do runtime detection, but QtCore does not. And this is the issue: each of those individual optimisations is small enough that the overhead of selecting at runtime is higher than the benefit. We're talking about very hot functions such as QString::fromLatin1 taking an (extra) indirect function call for everything. But the aggregate of those optimisations is worth it, for negligible cost in producing them: build the entire library twice and let the dynamic linker choose. I do plan on looking into runtime detection in QtCore for 6.5, but I'm not convinced I can make it work with sufficient low overhead in all platforms. I know I can for Linux with IFUNC support, but doing so for Linux only is not worth our collective time (development, review, long-term maintenance). For the QtGui runtime detection, we have a ticking time bomb. See below. > > I propose we remove the tests for the intrinsics of each individual CPU > > feature. Instead, let's just assume they all have everything up to 2016. > > This will shorten cmake time a little and fix the macOS universal > > builds. It'll also change how 32-bit non-SSE2 builds are selected (see > > below). > > > > The change https://codereview.qt-project.org/c/qt/qtbase/+/386738 is going > > in this direction but retains a test (all or nothing). I'm proposing > > now we remove the test completely and just assume. > > > I’m fine with that, I don’t think we need to support a compiler that doesn’t > support those. As I said, for the record: all compilers we support do support all of them, based on the CI runs of those changes. There's only one issue and it's the QCC compiler lacking the RDSEED intrinsics, but that's easily worked around (it does support inline assembly and it might support the low-level __builtin_ia32_rdseed_si_step() intrinsic. > Can we at the same time do the same thing for NEON btw. While there are some > platforms that don’t support NEON, I believe all compilers do support > them. Neon is different. First, it's never selected at runtime. The issue that affected the NEON code generation is actually present in x86 too, but we've been lucky to avoid it. This is the ticking time bomb I was talking about: whenever you compile C++ sources and the compiler decides not to inline an inlineable function, it will create a copy of it and call that. This may happen in multiple translation units, so the linker chooses any one of them to emit in the final binary (usually, it's the one from the first .o that offered it, but that's just implementation behaviour). So what happens if the copy came from the higher-target .o? Kaboom. This is what affected our Neon builds and this is why we don't detect it at runtime. I've spoken to GCC and GNU Binutils maintainers about this issue. There's little incentive for them to provide a workaround for this, so it won't happen. The only solution they offer is to have different libraries or plugins. So, can we change how we detect Neon? Sure. We can simply assume it's there all the time, since it is there all the time anyway. I know the ARMv8 architecture (read: 64-bit, AArch64) requires it, like 64-bit x86 requires SSE2, so that's inescapable. The question is therefore whether we want to always enable it on 32-bit ARMv7. I'd say it should get the same answer as 32- bit i386: yes, enable it by default but allow disabling it. > See my comment above. We also need to think about non Linux platforms. > Multi-arch is difficult on Windows as far as I know, so a v2 baseline build > and runtime detection might be preferable. Indeed, this solution is specific to glibc-based Linux. It does not apply to other OSes because it's a solution predicated on the system's dynamic linker being able to select different files or different sections of a file based on CPU identification. If Microsoft wants their OS to have better-performing content, they'll have to come up with a solution. I do plan to reach out to them via the team that works with them at Intel, but I don't expect to see any solution, at least not before 2030. There are some workarounds for this (search for "delay-loaded DLL") but they've left me with a bad taste in my mouth. > > 5) for glibc-based Linux, add v3 sub-arch by default > > > > I'd like to raise the default on Linux from baseline to v2 *and* add a v3 > > sub-arch build, as described by point #3 above. > > > > Device-specific Qt builds (Yocto Project, Boot2Qt) would need to turn this > > off and select a single architecture, if they don't want the extra > > files. > > This complicates the build system and deployment in quite a few places and > is a Linux specific solution. Can you give some numbers how much of an > improvement this would give over runtime detection where we have AVX > optimised code? Open Phoronix.com and search for Clear Linux. Almost every single time we've won a benchmark, it was because Clear Linux defaults to: * v2 as the minimum * v3 for quite a lot of libraries (including qtbase and qt3d) * v4 for a few, especially libm I don't expect this solution to complicate the build THAT much. I expect it's basically making qt_internal_add_module() CMake function create two targets instead of one based on some opt-in flag we set for the handful of libraries we think there's value in doing this. Then it builds the same sources twice and installs two sets of binaries and their symlinks. Clear Linux attempts to use a heuristic to guess which libraries it thinks are worth keeping the AVX2 version of. To see which ones it thought of qtbase, see https://github.com/clearlinux-pkgs/qtbase/blob/ e16f08be736d28351219b05e807a6468ea39341b/qtbase.spec#L5771-L5902 For my OpenSUSE desktop, I didn't go nearly as far. I simply built QtCore and QtGui twice. See lines 938-944 of https://build.opensuse.org/package/view_file/ home:thiagomacieira:branches:openSUSE:Factory/libqt5-qtbase/libqt5- qtbase.spec?expand=1 And this is why I want to do this inside Qt itself and do it by default: because I've had to manually solve it twice for two different distros. It's not reasonable to expect every one of them to copy the solutions. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel DPG Cloud Engineering _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development