https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92964
Bug ID: 92964 Summary: order of base class members generates vastly different code Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: john at drouhard dot dev Target Milestone: --- Created attachment 47510 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47510&action=edit reproducible c++ source I'm new at submitting bug reports, sorry if this one is invalid.D I have attached a repro file that actually exhibits what I believe to be two issues, but I'm unsure of the second. If I need to open a second report for it (or if it's really not a problem) just let me know. I've reproduced this behavior on gcc 8.3, gcc 9.2, and gcc trunk. Godbolt link if interested: https://godbolt.org/z/F756Pe Anyway, looking at the generated assembly when compiling the program with -std=c++17 -O2 (and -O3) seems to indicate a missed optimization of some kind. bar1() and bar3() are equivalent, other than the order of the union and the bool in the base storage class. I believe gcc should be able to generate basically the same assembly as well (other than swapping %rax/%rdx), but bar3 utilizes the stack, even using SIMD instructions for the false branch. bar1 doesn't do this. clang generates equivalent code for bar1/bar3. I'm assuming it has something to do with alignment, but what's strange is that if the intermediary Payload class is bypassed by using a PayloadBase directly in Foo<>, bar1 and bar3 are generated (almost) equivalently as expected (no SIMD instructions or stack usage). The other issue (please let me know if I should just open another report or if this is expected), is the difference between bar1/bar2 and bar3/bar4. They only differ in how they return the Foo<> object. bar1/bar3 return a default constructed Foo<>, but bar2/bar4 return a sentinel and rely on converting to Foo<> using the user-defined constructor. Generated assembly for these pairs of functions are very different (at least at -O2 optimization). With the -O2 level optimization, bar2 does not set %rdx in the false branch (bar1 does). clang sets it in both. I realize it's not strictly necessary to ensure that a value is returned in %rdx since the empty union member is the one being initialized, but bar1 takes care to make sure it's set to 0 (as does clang for both). Bumping the optimization level up to -O3 actually causes bar2 to ensure both are set to 0 explicitly. bar3 and bar4, however, both intentionally populate %rax with uninitialized junk from the stack in the false branch (both -O2 and -O3), though it's clearer in bar4 since it doesn't use SIMD instructions there (related to the first point?) If initializing %rax isn't necessary, why does it move memory from the uninitialized stack into the return register at all? I apologize if this is not well-thought-out or if there's something obvious I'm missing. Let me know if you need me to provide any more information.