https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85730
Bug ID: 85730 Summary: complex code for modifying lowest byte in a 4-byte vector Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: zsojka at seznam dot cz Target Milestone: --- Host: x86_64-pc-linux-gnu Target: x86_64-pc-linux-gnu Created attachment 44109 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44109&action=edit reduced testcase The attached testcase has 3 implementations of the same function, yet the compiled code differs: (@ -O3) foo: movsx edx, dil mov eax, edi add edx, edx mov al, dl ret bar: mov eax, edi add al, al ret baz: movsx edx, dil mov eax, edi add edx, edx mov al, dl ret bar() has the shortest code and is also using fewer registers. I tried benchmarking all 3 functions on a Skylake CPU; I could not find out which function is the fastest (the jitter was too high). The difference between foo() and bar() is that bar() is compiled with -fno-tree-ccp -fno-tree-fre. baz() has one extra constant in the code, which needs to be propagated in foor() and bar().