[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Xi Ruoyao changed: What|Removed |Added Ever confirmed|0 |1 CC||xry111 at gcc dot gnu.org Last reconfirmed||2024-01-04 Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark |is almost 40% slower vs.|is almost 40% slower vs. |Clang on AMD Zen 4 |Clang Status|UNCONFIRMED |NEW --- Comment #3 from Xi Ruoyao --- GCC trunk still gets around 200 (on a Tiger Lake but I've not used -march) with -fno-semantic-interposition. Confirm, and I'm removing "on xxx" from the subject as the uarch seems irrelevant.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added CC||hubicka at gcc dot gnu.org --- Comment #4 from Jan Hubicka --- I keep mentioning to Larabel that he should use -fno-semantic-interposition, but he doesn't. Profile is very simple: 96.75% SMHasher[.] keccakf.lto_priv.0 ◆ All goes to simple loop. On Zen3 gcc 13 -march=native -Ofast -flto I get: 3.85 │330: mov%r8,%rdi 7.68 │ movslq (%rsi,%r9,1),%rcx 3.85 │ lea(%rax,%rcx,8),%r10 3.86 │ mov(%rdx,%r9,1),%ecx 3.83 │ add$0x4,%r9 3.86 │ mov(%r10),%r8 7.37 │ rol%cl,%rdi 7.37 │ mov%rdi,(%r10) 4.76 │ cmp$0x60,%r9 0.00 │ ↑ jne330 Clang seems to unroll it: 0.25 │ d0: mov -0x48(%rsp),%rdx ▒ 0.25 │ xor %r12,%rcx ▒ 0.25 │ mov %r13,%r12 ▒ 0.25 │ mov %r13,0x10(%rsp) ▒ 0.25 │ mov %rax,%r13 ◆ 0.26 │ xor %r15,%r13 ▒ 0.23 │ mov %r11,-0x70(%rsp) ▒ 0.25 │ mov %r8,0x8(%rsp) ▒ 0.25 │ mov %r15,-0x40(%rsp) ▒ 0.25 │ mov %r10,%r15 ▒ 0.26 │ mov %r10,(%rsp) ▒ 0.26 │ mov %r14,%r10 ▒ 0.25 │ xor %r12,%r10 ▒ 0.26 │ xor %rsi,%r15 ▒ 0.24 │ mov %rbp,-0x80(%rsp) ▒ 0.25 │ xor %rcx,%r15 ▒ 0.26 │ mov -0x60(%rsp),%rcx ▒ 0.25 │ xor -0x68(%rsp),%r15 ▒ 0.26 │ xor %rbp,%rdx ▒ 0.25 │ mov -0x30(%rsp),%rbp ▒ 0.25 │ xor %rdx,%r13 ▒ 0.24 │ mov -0x10(%rsp),%rdx ▒ 0.25 │ mov %rcx,%r12 ▒ 0.24 │ xor %rcx,%r13 ▒ 0.25 │ mov $0x1,%ecx ▒ 0.25 │ xor %r11,%rdx ▒ 0.24 │ mov %r8,%r11 ▒ 0.25 │ mov -0x28(%rsp),%r8 ▒ 0.26 │ xor -0x58(%rsp),%r8 ▒ 0.24 │ xor %rdx,%r8 ▒ 0.26 │ mov -0x8(%rsp),%rdx ▒ 0.25 │ xor %rbp,%r8 ▒ 0.26 │ xor %r11,%rdx ▒ 0.25 │ mov -0x20(%rsp),%r11 ▒ 0.25 │ xor %rdx,%r10 ▒
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #1 from Artem S. Tashkinov --- Also valid for MTL: https://www.phoronix.com/review/intel-meteorlake-gcc-clang/2
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang on AMD Zen 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #2 from Xi Ruoyao --- The test file can be downloaded from http://phoronix-test-suite.com/benchmark-files/smhasher-20220822.tar.xz. Just build it with cmake and run "./SMHasher --test=Speed sha3-256". The building system enables -O3 and LTO by default. With GCC 13 I get about 180 MiB/s, but Clang 17 produces 250 MiB/s. Part of the difference is caused by the different -fsemantic-interposition default, if I pass -fno-semantic-interposition GCC 13 produces about 200 MiB/s.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #9 from Jan Hubicka --- Phoronix still claims the difference https://www.phoronix.com/review/gcc14-clang18-amd-zen4/2
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 David Malcolm changed: What|Removed |Added CC||dmalcolm at gcc dot gnu.org --- Comment #10 from David Malcolm --- (In reply to Jan Hubicka from comment #4) > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > but he doesn't. Possibly a silly question, but how about changing the default in GCC 15? What proportion of users actually make use of -fsemantic-interposition ?
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #11 from Andrew Pinski --- (In reply to David Malcolm from comment #10) > (In reply to Jan Hubicka from comment #4) > > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > > but he doesn't. > > Possibly a silly question, but how about changing the default in GCC 15? > What proportion of users actually make use of -fsemantic-interposition ? See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for previous discussion on this.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #12 from Andrew Pinski --- (In reply to Andrew Pinski from comment #11) > (In reply to David Malcolm from comment #10) > > (In reply to Jan Hubicka from comment #4) > > > I keep mentioning to Larabel that he should use > > > -fno-semantic-interposition, > > > but he doesn't. > > > > Possibly a silly question, but how about changing the default in GCC 15? > > What proportion of users actually make use of -fsemantic-interposition ? > > See https://inbox.sourceware.org/gcc-patches/ri6czn5z8mw@suse.cz/ for > previous discussion on this. Sorry https://inbox.sourceware.org/gcc-patches/20210606231215.49899-1-mask...@google.com/
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #13 from Xi Ruoyao --- (In reply to David Malcolm from comment #10) > (In reply to Jan Hubicka from comment #4) > > I keep mentioning to Larabel that he should use -fno-semantic-interposition, > > but he doesn't. > > Possibly a silly question, but how about changing the default in GCC 15? > What proportion of users actually make use of -fsemantic-interposition ? At least if building Glibc with -fno-semantic-interposition, several tests will fail. I've not figured out if they are test-suite issues or real issues though.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Jan Hubicka changed: What|Removed |Added Summary|SMHasher SHA3-256 benchmark |SMHasher SHA3-256 benchmark |is almost 40% slower vs.|is almost 40% slower vs. |Clang |Clang (not enough complete ||loop peeling) --- Comment #5 from Jan Hubicka --- On my zen3 machine default build gets me 180MB/S -O3 -flto -funroll-all-loops gets me 193MB/s -O3 -flto --param max-completely-peel-times=30 gets me 382MB/s, speedup is gone with --param max-completely-peel-times=20, default is 16.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #6 from Jan Hubicka --- The internal loops are: static const unsigned keccakf_rotc[24] = { 1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 2, 14, 27, 41, 56, 8, 25, 43, 62, 18, 39, 61, 20, 44 }; static const unsigned keccakf_piln[24] = { 10, 7, 11, 17, 18, 3, 5, 16, 8, 21, 24, 4, 15, 23, 19, 13, 12, 2, 20, 14, 22, 9, 6, 1 }; static void keccakf(ulong64 s[25]) { int i, j, round; ulong64 t, bc[5]; for(round = 0; round < SHA3_KECCAK_ROUNDS; round++) { /* Theta */ for(i = 0; i < 5; i++) bc[i] = s[i] ^ s[i + 5] ^ s[i + 10] ^ s[i + 15] ^ s[i + 20]; for(i = 0; i < 5; i++) { t = bc[(i + 4) % 5] ^ ROL64(bc[(i + 1) % 5], 1); for(j = 0; j < 25; j += 5) s[j + i] ^= t; } /* Rho Pi */ t = s[1]; for(i = 0; i < 24; i++) { j = keccakf_piln[i]; bc[0] = s[j]; s[j] = ROL64(t, keccakf_rotc[i]); t = bc[0]; } /* Chi */ for(j = 0; j < 25; j += 5) { for(i = 0; i < 5; i++) bc[i] = s[j + i]; for(i = 0; i < 5; i++) s[j + i] ^= (~bc[(i + 1) % 5]) & bc[(i + 2) % 5]; } s[0] ^= keccakf_rndc[round]; } } I suppose with complete unrolling this will propagate, partly stay in registers and fold. I think increasing the default limits, especially -O3 may make sense. Value of 16 is there for very long time (I think since the initial implementation).
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #7 from Richard Biener --- IMO it should be purely growth/unrolled-insns bound, the bound on the actual unrolled iterations is somewhat artificial (to avoid really large unrolls when we estimate the unrolled body to be zero, thus never hit any of the other limits). That said, I think we should get better at estimating growth - I don't think we get that the reads from the constant arrays get elided? (though that's not always an optimal thing) See the proposal on better estimation I had last year.
[Bug target/113235] SMHasher SHA3-256 benchmark is almost 40% slower vs. Clang (not enough complete loop peeling)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113235 --- Comment #8 from Richard Biener --- Created attachment 57006 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57006&action=edit unroll heuristics this one