On Mon, Jun 12, 2023 at 4:03 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > The following simple test case, from PR 104610, shows that memcmp () == 0 > can result in some bizarre code sequences on x86. > > int foo(char *a) > { > static const char t[] = "0123456789012345678901234567890"; > return __builtin_memcmp(a, &t[0], sizeof(t)) == 0; > } > > with -O2 currently contains both: > xorl %eax, %eax > xorl $1, %eax > and also > movl $1, %eax > xorl $1, %eax > > Changing the return type of foo to _Bool results in the equally > bizarre: > xorl %eax, %eax > testl %eax, %eax > sete %al > and also > movl $1, %eax > testl %eax, %eax > sete %al > > All these sequences set the result to a constant, but this optimization > opportunity only occurs very late during compilation, by basic block > duplication in the 322r.bbro pass, too late for CSE or peephole2 to > do anything about it. The problem is that the idiom expanded by > compare_by_pieces for __builtin_memcmp_eq contains basic blocks that > can't easily be optimized by if-conversion due to the multiple > incoming edges on the fail block. > > In summary, compare_by_pieces generates code that looks like: > > if (x[0] != y[0]) goto fail_label; > if (x[1] != y[1]) goto fail_label; > ... > if (x[n] != y[n]) goto fail_label; > result = 1; > goto end_label; > fail_label: > result = 0; > end_label: > > In theory, the RTL if-conversion pass could be enhanced to tackle > arbitrarily complex if-then-else graphs, but the solution proposed > here is to allow suitable targets to perform if-conversion during > compare_by_pieces. The x86, for example, can take advantage that > all of the above comparisons set and test the zero flag (ZF), which > can then be used in combination with sete. Hence compare_by_pieces > could instead generate: > > if (x[0] != y[0]) goto fail_label; > if (x[1] != y[1]) goto fail_label; > ... > if (x[n] != y[n]) goto fail_label; > fail_label: > sete result > > which requires one less basic block, and the redundant conditional > branch to a label immediately after is cleaned up by GCC's existing > RTL optimizations. > > For the test case above, where -O2 -msse4 previously generated: > > foo: movdqu (%rdi), %xmm0 > pxor .LC0(%rip), %xmm0 > ptest %xmm0, %xmm0 > je .L5 > .L2: movl $1, %eax > xorl $1, %eax > ret > .L5: movdqu 16(%rdi), %xmm0 > pxor .LC1(%rip), %xmm0 > ptest %xmm0, %xmm0 > jne .L2 > xorl %eax, %eax > xorl $1, %eax > ret > > we now generate: > > foo: movdqu (%rdi), %xmm0 > pxor .LC0(%rip), %xmm0 > ptest %xmm0, %xmm0 > jne .L2 > movdqu 16(%rdi), %xmm0 > pxor .LC1(%rip), %xmm0 > ptest %xmm0, %xmm0 > .L2: sete %al > movzbl %al, %eax > ret > > Using a target hook allows the large amount of intelligence already in > compare_by_pieces to be re-used by the i386 backend, but this can also > help other backends with condition flags where the equality result can > be materialized. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. Ok for mainline? > > > 2023-06-12 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386.cc (ix86_finish_compare_by_pieces): New > function to provide a backend specific implementation. > (TARGET_FINISH_COMPARE_BY_PIECES): Use the above function. > > * doc/tm.texi.in (TARGET_FINISH_COMPARE_BY_PIECES): New @hook. > * doc/tm.texi: Regenerate. > > * expr.cc (compare_by_pieces): Call finish_compare_by_pieces in > targetm to finalize the RTL expansion. Move the current > implementation to a default target hook. > * target.def (finish_compare_by_pieces): New target hook to allow > compare_by_pieces to be customized by the target. > * targhooks.cc (default_finish_compare_by_pieces): Default > implementation moved here from expr.cc's compare_by_pieces. > * targhooks.h (default_finish_compare_by_pieces): Prototype. > > gcc/testsuite/ChangeLog > * gcc.target/i386/pieces-memcmp-1.c: New test case.
This patch needs middle-end approval first. Uros. > > > Thanks in advance, > Roger > -- >