On Mon, Nov 15, 2021 at 2:04 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch tidies up the code that GCC generates for simple loops, > by selecting/generating a simpler loop bound expression in ivopts. > The original motivation came from looking at the following loop (from > gcc.target/i386/pr90178.c) > > int *find_ptr (int* mem, int sz, int val) > { > for (int i = 0; i < sz; i++) > if (mem[i] == val) > return &mem[i]; > return 0; > } > > which GCC currently compiles to: > > find_ptr: > movq %rdi, %rax > testl %esi, %esi > jle .L4 > leal -1(%rsi), %ecx > leaq 4(%rdi,%rcx,4), %rcx > jmp .L3 > .L7: addq $4, %rax > cmpq %rcx, %rax > je .L4 > .L3: cmpl %edx, (%rax) > jne .L7 > ret > .L4: xorl %eax, %eax > ret > > Notice the relatively complex leal/leaq instructions, that result > from ivopts using the following expression for the loop bound: > inv_expr 2: ((unsigned long) ((unsigned int) sz_8(D) + 4294967295) > * 4 + (unsigned long) mem_9(D)) + 4 > > which results from NITERS being (unsigned int) sz_8(D) + 4294967295, > i.e. (sz - 1), and the logic in cand_value_at determining the bound > as BASE + NITERS*STEP at the start of the final iteration and as > BASE + NITERS*STEP + STEP at the end of the final iteration. > > Ideally, we'd like the middle-end optimizers to simplify > BASE + NITERS*STEP + STEP as BASE + (NITERS+1)*STEP, especially > when NITERS already has the form BOUND-1, but with type conversions > and possible overflow to worry about, the above "inv_expr 2" is the > best that can be done by fold (without additional context information). > > This patch improves ivopts' cand_value_at by instead of using just > the tree expression for NITERS, passing the data structure that > explains how that expression was derived. This allows us to peek > under the surface to check that NITERS+1 doesn't overflow, and in > this patch to use the SSA_NAME already holding the required value. > > In the motivating loop above, inv_expr 2 now becomes: > (unsigned long) sz_8(D) * 4 + (unsigned long) mem_9(D) > > And as a result, on x86_64 we now generate: > > find_ptr: > movq %rdi, %rax > testl %esi, %esi > jle .L4 > movslq %esi, %rsi > leaq (%rdi,%rsi,4), %rcx > jmp .L3 > .L7: addq $4, %rax > cmpq %rcx, %rax > je .L4 > .L3: cmpl %edx, (%rax) > jne .L7 > ret > .L4: xorl %eax, %eax > ret > > > This improvement required one minor tweak to GCC's testsuite for > gcc.dg/wrapped-binop-simplify.c, where we again generate better > code, and therefore no longer find as many optimization opportunities > in later passes (vrp2). > > Previously: > > void v1 (unsigned long *in, unsigned long *out, unsigned int n) > { > int i; > for (i = 0; i < n; i++) { > out[i] = in[i]; > } > } > > on x86_64 generated: > v1: testl %edx, %edx > je .L1 > movl %edx, %edx > xorl %eax, %eax > .L3: movq (%rdi,%rax,8), %rcx > movq %rcx, (%rsi,%rax,8) > addq $1, %rax > cmpq %rax, %rdx > jne .L3 > .L1: ret > > and now instead generates: > v1: testl %edx, %edx > je .L1 > movl %edx, %edx > xorl %eax, %eax > leaq 0(,%rdx,8), %rcx > .L3: movq (%rdi,%rax), %rdx > movq %rdx, (%rsi,%rax) > addq $8, %rax > cmpq %rax, %rcx > jne .L3 > .L1: ret
Is that actually better? IIRC the addressing modes are both complex and we now have an extra lea? For this case I see we generate _15 = n_10(D) + 4294967295; _8 = (unsigned long) _15; _7 = _8 + 1; where n is unsigned int so if we know that n is not zero we can simplify the addition and conveniently the loop header test provides this guarantee. IIRC there were some attempts to enhance match.pd for some cases of such expressions. > > This patch has been tested on x86_64-pc-linux-gnu with a make bootstrap > and make -k check with no new failures. Ok for mainline? + /* If AFTER_ADJUST is required, the code below generates the equivalent + * of BASE + NITER * STEP + STEP, when ideally we'd prefer the expression + * BASE + (NITER + 1) * STEP, especially when NITER is often of the form + * SSA_NAME - 1. Unfortunately, guaranteeing that adding 1 to NITER + * doesn't overflow is tricky, so we peek inside the TREE_NITER_DESC + * class for common idioms that we know are safe. */ No '* ' each line. + if (after_adjust + && desc->control.no_overflow + && integer_onep (desc->control.step) + && integer_onep (desc->control.base) + && desc->cmp == LT_EXPR + && TREE_CODE (desc->bound) == SSA_NAME) + { + niter = desc->bound; + after_adjust = false; + } I wonder if the non-overflowing can be captured by integer_onep (iv->step) && max_stmt_executions (loop, &max) and if we then do (niter + 1) * step instead of niter*step + step would that do the same? That said, given we have 'niter' as expression we might even be able to use rangers range_of_expr to tell that '(niter + 1) * step' does not overflow? That said - what the change does is actually ensure that we CSE niter + 1 with the bound of the simplified exit test? The patch doesn't add any testcase. Thanks, Richard. > > 2021-11-15 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * tree-ssa-loop-ivopts.c (cand_value_at): Take a class > tree_niter_desc* argument instead of just a tree for NITER. > If we require the iv candidate value at the end of the final > loop iteration, try using the original loop bound as the > NITER for sufficiently simple loops. > (may_eliminate_iv): Update (only) call to cand_value_at. > > gcc/testsuite > * gcc.dg/wrapped-binop-simplify.c: Update expected test result. > > > Thanks in advance, > Roger > -- >