Hi Uros,
> On Sat, Jun 29, 2024 at 6:21 PM Roger Sayle <ro...@nextmovesoftware.com>
> wrote:
> > A common idiom for implementing an integer division that rounds
> > upwards is to write (x + y - 1) / y.  Conveniently on x86, the two
> > additions to form the numerator can be performed by a single lea
> > instruction, and indeed gcc currently generates a lea when x and y both
> registers.
> >
> > int foo(int x, int y) {
> >   return (x+y-1)/y;
> > }
> >
> > generates with -O2:
> >
> > foo:    leal    -1(%rsi,%rdi), %eax     // 4 bytes
> >         cltd
> >         idivl   %esi
> >         ret
> >
> > Oddly, however, if x is a memory, gcc currently uses two instructions:
> >
> > int m;
> > int bar(int y) {
> >   return (m+y-1)/y;
> > }
> >
> > generates:
> >
> > foo:    movl    m(%rip), %eax
> >         addl    %edi, %eax              // 2 bytes
> >         subl    $1, %eax                // 3 bytes
> >         cltd
> >         idivl   %edi
> >         ret
> >
> > This discrepancy is caused by the late decision (in peephole2) to
> > split an addition with a memory operand, into a load followed by a
> > reg-reg addition.  This patch improves this situation by adding a
> > peephole2 to recognized consecutive additions and transform them into
> > lea if profitable.
> >
> > My first attempt at fixing this was to use a define_insn_and_split:
> >
> > (define_insn_and_split "*lea<mode>3_reg_mem_imm"
> >   [(set (match_operand:SWI48 0 "register_operand")
> >        (plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
> >                                (match_operand:SWI48 2 "memory_operand"))
> >                    (match_operand:SWI48 3 "x86_64_immediate_operand")))]
> >   "ix86_pre_reload_split ()"
> >   "#"
> >   "&& 1"
> >   [(set (match_dup 4) (match_dup 2))
> >    (set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
> >                                  (match_dup 3)))]
> >   "operands[4] = gen_reg_rtx (<MODE>mode);")
> >
> > using combine to combine instructions.  Unfortunately, this approach
> > interferes with (reload's) subtle balance of deciding when to
> > use/avoid lea, which can be observed as a code size regression in
> > CSiBE.  The peephole2 approach (proposed here) uniformly improves CSiBE
> results.
> >
> > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> > and make -k check, both with and without --target_board=unix{-m32}
> > with no new failures.  Ok for mainline?
> >
> >
> > 2024-06-29  Roger Sayle  <ro...@nextmovesoftware.com>
> >
> > gcc/ChangeLog
> >         * config/i386/i386.md (peephole2): Transform two consecutive
> >         additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
> >
> > gcc/testsuite/ChageLog
> >         * gcc.target/i386/lea-3.c: New test case.
> 
> Is the assumption that one LEA is always faster than two ADD instructions
> universally correct for TARGET_AVOID_LEA_FOR_ADDR?
> 
> Please note ix86_lea_outperforms predicate and its uses in
> ix86_avoid_lea_for_add(), ix86_use_lea_for_mov(),
> ix86_avoid_lea_for_addr() and ix86_lea_for_add_ok(). IMO,
> !avoid_lea_for_addr() should be used here, but I didn't check it thoroughly.
> 
> The function comment of avoid_lea_for_addr() says:
> 
> /* Return true if we need to split lea into a sequence of
>    instructions to avoid AGU stalls during peephole2. */
> 
> And your peephole tries to reverse the above split.

I completely agree that understanding when/why i386.md converts
an lea into a sequence of additions (and avoiding reversing this split)
is vitally important to understanding my patch.  You're quite right that
the logic governing this ultimately calls ix86_lea_outperforms, but as
I'll explain below the shape of those APIs (requiring an insn) is not as
convenient for instruction merging as for splitting.

The current location in i386.md where it decides whether the
lea in the foo example above needs to be split, is at line 6293:

(define_peephole2
  [(set (match_operand:SWI48 0 "register_operand")
    (match_operand:SWI48 1 "address_no_seg_operand"))]
  "ix86_hardreg_mov_ok (operands[0], operands[1])
   && peep2_regno_dead_p (0, FLAGS_REG)
   && ix86_avoid_lea_for_addr (peep2_next_insn (0), operands)"
...

Hence, we transform lea->add+add when ix86_avoid_lea_for_addr
returns true, so by symmetry is not unreasonable to turn add+add->lea
when ix86_avoid_lea_for_addr would return false.  The relevant part
of ix86_avoid_lea_for_addr is then around line 15974 of i386.cc:

  /* Check we need to optimize.  */
  if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
    return false;

which you'll recognize is precisely the condition under which my
proposed peephole2 fires.  Technically, we also know that this is
a 3-component lea, "parts.disp" is non-zero, the addition is destructive
(overwriting one of the input register operands), and that we're not
using partial register mode (such as HImode).

I won't argue that my peephole2 is "obviously" correct, but the observation
that it correctly reduces code size when -Os is specified, indicates that
something is broken/inconsistent/missing with the current logic, and
this change is an improvement.

Perhaps the TARGET_AVOID_LEA_FOR_ADDR in ix86_avoid_lead_for_addr
is incorrect, if so please feel free to file a bugzilla PR.

Ok for mainline?

> Uros.
> > Thanks in advance,
> > Roger


Reply via email to