https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91569

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2020-01-29
          Component|rtl-optimization            |target
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to cubitect from comment #0)
> I wasn't entirely sure where to post this, but I have a very simple test 
> problem that shows some missed optimisation potential. The task is to cast 
> an integer to a long and replace the second lowest byte of the result with 
> a constant (4). Below are three ways to achieve this:
> 
> 
> long opt_test1(int num)             //  opt_test1:
> {                                   //      movslq  %edi, %rax
>     union {                         //      mmovb   $4, %ah
>         long q;                     //      ret
>         struct { char l,h; };
>     } a;
>     a.q = num;
>     a.h = 4;
>     return a.q;
> }
> 
> The union here is modelled after the structure of a r?x register which 
> contains the low and high byte registers: ?l and ?h. The cast and second 
> byte assignment can be done in one instruction each. The optimiser manages 
> to understand this and gives the optimal instructions.
> 
> 
> long opt_test2(int num)             //  opt_test2:
> {                                   //      movl    %edi, %eax
>     long a = num;                   //      xor     %ah, %ah
>     a &= (-1UL ^ 0xff00);           //      orb     $4, %ah
>     a |= (4 << 8);                  //      cltq
>     return a;                       //      ret
> }
> 
> This solution, based on a bitwise AND and OR, is interesting. The optimiser 
> recognised that I am interested in the second byte and makes use of the 'ah' 
> register, but why is there a XOR and an OR rather than an a single, 
> equivalent MOV? Similarly the (MOV + CLTQ) can be replaced outright with 
> MOVSLQ. Notable here is that some older versions (such as "gcc-4.8.5 -O3") 
> give results that correspond more to the C code:
>     andl    $-65281, %edi
>     orl     $1024, %edi
>     movslq  %edi, %rax
>     ret
> which is actually better than the output for gcc-9.2.

This one looks like a target issue to me.  We end up using

(insn 20 18 21 2 (parallel [
            (set (zero_extract:SI (reg:SI 0 ax [89])
                    (const_int 8 [0x8])
                    (const_int 8 [0x8]))
                (subreg:SI (xor:QI (subreg:QI (zero_extract:SI (reg:SI 0 ax
[89])                               
                                (const_int 8 [0x8])
                                (const_int 8 [0x8])) 0)
                        (subreg:QI (zero_extract:SI (reg:SI 0 ax [89])
                                (const_int 8 [0x8])
                                (const_int 8 [0x8])) 0)) 0))
            (clobber (reg:CC 17 flags))
        ]) "t2.c":5:7 520 {*xorqi_ext_2}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

via split2

> long opt_test3(int num)             //  opt_test3:
> {                                   //      movslq  %edi, %rdi
>     long a = num;                   //      movq    %rdi, -8(%rsp)
>     ((char*)&a)[1] = 4;             //      movb    $4, -7(%rsp)
>     return a;                       //      movq    -8(%rsp), %rax
> }                                   //      ret

Here we could (and fail to) rewrite the store to a BIT_INSERT_EXPR
in update_address_taken.  The stack approach suffers from STLF issues
on modern CPUs.

This one could be split out (it would be tree-optimization).  Keeping this
issue for the weird splitter.

Reply via email to