[Bug tree-optimization/78821] GCC7: Copying whole 32 bits structure field by field not optimised into copying whole 32 bits at once

rguenther at suse dot de Tue, 21 Nov 2017 01:14:35 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78821

--- Comment #33 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 20 Nov 2017, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78821
> 
> --- Comment #32 from Uroš Bizjak <ubizjak at gmail dot com> ---
> (In reply to rguent...@suse.de from comment #20)
> 
> > Index: gcc/tree-ssa-sccvn.c
> > ===================================================================
> > --- gcc/tree-ssa-sccvn.c        (revision 254945)
> > +++ gcc/tree-ssa-sccvn.c        (working copy)
> > @@ -3632,6 +3632,38 @@ visit_nary_op (tree lhs, gassign *stmt)
> 
> The patched compiler works for this particular case (a "break" is needed at 
> the
> top of the patch, though), but there are many other cases where a temporary
> should be calculated and its value stored, e.g.:
> 
> --cut here--
> void foo_OK (char *buf, unsigned int data)
> {
>   buf[0] = ~data >> 8;
>   buf[1] = ~data;
> }
> 
> void foo (char *buf, unsigned int data)
> {
>   buf[0] = ~data;
>   buf[1] = ~data >> 8;
> }
> 
> void bar (char *buf, unsigned int data)
> {
>   buf[0] = -data >> 8;
>   buf[1] = -data;
> }
> 
> void baz (char *buf, unsigned int data)
> {
>   buf[0] = (data+3) >> 8;
>   buf[1] = (data+3);
> }
> --cut here--
> 
> Only foo_OK compiles (-O2 -march=haswell) to expected asm:
> 
>         notl    %esi
>         movbe   %si, (%rdi)
> 
> all others generate duplicated operation:
> 
>         movb    %sil, %al
>         notl    %esi
>         notl    %eax
>         movb    %al, (%rdi)
>         movl    %esi, %eax
>         movb    %ah, 1(%rdi)
> 
> Should I open a new PR for the above enhancement?

Ok, I fear we're missing some canonicalization here.  For foo we see

  <bb 2> :
  _1 = (char) data_7(D);
  _2 = ~_1;
  *buf_9(D) = _2;
  _3 = ~data_7(D);
  _4 = _3 >> 8;
  _6 = (char) _4;
  MEM[(char *)buf_9(D) + 1B] = _6;

for bar (patch didn't handle negates which I'm not sure are as easily
handled correctness wise?)

  <bb 2> :
  _1 = -data_8(D);
  _2 = _1 >> 8;
  _3 = (char) _2;
  *buf_10(D) = _3;
  _4 = (unsigned char) data_8(D);
  _5 = -_4;
  _7 = (char) _5;
  MEM[(char *)buf_10(D) + 1B] = _7;

for baz

  <bb 2> :
  _1 = data_8(D) + 3;
  _2 = _1 >> 8;
  _3 = (char) _2;
  *buf_10(D) = _3;
  _4 = (unsigned char) data_8(D);
  _5 = _4 + 3;
  _7 = (char) _5;
  MEM[(char *)buf_10(D) + 1B] = _7;

and it's already the FEs shortening optimization or folding that
makes the operations non-canonical.  .original dumps:

  *buf = (char) (~data >> 8);
  *(buf + 1) = ~(char) data;

  *buf = ~(char) data;
  *(buf + 1) = (char) (~data >> 8);

  *buf = (char) (-data >> 8);
  *(buf + 1) = (char) -(unsigned char) data;

  *buf = (char) (data + 3 >> 8);
  *(buf + 1) = (char) ((unsigned char) data + 3);

note there's conflicting interest from targets like AVR that
prefer truncation to be moved innermost (which would argue
for a CSE-only solution together with aggressive shortening
of operations).  In the above case that probably means
(char) (~data >> 8) -> ~(char)(data >> 8) or alternatively
figure we're interested in short only and thus
(char)~(short)data and (char)(~(short)data >> 8) so we can
CSE ~(short)data.

Note that within CSE we have the difficulty that the first
stmt is basically unchangeable with the trick we're doing
right now.  This means CSE requires the first stmt to be
in useful canonical form.

That said, bar is easily handled in the proposed patch
(though I have to think about correctness).  The other
two can't be done easily - what would fix it would be
not narrowing the ~- operation in the FE.  In this
case it is convert_to_integer_1 doing the narrowing
and also fold_unary having a related transform.

Removing the convert_to_integer_1 code fixes the missed
transforms but in baz where we have similar narrowing
being performed for the add...

I suppose we do want to narrow at some point (and in fact
the user might already write ~(char)data ...).

As usual we have conflicting interests here.  Note that
undoing narrowing isn't easily done so narrowing very early
is certainly unwanted.

[Bug tree-optimization/78821] GCC7: Copying whole 32 bits structure field by field not optimised into copying whole 32 bits at once

Reply via email to