[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-04 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #10 from H.J. Lu  ---
(In reply to Wilco from comment #9)
> (In reply to H.J. Lu from comment #8)
> > Inlining mempcpy uses a callee-saved register:
> > 
> ...
> > 
> > Not inlining mempcpy is preferred.
> 
> If codesize is the only thing that matters... The cost is not at the caller
> side but in requiring a separate mempcpy function which causes extra I-cache
> misses. The only case where mempcpy makes sense is if you can use a shared
> implementation with zero overhead to memcpy.

Some archs have gone extra effort to implement optimized mempcpy in glibc.
There is no reason not to use it.  Sharing instructions between memcpy and
mempcpy belongs to a different discussion.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-04 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #9 from Wilco  ---
(In reply to H.J. Lu from comment #8)
> Inlining mempcpy uses a callee-saved register:
> 
...
> 
> Not inlining mempcpy is preferred.

If codesize is the only thing that matters... The cost is not at the caller
side but in requiring a separate mempcpy function which causes extra I-cache
misses. The only case where mempcpy makes sense is if you can use a shared
implementation with zero overhead to memcpy.

Btw those tests do want to use memcpy anyway, but you could use (mempcpy) to
avoid any unwanted header substitution to ensure the test passes.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #8 from H.J. Lu  ---
Inlining mempcpy uses a callee-saved register:


hjl@gnu-6 tmp]$ cat m.c 
extern char *src, *dst;

char *
foo (unsigned long i)
{
  return __builtin_mempcpy (dst, src, i);
}

char *
bar (unsigned long i)
{
  return __builtin_memcpy (dst, src, i) + i;
}
[hjl@gnu-6 tmp]$ gcc -S -O2 m.c
[hjl@gnu-6 tmp]$ cat m.s
.file   "m.c"
.section.text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl  foo
.type   foo, @function
foo:
.LFB0:
.cfi_startproc
movq%rdi, %rdx
movqsrc(%rip), %rsi
movqdst(%rip), %rdi
jmp mempcpy
.cfi_endproc
.LFE0:
.size   foo, .-foo
.section.text.unlikely
.LCOLDE0:
.text
.LHOTE0:
.section.text.unlikely
.LCOLDB1:
.text
.LHOTB1:
.p2align 4,,15
.globl  bar
.type   bar, @function
bar:
.LFB1:
.cfi_startproc
pushq   %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq%rdi, %rdx
movq%rdi, %rbx
movqsrc(%rip), %rsi
movqdst(%rip), %rdi
callmemcpy
addq%rbx, %rax
popq%rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE1:
.size   bar, .-bar
.section.text.unlikely
.LCOLDE1:
.text
.LHOTE1:
.ident  "GCC: (GNU) 5.3.1 20160212 (Red Hat 5.3.1-4)"
.section.note.GNU-stack,"",@progbits
[hjl@gnu-6 tmp]$ 

Not inlining mempcpy is preferred.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #7 from Jakub Jelinek  ---
(In reply to Wilco from comment #6)
> (In reply to Jakub Jelinek from comment #4)
> > Note the choice of this in a header file is obviously wrong, if you at some
> > point fix this up, then apps will still call memcpy rather than mempcpy,
> > even when the latter is more efficient (because it doesn't have to save the
> > length value in some location where it survives across the call).
> > Note if you don't use the result of mempcpy, gcc is able to optimize it into
> > memcpy, and tons of other optimizations.
> 
> It's highly unlikely that calling mempcpy would ever provide a performance
> gain.

There is no reason why the memcpy vs. mempcpy calls should not be the same
speed (+/- a cycle or 3, but it can actually be that mempcpy may be faster).
The gain is on the caller side, mempcpy allows for more compact and faster code
in quite common case.
By sticking the bogus __mempcpy_inline you pessimize apps until they are
recompiled, no matter whether glibc is fixed in the mean time or not.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #6 from Wilco  ---
(In reply to Jakub Jelinek from comment #4)
> Note the choice of this in a header file is obviously wrong, if you at some
> point fix this up, then apps will still call memcpy rather than mempcpy,
> even when the latter is more efficient (because it doesn't have to save the
> length value in some location where it survives across the call).
> Note if you don't use the result of mempcpy, gcc is able to optimize it into
> memcpy, and tons of other optimizations.

It's highly unlikely that calling mempcpy would ever provide a performance
gain.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #5 from Wilco  ---
(In reply to Jakub Jelinek from comment #3)
> If some arch in glibc implements memcpy.S and does not implement mempcpy.S,
> then obviously the right fix is to add mempcpy.S for that arch, usually it
> is just a matter of #include memcpy.S with some define USE_AS_MEMPCPY, and
> change a couple of instructions in the assembly.  You don't need to remember
> the original value of dest, usually you have the address to which you store
> bytes in some register, so it is just a matter of copying it to the return
> register.

We had a long discussion on this at the time. Very few targets have implemented
mempcpy.S so it would be a lot of effort to do so. Sharing a single
implementation of memcpy is typically not possible without slowing down memcpy
(which is more important), and thus you end up with 2 separate implementations
which creates unnecessary cache pollution.

So overall I believe the best option for most targets is to defer to memcpy
with an extra add instruction to compute the end. We could do that in GCC
rather than in the GLIBC headers, but then it would be harder to actually call
mempcpy in the case that a target supports an optimized implementation.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #4 from Jakub Jelinek  ---
Note the choice of this in a header file is obviously wrong, if you at some
point fix this up, then apps will still call memcpy rather than mempcpy, even
when the latter is more efficient (because it doesn't have to save the length
value in some location where it survives across the call).
Note if you don't use the result of mempcpy, gcc is able to optimize it into
memcpy, and tons of other optimizations.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

--- Comment #3 from Jakub Jelinek  ---
If some arch in glibc implements memcpy.S and does not implement mempcpy.S,
then obviously the right fix is to add mempcpy.S for that arch, usually it is
just a matter of #include memcpy.S with some define USE_AS_MEMPCPY, and change
a couple of instructions in the assembly.  You don't need to remember the
original value of dest, usually you have the address to which you store bytes
in some register, so it is just a matter of copying it to the return register.

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-03 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #2 from Wilco  ---
(In reply to Jakub Jelinek from comment #1)
> These glibc inlines are just wrong, they should trust the compiler doing the
> right thing at least for contemporary versions.
> 05a910f7b420c2b831f35ba90e61c80f001c0606 should be IMHO reverted, it should
> be approached at the compiler side if the ARM folks see any performance
> issues with what is generated.

It's not about the generated code, it's about whether one calls memcpy or
mempcpy in the case where GCC doesn't inline it. Few targets support an
assembler mempcpy implementation, so by default the best option is defer to
memcpy. Targets can add an optimized mempcpy and define
_HAVE_STRING_ARCH_mempcpy to get mempcpy (as x86 does).

Is there a mechanism by which GCC is told accurately whether the library it
targets supports an optimized mempcpy?

[Bug testsuite/70055] gcc.target/i386/chkp-stropt-16.c is incompatible with glibc 2.23

2016-03-02 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70055

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek  ---
These glibc inlines are just wrong, they should trust the compiler doing the
right thing at least for contemporary versions.
05a910f7b420c2b831f35ba90e61c80f001c0606 should be IMHO reverted, it should be
approached at the compiler side if the ARM folks see any performance issues
with what is generated.