Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-06 Thread H.J. Lu via Gcc-patches
On Tue, Apr 6, 2021 at 2:51 AM Jan Hubicka  wrote:
>
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> >
> > This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> >
> > Current strategy for skylake is {512, unrolled_loop, false} for such
> > size, so it will generate unrolled loops with mov, while the patch
> > generates memcpy/memset libcall and uses vector move.
>
> This is good - I originally set the table based on this
> micro-benchmarking script and apparently glibc used at that time had
> more expensive memcpy for small blocks.
>
> One thing to consider is, however, that calling external memcpy has also
> additional cost of clobbering all caller saved registers.  Especially
> for code that uses SSE this is painful since all needs to go to stack in
> that case. So I am not completely sure how representative the
> micro-benchmark is to this respect since it does not use any SSE and
> register pressure is generally small.
>
> So with current glibc it seems libcall is win for blocks of size greater
> than 64 or 128 at least if the register pressure is not big.
> With this respect your change looks good.
> > >
> > > My patch generates "rep movsb" only in a very limited cases:
> > >
> > > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> > >load and store for up to 16 * 16 (256) bytes when the data size is
> > >fixed and known.
> > > 2. Inline only if data size is known to be <= 256.
> > >a. Use "rep movsb/stosb" with a simple code sequence if the data size
> > >   is a constant.
> > >b. Use loop if data size is not a constant.
>
> Aha, this is very hard to read from the algorithm descriptor.  So we
> still have the check that maxsize==minsize and use rep mosb only for
> constant sized blocks when the corresponding TARGET macro is defined.
>
> I think it would be more readable if we introduced rep_1_byte_constant.
> The descriptor is supposed to read as a sequence of rules where fist
> applies.  It is not obvious that we have another TARGET_* macro that
> makes rep_1_byte to be ignored in some cases.
> (TARGET macro will also interfere with the microbenchmarking script).
>
> Still I do not understand why compile time constant makes rep mosb/stosb
> better than loop. Is it CPU special casing it at decoder time and
> requiring explicit mov instruction? Or is it only becuase rep mosb is
> not good for blocks smaller than 128bit?

Non constant "rep movsb" triggers more machine clear events:

https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference/cpu-metrics-reference/mo-machine-clear-overhead.html

in hot loops of some workloads.

> > >
> > > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > > with -mno-sse.
> > >
> > > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > > compared to unrolled loop for perhaps more real world benchmarks?
> > >
> > > "rep movsb" isn't generated with my patch in this case since
> > > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > > XMM registers.
>
> OK, so I guess:
>   {libcall,
>{{256, rep_1_byte, true},
> {256, unrolled_loop, false},
> {-1, libcall, false}}},
>   {libcall,
>{{256, rep_1_loop, true},
> {256, unrolled_loop, false},
> {-1, libcall, false;
>
> may still perform better but the differnece between loop and unrolled
> loop is within 10% margin..
>
> So i guess patch is OK and we should look into cleaning up the
> descriptors.  I can make patch for that once I understand the logic above.

I am checking in my patch.  We improve it for GCC 12.  We will also revisit:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90773

for GCC 12.

Thanks.

-- 
H.J.


Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-06 Thread Jan Hubicka
> > Do you know what of the three changes (preferring reps/stosb,
> > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > on eebmc?
> 
> A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP
> 
> This loop is transformed to builtin_memcpy and builtin_memset with size 280.
> 
> Current strategy for skylake is {512, unrolled_loop, false} for such
> size, so it will generate unrolled loops with mov, while the patch
> generates memcpy/memset libcall and uses vector move.

This is good - I originally set the table based on this
micro-benchmarking script and apparently glibc used at that time had
more expensive memcpy for small blocks.

One thing to consider is, however, that calling external memcpy has also
additional cost of clobbering all caller saved registers.  Especially
for code that uses SSE this is painful since all needs to go to stack in
that case. So I am not completely sure how representative the
micro-benchmark is to this respect since it does not use any SSE and
register pressure is generally small.

So with current glibc it seems libcall is win for blocks of size greater
than 64 or 128 at least if the register pressure is not big.
With this respect your change looks good.
> >
> > My patch generates "rep movsb" only in a very limited cases:
> >
> > 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
> >load and store for up to 16 * 16 (256) bytes when the data size is
> >fixed and known.
> > 2. Inline only if data size is known to be <= 256.
> >a. Use "rep movsb/stosb" with a simple code sequence if the data size
> >   is a constant.
> >b. Use loop if data size is not a constant.

Aha, this is very hard to read from the algorithm descriptor.  So we
still have the check that maxsize==minsize and use rep mosb only for
constant sized blocks when the corresponding TARGET macro is defined.

I think it would be more readable if we introduced rep_1_byte_constant.
The descriptor is supposed to read as a sequence of rules where fist
applies.  It is not obvious that we have another TARGET_* macro that
makes rep_1_byte to be ignored in some cases.
(TARGET macro will also interfere with the microbenchmarking script).

Still I do not understand why compile time constant makes rep mosb/stosb
better than loop. Is it CPU special casing it at decoder time and
requiring explicit mov instruction? Or is it only becuase rep mosb is
not good for blocks smaller than 128bit?

> >
> > As a result,  "rep stosb" is generated only when 128 < data size < 256
> > with -mno-sse.
> >
> > > Do you have some data for blocks in size 8...256 to be faster with rep1
> > > compared to unrolled loop for perhaps more real world benchmarks?
> >
> > "rep movsb" isn't generated with my patch in this case since
> > MOVE_RATIO == 17 can copy up to 16 * 16 (256) bytes with
> > XMM registers.

OK, so I guess:
  {libcall,
   {{256, rep_1_byte, true},
{256, unrolled_loop, false},
{-1, libcall, false}}},
  {libcall,
   {{256, rep_1_loop, true},
{256, unrolled_loop, false},
{-1, libcall, false;

may still perform better but the differnece between loop and unrolled
loop is within 10% margin..

So i guess patch is OK and we should look into cleaning up the
descriptors.  I can make patch for that once I understand the logic above.

Honza
> >
> > > The difference seems to get quite big for small locks in range 8...16
> > > bytes.  I noticed that before and sort of conlcuded that it is probably
> > > the branch prediction playing relatively well for those small block
> > > sizes. On the other hand winding up the relatively long unrolled loop is
> > > not very cool just to catch this case.
> > >
> > > Do you know what of the three changes (preferring reps/stosb,
> > > CLEAR_RATIO and algorithm choice changes) cause the two speedups
> > > on eebmc?
> >
> > Hongyu, can you find out where the speedup came from?
> >
> > Thanks.
> >
> > --
> > H.J.


Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-06 Thread Hongyu Wang via Gcc-patches
> Do you know what of the three changes (preferring reps/stosb,
> CLEAR_RATIO and algorithm choice changes) cause the two speedups
> on eebmc?

A extracted testcase from nnet_test in https://godbolt.org/z/c8KdsohTP

This loop is transformed to builtin_memcpy and builtin_memset with size 280.

Current strategy for skylake is {512, unrolled_loop, false} for such
size, so it will generate unrolled loops with mov, while the patch
generates memcpy/memset libcall and uses vector move.

For idctrn01 it is memset with size 512. So the speedups come from
algorithm change.

H.J. Lu via Gcc-patches  于2021年4月6日周二 上午5:55写道:
>
> On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka  wrote:
> >
> > > >  /* skylake_cost should produce code tuned for Skylake familly of CPUs. 
> > > >  */
> > > >  static stringop_algs skylake_memcpy[2] =   {
> > > > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > > > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > > > - {-1, libcall, false;
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +{256, loop, false},
> > > > +{-1, libcall, false}}},
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +{256, loop, false},
> > > > +{-1, libcall, false;
> > > >
> > > >  static stringop_algs skylake_memset[2] = {
> > > > -  {libcall, {{6, loop_1_byte, true},
> > > > - {24, loop, true},
> > > > - {8192, rep_prefix_4_byte, true},
> > > > - {-1, libcall, false}}},
> > > > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > > > - {-1, libcall, false;
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +{256, loop, false},
> > > > +{-1, libcall, false}}},
> > > > +  {libcall,
> > > > +   {{256, rep_prefix_1_byte, true},
> > > > +{256, loop, false},
> > > > +{-1, libcall, false;
> > > >
> > >
> > > If there are no objections, I will check it in on Wednesday.
> >
> > On my skylake notebook if I run the benchmarking script I get:
> >
> > jan@skylake:~/trunk/contrib> ./bench-stringop 64 64000 gcc -march=native
> > memcpy
> >   block size  libcall rep1noalg   rep4noalg   rep8noalg   loop  
> >   noalg   unrlnoalg   sse noalg   bytePGO dynamicBEST
> >  8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 
> > 0:00.28 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.18
> > 0:00.19 sse
> >   819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 
> > 0:00.19 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.09
> > 0:00.09 libcall
> >81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 
> > 0:00.12 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.06
> > 0:00.06 libcall
> >20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 
> > 0:00.14 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.09
> > 0:00.05 rep1noalign
> > 8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 
> > 0:00.12 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.05
> > 0:00.04 rep1noalign
> > 4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 
> > 0:00.09 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.07
> > 0:00.05 libcall
> > 2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 
> > 0:00.10 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.07
> > 0:00.04 libcall
> > 1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 
> > 0:00.12 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.06
> > 0:00.06 libcall
> >  512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 
> > 0:00.13 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.08
> > 0:00.06 libcall
> >  256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 
> > 0:00.14 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.12
> > 0:00.10 libcall
> >  128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 
> > 0:00.19 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.17
> > 0:00.15 libcall
> >   64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 
> > 0:00.25 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.28
> > 0:00.25 loop
> >   48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 
> > 0:00.45 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.31
> > 0:00.32 unrl
> >   32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 
> > 0:00.42 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.40
> > 0:00.40 unrl
> >   24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 
> > 0:00.52 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.50
> > 0:00.50 unrlnoalign
> >   16  0:00.97 0:01.03 0:01

Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-05 Thread H.J. Lu via Gcc-patches
On Mon, Apr 5, 2021 at 2:14 PM Jan Hubicka  wrote:
>
> > >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  
> > > */
> > >  static stringop_algs skylake_memcpy[2] =   {
> > > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > > - {-1, libcall, false;
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +{256, loop, false},
> > > +{-1, libcall, false}}},
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +{256, loop, false},
> > > +{-1, libcall, false;
> > >
> > >  static stringop_algs skylake_memset[2] = {
> > > -  {libcall, {{6, loop_1_byte, true},
> > > - {24, loop, true},
> > > - {8192, rep_prefix_4_byte, true},
> > > - {-1, libcall, false}}},
> > > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > > - {-1, libcall, false;
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +{256, loop, false},
> > > +{-1, libcall, false}}},
> > > +  {libcall,
> > > +   {{256, rep_prefix_1_byte, true},
> > > +{256, loop, false},
> > > +{-1, libcall, false;
> > >
> >
> > If there are no objections, I will check it in on Wednesday.
>
> On my skylake notebook if I run the benchmarking script I get:
>
> jan@skylake:~/trunk/contrib> ./bench-stringop 64 64000 gcc -march=native
> memcpy
>   block size  libcall rep1noalg   rep4noalg   rep8noalg   loop
> noalg   unrlnoalg   sse noalg   bytePGO dynamicBEST
>  8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 
> 0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.180:00.19 sse
>   819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 
> 0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.090:00.09 
> libcall
>81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 
> 0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.060:00.06 
> libcall
>20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 
> 0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.090:00.05 
> rep1noalign
> 8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 
> 0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.050:00.04 
> rep1noalign
> 4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 
> 0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.070:00.05 
> libcall
> 2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 
> 0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.070:00.04 
> libcall
> 1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 
> 0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.060:00.06 
> libcall
>  512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 
> 0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.080:00.06 
> libcall
>  256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 
> 0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.120:00.10 
> libcall
>  128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 
> 0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.170:00.15 
> libcall
>   64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 
> 0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.280:00.25 
> loop
>   48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 
> 0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.310:00.32 
> unrl
>   32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 
> 0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.400:00.40 
> unrl
>   24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 
> 0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.500:00.50 
> unrlnoalign
>   16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 
> 0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.910:00.77 
> unrlnoalign
>   14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 
> 0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.990:00.94 
> unrl
>   12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 
> 0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.100:01.02 
> unrl
>   10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 
> 0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.380:01.23 
> unrlnoalign
>8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 
> 0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.550:01.38 

Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-05 Thread Jan Hubicka
> >  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
> >  static stringop_algs skylake_memcpy[2] =   {
> > -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> > -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> > - {-1, libcall, false;
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +{256, loop, false},
> > +{-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +{256, loop, false},
> > +{-1, libcall, false;
> >
> >  static stringop_algs skylake_memset[2] = {
> > -  {libcall, {{6, loop_1_byte, true},
> > - {24, loop, true},
> > - {8192, rep_prefix_4_byte, true},
> > - {-1, libcall, false}}},
> > -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> > - {-1, libcall, false;
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +{256, loop, false},
> > +{-1, libcall, false}}},
> > +  {libcall,
> > +   {{256, rep_prefix_1_byte, true},
> > +{256, loop, false},
> > +{-1, libcall, false;
> >
> 
> If there are no objections, I will check it in on Wednesday.

On my skylake notebook if I run the benchmarking script I get:

jan@skylake:~/trunk/contrib> ./bench-stringop 64 64000 gcc -march=native
memcpy
  block size  libcall rep1noalg   rep4noalg   rep8noalg   loop
noalg   unrlnoalg   sse noalg   bytePGO dynamicBEST
 8192000  0:00.23 0:00.21 0:00.21 0:00.21 0:00.21 0:00.22 0:00.24 0:00.28 
0:00.22 0:00.20 0:00.21 0:00.19 0:00.19 0:00.77 0:00.18 0:00.180:00.19 sse
  819200  0:00.09 0:00.18 0:00.18 0:00.18 0:00.18 0:00.18 0:00.20 0:00.19 
0:00.16 0:00.15 0:00.16 0:00.13 0:00.14 0:00.63 0:00.09 0:00.090:00.09 
libcall
   81920  0:00.06 0:00.07 0:00.07 0:00.06 0:00.06 0:00.06 0:00.06 0:00.12 
0:00.11 0:00.11 0:00.10 0:00.07 0:00.08 0:00.66 0:00.11 0:00.060:00.06 
libcall
   20480  0:00.06 0:00.07 0:00.05 0:00.06 0:00.07 0:00.07 0:00.08 0:00.14 
0:00.14 0:00.10 0:00.11 0:00.06 0:00.07 0:01.11 0:00.07 0:00.090:00.05 
rep1noalign
8192  0:00.06 0:00.05 0:00.04 0:00.05 0:00.06 0:00.07 0:00.07 0:00.12 
0:00.15 0:00.11 0:00.10 0:00.06 0:00.06 0:00.64 0:00.06 0:00.050:00.04 
rep1noalign
4096  0:00.05 0:00.05 0:00.05 0:00.06 0:00.07 0:00.05 0:00.05 0:00.09 
0:00.14 0:00.11 0:00.10 0:00.07 0:00.06 0:00.61 0:00.05 0:00.070:00.05 
libcall
2048  0:00.04 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 
0:00.14 0:00.09 0:00.09 0:00.09 0:00.07 0:00.64 0:00.06 0:00.070:00.04 
libcall
1024  0:00.06 0:00.08 0:00.08 0:00.10 0:00.11 0:00.06 0:00.06 0:00.12 
0:00.15 0:00.09 0:00.09 0:00.16 0:00.09 0:00.63 0:00.05 0:00.060:00.06 
libcall
 512  0:00.06 0:00.07 0:00.08 0:00.12 0:00.08 0:00.10 0:00.09 0:00.13 
0:00.16 0:00.10 0:00.10 0:00.28 0:00.18 0:00.66 0:00.13 0:00.080:00.06 
libcall
 256  0:00.10 0:00.12 0:00.11 0:00.14 0:00.11 0:00.12 0:00.13 0:00.14 
0:00.16 0:00.13 0:00.12 0:00.49 0:00.30 0:00.68 0:00.14 0:00.120:00.10 
libcall
 128  0:00.15 0:00.19 0:00.18 0:00.20 0:00.19 0:00.20 0:00.18 0:00.19 
0:00.21 0:00.17 0:00.15 0:00.49 0:00.43 0:00.72 0:00.17 0:00.170:00.15 
libcall
  64  0:00.29 0:00.28 0:00.29 0:00.33 0:00.33 0:00.34 0:00.29 0:00.25 
0:00.29 0:00.26 0:00.26 0:01.01 0:00.97 0:01.13 0:00.32 0:00.280:00.25 loop
  48  0:00.37 0:00.39 0:00.38 0:00.45 0:00.41 0:00.45 0:00.44 0:00.45 
0:00.33 0:00.32 0:00.33 0:02.21 0:02.22 0:00.87 0:00.32 0:00.310:00.32 unrl
  32  0:00.54 0:00.52 0:00.50 0:00.60 0:00.62 0:00.61 0:00.52 0:00.42 
0:00.43 0:00.40 0:00.42 0:01.18 0:01.16 0:01.14 0:00.39 0:00.400:00.40 unrl
  24  0:00.71 0:00.74 0:00.77 0:00.83 0:00.78 0:00.81 0:00.75 0:00.52 
0:00.52 0:00.52 0:00.50 0:02.28 0:02.27 0:00.94 0:00.49 0:00.500:00.50 
unrlnoalign
  16  0:00.97 0:01.03 0:01.20 0:01.52 0:01.37 0:01.84 0:01.10 0:00.90 
0:00.86 0:00.79 0:00.77 0:01.27 0:01.32 0:01.25 0:00.91 0:00.910:00.77 
unrlnoalign
  14  0:01.35 0:01.37 0:01.39 0:01.76 0:01.44 0:01.53 0:01.58 0:01.01 
0:00.99 0:00.94 0:00.94 0:01.34 0:01.29 0:01.28 0:01.01 0:00.990:00.94 unrl
  12  0:01.48 0:01.55 0:01.55 0:01.70 0:01.55 0:02.01 0:01.52 0:01.11 
0:01.07 0:01.02 0:01.04 0:02.21 0:02.25 0:01.19 0:01.11 0:01.100:01.02 unrl
  10  0:01.73 0:01.90 0:01.88 0:02.05 0:01.86 0:02.09 0:01.78 0:01.32 
0:01.41 0:01.25 0:01.23 0:02.46 0:02.25 0:01.36 0:01.50 0:01.380:01.23 
unrlnoalign
   8  0:02.22 0:02.17 0:02.18 0:02.43 0:02.09 0:02.55 0:01.92 0:01.54 
0:01.46 0:01.38 0:01.38 0:01.51 0:01.62 0:01.54 0:01.55 0:01.550:01.38 unrl
So indeed rep byte seems consistently outperforming rep4/rep8 however
urolled variant seems to be better than rep byte for small block sizes.
Do you have some data for blocks in size 8...256 to be faster with rep1
compared to unrolled loop for perhaps

Re: [PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-04-05 Thread H.J. Lu via Gcc-patches
On Mon, Mar 22, 2021 at 6:16 AM H.J. Lu  wrote:
>
> Simply memcpy and memset inline strategies to avoid branches for
> Skylake family CPUs:
>
> 1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
>load and store for up to 16 * 16 (256) bytes when the data size is
>fixed and known.
> 2. Inline only if data size is known to be <= 256.
>a. Use "rep movsb/stosb" with simple code sequence if the data size
>   is a constant.
>b. Use loop if data size is not a constant.
> 3. Use memcpy/memset libray function if data size is unknown or > 256.
>
> On Cascadelake processor with -march=native -Ofast -flto,
>
> 1. Performance impacts of SPEC CPU 2017 rate are:
>
> 500.perlbench_r  0.17%
> 502.gcc_r   -0.36%
> 505.mcf_r0.00%
> 520.omnetpp_r0.08%
> 523.xalancbmk_r -0.62%
> 525.x264_r   1.04%
> 531.deepsjeng_r  0.11%
> 541.leela_r -1.09%
> 548.exchange2_r -0.25%
> 557.xz_r 0.17%
> Geomean -0.08%
>
> 503.bwaves_r 0.00%
> 507.cactuBSSN_r  0.69%
> 508.namd_r  -0.07%
> 510.parest_r 1.12%
> 511.povray_r 1.82%
> 519.lbm_r0.00%
> 521.wrf_r   -1.32%
> 526.blender_r   -0.47%
> 527.cam4_r   0.23%
> 538.imagick_r   -1.72%
> 544.nab_r   -0.56%
> 549.fotonik3d_r  0.12%
> 554.roms_r   0.43%
> Geomean  0.02%
>
> 2. Significant impacts on eembc benchmarks are:
>
> eembc/idctrn01   9.23%
> eembc/nnet_test  29.26%
>
> gcc/
>
> * config/i386/x86-tune-costs.h (skylake_memcpy): Updated.
> (skylake_memset): Likewise.
> (skylake_cost): Change CLEAR_RATIO to 17.
> * config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
> Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER,
> m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512.
>
> gcc/testsuite/
>
> * gcc.target/i386/memcpy-strategy-9.c: New test.
> * gcc.target/i386/memcpy-strategy-10.c: Likewise.
> * gcc.target/i386/memcpy-strategy-11.c: Likewise.
> * gcc.target/i386/memset-strategy-7.c: Likewise.
> * gcc.target/i386/memset-strategy-8.c: Likewise.
> * gcc.target/i386/memset-strategy-9.c: Likewise.
> ---
>  gcc/config/i386/x86-tune-costs.h  | 27 ---
>  gcc/config/i386/x86-tune.def  |  3 +--
>  .../gcc.target/i386/memcpy-strategy-10.c  | 11 
>  .../gcc.target/i386/memcpy-strategy-11.c  | 18 +
>  .../gcc.target/i386/memcpy-strategy-9.c   |  9 +++
>  .../gcc.target/i386/memset-strategy-7.c   | 11 
>  .../gcc.target/i386/memset-strategy-8.c   |  9 +++
>  .../gcc.target/i386/memset-strategy-9.c   | 17 
>  8 files changed, 93 insertions(+), 12 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c
>
> diff --git a/gcc/config/i386/x86-tune-costs.h 
> b/gcc/config/i386/x86-tune-costs.h
> index 0e00ff99df3..ffe810f2bcb 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = {
>
>  /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
>  static stringop_algs skylake_memcpy[2] =   {
> -  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
> -  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
> - {-1, libcall, false;
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +{256, loop, false},
> +{-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +{256, loop, false},
> +{-1, libcall, false;
>
>  static stringop_algs skylake_memset[2] = {
> -  {libcall, {{6, loop_1_byte, true},
> - {24, loop, true},
> - {8192, rep_prefix_4_byte, true},
> - {-1, libcall, false}}},
> -  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
> - {-1, libcall, false;
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +{256, loop, false},
> +{-1, libcall, false}}},
> +  {libcall,
> +   {{256, rep_prefix_1_byte, true},
> +{256, loop, false},
> +{-1, libcall, false;
>
>  static const
>  struct processor_costs skylake_cost = {
> @@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = {
>COSTS_N_INSNS (0),   /* cost of movzx */
>8,   /* "large" insn */
>17,  /* MOVE_RATIO */
> -  6,   /* CLEAR_RATIO */
> +  17,  /* CLEAR_RATIO */
>{4, 4, 4},  

[PATCH 2/3] x86: Update memcpy/memset inline strategies for Skylake family CPUs

2021-03-22 Thread H.J. Lu via Gcc-patches
Simply memcpy and memset inline strategies to avoid branches for
Skylake family CPUs:

1. With MOVE_RATIO and CLEAR_RATIO == 17, GCC will use integer/vector
   load and store for up to 16 * 16 (256) bytes when the data size is
   fixed and known.
2. Inline only if data size is known to be <= 256.
   a. Use "rep movsb/stosb" with simple code sequence if the data size
  is a constant.
   b. Use loop if data size is not a constant.
3. Use memcpy/memset libray function if data size is unknown or > 256.

On Cascadelake processor with -march=native -Ofast -flto,

1. Performance impacts of SPEC CPU 2017 rate are:

500.perlbench_r  0.17%
502.gcc_r   -0.36%
505.mcf_r0.00%
520.omnetpp_r0.08%
523.xalancbmk_r -0.62%
525.x264_r   1.04%
531.deepsjeng_r  0.11%
541.leela_r -1.09%
548.exchange2_r -0.25%
557.xz_r 0.17%
Geomean -0.08%

503.bwaves_r 0.00%
507.cactuBSSN_r  0.69%
508.namd_r  -0.07%
510.parest_r 1.12%
511.povray_r 1.82%
519.lbm_r0.00%
521.wrf_r   -1.32%
526.blender_r   -0.47%
527.cam4_r   0.23%
538.imagick_r   -1.72%
544.nab_r   -0.56%
549.fotonik3d_r  0.12%
554.roms_r   0.43%
Geomean  0.02%

2. Significant impacts on eembc benchmarks are:

eembc/idctrn01   9.23%
eembc/nnet_test  29.26%

gcc/

* config/i386/x86-tune-costs.h (skylake_memcpy): Updated.
(skylake_memset): Likewise.
(skylake_cost): Change CLEAR_RATIO to 17.
* config/i386/x86-tune.def (X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB):
Replace m_CANNONLAKE, m_ICELAKE_CLIENT, m_ICELAKE_SERVER,
m_TIGERLAKE and m_SAPPHIRERAPIDS with m_SKYLAKE and m_CORE_AVX512.

gcc/testsuite/

* gcc.target/i386/memcpy-strategy-9.c: New test.
* gcc.target/i386/memcpy-strategy-10.c: Likewise.
* gcc.target/i386/memcpy-strategy-11.c: Likewise.
* gcc.target/i386/memset-strategy-7.c: Likewise.
* gcc.target/i386/memset-strategy-8.c: Likewise.
* gcc.target/i386/memset-strategy-9.c: Likewise.
---
 gcc/config/i386/x86-tune-costs.h  | 27 ---
 gcc/config/i386/x86-tune.def  |  3 +--
 .../gcc.target/i386/memcpy-strategy-10.c  | 11 
 .../gcc.target/i386/memcpy-strategy-11.c  | 18 +
 .../gcc.target/i386/memcpy-strategy-9.c   |  9 +++
 .../gcc.target/i386/memset-strategy-7.c   | 11 
 .../gcc.target/i386/memset-strategy-8.c   |  9 +++
 .../gcc.target/i386/memset-strategy-9.c   | 17 
 8 files changed, 93 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-10.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-11.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memcpy-strategy-9.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-7.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-8.c
 create mode 100644 gcc/testsuite/gcc.target/i386/memset-strategy-9.c

diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index 0e00ff99df3..ffe810f2bcb 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1822,17 +1822,24 @@ struct processor_costs znver3_cost = {
 
 /* skylake_cost should produce code tuned for Skylake familly of CPUs.  */
 static stringop_algs skylake_memcpy[2] =   {
-  {libcall, {{1024, rep_prefix_4_byte, true}, {-1, libcall, false}}},
-  {libcall, {{16, loop, false}, {512, unrolled_loop, false},
- {-1, libcall, false;
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+{256, loop, false},
+{-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+{256, loop, false},
+{-1, libcall, false;
 
 static stringop_algs skylake_memset[2] = {
-  {libcall, {{6, loop_1_byte, true},
- {24, loop, true},
- {8192, rep_prefix_4_byte, true},
- {-1, libcall, false}}},
-  {libcall, {{24, loop, true}, {512, unrolled_loop, false},
- {-1, libcall, false;
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+{256, loop, false},
+{-1, libcall, false}}},
+  {libcall,
+   {{256, rep_prefix_1_byte, true},
+{256, loop, false},
+{-1, libcall, false;
 
 static const
 struct processor_costs skylake_cost = {
@@ -1889,7 +1896,7 @@ struct processor_costs skylake_cost = {
   COSTS_N_INSNS (0),   /* cost of movzx */
   8,   /* "large" insn */
   17,  /* MOVE_RATIO */
-  6,   /* CLEAR_RATIO */
+  17,  /* CLEAR_RATIO */
   {4, 4, 4},   /* cost of loading integer registers
   in QImode, HImode and SImode.
   Relative to reg-reg move (2).  */
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
inde