https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88915

            Bug ID: 88915
           Summary: Try smaller vectorisation factors in scalar fallback
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
            Blocks: 53947
  Target Milestone: ---

The get_ref hot function in 525.x264_r inlines a hot helper that performs a
vector average:
void pixel_avg( unsigned char *dst, int i_dst_stride,
                               unsigned char *src1, int i_src1_stride,
                               unsigned char *src2, int i_src2_stride,
                               int i_width, int i_height )
 {
     for( int y = 0; y < i_height; y++ )
     {
         for( int x = 0; x < i_width; x++ )
             dst[x] = ( src1[x] + src2[x] + 1 ) >> 1;
         dst += i_dst_stride;
         src1 += i_src1_stride;
         src2 += i_src2_stride;
     }
 }

GCC 9 already knows how to generate vector average instructions (PR 85694).
For aarch64 it generates a 16x vectorised loop.
Runtime profiling of the arguments to this function, however, show that the
>50% of the time the i_width has value 8 during runtime and therefore the
vector loop is skipped in favour of a scalar fallback:
32.07%  40ed2c          ldrb    w3, [x0,x5]
11.41%  40ed30          ldrb    w11, [x4,x5]
        40ed34          add     w3, w3, w11
        40ed38          add     w3, w3, #0x1
        40ed3c          asr     w3, w3, #1
0.71%   40ed40          strb    w3, [x2,x5]
        40ed44          add     x5, x5, #0x1
        40ed48          cmp     w6, w5
        40ed4c          b.gt    <loop>

The most frequent runtime combinations of inputs to this function are:
29240545 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344,
i_src2_stride: 1344
22714355 i_height: 16, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344,
i_src2_stride: 1344
19669512 i_height: 8, i_width: 8, i_dst_stride: 16, i_src1_stride: 704,
i_src2_stride: 704
3689216 i_height: 16, i_width: 8, i_dst_stride: 16, i_src1_stride: 1344,
i_src2_stride: 1344
3670639 i_height: 8, i_width: 16, i_dst_stride: 16, i_src1_stride: 1344,
i_src2_stride: 1344

That's a shame. AArch64 supports the V8QI form of the vector average
instruction (and advertises it through optabs).
With --param vect-epilogues-nomask=1 we already generate something like:
if (bytes_left > 16)
{
  while (bytes_left > 16)
    16x_vectorised;
  if (bytes_left > 8)
    8x_vectorised;
  unrolled_scalar_epilogue;
}
else
  scalar_loop;

Could we perhaps generate:
  while (bytes_left > 16)
    16x_vectorised;
  if (bytes_left > 8)
    8x_vectorised;
  unrolled_scalar_epilogue; // or keep it as a rolled scalar_loop to save on
codesize?

Basically I'm looking for a way to take advantage of the 8x vectorised form.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to