https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85406

            Bug ID: 85406
           Summary: Unnecessary blend when vectorizing short-cutted
                    calculations
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: linux at carewolf dot com
  Target Milestone: ---

If you have something like this:

inline unsigned qPremultiply(unsigned x)
{
    const unsigned a = x >> 24;
    if (a == 255)
      return x;

    unsigned t = (x & 0xff00ff) * a;
    t = (t + ((t >> 8) & 0xff00ff) + 0x800080) >> 8;
    t &= 0xff00ff;

    x = ((x >> 8) & 0xff) * a;
    x = (x + ((x >> 8) & 0xff) + 0x80);
    x &= 0xff00;
    return x | t | (a << 24);

}

Gcc will vectorize it so that the longer calculation is always performed and
with an added blend in the end to merge the two different return values. This
is however unnecessary as the calculation will give the same result, and thus
the blend can be saved.

Also in any case it is actually a bit unsafe to vectorize as the performance
difference between the two branches is substantial, and it happens that in this
case the short-cut is likely to be valid most of the time, so a nonvectorized
loop might be faster than a vectorized one by doing a lot less.

The latter can be fixed, if the short-cut was also vectorized, for instance
making the test for 4 values at a time and skip the long route if none of them
need it.

Reply via email to