https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101225

            Bug ID: 101225
           Summary: Example where y % 16 == 0 seems more expensive than y
                    % 400 == 0.
           Product: gcc
           Version: 11.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cassio.neri at gmail dot com
  Target Milestone: ---

Consider this implementation of is_leap_year:

bool is_leap_year_1(short year) {
  return year % 100 == 0 ? year % 400 == 0 : year % 4 == 0;
}

If a number is multiple of 100, then it's divisible by 400 if and only if it's
divisible by 16. Since checking divisibility by 16 is cheap, one would expect
the following version to be more efficient (at least, not worse):

bool is_leap_year_2(short year) {
  return year % 100 == 0 ? year % 16 == 0 : year % 4 == 0;
}

According to [1] the latter is 1.4x slower than the former.

The emitted instructions with -O3 [2] don't seem bad and, except for a leal x
addw, the difference is a localized strength-reduction from "y % 400 == 0" to
"y % 16 == 0":

is_leap_year_1(short):
  imulw $23593, %di, %ax
  leal 1308(%rax), %edx
  rorw $2, %dx
  cmpw $654, %dx
  ja .L2
  addw $1296, %ax # Begin: year % 400 == 0
  rorw $4, %ax    #
  cmpw $162, %ax  #
  setbe %al       # End  : year % 400 == 0
  ret
.L2:
  andl $3, %edi
  sete %al
  ret

is_leap_year_2(short):
  imulw $23593, %di, %ax
  addw $1308, %ax
  rorw $2, %ax
  cmpw $654, %ax
  ja .L6
  andl $15, %edi # Begin: y % 16 == 0
  sete %al       # End  : y % 16 == 0
  ret
.L6:
  andl $3, %edi
  sete %al
  ret

FWIW: My educated **guess** is that the issue is the choice of registers: for
version 1 just after leal, the register rax/ax/al is free and regardless of the
branch taken, the CPU can continue the calculation of "y % 100 == 0" in
parallel with the other divisibility check, up to "sete %al". For version 2,
rax/ax/al is busy during the whole execution of "y % 100" and "sete %al" can't
be preemptively executed. As a test for my theory I reimplemented half of
is_leap_year_2 in inline asm (see in [1] and [2]) using similar choices of
registers as in is_leap_year_1 and I got the performance boost that I was
expecting.

[1] https://quick-bench.com/q/3U8t4qzXxtSpsehbWNOh3SWxBGQ
[2] https://godbolt.org/z/jfK3j5777

Note: [1] runs GCC 10.2 but the same happens on GCC 11.0.0.

Reply via email to