Re: [fpc-devel] Policy regarding SHL/SHR under x86

J. Gareth Moreton via fpc-devel Tue, 25 Oct 2022 05:44:42 -0700

What I want to do is the following...

Say I have the expression "(1 shl x) - 1"... under the default AMDAthlon optimisations, you might get something like this:


(x in %cl, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax

Under -CpCOREAVX2, you might get this (ignoring any zero-extensionsrequired on the index):


(x in %ecx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax

All of these sequences take at least 3 cycles, or more accurately, havea dependency chain of length 3. Now consider using BZHI:


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%eax,%eax

A dependency chain length of 2 (I'm not sure how many cycles it takesfor BZHI to complete execution).

The savings go further if this result is used as a mask then discarded,i.e. "Result := Input and ((1 shl x) - 1)". Under AMD Athlon, for example:


(x in %cl, Input in %edx, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax
andl %edx,%eax

Under -CpCOREAVX2:

(x in %ecx, Input in %edx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax
andl  %edx,%eax

All have a dependency chain length of 4.  But with BZHI:

(x in %ecx, Input in %edx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%edx,%eax

Once again, the dependency chain length is reduced to 2. Like with theearlier two, this sequence also works if Input is a reference ratherthan a register; e.g.


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,(ref-to-Input),%eax

A problem, however, arises if the index x is out of range. In the caseof 32-bit operands, the shift instructions in x86 and ARM (includingAArch64) essentially reduce the index modulo 32. "(1 shl 32) - 1" isoften expected to return $FFFFFFFF, a mask that covers the entirebitrange, but "1 shl 32" returns 1 in this case, so the resultant maskends up being all zeroes. However, with BZHI, if the index is out ofrange, the carry flag is set and the output (%eax in this case) is setequal to the input, which results in "(1 shl 32) - 1" returning$FFFFFFFF. I think the same thing happens with negative indices, sinceBZHI is essentially unsigned (it also only reads the least significantbyte of the index register). With this in mind, and "1 shl 32" beingconsidered undefined for 32-bit operands, is this an acceptableoptimisation?

Kit

P.S. There is code in the compiler that catches undefined bitmasks andsimply sets it to all ones if the index is 32 or 64 or whatever theinteger word size is. If BZHI is used, a peephole or node optimisationcan be used to eliminate this catch since it becomes unnecessary with BZHI.


_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Re: [fpc-devel] Policy regarding SHL/SHR under x86

Reply via email to