Re: [fpc-devel] Policy regarding SHL/SHR under x86
Correction to last post. When applying BZHI to an input ("Result := Input and ((1 shl x) - 1)"), the initial "mov $-1,%eax" is unnecessary unless the mask is being preserved, and is just: bzhil %ecx,(ref-to-Input),%eax Kit On 25/10/2022 13:44, J. Gareth Moreton via fpc-devel wrote: What I want to do is the following... Say I have the expression "(1 shl x) - 1"... under the default AMD Athlon optimisations, you might get something like this: (x in %cl, Result in %eax) movl $1,%eax shll %cl,%eax subl $1,%eax Under -CpCOREAVX2, you might get this (ignoring any zero-extensions required on the index): (x in %ecx, Result in %eax) movl $1,%eax shlxl %ecx,%eax,%eax subl $1,%eax All of these sequences take at least 3 cycles, or more accurately, have a dependency chain of length 3. Now consider using BZHI: (x in %ecx, Result in %eax) movl $-1,%eax bzhil %ecx,%eax,%eax A dependency chain length of 2 (I'm not sure how many cycles it takes for BZHI to complete execution). The savings go further if this result is used as a mask then discarded, i.e. "Result := Input and ((1 shl x) - 1)". Under AMD Athlon, for example: (x in %cl, Input in %edx, Result in %eax) movl $1,%eax shll %cl,%eax subl $1,%eax andl %edx,%eax Under -CpCOREAVX2: (x in %ecx, Input in %edx, Result in %eax) movl $1,%eax shlxl %ecx,%eax,%eax subl $1,%eax andl %edx,%eax All have a dependency chain length of 4. But with BZHI: (x in %ecx, Input in %edx, Result in %eax) movl $-1,%eax bzhil %ecx,%edx,%eax Once again, the dependency chain length is reduced to 2. Like with the earlier two, this sequence also works if Input is a reference rather than a register; e.g. (x in %ecx, Result in %eax) movl $-1,%eax bzhil %ecx,(ref-to-Input),%eax A problem, however, arises if the index x is out of range. In the case of 32-bit operands, the shift instructions in x86 and ARM (including AArch64) essentially reduce the index modulo 32. "(1 shl 32) - 1" is often expected to return $, a mask that covers the entire bitrange, but "1 shl 32" returns 1 in this case, so the resultant mask ends up being all zeroes. However, with BZHI, if the index is out of range, the carry flag is set and the output (%eax in this case) is set equal to the input, which results in "(1 shl 32) - 1" returning $. I think the same thing happens with negative indices, since BZHI is essentially unsigned (it also only reads the least significant byte of the index register). With this in mind, and "1 shl 32" being considered undefined for 32-bit operands, is this an acceptable optimisation? Kit P.S. There is code in the compiler that catches undefined bitmasks and simply sets it to all ones if the index is 32 or 64 or whatever the integer word size is. If BZHI is used, a peephole or node optimisation can be used to eliminate this catch since it becomes unnecessary with BZHI. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
What I want to do is the following... Say I have the expression "(1 shl x) - 1"... under the default AMD Athlon optimisations, you might get something like this: (x in %cl, Result in %eax) movl $1,%eax shll %cl,%eax subl $1,%eax Under -CpCOREAVX2, you might get this (ignoring any zero-extensions required on the index): (x in %ecx, Result in %eax) movl $1,%eax shlxl %ecx,%eax,%eax subl $1,%eax All of these sequences take at least 3 cycles, or more accurately, have a dependency chain of length 3. Now consider using BZHI: (x in %ecx, Result in %eax) movl $-1,%eax bzhil %ecx,%eax,%eax A dependency chain length of 2 (I'm not sure how many cycles it takes for BZHI to complete execution). The savings go further if this result is used as a mask then discarded, i.e. "Result := Input and ((1 shl x) - 1)". Under AMD Athlon, for example: (x in %cl, Input in %edx, Result in %eax) movl $1,%eax shll %cl,%eax subl $1,%eax andl %edx,%eax Under -CpCOREAVX2: (x in %ecx, Input in %edx, Result in %eax) movl $1,%eax shlxl %ecx,%eax,%eax subl $1,%eax andl %edx,%eax All have a dependency chain length of 4. But with BZHI: (x in %ecx, Input in %edx, Result in %eax) movl $-1,%eax bzhil %ecx,%edx,%eax Once again, the dependency chain length is reduced to 2. Like with the earlier two, this sequence also works if Input is a reference rather than a register; e.g. (x in %ecx, Result in %eax) movl $-1,%eax bzhil %ecx,(ref-to-Input),%eax A problem, however, arises if the index x is out of range. In the case of 32-bit operands, the shift instructions in x86 and ARM (including AArch64) essentially reduce the index modulo 32. "(1 shl 32) - 1" is often expected to return $, a mask that covers the entire bitrange, but "1 shl 32" returns 1 in this case, so the resultant mask ends up being all zeroes. However, with BZHI, if the index is out of range, the carry flag is set and the output (%eax in this case) is set equal to the input, which results in "(1 shl 32) - 1" returning $. I think the same thing happens with negative indices, since BZHI is essentially unsigned (it also only reads the least significant byte of the index register). With this in mind, and "1 shl 32" being considered undefined for 32-bit operands, is this an acceptable optimisation? Kit P.S. There is code in the compiler that catches undefined bitmasks and simply sets it to all ones if the index is 32 or 64 or whatever the integer word size is. If BZHI is used, a peephole or node optimisation can be used to eliminate this catch since it becomes unnecessary with BZHI. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
Thanks Michael. Sven already filled me in - the more I learn! Kit On 24/10/2022 16:44, Michael Van Canneyt via fpc-devel wrote: On Mon, 24 Oct 2022, J. Gareth Moreton via fpc-devel wrote: That's useful - thank you. Michael Van Canneyt mentioend he updated the documentation for this - where is this usually located? It's not here, for example: https://www.freepascal.org/docs-html/ref/refsu45.html Daily documentation: https://www.freepascal.org/daily/daily.html In particular: https://www.freepascal.org/daily/doc/ref/refsu46.html Michael. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
On Mon, 24 Oct 2022, J. Gareth Moreton via fpc-devel wrote: That's useful - thank you. Michael Van Canneyt mentioend he updated the documentation for this - where is this usually located? It's not here, for example: https://www.freepascal.org/docs-html/ref/refsu45.html Daily documentation: https://www.freepascal.org/daily/daily.html In particular: https://www.freepascal.org/daily/doc/ref/refsu46.html Michael.___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
The more I learn! On 24/10/2022 13:06, Sven Barth wrote: J. Gareth Moreton via fpc-devel schrieb am Mo., 24. Okt. 2022, 13:52: That's useful - thank you. Michael Van Canneyt mentioend he updated the documentation for this - where is this usually located? It's not here, for example: https://www.freepascal.org/docs-html/ref/refsu45.html That is for the last released version, in this case 3.2.2. A snapshot of the documentation for the development version is available at https://www.freepascal.org/daily/daily.html, so the one you want is here: https://www.freepascal.org/daily/doc/ref/refsu46.html Regards, Sven___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
J. Gareth Moreton via fpc-devel schrieb am Mo., 24. Okt. 2022, 13:52: > That's useful - thank you. Michael Van Canneyt mentioend he updated the > documentation for this - where is this usually located? It's not here, > for example: https://www.freepascal.org/docs-html/ref/refsu45.html That is for the last released version, in this case 3.2.2. A snapshot of the documentation for the development version is available at https://www.freepascal.org/daily/daily.html, so the one you want is here: https://www.freepascal.org/daily/doc/ref/refsu46.html Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
That's useful - thank you. Michael Van Canneyt mentioend he updated the documentation for this - where is this usually located? It's not here, for example: https://www.freepascal.org/docs-html/ref/refsu45.html Kit On 24/10/2022 11:58, Kai Burghardt via fpc-devel wrote: Hi there: On 2022‑10‑24 11:51:32 +0100, J. Gareth Moreton via fpc-devel wrote: [...] I've come across one situation that I need clarity on... how are SHL and SHR instructions handled if the shift value exceeds the word size? About a half year ago I raised a documentation issue regarding that: https://gitlab.com/freepascal.org/fpc/documentation/-/issues/39304 Bottom line: The behavior is _undefined_. Explanation by Jonas Maebe: Such behaviour is indeed undefined (it's not implementation-defined because when evaluating at compile time you may get different results compared to when it gets evaluated at run time due to architecture peculiarities). ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel
Re: [fpc-devel] Policy regarding SHL/SHR under x86
Hi there: On 2022‑10‑24 11:51:32 +0100, J. Gareth Moreton via fpc-devel wrote: > [...] I've come across one situation that I need clarity on... how > are SHL and SHR instructions handled if the shift value exceeds the word > size? About a half year ago I raised a documentation issue regarding that: https://gitlab.com/freepascal.org/fpc/documentation/-/issues/39304 Bottom line: The behavior is _undefined_. Explanation by Jonas Maebe: > Such behaviour is indeed undefined (it's not implementation-defined > because when evaluating at compile time you may get different results > compared to when it gets evaluated at run time due to architecture > peculiarities). -- Sincerely yours, Kai Burghardt signature.asc Description: PGP signature ___ fpc-devel maillist - fpc-devel@lists.freepascal.org https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel