Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-25 Thread J. Gareth Moreton via fpc-devel
Correction to last post.  When applying BZHI to an input ("Result := 
Input and ((1 shl x) - 1)"), the initial "mov $-1,%eax" is unnecessary 
unless the mask is being preserved, and is just:


bzhil %ecx,(ref-to-Input),%eax

Kit

On 25/10/2022 13:44, J. Gareth Moreton via fpc-devel wrote:

What I want to do is the following...

Say I have the expression "(1 shl x) - 1"... under the default AMD 
Athlon optimisations, you might get something like this:


(x in %cl, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax

Under -CpCOREAVX2, you might get this (ignoring any zero-extensions 
required on the index):


(x in %ecx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax

All of these sequences take at least 3 cycles, or more accurately, 
have a dependency chain of length 3.  Now consider using BZHI:


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%eax,%eax

A dependency chain length of 2 (I'm not sure how many cycles it takes 
for BZHI to complete execution).


The savings go further if this result is used as a mask then 
discarded, i.e. "Result := Input and ((1 shl x) - 1)".  Under AMD 
Athlon, for example:


(x in %cl, Input in %edx, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax
andl %edx,%eax

Under -CpCOREAVX2:

(x in %ecx, Input in %edx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax
andl  %edx,%eax

All have a dependency chain length of 4.  But with BZHI:

(x in %ecx, Input in %edx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%edx,%eax

Once again, the dependency chain length is reduced to 2.  Like with 
the earlier two, this sequence also works if Input is a reference 
rather than a register; e.g.


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,(ref-to-Input),%eax

A problem, however, arises if the index x is out of range.  In the 
case of 32-bit operands, the shift instructions in x86 and ARM 
(including AArch64) essentially reduce the index modulo 32.  "(1 shl 
32) - 1" is often expected to return $, a mask that covers the 
entire bitrange, but "1 shl 32" returns 1 in this case, so the 
resultant mask ends up being all zeroes.  However, with BZHI, if the 
index is out of range, the carry flag is set and the output (%eax in 
this case) is set equal to the input, which results in "(1 shl 32) - 
1" returning $.  I think the same thing happens with negative 
indices, since BZHI is essentially unsigned (it also only reads the 
least significant byte of the index register).  With this in mind, and 
"1 shl 32" being considered undefined for 32-bit operands, is this an 
acceptable optimisation?


Kit

P.S. There is code in the compiler that catches undefined bitmasks and 
simply sets it to all ones if the index is 32 or 64 or whatever the 
integer word size is.  If BZHI is used, a peephole or node 
optimisation can be used to eliminate this catch since it becomes 
unnecessary with BZHI.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-25 Thread J. Gareth Moreton via fpc-devel

What I want to do is the following...

Say I have the expression "(1 shl x) - 1"... under the default AMD 
Athlon optimisations, you might get something like this:


(x in %cl, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax

Under -CpCOREAVX2, you might get this (ignoring any zero-extensions 
required on the index):


(x in %ecx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax

All of these sequences take at least 3 cycles, or more accurately, have 
a dependency chain of length 3.  Now consider using BZHI:


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%eax,%eax

A dependency chain length of 2 (I'm not sure how many cycles it takes 
for BZHI to complete execution).


The savings go further if this result is used as a mask then discarded, 
i.e. "Result := Input and ((1 shl x) - 1)".  Under AMD Athlon, for example:


(x in %cl, Input in %edx, Result in %eax)
movl $1,%eax
shll %cl,%eax
subl $1,%eax
andl %edx,%eax

Under -CpCOREAVX2:

(x in %ecx, Input in %edx, Result in %eax)
movl  $1,%eax
shlxl %ecx,%eax,%eax
subl  $1,%eax
andl  %edx,%eax

All have a dependency chain length of 4.  But with BZHI:

(x in %ecx, Input in %edx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,%edx,%eax

Once again, the dependency chain length is reduced to 2.  Like with the 
earlier two, this sequence also works if Input is a reference rather 
than a register; e.g.


(x in %ecx, Result in %eax)
movl  $-1,%eax
bzhil %ecx,(ref-to-Input),%eax

A problem, however, arises if the index x is out of range.  In the case 
of 32-bit operands, the shift instructions in x86 and ARM (including 
AArch64) essentially reduce the index modulo 32.  "(1 shl 32) - 1" is 
often expected to return $, a mask that covers the entire 
bitrange, but "1 shl 32" returns 1 in this case, so the resultant mask 
ends up being all zeroes.  However, with BZHI, if the index is out of 
range, the carry flag is set and the output (%eax in this case) is set 
equal to the input, which results in "(1 shl 32) - 1" returning 
$.  I think the same thing happens with negative indices, since 
BZHI is essentially unsigned (it also only reads the least significant 
byte of the index register).  With this in mind, and "1 shl 32" being 
considered undefined for 32-bit operands, is this an acceptable 
optimisation?


Kit

P.S. There is code in the compiler that catches undefined bitmasks and 
simply sets it to all ones if the index is 32 or 64 or whatever the 
integer word size is.  If BZHI is used, a peephole or node optimisation 
can be used to eliminate this catch since it becomes unnecessary with BZHI.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread J. Gareth Moreton via fpc-devel

Thanks Michael.  Sven already filled me in - the more I learn!

Kit

On 24/10/2022 16:44, Michael Van Canneyt via fpc-devel wrote:



On Mon, 24 Oct 2022, J. Gareth Moreton via fpc-devel wrote:

That's useful - thank you.  Michael Van Canneyt mentioend he updated 
the documentation for this - where is this usually located? It's not 
here, for example: https://www.freepascal.org/docs-html/ref/refsu45.html


Daily documentation:

https://www.freepascal.org/daily/daily.html

In particular:

https://www.freepascal.org/daily/doc/ref/refsu46.html

Michael.

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread Michael Van Canneyt via fpc-devel



On Mon, 24 Oct 2022, J. Gareth Moreton via fpc-devel wrote:

That's useful - thank you.  Michael Van Canneyt mentioend he updated the 
documentation for this - where is this usually located? It's not here, 
for example: https://www.freepascal.org/docs-html/ref/refsu45.html


Daily documentation:

https://www.freepascal.org/daily/daily.html

In particular:

https://www.freepascal.org/daily/doc/ref/refsu46.html

Michael.___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread J. Gareth Moreton via fpc-devel

The more I learn!

On 24/10/2022 13:06, Sven Barth wrote:
J. Gareth Moreton via fpc-devel  
schrieb am Mo., 24. Okt. 2022, 13:52:


That's useful - thank you.  Michael Van Canneyt mentioend he
updated the
documentation for this - where is this usually located? It's not
here,
for example: https://www.freepascal.org/docs-html/ref/refsu45.html


That is for the last released version, in this case 3.2.2. A snapshot 
of the documentation for the development version is available at 
https://www.freepascal.org/daily/daily.html, so the one you want is 
here: https://www.freepascal.org/daily/doc/ref/refsu46.html


Regards,
Sven___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread Sven Barth via fpc-devel
J. Gareth Moreton via fpc-devel  schrieb am
Mo., 24. Okt. 2022, 13:52:

> That's useful - thank you.  Michael Van Canneyt mentioend he updated the
> documentation for this - where is this usually located? It's not here,
> for example: https://www.freepascal.org/docs-html/ref/refsu45.html


That is for the last released version, in this case 3.2.2. A snapshot of
the documentation for the development version is available at
https://www.freepascal.org/daily/daily.html, so the one you want is here:
https://www.freepascal.org/daily/doc/ref/refsu46.html

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread J. Gareth Moreton via fpc-devel
That's useful - thank you.  Michael Van Canneyt mentioend he updated the 
documentation for this - where is this usually located? It's not here, 
for example: https://www.freepascal.org/docs-html/ref/refsu45.html


Kit

On 24/10/2022 11:58, Kai Burghardt via fpc-devel wrote:

Hi there:

On 2022‑10‑24 11:51:32 +0100, J. Gareth Moreton via fpc-devel wrote:

[...] I've come across one situation that I need clarity on... how
are SHL and SHR instructions handled if the shift value exceeds the word
size?

About a half year ago I raised a documentation issue regarding that:
https://gitlab.com/freepascal.org/fpc/documentation/-/issues/39304

Bottom line: The behavior is _undefined_. Explanation by Jonas Maebe:


Such behaviour is indeed undefined (it's not implementation-defined
because when evaluating at compile time you may get different results
compared to when it gets evaluated at run time due to architecture
peculiarities).

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel


Re: [fpc-devel] Policy regarding SHL/SHR under x86

2022-10-24 Thread Kai Burghardt via fpc-devel
Hi there:

On 2022‑10‑24 11:51:32 +0100, J. Gareth Moreton via fpc-devel wrote:
> [...] I've come across one situation that I need clarity on... how
> are SHL and SHR instructions handled if the shift value exceeds the word
> size?

About a half year ago I raised a documentation issue regarding that:
https://gitlab.com/freepascal.org/fpc/documentation/-/issues/39304

Bottom line: The behavior is _undefined_. Explanation by Jonas Maebe:

> Such behaviour is indeed undefined (it's not implementation-defined
> because when evaluating at compile time you may get different results
> compared to when it gets evaluated at run time due to architecture
> peculiarities).
-- 
Sincerely yours,
Kai Burghardt


signature.asc
Description: PGP signature
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel