Re: Another epic optimiser failure

2023-05-29 Thread Nicholas Vinson via Gcc

On 5/29/23 15:01, Dave Blanchard wrote:


He's certainly got a few things wrong from time to time in his zeal, but his overall 
point seems to stand. Do you have any rebuttals of his argument to present yourself? Or 
do you prefer to just sit back and wait on "y'all" to do the heavy lifting?


He's gotten many details wrong including the proper flags to set for gcc 
(and the "bad documentation" does not justify all the errors he's made), 
his hand-generated assembly (I've personally pointed out logic errors in 
his assembly on more than on occasion), and has failed to provide 
evidence that his solutions are better.


In almost all of his examples, he uses -O3 which is basically the "speed 
above all else" optimization level. I pointed this out before; I also 
pointed out that the smallest code (in bytes) with the fewest 
instructions is not always the fastest. He has not provided any data 
showing that his solutions result in faster executing code than what gcc 
produces. He has also raised questions that show a distinct lack of 
understanding when it comes to storage hierarchy; something I feel one 
would need to know to properly write fast assembly. Finally, I will 
admit some of the examples of gcc produced code are a bit suspicious, 
and probably should be reviewed.


In short Stefan is not being taken seriously because he is not 
presenting himself, or his arguments, in a manner that would convince 
people to take him seriously. As long as Stefan continues to communicate 
in such a manner, we're going to see similar such responses from (some 
of) the gcc devs (unfortunately).


The best next steps for Stefan, would be to review the constructive 
criticism, expand on his examples by providing explanation and proof as 
to why they're better, and then present these updated findings in the 
proper manner.


Using his first example as my own, take the C code:

int ispowerof2(unsigned long long argument)
{
return (argument & argument - 1) == 0;
}

when compiled produces:

% gcc -m32 -O3 -c ispowerof2.c && objdump -d -Mintel ispowerof2.o

ispowerof2.o: file format elf32-i386

Disassembly of section .text:

 :
   0:   f3 0f 7e 4c 24 04   movq   xmm1,QWORD PTR [esp+0x4]
   6:   66 0f 76 c0 pcmpeqd xmm0,xmm0
   a:   66 0f d4 c1 paddq  xmm0,xmm1
   e:   66 0f db c1 pand   xmm0,xmm1
  12:   66 0f 7e c2 movd   edx,xmm0
  16:   66 0f 73 d0 20  psrlq  xmm0,0x20
  1b:   66 0f 7e c0 movd   eax,xmm0
  1f:   09 c2   or edx,eax
  21:   0f 94 c0    sete   al
  24:   0f b6 c0    movzx  eax,al
  27:   c3  ret

Whereas he claims the following is better:

movq    xmm1, [esp+4]
pcmpeqd xmm0, xmm0
paddq   xmm0, xmm1
pand    xmm0, xmm1
pxor    xmm1, xmm1
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
cmp al, 255
sete    al
ret

because it has 10 instructions and is 36 bytes long vs the 11 
instructions and 40 bytes. However, the rebuttals are 1. his code is 
wrong (can return values other than 0 or 1) and 2. -O3 doesn't optimize 
on instruction count or  byte size (as an aside: clang's output uses 14 
instructions but is only 32 bytes in size -- is it better or worse than 
gcc's?).


Therefore, while he's 1 instruction less and 4 bytes fewer (1 byte fewer 
if you add the needed correction), he presents no evidence that his 
solution is actually faster. What he would need to do instead is show 
proof that his solution is indeed faster than what gcc produces.


Afterwards, he would be in a position to represent this data in a proper 
manner.


Re: Who cares about performance (or Intel's CPU errata)?

2023-05-28 Thread Nicholas Vinson via Gcc



On 5/27/23 18:52, Stefan Kanthak wrote:

"Andrew Pinski"  wrote:


On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak  wrote:

Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are:

--- dontcare.c ---
int ispowerof2(unsigned __int128 argument) {
 return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 
64) == 1;
}
--- EOF ---

GCC 13.3gcc -march=haswell -O3

https://gcc.godbolt.org/z/PPzYsPzMc
ispowerof2(unsigned __int128):
 popcnt  rdi, rdi
 popcnt  rsi, rsi
 add esi, edi
 xor eax, eax
 cmp esi, 1
 seteal
 ret

OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs 
output?

Because the popcount is going to the same register, there is no false
dependency 
The false dependency errata only applies if the result of the popcnt
is going to a different register, the processor thinks it depends on
the result in that register from a previous instruction but it does
not (which is why it is called a false dependency). In this case it
actually does depend on the previous result since the input is the
same as the input.

OUCH, my fault; sorry for the confusion and the wrong accusation.

Nevertheless GCC fails to optimise code properly:

--- .c ---
int ispowerof2(unsigned long long argument) {
 return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3gcc -m32 -mpopcnt -O3

https://godbolt.org/z/fT7a7jP4e
ispowerof2(unsigned long long):
 xor eax, eax
 xor edx, edx
 popcnt  eax, [esp+4]
 popcnt  edx, [esp+8]
 add eax, edx # eax is less than 64!
Less than or equal to 64 (consider the case when input is (unsigned long 
long)-1)

 cmp eax, 1->dec eax  # 2 bytes shorter
 seteal
 movzx   eax, al  # superfluous
Not when dec is used. Use dec and omit this instruction, you may get a 
result value of 0xff00 (consider the case when input is (unsigned 
long long)0).

 ret

5 bytes and 1 instruction saved; 5 bytes here and there accumulate to
kilo- or even megabytes, and they can extend code to cross a cache line
or a 16-byte alignment boundary.

JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument

JFTR: GCC is notorious for generating superfluous MOVZX instructions
   where its optimiser SHOULD be able see that the value is already
   less than 256!

Stefan


Re: Another epic optimiser failure

2023-05-28 Thread Nicholas Vinson via Gcc



On 5/27/23 17:04, Stefan Kanthak wrote:

--- .c ---
int ispowerof2(unsigned long long argument) {
 return __builtin_popcountll(argument) == 1;
}
--- EOF ---

GCC 13.3gcc -m32 -march=alderlake -O3
 gcc -m32 -march=sapphirerapids -O3
 gcc -m32 -mpopcnt -mtune=sapphirerapids -O3

https://gcc.godbolt.org/z/cToYrrYPq
ispowerof2(unsigned long long):
 xor eax, eax# superfluous
 xor edx, edx# superfluous
 popcnt  eax, [esp+4]
 popcnt  edx, [esp+8]
 add eax, edx
 cmp eax, 1  ->dec  eax
 seteal
 movzx   eax, al # superfluous
 ret

9 instructions in 28 bytes  # 6 instructions in 20 bytes


I agree this can be done using 6 instructions, but you cannot do it 
using the dec instruction. If you use the dec instruction, "movzx eax, 
al" becomes a required instruction (consider the case when the input is 
0) resulting in 7 instructions and 22 bytes.




Re: Will GCC eventually support SSE2 or SSE4.1?

2023-05-26 Thread Nicholas Vinson via Gcc

On 5/26/23 08:42, Stefan Kanthak wrote:


I could have added PROPERLY, because that's where it CLEARLY fails, as
shown by the generated unoptimised code.


From what I've seen so far, I find your arguments unconvincing.

In this thread alone, you've proven that you don't know how to properly 
control gcc via its command-line flags, and that you don't know how to 
properly generate assembly code for your own C example (properly in this 
case meaning to exhibit the behavior the ISO C standard requires) which 
makes it hard for me to accept your claims at face value (your C example 
is also logically incorrect, but that's not important to this discussion).


That said assuming that your "optimized assembly" examples (with the 
exception of the first) are correct, all you've done is shown that your 
versions are slightly smaller in both instruction count and size and 
declared your examples "proper". The optimization flag -O3 (like most of 
the -On flags) optimize for speed over all else, and it has been proven 
that the faster code isn't necessarily the code with fewer instructions 
or the smallest size (see the RISC v CISC debate).


To accept that your suggestions are the proper ways to generate code 
using SSE4.1 instructions at -O3, I insist on data that clearly 
demonstrates that your suggestions are at least as performant than what 
GCC's currently does.




Re: Will GCC eventually support SSE2 or SSE4.1?

2023-05-26 Thread Nicholas Vinson via Gcc

On 5/26/23 02:46, Stefan Kanthak wrote:


Hi,

compile the following function on a system with Core2 processor
(released January 2008) for the 32-bit execution environment:

--- demo.c ---
int ispowerof2(unsigned long long argument)
{
 return (argument & argument - 1) == 0;
}
--- EOF ---

GCC 13.3: gcc -m32 -O3 demo.c

NOTE: -mtune=native is the default!

# https://godbolt.org/z/b43cjGdY9
ispowerof2(unsigned long long):
 movqxmm1, [esp+4]
 pcmpeqd xmm0, xmm0
 paddq   xmm0, xmm1
 pandxmm0, xmm1
 movdedx, xmm0  #pxorxmm1, xmm1
 psrlq   xmm0, 32   #pcmpeqb xmm0, xmm1
 movdeax, xmm0  #pmovmskb eax, xmm0
 or  edx, eax   #cmp al, 255
 seteal #seteal
 movzx   eax, al#
 ret

11 instructions in 40 bytes # 10 instructions in 36 bytes 


You cannot delete the 'movzx eax, al' instruction. The line "(argument & 
argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required 
to ensure that the upper 24-bits of the eax register are properly zeroed.




OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set"
   here instead of the native SSE4.1 alias "Penryn New Instruction Set"
   of the Core2 (and all later processors)?

OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the
right side?
After correcting for the above error, your solution is is the same size 
as the solution gcc generated. Therefore, the only remaining question 
would be "Is your solution faster than the code gcc produced?"


If you claim it is, I'd like to see evidence supporting that claim.

Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1
alias "Penryn New Instruction Set" of the Core2 processor:

GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c

# https://godbolt.org/z/svhEoYT11
ispowerof2(unsigned long long):
#xor  eax, eax
 movqxmm1, [esp+4]  #movq xmm1, [esp+4]
 pcmpeqd xmm0, xmm0 #pcmpeqq  xmm0, xmm0
 paddq   xmm0, xmm1 #paddqxmm0, xmm1
 pandxmm0, xmm1 #ptestxmm0, xmm1
 movdedx, xmm0  #
 psrlq   xmm0, 32   #
 movdeax, xmm0  #
 or  edx, eax   #
 seteal #sete al
 movzx   eax, al#
 ret#ret

11 instructions in 40 bytes# 7 instructions in 26 bytes

OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side.
   ~~~


As pointed out elsewhere in this thread, you used the wrong flags. With 
the proper flags, I get


% gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c  && objdump -d 
ispowerof2.o



ispowerof2.o: file format elf32-i386


Disassembly of section .text:

 :
   0:   f3 0f 7e 4c 24 04   movq   0x4(%esp),%xmm1
   6:   66 0f 76 c0 pcmpeqd %xmm0,%xmm0
   a:   31 c0   xor    %eax,%eax
   c:   66 0f d4 c1 paddq  %xmm1,%xmm0
  10:   66 0f db c1 pand   %xmm1,%xmm0
  14:   66 0f 6c c0 punpcklqdq %xmm0,%xmm0
  18:   66 0f 38 17 c0  ptest  %xmm0,%xmm0
  1d:   0f 94 c0    sete   %al
  20:   c3  ret

so with just the SSE-4.1 instruction set the output is 31 bytes long.


Last compile with -mtune=i386 for the i386 processor:

GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c

# https://godbolt.org/z/e76W6dsMj
ispowerof2(unsigned long long):
 pushebx#
 mov ecx, [esp+8]   #moveax, [esp+4]
 mov ebx, [esp+12]  #movedx, [esp+8]
 mov eax, ecx   #
 mov edx, ebx   #
 add eax, -1#addeax, -1
 adc edx, -1#adcedx, -1
 and eax, ecx   #andeax, [esp+4]
 and edx, ebx   #andedx, [esp+8]
 or  eax, edx   #or eax, edx
 seteal #negeax
 movzx   eax, al#sbbeax, eax
 pop ebx#inceax
 ret#ret

14 instructions in 33 bytes# 11 instructions in 32 bytes

OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous
   memory write?


At -O1 gcc produces:

% gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c  && objdump 
-Mintel -d ispowerof2.o


ispowerof2.o: file format elf32-i386


Disassembly of section .text:

 :
   0:   8b 44 24 04 mov    eax,DWORD PTR [esp+0x4]
   4:   8b 54 24 08 mov    edx,DWORD PTR [esp+0x8]
   8:   83 c0 ff    add    eax,0x
   b:   83 d2 ff    adc    edx,0x
   e:   23 44 24 04 and    eax,DWORD PTR [esp+0x4]
  12: 

Re: More C type errors by default for GCC 14

2023-05-14 Thread Nicholas Vinson via Gcc

Jonathan Wakely  writes:

Wrong. I wouldn't bother replying to you again in this thread, but I
feel that as a gcc maintainer I should confirm that Eli S. is right
here; and nobody else I know agrees with your definition of extension
as "every non-standard aspect of the compiler's behaviour, whether
intentional or accidental". That's just silly.

GCC's support for implicit int is clearly intentional.
I never claimed that accidental GNU CC behavior was part of GNU C.


You might not have explicitly stated that, but you have made that 
argument in this thread.


You have asserted that the compiler's behavior, and not its 
documentation, determines what should be consider a language extension.


That assertion when taken to its natural conclusion show support for the 
idea that "accidental GNU CC behavior" should be considered a language 
extension, and by becoming a language extension it would be part of GNU C.


If the behavior, and not the documentation, determines what is and is 
not an extension unless it's "accidental behavior", then how is anyone 
to know what is or is not a GNU C extension?