On 5/29/23 15:01, Dave Blanchard wrote:
He's certainly got a few things wrong from time to time in his zeal, but his overall
point seems to stand. Do you have any rebuttals of his argument to present yourself? Or
do you prefer to just sit back and wait on "y'all" to do the heavy lifting?
He's gotten many details wrong including the proper flags to set for gcc
(and the "bad documentation" does not justify all the errors he's made),
his hand-generated assembly (I've personally pointed out logic errors in
his assembly on more than on occasion), and has failed to provide
evidence that his solutions are better.
In almost all of his examples, he uses -O3 which is basically the "speed
above all else" optimization level. I pointed this out before; I also
pointed out that the smallest code (in bytes) with the fewest
instructions is not always the fastest. He has not provided any data
showing that his solutions result in faster executing code than what gcc
produces. He has also raised questions that show a distinct lack of
understanding when it comes to storage hierarchy; something I feel one
would need to know to properly write fast assembly. Finally, I will
admit some of the examples of gcc produced code are a bit suspicious,
and probably should be reviewed.
In short Stefan is not being taken seriously because he is not
presenting himself, or his arguments, in a manner that would convince
people to take him seriously. As long as Stefan continues to communicate
in such a manner, we're going to see similar such responses from (some
of) the gcc devs (unfortunately).
The best next steps for Stefan, would be to review the constructive
criticism, expand on his examples by providing explanation and proof as
to why they're better, and then present these updated findings in the
proper manner.
Using his first example as my own, take the C code:
int ispowerof2(unsigned long long argument)
{
return (argument & argument - 1) == 0;
}
when compiled produces:
% gcc -m32 -O3 -c ispowerof2.c && objdump -d -Mintel ispowerof2.o
ispowerof2.o: file format elf32-i386
Disassembly of section .text:
00000000 <ispowerof2>:
0: f3 0f 7e 4c 24 04 movq xmm1,QWORD PTR [esp+0x4]
6: 66 0f 76 c0 pcmpeqd xmm0,xmm0
a: 66 0f d4 c1 paddq xmm0,xmm1
e: 66 0f db c1 pand xmm0,xmm1
12: 66 0f 7e c2 movd edx,xmm0
16: 66 0f 73 d0 20 psrlq xmm0,0x20
1b: 66 0f 7e c0 movd eax,xmm0
1f: 09 c2 or edx,eax
21: 0f 94 c0 sete al
24: 0f b6 c0 movzx eax,al
27: c3 ret
Whereas he claims the following is better:
movq xmm1, [esp+4]
pcmpeqd xmm0, xmm0
paddq xmm0, xmm1
pand xmm0, xmm1
pxor xmm1, xmm1
pcmpeqb xmm0, xmm1
pmovmskb eax, xmm0
cmp al, 255
sete al
ret
because it has 10 instructions and is 36 bytes long vs the 11
instructions and 40 bytes. However, the rebuttals are 1. his code is
wrong (can return values other than 0 or 1) and 2. -O3 doesn't optimize
on instruction count or byte size (as an aside: clang's output uses 14
instructions but is only 32 bytes in size -- is it better or worse than
gcc's?).
Therefore, while he's 1 instruction less and 4 bytes fewer (1 byte fewer
if you add the needed correction), he presents no evidence that his
solution is actually faster. What he would need to do instead is show
proof that his solution is indeed faster than what gcc produces.
Afterwards, he would be in a position to represent this data in a proper
manner.