Okay, sit back everyone - this is a long read! ----
I'm starting with the problem as listed in https://bugs.freepascal.org/view.php?id=27870 with the source code provided, although with {$codealign varmin=16} and {$codealign localmin=16} at the top. I'm running the latest version of the compiler with the following parameters "-O3 -va -CfSSE64 -a -Sv". Find attached the source file and the generated assembly. First thing to note is that no vectorisation occurs for the individual setting of elements - e.g. the v1[ 0] := 0.2 lines are assembled as follows: movl _$TESTFILE$_Ld1(%rip),%eax movl %eax,48(%rsp) movl _$TESTFILE$_Ld1(%rip),%eax movl %eax,52(%rsp) movl _$TESTFILE$_Ld1(%rip),%eax movl %eax,56(%rsp) movl _$TESTFILE$_Ld1(%rip),%eax movl %eax,60(%rsp) (_$TESTFILE$_Ld1 refers to the 32-bit representation of 0.2, namely $CDCC4C3E, and I'm surprised the optimizer doesn't notice the redundant setting of %eax) For the line "v3 := v1 + v2;", this is vectorised because the compiler can identify all the operands as vector types, but as already suspected, there is a missing command to write %xmm0 to the stack. movdqa 48(%rsp),%xmm0 addps 64(%rsp),%xmm0 The next operation is "call fpc_get_output" that begins a call to "WriteLn". Also, there is a very slight bug with the generated code. "movdqa" is an integer move, not a floating-point move. With the floating-point "addps" that follows, this incurs a performance penalty due to switching between the two modes - "movaps" should be used instead. Regarding alignment, the stack is correctly aligned because, while no stack frame is set up, the command "pushq %rbx" aligns the stack to a 16-byte boundary. Depending on how easy or tricky it is to enforce the stack alignment, it might be possible to not have to switch to using the unaligned move commands. Once I've figured out how it emits the vector commands, I'll see that it includes the missing movaps command. Initially I'll probably switch to using movups to ensure no segmentation faults occur, and then migrate back to movaps if I can automatically enforce the correct byte alignment with no input from the programmer. This might be due to seeing the variables are vector types and aligning them to a 16-byte boundary if SSE is selected. I'll let you know how it goes. Kit ---- P.S. Depending on how the optimizer is structured, I might suggest a kind of "Deep Optimizer" that is a part of -O3 (or -O4 if it's a little risky) and is done after all of the other compilation and optimisation stages and immediately prior to writing the assembler/object file, which does things like remove the redundant writes to %eax and also other optimizations that the peephole optimizer misses. In the .s file, there are snippets of code akin to the following: movq %rax,%rbx leaq _$TESTFILE$_Ld3(%rip),%r8 movq %rbx,%rdx Because of the leaq command in between, the peephole optimizer doesn't notice the performance penalty that comes from writing to %rbx and then immediately reading it again to copy into %rdx. If it were detected and changed to the following: movq %rax,%rbx leaq _$TESTFILE$_Ld3(%rip),%r8 movq %rax,%rdx Changing %rbx to %rax in the second movq command removes the performance penalty and takes advantage of modern processors' multiple ALUs (leaq does not modify any of the registers other than the unrelated %r8 in this instance, so it's safe), thus likely collapsing this group of three commands into a single CPU cycle instead of 2.
testfile.pp
Description: Binary data
testfile.s
Description: Binary data
_______________________________________________ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel