Okay, sit back everyone - this is a long read!

----

I'm starting with the problem as listed in 
https://bugs.freepascal.org/view.php?id=27870 with the source 
code provided, although with {$codealign varmin=16} and {$codealign 
localmin=16} at the top.

I'm running the latest version of the compiler with the following parameters 
"-O3 -va -CfSSE64 -a -Sv".  Find attached the source file and the generated 
assembly.

First thing to note is that no vectorisation occurs for the individual setting 
of elements - e.g. the v1[ 0] 
:= 0.2 lines are assembled as follows: 

movl    _$TESTFILE$_Ld1(%rip),%eax
movl    %eax,48(%rsp)
movl    _$TESTFILE$_Ld1(%rip),%eax
movl    %eax,52(%rsp)
movl    _$TESTFILE$_Ld1(%rip),%eax
movl    %eax,56(%rsp)
movl    _$TESTFILE$_Ld1(%rip),%eax
movl    %eax,60(%rsp)

(_$TESTFILE$_Ld1 refers to the 32-bit representation of 0.2, namely $CDCC4C3E, 
and I'm surprised the 
optimizer doesn't notice the redundant setting of %eax)

For the line "v3 := v1 + v2;", this is vectorised because the compiler can 
identify all the operands as 
vector types, but as already suspected, there is a missing command to write 
%xmm0 to the stack.

movdqa  48(%rsp),%xmm0
addps   64(%rsp),%xmm0

The next operation is "call fpc_get_output" that begins a call to "WriteLn".

Also, there is a very slight bug with the generated code.  "movdqa" is an 
integer move, not a floating-point 
move.  With the floating-point "addps" that follows, this incurs a performance 
penalty due to switching 
between the two modes - "movaps" should be used instead.

Regarding alignment, the stack is correctly aligned because, while no stack 
frame is set up, the command 
"pushq %rbx" aligns the stack to a 16-byte boundary. Depending on how easy or 
tricky it is to enforce the 
stack alignment, it might be possible to not have to switch to using the 
unaligned move commands.

Once I've figured out how it emits the vector commands, I'll see that it 
includes the missing movaps 
command.  Initially I'll probably switch to using movups to ensure no 
segmentation faults occur, and then 
migrate back to movaps if I can automatically enforce the correct byte 
alignment with no input from the 
programmer.  This might be due to seeing the variables are vector types and 
aligning them to a 16-byte 
boundary if SSE is selected.  I'll let you know how it goes.


Kit

----

P.S. Depending on how the optimizer is structured, I might suggest a kind of 
"Deep Optimizer" that is a part 
of -O3 (or -O4 if it's a little risky) and is done after all of the other 
compilation and optimisation 
stages and immediately prior to writing the assembler/object file, which does 
things like remove the 
redundant writes to %eax and also other optimizations that the peephole 
optimizer misses.  In the .s file, 
there are snippets of code akin to the following:

movq    %rax,%rbx
leaq    _$TESTFILE$_Ld3(%rip),%r8
movq    %rbx,%rdx

Because of the leaq command in between, the peephole optimizer doesn't notice 
the performance penalty that 
comes from writing to %rbx and then immediately reading it again to copy into 
%rdx.  If it were detected and 
changed to the following:

movq    %rax,%rbx
leaq    _$TESTFILE$_Ld3(%rip),%r8
movq    %rax,%rdx

Changing %rbx to %rax in the second movq command removes the performance 
penalty and takes advantage of 
modern processors' multiple ALUs (leaq does not modify any of the registers 
other than the unrelated %r8 in 
this instance, so it's safe), thus likely collapsing this group of three 
commands into a single CPU cycle 
instead of 2.

Attachment: testfile.pp
Description: Binary data

Attachment: testfile.s
Description: Binary data

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel

Reply via email to