Hi everyone,
Since I'm masochistic in my desire to understand and improve the Free Pascal
Compiler, I would like to add
some vectorisation support in its optimisation cycle, since that is one thing
that many other compilers
attempt to do these days. But before I begin, does FPC support any kind of
vectorisation already? If it
does I haven't been able to find it yet, and I don't want to end up reinventing
the wheel.
I recall things, for example, where the following is not optimised even if the
compiler is set to use SSE:
type
TVector4f = packed record
X, Y, Z, W: Single;
end;
function VectorAdd(A, B: TVector4f): TVector4f;
begin
Result.X := A.X + B.X;
Result.Y := A.Y + B.Y;
Result.Z ;= A.Z + B.Z;
Result.W := A.W + B.W;
end;
The resultant assembler code yields an individual "MOVSS" and arithmetic for
each element rather than
combining the reads and writes into a MOVUPS instruction and reducing the
number of arithmetic instructions
by a factor of 4. For clarity, this is the assembler produced with '-CfSSE64':
.section .text.n_p$testfile_$$_addvector$tvector4f$tvector4f$$tvector4f,"x"
.balign 16,0x90
.globl P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F
P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F:
.Lc1:
.seh_proc P$TESTFILE_$$_ADDVECTOR$TVECTOR4F$TVECTOR4F$$TVECTOR4F
leaq-56(%rsp),%rsp
.Lc3:
.seh_stackalloc 56
.seh_endprologue
movq%rcx,%rax
movq%rdx,(%rsp)
movq%r8,8(%rsp)
movq(%rsp),%rdx
movq(%rdx),%rcx
movq%rcx,16(%rsp)
movq8(%rdx),%rdx
movq%rdx,24(%rsp)
movq8(%rsp),%rdx
movq(%rdx),%rcx
movq%rcx,32(%rsp)
movq8(%rdx),%rdx
movq%rdx,40(%rsp)
movss 16(%rsp),%xmm0
addss 32(%rsp),%xmm0
movss %xmm0,(%rax)
movss 20(%rsp),%xmm0
addss 36(%rsp),%xmm0
movss %xmm0,4(%rax)
movss 24(%rsp),%xmm0
addss 40(%rsp),%xmm0
movss %xmm0,8(%rax)
movss 28(%rsp),%xmm0
addss 44(%rsp),%xmm0
movss %xmm0,12(%rax)
leaq56(%rsp),%rsp
ret
.seh_endproc
.Lc2:
A good vectoriser (for lack of a better name!) would be able to optimise the 12
movss/addss routines to just
"movups 16(%rsp),%xmm0 addps 32(%rsp),%xmm0 movups %xmm0,(%rax)" - since the
stack is aligned to a 16-byte
boundary, it can swap out the first movups to a movaps too. Not sure what to
do regarding moving everything
to the stack first though.
I'm sure it's a mammoth task, but I would like to start somewhere with it -
however, are there any design
plans that I should be adhering to so I don't end up designing something that
is disliked?
Kit
___
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-devel