Siarhei Siamashka <siarhei.siamas...@gmail.com> writes: > The loops are already unrolled, so it was just a matter of packing > 4 pixels into a single XMM register and doing aligned 128-bit > writes to memory via MOVDQA instructions for the SRC compositing > operator fast path. For the other fast paths, this XMM register > is also directly routed to further processing instead of doing > extra reshuffling. This replaces "8 PACKSSDW/PACKUSWB + 4 MOVD" > instructions with "3 PACKSSDW/PACKUSWB + 1 MOVDQA" per 4 pixels, > which results in a clear performance improvement. > > There are also some other (less important) tweaks: > > 1. Convert 'pixman_fixed_t' to 'intptr_t' before using it as an > index for addressing memory. The problem is that 'pixman_fixed_t' > is a 32-bit data type and it has to be extended to 64-bit > offsets, which needs extra instructions on 64-bit systems. > > 2. Dropped support for 8-bit interpolation precision to simplify > the code.
If we are dropping support for 8-bit precision, let's drop it everywhere (in a separate patch from this optimization). I'll send a patch as a follow-up to this mail. The other question I have is whether you tested if this makes the SSE2 fast paths competitive with the SSSE3 iterator? If it does, that would allow us to postpone dealing with the iterators-vs-fastpaths problem. Søren _______________________________________________ Pixman mailing list Pixman@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/pixman