Edouard Gomez ([EMAIL PROTECTED]) wrote:
> Here is an updated patch... still no SIMD... might come later.
Ok found some time to write the kernel of the computation in SSE
asm volatile (
"movaps (%1), %%xmm0\n\t" // xmm0 = Lr, Lg, Lb, 0
"movaps (%2), %%xmm1\n\t" // xmm1 = R, G, B, G2
"movaps %%xmm1, %%xmm4\n\t" // xmm4 = R, G, B, G2
"mulps %%xmm0, %%xmm1\n\t" // xmm1 = Lr*R, Lg*G, Lb*B, 0
"movhlps %%xmm1, %%xmm0\n\t"// xmm0 = LbB, 0, x, x
"addps %%xmm0, %%xmm1\n\t" // xmm1 = LrR + LbB, LgG, x, x
"movaps %%xmm1, %%xmm0\n\t" // xmm0 = LrR + LbB, LgG, x, x
"shufps $0x1, %%xmm0, %%xmm0\n\t" // xmm0 = LgG, x, x , x
"addps %%xmm1, %%xmm0\n\t" // xmm0 = Y = LrR + LbB + LgG, x,
x, x
"movaps %%xmm0, %%xmm1\n\t" // xmm1 = LrR + LbB + LgG, x, x, x
"maxss %%xmm2, %%xmm1\n\t" // xmm1 = max(Y, 0)
"minss %%xmm3, %%xmm1\n\t" // xmm1 = min(Y, 65535)
"cvtss2si %%xmm1, %%rax\n\t"// eax = (int)Y
"movss (%3,%%rax, 4), %%xmm1\n\t" // xmm1 = curve[(int)Y]
"mulss %%xmm3, %%xmm1\n\t" // xmm1 = curve[(int)Y]*65535.f = Y'
"maxss %%xmm2, %%xmm1\n\t" // xmm1 = max(Y', 0)
"minss %%xmm3, %%xmm1\n\t" // xmm1 = min(Y', 65535)
"divss %%xmm0, %%xmm1\n\t" // xmm1 = Y'/Y = a
"shufps $0x0, %%xmm1, %%xmm1\n\t" // xmm1 = a, a, a, a
"mulps %%xmm4, %%xmm1\n\t" // xmm1 = a*R, a*G, a*B, a*G2
"maxps %%xmm2, %%xmm1\n\t" // xmm1 = max(xmm1, 0)
"minps %%xmm3, %%xmm1\n\t" // xmm1 = min(xmm1, 65535)
"movaps %%xmm1, %0\n\t"
: "=m" (result)
: "r" (luminance),
"r" (rgbg),
"r" (curve)
: "%xmm0", "%xmm1", "%xmm4", "%rax", "memory");
Assumptions:
1 - xmm2 is supposed to be a 0 vector
2 - xmm3 is supposed to be a 65535.f vector
3 - curve is supposed to be something [0.f, 1.f] but as it's not clipped to
that range prior to the rendering, i included some clipping max/min
magic.
4 - luminance is an aligned vector pointing to the 3 Y factors for RGB +
a fourth 0 value
5 - result is an aligned float[4]
Possible changes:
1 - All alignments can be removed if necessary, just use a unaligned movups
instruction
2 - The SSE division can be spared if we precompute the curve[Y]/Y table
this removes the need to clip the result to [0, 65535] and avoids
the stupid case Y=0 that is not handled in this code.
I just didn't know the policy for precomputed tables in code if
they're not used for all platforms and if the platform is runtime
detected.
Ok, time to have some rest :-) it's now 2AM
--
Edouard Gomez
_______________________________________________
Rawstudio-dev mailing list
[email protected]
http://rawstudio.org/cgi-bin/mailman/listinfo/rawstudio-dev