Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-17 Thread Rune Petersen
Roland Scheidegger wrote:
> Roland Scheidegger wrote:
>> Rune Petersen wrote:
>> Also, the comments for SCS seem a bit off. That's a pity, because 
>> without comments I can't really see what the code does at first sight
>>  :-). Looks like quite a few extra instructions though, are you sure
>> not more could be shared for calculating both sin and cos?
> I've looked a bit closer (this is an interesting optimization
> problem...) and I think it should be doable with fewer instructions,
> though ultimately I needed 2 temps instead of 1 (I don't think it's much
> of a problem, 32 is plenty, PS2.0 only exposes 12).
> 
> Ok the equation was:
> Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
> 
> Simplified to:
> y = B * x + C * x * abs(x)
> y = P * (y * abs(y) - y) + y
> 
> const0: B,C,pi,P
> const1: 0.5pi, 0.75, 1/(2pi), 2.0pi
> 
> That's what I came up with with pseudo-code:
> //should be 5 slots (I guess it might generate 6 due to force same-slot,
> //but that needs fixing elewhere)
> 
> //cos is even: cos(x) = cos(-x). So using simple trigo-fu
> //we get sin(neg(abs(x)) + pi/2)) = cos(x), no comparison needed and all
> //values for sine stay inside [-pi,pi] ([-pi/2, pi/2], actually)
> //hope it's ok to use neg+abs simultaneously?
> temp.z = add(neg(abs(src)), const1.x)
> temp.w = mul(src, C)
> 
> //temp.xy = B*x, C*x (cos), temp.w = C * x, temp2.w = B * x (sin)
> temp.xy = mul(temp.z, BC)
> temp2.w = mul(src, B)
> 
> //do cos in alpha slot not sin due to restricted swizzling
> //sin y = B * x + C * x * abs(x)
> temp2.z = mad(temp.w, abs(src), temp2.w)
> //cos
> temp2.w = mad(temp.y, abs(temp.z), temp.x)
> 
> temp.xy = mad(temp2.wzy, abs(temp2.wzy), neg(temp2.wzy))
> // now temp.x holds y * abs(y) - y for cos, temp.y same for sin
> 
> dest.xy = mad(temp.xy, P, temp2.wzy)
> 
> range reduction for cos:
> x = (x/(2*PI))+0.75
> x = frac(x)
> x = (x*2*PI)-PI
> 
> sin:
> x = (x/(2*PI))+HALF
> x = frac(x)
> x = (x*2*PI)-PI
> 
> Isn't that an elegant solution :-) There may be any number of bugs, of
> course...
> 

I have attached a patch that implements your solution. looks fine, and
it uses 6 slots until I teach emit_arith() how to pack instructions.


Rune Petersen
diff --git a/src/mesa/drivers/dri/r300/r300_fragprog.c b/src/mesa/drivers/dri/r300/r300_fragprog.c
index 8e45bd5..a1c634a 100644
--- a/src/mesa/drivers/dri/r300/r300_fragprog.c
+++ b/src/mesa/drivers/dri/r300/r300_fragprog.c
@@ -1214,8 +1214,8 @@ static void make_sin_const(struct r300_fragment_program *rp)
 	cnstv[3] = 0.2225;  // weight
 	rp->const_sin[0] = emit_const4fv(rp, cnstv);
 
-	cnstv[0] = 0.5;
-	cnstv[1] = -1.5;
+	cnstv[0] = 0.75;
+	cnstv[1] = 0.0;
 	cnstv[2] = 0.159154943; // 1/(2*PI)
 	cnstv[3] = 6.283185307; // 2*PI
 	rp->const_sin[1] = emit_const4fv(rp, cnstv);
@@ -1227,7 +1227,7 @@ static GLboolean parse_program(struct r300_fragment_program *rp)
 	struct gl_fragment_program *mp = &rp->mesa_program;
 	const struct prog_instruction *inst = mp->Base.Instructions;
 	struct prog_instruction *fpi;
-	GLuint src[3], dest, temp;
+	GLuint src[3], dest, temp[2];
 	GLuint cnst;
 	int flags, mask = 0;
 	GLfloat cnstv[4] = {0.0, 0.0, 0.0, 0.0};
@@ -1277,70 +1277,63 @@ static GLboolean parse_program(struct r300_fragment_program *rp)
 			/*
 			 * cos using a parabola (see SIN):
 			 * cos(x):
-			 *   x += PI/2
-			 *   x = (x/(2*PI))+0.5
+			 *   x = (x/(2*PI))+0.75
 			 *   x = frac(x)
 			 *   x = (x*2*PI)-PI
 			 *   result = sin(x)
 			 */
-			temp = get_temp_reg(rp);
+			temp[0] = get_temp_reg(rp);
 			make_sin_const(rp);
 			src[0] = t_scalar_src(rp, fpi->SrcReg[0]);
 
 			/* add 0.5*PI and do range reduction */
 
-			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_X,
-   swizzle(rp->const_sin[0], Z, Z, Z, Z), //PI
-   pfs_half,
-   swizzle(keep(src[0]), X, X, X, X),
-   0);
-
-			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_X,
-   swizzle(temp, X, X, X, X),
+			emit_arith(rp, PFS_OP_MAD, temp[0], WRITEMASK_X,
+   swizzle(src[0], X, X, X, X),
    swizzle(rp->const_sin[1], Z, Z, Z, Z),
-   pfs_half,
+   swizzle(rp->const_sin[1], X, X, X, X),
    0);
 
-			emit_arith(rp, PFS_OP_FRC, temp, WRITEMASK_X,
-   swizzle(temp, X, X, X, X),
+			emit_arith(rp, PFS_OP_FRC, temp[0], WRITEMASK_X,
+   swizzle(temp[0], X, X, X, X),
    undef,
    undef,
    0);
 
-			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_Z,
-   swizzle(temp, X, X, X, X),
+			emit_arith(rp, PFS_OP_MAD, temp[0], WRITEMASK_Z,
+   swizzle(temp[0], X, X, X, X),
    swizzle(rp->const_sin[1], W, W, W, W), //2*PI
    negate(swizzle(rp->const_sin[0], Z, Z, Z, Z)), //-PI
    0);
 
 			/* SIN */
 
-			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_X | WRITEMASK_Y,
-   swizzle(temp, Z, Z, Z, Z),
+			emit_arith(rp, PFS_OP_MAD, temp[0], WRITEMASK_X | WRITEMASK_Y,
+   swizzle(temp[0], Z, Z, Z, Z),
    rp->const_sin[0],
    pfs_zero,
    0);
 
-			emit_arith(rp, PFS_OP_MAD, te

Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-15 Thread Rune Petersen
Roland Scheidegger wrote:
> Roland Scheidegger wrote:
>> Rune Petersen wrote:
>>> This patch: - Fixes COS. - Does range reductions for SIN & COS. - 
>>> Adds SCS. - removes the optimized version of SIN & COS. - tweaked 
>>> weight (should help on precision). - fixed a copy paste typo in 
>>> emit_arith().
>>>
>>> Roland would you mind testing if the tweaked weight helped?
>> Well I didn't test it first time (just quoting the numbers from the
>> link you provided), but I guess that's fine too. I was actually
>> wondering myself if it's better to optimize for absolute or relative
>> error, so choosing a weight in-between should work too (the
>> difference is not that big after all).
>>
>> A couple comments though: Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI)
>> + (1/4 + 0.5) you could optimize away the first mad for the COS case.
>>
> Ah I see you're a bit short on consts, if you want to only use 2 (btw
> I'd say there should be 32 not only 16 but I have no idea why the driver
> restricts it to 16).
> 
>> Also, the comments for SCS seem a bit off. That's a pity, because 
>> without comments I can't really see what the code does at first sight
>>  :-). Looks like quite a few extra instructions though, are you sure
>> not more could be shared for calculating both sin and cos?
> I've looked a bit closer (this is an interesting optimization
> problem...) and I think it should be doable with fewer instructions,
> though ultimately I needed 2 temps instead of 1 (I don't think it's much
> of a problem, 32 is plenty, PS2.0 only exposes 12).
> 
> Ok the equation was:
> Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2
> 
> Simplified to:
> y = B * x + C * x * abs(x)
> y = P * (y * abs(y) - y) + y
> 
> const0: B,C,pi,P
> const1: 0.5pi, 0.75, 1/(2pi), 2.0pi
> 
> That's what I came up with with pseudo-code:
> //should be 5 slots (I guess it might generate 6 due to force same-slot,
> //but that needs fixing elewhere)
> 
> //cos is even: cos(x) = cos(-x). So using simple trigo-fu
> //we get sin(neg(abs(x)) + pi/2)) = cos(x), no comparison needed and all
> //values for sine stay inside [-pi,pi] ([-pi/2, pi/2], actually)
> //hope it's ok to use neg+abs simultaneously?
> temp.z = add(neg(abs(src)), const1.x)
> temp.w = mul(src, C)
> 
> //temp.xy = B*x, C*x (cos), temp.w = C * x, temp2.w = B * x (sin)
> temp.xy = mul(temp.z, BC)
> temp2.w = mul(src, B)
> 
> //do cos in alpha slot not sin due to restricted swizzling
> //sin y = B * x + C * x * abs(x)
> temp2.z = mad(temp.w, abs(src), temp2.w)
> //cos
> temp2.w = mad(temp.y, abs(temp.z), temp.x)
> 
> temp.xy = mad(temp2.wzy, abs(temp2.wzy), neg(temp2.wzy))
> // now temp.x holds y * abs(y) - y for cos, temp.y same for sin
> 
> dest.xy = mad(temp.xy, P, temp2.wzy)
> 
> range reduction for cos:
> x = (x/(2*PI))+0.75
> x = frac(x)
> x = (x*2*PI)-PI
> 
> sin:
> x = (x/(2*PI))+HALF
> x = frac(x)
> x = (x*2*PI)-PI
> 
> Isn't that an elegant solution :-) There may be any number of bugs, of
> course...

Very elegant I must say. Thank you I'll see about implementing this.


Rune Petersen

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-15 Thread Rune Petersen
Roland Scheidegger wrote:
> Rune Petersen wrote:
>> This patch: - Fixes COS. - Does range reductions for SIN & COS. -
>> Adds SCS. - removes the optimized version of SIN & COS. - tweaked
>> weight (should help on precision). - fixed a copy paste typo in
>> emit_arith().
>>
>> Roland would you mind testing if the tweaked weight helped?
> Well I didn't test it first time (just quoting the numbers from the link
> you provided), but I guess that's fine too. I was actually wondering
> myself if it's better to optimize for absolute or relative error, so
> choosing a weight in-between should work too (the difference is not that
> big after all).
> 
> A couple comments though:
> Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI) + (1/4 + 0.5)
> you could optimize away the first mad for the COS case.

I knew you were going to catch that.
I ran out of scalers in the 2 consts I use. Decided to trade one
instruction for one const. If you think its better to sacrifice a const,
I wont mind changing it.

> 
> Also, the comments for SCS seem a bit off. That's a pity, because
> without comments I can't really see what the code does at first sight
> :-). Looks like quite a few extra instructions though, are you sure not
> more could be shared for calculating both sin and cos?
sorry about the comment.

I see SCS as a work in progress.
I felt I was fighting how emit_arith() syncs the vector & scalar
instructions.

I intend to improve amit_arith() before taking a shot at improving SCS
any further.

I was thinking of storing when each scalar in a temp was written to and
use it to pack th instruction tighter.

> Otherwise, looks good to me. Keep up the good work!

Thank you for you comments.


Rune Petersen

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-15 Thread Roland Scheidegger
Roland Scheidegger wrote:
> Rune Petersen wrote:
>> This patch: - Fixes COS. - Does range reductions for SIN & COS. - 
>> Adds SCS. - removes the optimized version of SIN & COS. - tweaked 
>> weight (should help on precision). - fixed a copy paste typo in 
>> emit_arith().
>> 
>> Roland would you mind testing if the tweaked weight helped?
> Well I didn't test it first time (just quoting the numbers from the
> link you provided), but I guess that's fine too. I was actually
> wondering myself if it's better to optimize for absolute or relative
> error, so choosing a weight in-between should work too (the
> difference is not that big after all).
> 
> A couple comments though: Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI)
> + (1/4 + 0.5) you could optimize away the first mad for the COS case.
> 
Ah I see you're a bit short on consts, if you want to only use 2 (btw
I'd say there should be 32 not only 16 but I have no idea why the driver
restricts it to 16).

> Also, the comments for SCS seem a bit off. That's a pity, because 
> without comments I can't really see what the code does at first sight
>  :-). Looks like quite a few extra instructions though, are you sure
> not more could be shared for calculating both sin and cos?
I've looked a bit closer (this is an interesting optimization
problem...) and I think it should be doable with fewer instructions,
though ultimately I needed 2 temps instead of 1 (I don't think it's much
of a problem, 32 is plenty, PS2.0 only exposes 12).

Ok the equation was:
Q (4/pi x - 4/pi^2 x^2) + P (4/pi x - 4/pi^2 x^2)^2

Simplified to:
y = B * x + C * x * abs(x)
y = P * (y * abs(y) - y) + y

const0: B,C,pi,P
const1: 0.5pi, 0.75, 1/(2pi), 2.0pi

That's what I came up with with pseudo-code:
//should be 5 slots (I guess it might generate 6 due to force same-slot,
//but that needs fixing elewhere)

//cos is even: cos(x) = cos(-x). So using simple trigo-fu
//we get sin(neg(abs(x)) + pi/2)) = cos(x), no comparison needed and all
//values for sine stay inside [-pi,pi] ([-pi/2, pi/2], actually)
//hope it's ok to use neg+abs simultaneously?
temp.z = add(neg(abs(src)), const1.x)
temp.w = mul(src, C)

//temp.xy = B*x, C*x (cos), temp.w = C * x, temp2.w = B * x (sin)
temp.xy = mul(temp.z, BC)
temp2.w = mul(src, B)

//do cos in alpha slot not sin due to restricted swizzling
//sin y = B * x + C * x * abs(x)
temp2.z = mad(temp.w, abs(src), temp2.w)
//cos
temp2.w = mad(temp.y, abs(temp.z), temp.x)

temp.xy = mad(temp2.wzy, abs(temp2.wzy), neg(temp2.wzy))
// now temp.x holds y * abs(y) - y for cos, temp.y same for sin

dest.xy = mad(temp.xy, P, temp2.wzy)

range reduction for cos:
x = (x/(2*PI))+0.75
x = frac(x)
x = (x*2*PI)-PI

sin:
x = (x/(2*PI))+HALF
x = frac(x)
x = (x*2*PI)-PI

Isn't that an elegant solution :-) There may be any number of bugs, of
course...

Roland


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-15 Thread Rune Petersen
Jerome Glisse wrote:
> On 2/14/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
>> Roland Scheidegger wrote:
>>> Roland Scheidegger wrote:
>> Rune Petersen
>>
> Ok commited.
 I didn't look too closely at this but I've a couple of comments.
 - COS looks too complicated & broken. If you'd want to get 2 with a
 LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
 instructions before performing the sine, for something you can easily do
 in two, using another constant (just 1 add + 1 cmp needed, if you use
 the right constants for the add). Maybe it's not that bad though, I
 don't know how many rgb and a slots it will actually consume, but still,
 are constant slots that rare?
 Second, you'd really need to do range reduction of the input, otherwise
 results will be very wrong for inputs outside [-pi, pi]. This would be
 true for taylor approximation too, of course, unless you do an infinite
 series :-). You wouldn't need to do that for SCS.
>>> Oh, and forgot to mention, you probably really want to use the higher
>>> precision variant by default. 12% max relative error (and even absolute
>>> it's still 6%) will likely be visible in some cases depending what the
>>> shader is doing. Even the enhanced version seems to miss opengl
>>> conformance (accurate to "about 1 part in 10^5") by roughly a factor of
>>> 10, which stretches the meaning of "about" a bit probably already.
>>> You could also rely on the precision hint for fragment programs to
>>> switch to the faster version instead of a dri conf option (note though
>>> the spec explicitly states implementations are discouraged even in this
>>> case to perform optimizations which could have significant impact on the
>>> output).
>>>
>> This patch:
>>  - Fixes COS.
>>  - Does range reductions for SIN & COS.
>>  - Adds SCS.
>>  - removes the optimized version of SIN & COS.
>>  - tweaked weight (should help on precision).
>>  - fixed a copy paste typo in emit_arith().
>>
>> Roland would you mind testing if the tweaked weight helped?
>>
>> And Jerome would you mind committing this?
>>
>>
>> Rune Petersen
>>
> 
> Pushed, git isn't so frightening trust me :)
> 
I'll make you a promise: after March 1. I'll commit my own patches..
until than I can't really get a handle on it (too much on my mind).


Rune Petersen

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-14 Thread Roland Scheidegger
Rune Petersen wrote:
> This patch: - Fixes COS. - Does range reductions for SIN & COS. -
> Adds SCS. - removes the optimized version of SIN & COS. - tweaked
> weight (should help on precision). - fixed a copy paste typo in
> emit_arith().
> 
> Roland would you mind testing if the tweaked weight helped?
Well I didn't test it first time (just quoting the numbers from the link
you provided), but I guess that's fine too. I was actually wondering
myself if it's better to optimize for absolute or relative error, so
choosing a weight in-between should work too (the difference is not that
big after all).

A couple comments though:
Since ((x + PI/2)/(2*PI))+0.5 is (x/(2*PI) + (1/4 + 0.5)
you could optimize away the first mad for the COS case.

Also, the comments for SCS seem a bit off. That's a pity, because
without comments I can't really see what the code does at first sight
:-). Looks like quite a few extra instructions though, are you sure not
more could be shared for calculating both sin and cos?

Otherwise, looks good to me. Keep up the good work!

Roland

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-14 Thread Jerome Glisse
On 2/14/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> Roland Scheidegger wrote:
> > Roland Scheidegger wrote:
>  Rune Petersen
> 
> >>> Ok commited.
> >> I didn't look too closely at this but I've a couple of comments.
> >> - COS looks too complicated & broken. If you'd want to get 2 with a
> >> LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
> >> instructions before performing the sine, for something you can easily do
> >> in two, using another constant (just 1 add + 1 cmp needed, if you use
> >> the right constants for the add). Maybe it's not that bad though, I
> >> don't know how many rgb and a slots it will actually consume, but still,
> >> are constant slots that rare?
> >> Second, you'd really need to do range reduction of the input, otherwise
> >> results will be very wrong for inputs outside [-pi, pi]. This would be
> >> true for taylor approximation too, of course, unless you do an infinite
> >> series :-). You wouldn't need to do that for SCS.
> >
> > Oh, and forgot to mention, you probably really want to use the higher
> > precision variant by default. 12% max relative error (and even absolute
> > it's still 6%) will likely be visible in some cases depending what the
> > shader is doing. Even the enhanced version seems to miss opengl
> > conformance (accurate to "about 1 part in 10^5") by roughly a factor of
> > 10, which stretches the meaning of "about" a bit probably already.
> > You could also rely on the precision hint for fragment programs to
> > switch to the faster version instead of a dri conf option (note though
> > the spec explicitly states implementations are discouraged even in this
> > case to perform optimizations which could have significant impact on the
> > output).
> >
> This patch:
>  - Fixes COS.
>  - Does range reductions for SIN & COS.
>  - Adds SCS.
>  - removes the optimized version of SIN & COS.
>  - tweaked weight (should help on precision).
>  - fixed a copy paste typo in emit_arith().
>
> Roland would you mind testing if the tweaked weight helped?
>
> And Jerome would you mind committing this?
>
>
> Rune Petersen
>

Pushed, git isn't so frightening trust me :)

best,
Jerome Glisse

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-14 Thread Rune Petersen
Roland Scheidegger wrote:
> Roland Scheidegger wrote:
 Rune Petersen

>>> Ok commited.
>> I didn't look too closely at this but I've a couple of comments.
>> - COS looks too complicated & broken. If you'd want to get 2 with a
>> LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
>> instructions before performing the sine, for something you can easily do
>> in two, using another constant (just 1 add + 1 cmp needed, if you use
>> the right constants for the add). Maybe it's not that bad though, I
>> don't know how many rgb and a slots it will actually consume, but still,
>> are constant slots that rare?
>> Second, you'd really need to do range reduction of the input, otherwise
>> results will be very wrong for inputs outside [-pi, pi]. This would be
>> true for taylor approximation too, of course, unless you do an infinite
>> series :-). You wouldn't need to do that for SCS.
> 
> Oh, and forgot to mention, you probably really want to use the higher
> precision variant by default. 12% max relative error (and even absolute
> it's still 6%) will likely be visible in some cases depending what the
> shader is doing. Even the enhanced version seems to miss opengl
> conformance (accurate to "about 1 part in 10^5") by roughly a factor of
> 10, which stretches the meaning of "about" a bit probably already.
> You could also rely on the precision hint for fragment programs to
> switch to the faster version instead of a dri conf option (note though
> the spec explicitly states implementations are discouraged even in this
> case to perform optimizations which could have significant impact on the
> output).
> 
This patch:
 - Fixes COS.
 - Does range reductions for SIN & COS.
 - Adds SCS.
 - removes the optimized version of SIN & COS.
 - tweaked weight (should help on precision).
 - fixed a copy paste typo in emit_arith().

Roland would you mind testing if the tweaked weight helped?

And Jerome would you mind committing this?


Rune Petersen
diff --git a/src/mesa/drivers/dri/r300/r300_context.h b/src/mesa/drivers/dri/r300/r300_context.h
index b140235..48b50bc 100644
--- a/src/mesa/drivers/dri/r300/r300_context.h
+++ b/src/mesa/drivers/dri/r300/r300_context.h
@@ -731,7 +731,7 @@ struct r300_fragment_program {
 	int max_temp_idx;
 
 	/* the index of the sin constant is stored here */
-	GLint const_sin;
+	GLint const_sin[2];
 	
 	GLuint optimization;
 };
diff --git a/src/mesa/drivers/dri/r300/r300_fragprog.c b/src/mesa/drivers/dri/r300/r300_fragprog.c
index b00cf9e..8e45bd5 100644
--- a/src/mesa/drivers/dri/r300/r300_fragprog.c
+++ b/src/mesa/drivers/dri/r300/r300_fragprog.c
@@ -33,7 +33,6 @@
 
 /*TODO'S
  *
- * - SCS instructions
  * - Depth write, WPOS/FOGC inputs
  * - FogOption
  * - Verify results of opcodes for accuracy, I've only checked them
@@ -1081,7 +1080,7 @@ static void emit_arith(struct r300_fragment_program *rp,
 break;
 			}
 			if (emit_sop &&
-			(s_swiz[REG_GET_VSWZ(src[i])].flags & SLOT_VECTOR)) {
+			(s_swiz[REG_GET_SSWZ(src[i])].flags & SLOT_VECTOR)) {
 vpos = spos = MAX2(vpos, spos);
 break;
 			}
@@ -1204,6 +1203,25 @@ static GLuint get_attrib(struct r300_fragment_program *rp, GLuint attr)
 }
 #endif
 
+static void make_sin_const(struct r300_fragment_program *rp)
+{
+	if(rp->const_sin[0] == -1){
+	GLfloat cnstv[4];
+
+	cnstv[0] = 1.273239545; // 4/PI
+	cnstv[1] =-0.405284735; // -4/(PI*PI)
+	cnstv[2] = 3.141592654; // PI
+	cnstv[3] = 0.2225;  // weight
+	rp->const_sin[0] = emit_const4fv(rp, cnstv);
+
+	cnstv[0] = 0.5;
+	cnstv[1] = -1.5;
+	cnstv[2] = 0.159154943; // 1/(2*PI)
+	cnstv[3] = 6.283185307; // 2*PI
+	rp->const_sin[1] = emit_const4fv(rp, cnstv);
+	}
+}
+
 static GLboolean parse_program(struct r300_fragment_program *rp)
 {	
 	struct gl_fragment_program *mp = &rp->mesa_program;
@@ -1260,84 +1278,68 @@ static GLboolean parse_program(struct r300_fragment_program *rp)
 			 * cos using a parabola (see SIN):
 			 * cos(x):
 			 *   x += PI/2
-			 *   x = (x < PI)?x : x-2*PI
+			 *   x = (x/(2*PI))+0.5
+			 *   x = frac(x)
+			 *   x = (x*2*PI)-PI
 			 *   result = sin(x)
 			 */
 			temp = get_temp_reg(rp);
-			if(rp->const_sin == -1){
-			cnstv[0] = 1.273239545;
-			cnstv[1] =-0.405284735;
-			cnstv[2] = 3.141592654;
-			cnstv[3] = 0.225;
-			rp->const_sin = emit_const4fv(rp, cnstv);
-			}
-			cnst = rp->const_sin;			
+			make_sin_const(rp);
 			src[0] = t_scalar_src(rp, fpi->SrcReg[0]);
 
-			emit_arith(rp, PFS_OP_LG2, temp, WRITEMASK_W,
-   pfs_half,
-   undef,
-   undef,
-   0);
+			/* add 0.5*PI and do range reduction */
 
 			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_X,
-   swizzle(cnst, Z, Z, Z, Z), //PI
+   swizzle(rp->const_sin[0], Z, Z, Z, Z), //PI
    pfs_half,
    swizzle(keep(src[0]), X, X, X, X),
    0);
 
-			emit_arith(rp, PFS_OP_MAD, temp, WRITEMASK_W,
-   negate(swizzle(temp, W, W, W, W)), //-2
-   swizzle(cnst, Z, Z, Z, Z), /

Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-12 Thread Rune Petersen
Roland Scheidegger wrote:
> Roland Scheidegger wrote:
 Rune Petersen

>>> Ok commited.
>> I didn't look too closely at this but I've a couple of comments.
>> - COS looks too complicated & broken. If you'd want to get 2 with a
>> LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
>> instructions before performing the sine, for something you can easily do
>> in two, using another constant (just 1 add + 1 cmp needed, if you use
>> the right constants for the add). Maybe it's not that bad though, I
>> don't know how many rgb and a slots it will actually consume, but still,
>> are constant slots that rare?
>> Second, you'd really need to do range reduction of the input, otherwise
>> results will be very wrong for inputs outside [-pi, pi]. This would be
>> true for taylor approximation too, of course, unless you do an infinite
>> series :-). You wouldn't need to do that for SCS.

The mess of trying to get 2 from RCP was a brain fart on my part.
And since I forgot the range reduction, I'll add PI*0.5 and then do the
range reduction. And this time I will need a constant more (r300 has 16).
> 
> Oh, and forgot to mention, you probably really want to use the higher
> precision variant by default. 12% max relative error (and even absolute
> it's still 6%) will likely be visible in some cases depending what the
> shader is doing. Even the enhanced version seems to miss opengl
> conformance (accurate to "about 1 part in 10^5") by roughly a factor of
> 10, which stretches the meaning of "about" a bit probably already.
> You could also rely on the precision hint for fragment programs to
> switch to the faster version instead of a dri conf option (note though
> the spec explicitly states implementations are discouraged even in this
> case to perform optimizations which could have significant impact on the
> output).

it helps to have the numbers. when making these decisions.

Thank you for your feedback.


Rune Petersen

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-12 Thread Roland Scheidegger
Roland Scheidegger wrote:
>>> Rune Petersen
>>>
>> Ok commited.
> 
> I didn't look too closely at this but I've a couple of comments.
> - COS looks too complicated & broken. If you'd want to get 2 with a
> LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
> instructions before performing the sine, for something you can easily do
> in two, using another constant (just 1 add + 1 cmp needed, if you use
> the right constants for the add). Maybe it's not that bad though, I
> don't know how many rgb and a slots it will actually consume, but still,
> are constant slots that rare?
> Second, you'd really need to do range reduction of the input, otherwise
> results will be very wrong for inputs outside [-pi, pi]. This would be
> true for taylor approximation too, of course, unless you do an infinite
> series :-). You wouldn't need to do that for SCS.

Oh, and forgot to mention, you probably really want to use the higher
precision variant by default. 12% max relative error (and even absolute
it's still 6%) will likely be visible in some cases depending what the
shader is doing. Even the enhanced version seems to miss opengl
conformance (accurate to "about 1 part in 10^5") by roughly a factor of
10, which stretches the meaning of "about" a bit probably already.
You could also rely on the precision hint for fragment programs to
switch to the faster version instead of a dri conf option (note though
the spec explicitly states implementations are discouraged even in this
case to perform optimizations which could have significant impact on the
output).

Roland

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-12 Thread Roland Scheidegger
>>
>> Rune Petersen
>>
> 
> Ok commited.

I didn't look too closely at this but I've a couple of comments.
- COS looks too complicated & broken. If you'd want to get 2 with a
LOG2, you'd need 0.25 as source. But even using RCP instead, that's 5
instructions before performing the sine, for something you can easily do
in two, using another constant (just 1 add + 1 cmp needed, if you use
the right constants for the add). Maybe it's not that bad though, I
don't know how many rgb and a slots it will actually consume, but still,
are constant slots that rare?
Second, you'd really need to do range reduction of the input, otherwise
results will be very wrong for inputs outside [-pi, pi]. This would be
true for taylor approximation too, of course, unless you do an infinite
series :-). You wouldn't need to do that for SCS.

Roland


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-11 Thread Jerome Glisse
On 2/11/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> Rune Petersen wrote:
> > .
> >
> > Rune Petersen wrote:
> >> Jerome Glisse wrote:
> >>> On 2/11/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:
>  On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Getting proper SIN and COS wasn't as easy as it appeared. I had to make
> > make some changes to the fragment program code.
> >
> > general FP changes:
> > - support HHH swizzle for vector instructions.
> > - don't copy a source to a temp when it is not XYZW swizzled, but
> >   combine the two and have the swizzle resolve any issues.
> >   (saves temps/instructions with more elaborate shader code)
> > - Disable refcounting of temps.
> >   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
> >   This should be resolved properly.
> > - fix overflow in cnstv[].
> >
> >
> > SIN & COS:
> > they are based on:
> > http://www.devmaster.net/forums/showthread.php?t=5784
> >
> > There is an fast and a slow(high precision) version of SIN & COS.
> >
> > For SIN:
> > fast = 2 vector instructions
> > slow = 5 vector instructions
> >
> > For COS:
> > fast = 5 vector instructions + 2 scaler instructions
> > slow = 8 vector instructions + 2 scaler instructions
> >
> > The fast version appears to do a fine enough job, at least with the
> > simple test I have made.
> >
> >
> > Rune Petersen
>  Nice to tackle this :) few comment, maybe we could make an driconf
>  option to switch btw fast and slow version (or a more general conf
>  option to enable or disable fragprog optimization in case we come
>  with more optimization like that in the future).
> 
>  For the refcounting i am wondering if i didn't bump into that in
>  the past, i did use gdb to trace fragprog construction at that
>  time and found some strange interaction (which lead me to
>  the rework i did on fragprog).
> 
>  Anyway here from limited testing your patch seems good,
>  you should commit it.
> 
>  best,
>  Jerome Glisse
> 
> >>> Attached a patch to fix refcounting. Basicly whenever a temporary
> >>> source was used multiple time inside an instruction that lead to
> >>> multiple call to t_hw_src which is correct but as we also decrement
> >>> use counter in that function we over decremented the refcount.
> >>>
> >>> The patch decrement refcount after instruction decoding and avoid
> >>> over decrementing refcount.
> >>>
> >>> (The patch apply over yours)
> >>>
> >>> best,
> >>> Jerome
> >> I have found that the main reason for my problem was I forgot to use the
> >> keep() on the source.
> >>
> >> I think your patch is too intrusive. As long as keep() is used at the
> >> right places, you could move the refcount inside emit_arith() making the
> >> change more contained.
> >>
> >> Update patch attached.
> >>
> >> Could I get you to commit this, since I will not be able to find the
> >> time to figure GIT out any time (and lost my sig, don't ask).
> >>
> How I managed to compile that is beyond me.
> Here is a proper patch.
>
> Rune Petersen
>

Ok commited.

best,
Jerome Glisse

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-11 Thread Rune Petersen
Rune Petersen wrote:
> .
> 
> Rune Petersen wrote:
>> Jerome Glisse wrote:
>>> On 2/11/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:
 On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Getting proper SIN and COS wasn't as easy as it appeared. I had to make
> make some changes to the fragment program code.
>
> general FP changes:
> - support HHH swizzle for vector instructions.
> - don't copy a source to a temp when it is not XYZW swizzled, but
>   combine the two and have the swizzle resolve any issues.
>   (saves temps/instructions with more elaborate shader code)
> - Disable refcounting of temps.
>   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
>   This should be resolved properly.
> - fix overflow in cnstv[].
>
>
> SIN & COS:
> they are based on:
> http://www.devmaster.net/forums/showthread.php?t=5784
>
> There is an fast and a slow(high precision) version of SIN & COS.
>
> For SIN:
> fast = 2 vector instructions
> slow = 5 vector instructions
>
> For COS:
> fast = 5 vector instructions + 2 scaler instructions
> slow = 8 vector instructions + 2 scaler instructions
>
> The fast version appears to do a fine enough job, at least with the
> simple test I have made.
>
>
> Rune Petersen
 Nice to tackle this :) few comment, maybe we could make an driconf
 option to switch btw fast and slow version (or a more general conf
 option to enable or disable fragprog optimization in case we come
 with more optimization like that in the future).

 For the refcounting i am wondering if i didn't bump into that in
 the past, i did use gdb to trace fragprog construction at that
 time and found some strange interaction (which lead me to
 the rework i did on fragprog).

 Anyway here from limited testing your patch seems good,
 you should commit it.

 best,
 Jerome Glisse

>>> Attached a patch to fix refcounting. Basicly whenever a temporary
>>> source was used multiple time inside an instruction that lead to
>>> multiple call to t_hw_src which is correct but as we also decrement
>>> use counter in that function we over decremented the refcount.
>>>
>>> The patch decrement refcount after instruction decoding and avoid
>>> over decrementing refcount.
>>>
>>> (The patch apply over yours)
>>>
>>> best,
>>> Jerome
>> I have found that the main reason for my problem was I forgot to use the
>> keep() on the source.
>>
>> I think your patch is too intrusive. As long as keep() is used at the
>> right places, you could move the refcount inside emit_arith() making the
>> change more contained.
>>
>> Update patch attached.
>>
>> Could I get you to commit this, since I will not be able to find the
>> time to figure GIT out any time (and lost my sig, don't ask).
>>
How I managed to compile that is beyond me.
Here is a proper patch.

Rune Petersen
diff --git a/src/mesa/drivers/dri/r300/r300_context.h b/src/mesa/drivers/dri/r300/r300_context.h
index 02f8e91..b140235 100644
--- a/src/mesa/drivers/dri/r300/r300_context.h
+++ b/src/mesa/drivers/dri/r300/r300_context.h
@@ -729,6 +729,11 @@ struct r300_fragment_program {
 	GLboolean params_uptodate;
 
 	int max_temp_idx;
+
+	/* the index of the sin constant is stored here */
+	GLint const_sin;
+	
+	GLuint optimization;
 };
 
 #define R300_MAX_AOS_ARRAYS		16
diff --git a/src/mesa/drivers/dri/r300/r300_fragprog.c b/src/mesa/drivers/dri/r300/r300_fragprog.c
index 6e85f0b..b00cf9e 100644
--- a/src/mesa/drivers/dri/r300/r300_fragprog.c
+++ b/src/mesa/drivers/dri/r300/r300_fragprog.c
@@ -33,7 +33,7 @@
 
 /*TODO'S
  *
- * - COS/SIN/SCS instructions
+ * - SCS instructions
  * - Depth write, WPOS/FOGC inputs
  * - FogOption
  * - Verify results of opcodes for accuracy, I've only checked them
@@ -187,6 +187,10 @@ static const struct {
 #define SLOT_VECTOR	(1<<0)
 #define SLOT_SCALAR	(1<<3)
 #define SLOT_BOTH	(SLOT_VECTOR | SLOT_SCALAR)
+
+/* mapping from SWIZZLE_* to r300 native values for scalar insns */
+#define SWIZZLE_HALF 6
+
 #define MAKE_SWZ3(x, y, z) (MAKE_SWIZZLE4(SWIZZLE_##x, \
 	  SWIZZLE_##y, \
 	  SWIZZLE_##z, \
@@ -208,7 +212,7 @@ static const struct r300_pfs_swizzle {
 	{ MAKE_SWZ3(W, Z, Y), R300_FPI0_ARGC_SRC0CA_WZY, 1, SLOT_BOTH },
 	{ MAKE_SWZ3(ONE, ONE, ONE), R300_FPI0_ARGC_ONE, 0, 0},
 	{ MAKE_SWZ3(ZERO, ZERO, ZERO), R300_FPI0_ARGC_ZERO, 0, 0},
-	{ PFS_INVAL, R300_FPI0_ARGC_HALF, 0, 0},
+	{ MAKE_SWZ3(HALF, HALF, HALF), R300_FPI0_ARGC_HALF, 0, 0},
 	{ PFS_INVAL, 0, 0, 0},
 };
 
@@ -232,8 +236,6 @@ static const struct {
 	{ PFS_INVAL, PFS_INVAL, PFS_INVAL}
 };
 
-/* mapping from SWIZZLE_* to r300 native values for scalar insns */
-#define SWIZZLE_HALF 6
 static const struct {
 	int base;	/* hw value of swizzle */
 	int stride;	/* difference between SRC0/1/2 */
@@ -590,6 +592,7 @@ static GLuint do_swizzle(struct r300_fragment_program *rp,
 	/* 

Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-11 Thread Rune Petersen
.

Rune Petersen wrote:
> Jerome Glisse wrote:
>> On 2/11/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:
>>> On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
 Hi,

 Getting proper SIN and COS wasn't as easy as it appeared. I had to make
 make some changes to the fragment program code.

 general FP changes:
 - support HHH swizzle for vector instructions.
 - don't copy a source to a temp when it is not XYZW swizzled, but
   combine the two and have the swizzle resolve any issues.
   (saves temps/instructions with more elaborate shader code)
 - Disable refcounting of temps.
   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
   This should be resolved properly.
 - fix overflow in cnstv[].


 SIN & COS:
 they are based on:
 http://www.devmaster.net/forums/showthread.php?t=5784

 There is an fast and a slow(high precision) version of SIN & COS.

 For SIN:
 fast = 2 vector instructions
 slow = 5 vector instructions

 For COS:
 fast = 5 vector instructions + 2 scaler instructions
 slow = 8 vector instructions + 2 scaler instructions

 The fast version appears to do a fine enough job, at least with the
 simple test I have made.


 Rune Petersen
>>> Nice to tackle this :) few comment, maybe we could make an driconf
>>> option to switch btw fast and slow version (or a more general conf
>>> option to enable or disable fragprog optimization in case we come
>>> with more optimization like that in the future).
>>>
>>> For the refcounting i am wondering if i didn't bump into that in
>>> the past, i did use gdb to trace fragprog construction at that
>>> time and found some strange interaction (which lead me to
>>> the rework i did on fragprog).
>>>
>>> Anyway here from limited testing your patch seems good,
>>> you should commit it.
>>>
>>> best,
>>> Jerome Glisse
>>>
>> Attached a patch to fix refcounting. Basicly whenever a temporary
>> source was used multiple time inside an instruction that lead to
>> multiple call to t_hw_src which is correct but as we also decrement
>> use counter in that function we over decremented the refcount.
>>
>> The patch decrement refcount after instruction decoding and avoid
>> over decrementing refcount.
>>
>> (The patch apply over yours)
>>
>> best,
>> Jerome
> 
> I have found that the main reason for my problem was I forgot to use the
> keep() on the source.
> 
> I think your patch is too intrusive. As long as keep() is used at the
> right places, you could move the refcount inside emit_arith() making the
> change more contained.
> 
> Update patch attached.
> 
> Could I get you to commit this, since I will not be able to find the
> time to figure GIT out any time (and lost my sig, don't ask).
> 
> 
> Rune Petersen
> 
> 

diff --git a/src/mesa/drivers/dri/r300/r300_context.c b/src/mesa/drivers/dri/r300/r300_context.c
diff --git a/src/mesa/drivers/dri/r300/r300_context.h b/src/mesa/drivers/dri/r300/r300_context.h
index 02f8e91..b140235 100644
--- a/src/mesa/drivers/dri/r300/r300_context.h
+++ b/src/mesa/drivers/dri/r300/r300_context.h
@@ -729,6 +729,11 @@ struct r300_fragment_program {
 	GLboolean params_uptodate;
 
 	int max_temp_idx;
+
+	/* the index of the sin constant is stored here */
+	GLint const_sin;
+	
+	GLuint optimization;
 };
 
 #define R300_MAX_AOS_ARRAYS		16
diff --git a/src/mesa/drivers/dri/r300/r300_fragprog.c b/src/mesa/drivers/dri/r300/r300_fragprog.c
index 6e85f0b..cb250ca 100644
--- a/src/mesa/drivers/dri/r300/r300_fragprog.c
+++ b/src/mesa/drivers/dri/r300/r300_fragprog.c
@@ -33,7 +33,7 @@
 
 /*TODO'S
  *
- * - COS/SIN/SCS instructions
+ * - SCS instructions
  * - Depth write, WPOS/FOGC inputs
  * - FogOption
  * - Verify results of opcodes for accuracy, I've only checked them
@@ -51,6 +51,8 @@
 #include "r300_fragprog.h"
 #include "r300_reg.h"
 
+#define FAST_SIN
+
 /*
  * Usefull macros and values
  */
@@ -187,6 +189,10 @@ static const struct {
 #define SLOT_VECTOR	(1<<0)
 #define SLOT_SCALAR	(1<<3)
 #define SLOT_BOTH	(SLOT_VECTOR | SLOT_SCALAR)
+
+/* mapping from SWIZZLE_* to r300 native values for scalar insns */
+#define SWIZZLE_HALF 6
+
 #define MAKE_SWZ3(x, y, z) (MAKE_SWIZZLE4(SWIZZLE_##x, \
 	  SWIZZLE_##y, \
 	  SWIZZLE_##z, \
@@ -208,7 +214,7 @@ static const struct r300_pfs_swizzle {
 	{ MAKE_SWZ3(W, Z, Y), R300_FPI0_ARGC_SRC0CA_WZY, 1, SLOT_BOTH },
 	{ MAKE_SWZ3(ONE, ONE, ONE), R300_FPI0_ARGC_ONE, 0, 0},
 	{ MAKE_SWZ3(ZERO, ZERO, ZERO), R300_FPI0_ARGC_ZERO, 0, 0},
-	{ PFS_INVAL, R300_FPI0_ARGC_HALF, 0, 0},
+	{ MAKE_SWZ3(HALF, HALF, HALF), R300_FPI0_ARGC_HALF, 0, 0},
 	{ PFS_INVAL, 0, 0, 0},
 };
 
@@ -232,8 +238,6 @@ static const struct {
 	{ PFS_INVAL, PFS_INVAL, PFS_INVAL}
 };
 
-/* mapping from SWIZZLE_* to r300 native values for scalar insns */
-#define SWIZZLE_HALF 6
 static const struct {
 	int base;	/* hw value of swizzle */
 	int stride;	/* difference between SRC0/1/2 */
@@ -590,6 +5

Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-11 Thread Rune Petersen
Jerome Glisse wrote:
> On 2/11/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:
>> On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > Getting proper SIN and COS wasn't as easy as it appeared. I had to make
>> > make some changes to the fragment program code.
>> >
>> > general FP changes:
>> > - support HHH swizzle for vector instructions.
>> > - don't copy a source to a temp when it is not XYZW swizzled, but
>> >   combine the two and have the swizzle resolve any issues.
>> >   (saves temps/instructions with more elaborate shader code)
>> > - Disable refcounting of temps.
>> >   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
>> >   This should be resolved properly.
>> > - fix overflow in cnstv[].
>> >
>> >
>> > SIN & COS:
>> > they are based on:
>> > http://www.devmaster.net/forums/showthread.php?t=5784
>> >
>> > There is an fast and a slow(high precision) version of SIN & COS.
>> >
>> > For SIN:
>> > fast = 2 vector instructions
>> > slow = 5 vector instructions
>> >
>> > For COS:
>> > fast = 5 vector instructions + 2 scaler instructions
>> > slow = 8 vector instructions + 2 scaler instructions
>> >
>> > The fast version appears to do a fine enough job, at least with the
>> > simple test I have made.
>> >
>> >
>> > Rune Petersen
>>
>> Nice to tackle this :) few comment, maybe we could make an driconf
>> option to switch btw fast and slow version (or a more general conf
>> option to enable or disable fragprog optimization in case we come
>> with more optimization like that in the future).
>>
>> For the refcounting i am wondering if i didn't bump into that in
>> the past, i did use gdb to trace fragprog construction at that
>> time and found some strange interaction (which lead me to
>> the rework i did on fragprog).
>>
>> Anyway here from limited testing your patch seems good,
>> you should commit it.
>>
>> best,
>> Jerome Glisse
>>
> 
> Attached a patch to fix refcounting. Basicly whenever a temporary
> source was used multiple time inside an instruction that lead to
> multiple call to t_hw_src which is correct but as we also decrement
> use counter in that function we over decremented the refcount.
> 
> The patch decrement refcount after instruction decoding and avoid
> over decrementing refcount.
> 
> (The patch apply over yours)
> 
> best,
> Jerome

I have found that the main reason for my problem was I forgot to use the
keep() on the source.

I think your patch is too intrusive. As long as keep() is used at the
right places, you could move the refcount inside emit_arith() making the
change more contained.

Update patch attached.

Could I get you to commit this, since I will not be able to find the
time to figure GIT out any time (and lost my sig, don't ask).


Rune Petersen


-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-11 Thread Jerome Glisse

On 2/11/07, Jerome Glisse <[EMAIL PROTECTED]> wrote:

On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Getting proper SIN and COS wasn't as easy as it appeared. I had to make
> make some changes to the fragment program code.
>
> general FP changes:
> - support HHH swizzle for vector instructions.
> - don't copy a source to a temp when it is not XYZW swizzled, but
>   combine the two and have the swizzle resolve any issues.
>   (saves temps/instructions with more elaborate shader code)
> - Disable refcounting of temps.
>   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
>   This should be resolved properly.
> - fix overflow in cnstv[].
>
>
> SIN & COS:
> they are based on:
> http://www.devmaster.net/forums/showthread.php?t=5784
>
> There is an fast and a slow(high precision) version of SIN & COS.
>
> For SIN:
> fast = 2 vector instructions
> slow = 5 vector instructions
>
> For COS:
> fast = 5 vector instructions + 2 scaler instructions
> slow = 8 vector instructions + 2 scaler instructions
>
> The fast version appears to do a fine enough job, at least with the
> simple test I have made.
>
>
> Rune Petersen

Nice to tackle this :) few comment, maybe we could make an driconf
option to switch btw fast and slow version (or a more general conf
option to enable or disable fragprog optimization in case we come
with more optimization like that in the future).

For the refcounting i am wondering if i didn't bump into that in
the past, i did use gdb to trace fragprog construction at that
time and found some strange interaction (which lead me to
the rework i did on fragprog).

Anyway here from limited testing your patch seems good,
you should commit it.

best,
Jerome Glisse



Attached a patch to fix refcounting. Basicly whenever a temporary
source was used multiple time inside an instruction that lead to
multiple call to t_hw_src which is correct but as we also decrement
use counter in that function we over decremented the refcount.

The patch decrement refcount after instruction decoding and avoid
over decrementing refcount.

(The patch apply over yours)

best,
Jerome
--- r300_fragprog.c	2007-02-11 14:26:42.0 +0100
+++ /home/glisse/code/r300/mesa/src/mesa/drivers/dri/r300/r300_fragprog.c	2007-02-11 14:25:15.0 +0100
@@ -773,12 +773,6 @@
 			cs->temps[index].reg = get_hw_temp(rp);
 
 		idx = cs->temps[index].reg;
-
-/*
-		if (!REG_GET_NO_USE(src) &&
-		(--cs->temps[index].refcount == 0))
-			free_temp(rp, src);
-*/
 		break;
 	case REG_TYPE_INPUT:
 		idx = cs->inputs[index].reg;
@@ -819,13 +813,6 @@
 			}
 		}
 		idx = cs->temps[index].reg;
-
-/*
-		if (!REG_GET_NO_USE(dest) &&
-		(--cs->temps[index].refcount == 0))
-			free_temp(rp, dest);
-*/
-
 		cs->dest_in_node |= (1 << idx);
 		cs->used_in_node |= (1 << idx);
 		break;
@@ -1215,10 +1202,11 @@
 
 static GLboolean parse_program(struct r300_fragment_program *rp)
 {	
+	COMPILE_STATE;
 	struct gl_fragment_program *mp = &rp->mesa_program;
 	const struct prog_instruction *inst = mp->Base.Instructions;
 	struct prog_instruction *fpi;
-	GLuint src[3], dest, temp;
+	GLuint src[3], dest, temp, srccount;
 	GLuint cnst;
 	int flags, mask = 0;
 	GLfloat cnstv[4] = {0.0, 0.0, 0.0, 0.0};
@@ -1228,6 +1216,7 @@
 		return GL_FALSE;
 	}
 
+
 	for (fpi=mp->Base.Instructions; fpi->Opcode != OPCODE_END; fpi++) {
 		if (fpi->SaturateMode == SATURATE_ZERO_ONE)
 			flags = PFS_FLAG_SAT;
@@ -1238,14 +1227,17 @@
 			mask = fpi->DstReg.WriteMask;
 		}
 
+		srccount = 0;
 		switch (fpi->Opcode) {
 		case OPCODE_ABS:
+			srccount = 1;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			emit_arith(rp, PFS_OP_MAD, dest, mask,
    absolute(src[0]), pfs_one, pfs_zero,
    flags);
 			break;
 		case OPCODE_ADD:
+			srccount = 2;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			src[1] = t_src(rp, fpi->SrcReg[1]);
 			emit_arith(rp, PFS_OP_MAD, dest, mask,
@@ -1253,6 +1245,7 @@
    flags);
 			break;
 		case OPCODE_CMP:
+			srccount = 3;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			src[1] = t_src(rp, fpi->SrcReg[1]);
 			src[2] = t_src(rp, fpi->SrcReg[2]);
@@ -1271,6 +1264,7 @@
 			 *   x = (x < PI)?x : x-2*PI
 			 *   result = sin(x)
 			 */
+			srccount = 1;
 			temp = get_temp_reg(rp);
 			if(rp->const_sin == -1){
 			cnstv[0] = 1.273239545;
@@ -1350,6 +1344,7 @@
 			free_temp(rp, temp);
 			break;
 		case OPCODE_DP3:
+			srccount = 2;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			src[1] = t_src(rp, fpi->SrcReg[1]);
 			emit_arith(rp, PFS_OP_DP3, dest, mask,
@@ -1357,6 +1352,7 @@
    flags);
 			break;
 		case OPCODE_DP4:
+			srccount = 2;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			src[1] = t_src(rp, fpi->SrcReg[1]);
 			emit_arith(rp, PFS_OP_DP4, dest, mask,
@@ -1364,6 +1360,7 @@
    flags);
 			break;
 		case OPCODE_DPH:
+			srccount = 2;
 			src[0] = t_src(rp, fpi->SrcReg[0]);
 			src[1] = t_src(rp, fpi->SrcReg[1]);
 			/* src0.xyz1 -> temp
@@ -1386,6 +1383,7 @@
 #endif
 			break;
 		case OPCODE_DST:
+			srccount = 2;
 			src[0] 

Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-10 Thread Jerome Glisse
On 2/10/07, Rune Petersen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Getting proper SIN and COS wasn't as easy as it appeared. I had to make
> make some changes to the fragment program code.
>
> general FP changes:
> - support HHH swizzle for vector instructions.
> - don't copy a source to a temp when it is not XYZW swizzled, but
>   combine the two and have the swizzle resolve any issues.
>   (saves temps/instructions with more elaborate shader code)
> - Disable refcounting of temps.
>   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
>   This should be resolved properly.
> - fix overflow in cnstv[].
>
>
> SIN & COS:
> they are based on:
> http://www.devmaster.net/forums/showthread.php?t=5784
>
> There is an fast and a slow(high precision) version of SIN & COS.
>
> For SIN:
> fast = 2 vector instructions
> slow = 5 vector instructions
>
> For COS:
> fast = 5 vector instructions + 2 scaler instructions
> slow = 8 vector instructions + 2 scaler instructions
>
> The fast version appears to do a fine enough job, at least with the
> simple test I have made.
>
>
> Rune Petersen

Nice to tackle this :) few comment, maybe we could make an driconf
option to switch btw fast and slow version (or a more general conf
option to enable or disable fragprog optimization in case we come
with more optimization like that in the future).

For the refcounting i am wondering if i didn't bump into that in
the past, i did use gdb to trace fragprog construction at that
time and found some strange interaction (which lead me to
the rework i did on fragprog).

Anyway here from limited testing your patch seems good,
you should commit it.

best,
Jerome Glisse

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: [R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-10 Thread Rune Petersen
note to self: don't post this late...

Rune Petersen wrote:
> Hi,
> 
> Getting proper SIN and COS wasn't as easy as it appeared. I had to make
> make some changes to the fragment program code.
> 
> general FP changes:
> - support HHH swizzle for vector instructions.
> - don't copy a source to a temp when it is not XYZW swizzled, but
>   combine the two and have the swizzle resolve any issues.
>   (saves temps/instructions with more elaborate shader code)
> - Disable refcounting of temps.
>   The temp R0 in progs/fp/tri-cos.c is freed prematurely.
>   This should be resolved properly.
> - fix overflow in cnstv[].
> 
> 
> SIN & COS:
> they are based on:
>   http://www.devmaster.net/forums/showthread.php?t=5784
> 
> There is an fast and a slow(high precision) version of SIN & COS.
> 
> For SIN:
> fast = 2 vector instructions
> slow = 5 vector instructions
> 
> For COS:
> fast = 5 vector instructions + 2 scaler instructions
> slow = 8 vector instructions + 2 scaler instructions
> 
> The fast version appears to do a fine enough job, at least with the
> simple test I have made.
> 
> 
> Rune Petersen

diff --git a/src/mesa/drivers/dri/r300/r300_context.c b/src/mesa/drivers/dri/r300/r300_context.c
diff --git a/src/mesa/drivers/dri/r300/r300_context.h b/src/mesa/drivers/dri/r300/r300_context.h
index 02f8e91..fa636d8 100644
--- a/src/mesa/drivers/dri/r300/r300_context.h
+++ b/src/mesa/drivers/dri/r300/r300_context.h
@@ -729,6 +729,9 @@ struct r300_fragment_program {
 	GLboolean params_uptodate;
 
 	int max_temp_idx;
+
+	/* the index of the sin constant is stored here */
+	GLint const_sin;
 };
 
 #define R300_MAX_AOS_ARRAYS		16
diff --git a/src/mesa/drivers/dri/r300/r300_fragprog.c b/src/mesa/drivers/dri/r300/r300_fragprog.c
index 6e85f0b..eff86e9 100644
--- a/src/mesa/drivers/dri/r300/r300_fragprog.c
+++ b/src/mesa/drivers/dri/r300/r300_fragprog.c
@@ -51,6 +51,8 @@
 #include "r300_fragprog.h"
 #include "r300_reg.h"
 
+#define FAST_SIN
+
 /*
  * Usefull macros and values
  */
@@ -187,6 +189,10 @@ static const struct {
 #define SLOT_VECTOR	(1<<0)
 #define SLOT_SCALAR	(1<<3)
 #define SLOT_BOTH	(SLOT_VECTOR | SLOT_SCALAR)
+
+/* mapping from SWIZZLE_* to r300 native values for scalar insns */
+#define SWIZZLE_HALF 6
+
 #define MAKE_SWZ3(x, y, z) (MAKE_SWIZZLE4(SWIZZLE_##x, \
 	  SWIZZLE_##y, \
 	  SWIZZLE_##z, \
@@ -208,7 +214,7 @@ static const struct r300_pfs_swizzle {
 	{ MAKE_SWZ3(W, Z, Y), R300_FPI0_ARGC_SRC0CA_WZY, 1, SLOT_BOTH },
 	{ MAKE_SWZ3(ONE, ONE, ONE), R300_FPI0_ARGC_ONE, 0, 0},
 	{ MAKE_SWZ3(ZERO, ZERO, ZERO), R300_FPI0_ARGC_ZERO, 0, 0},
-	{ PFS_INVAL, R300_FPI0_ARGC_HALF, 0, 0},
+	{ MAKE_SWZ3(HALF, HALF, HALF), R300_FPI0_ARGC_HALF, 0, 0},
 	{ PFS_INVAL, 0, 0, 0},
 };
 
@@ -232,8 +238,6 @@ static const struct {
 	{ PFS_INVAL, PFS_INVAL, PFS_INVAL}
 };
 
-/* mapping from SWIZZLE_* to r300 native values for scalar insns */
-#define SWIZZLE_HALF 6
 static const struct {
 	int base;	/* hw value of swizzle */
 	int stride;	/* difference between SRC0/1/2 */
@@ -590,6 +594,7 @@ static GLuint do_swizzle(struct r300_fragment_program *rp,
 	/* If swizzling from something without an XYZW native swizzle,
 	 * emit result to a temp, and do new swizzle from the temp.
 	 */
+#if 0
 	if (REG_GET_VSWZ(src) != SWIZZLE_XYZ ||
 	REG_GET_SSWZ(src) != SWIZZLE_W) {
 		GLuint temp = get_temp_reg(rp);
@@ -603,10 +608,33 @@ static GLuint do_swizzle(struct r300_fragment_program *rp,
 			   0);
 		src = temp;
 	}
+#endif
 
 	/* set scalar swizzling */
 	REG_SET_SSWZ(src, GET_SWZ(arbswz, 3));
 
+	if (REG_GET_VSWZ(src) != SWIZZLE_XYZ ||
+	REG_GET_SSWZ(src) != SWIZZLE_W) {
+	GLuint vsrcswz = (v_swiz[REG_GET_VSWZ(src)].hash & (SWZ_X_MASK|SWZ_Y_MASK|SWZ_Z_MASK)) | REG_GET_SSWZ(src) << 9;
+	GLint i;
+
+	GLuint newswz = 0;
+	GLuint offset;
+	for(i=0; i < 4; ++i){
+		offset = GET_SWZ(arbswz, i);
+		
+		newswz |= (offset <= 3)?GET_SWZ(vsrcswz, offset) << i*3:offset << i*3;
+	}
+
+	arbswz = newswz & (SWZ_X_MASK|SWZ_Y_MASK|SWZ_Z_MASK);
+	REG_SET_SSWZ(src, GET_SWZ(newswz, 3));
+	}
+	else
+	{
+	/* set scalar swizzling */
+	REG_SET_SSWZ(src, GET_SWZ(arbswz, 3));
+
+	}
 	do {
 		vswz = REG_GET_VSWZ(src);
 		do {
@@ -746,9 +774,11 @@ static int t_hw_src(struct r300_fragment_program *rp,
 
 		idx = cs->temps[index].reg;
 
+/*
 		if (!REG_GET_NO_USE(src) &&
 		(--cs->temps[index].refcount == 0))
 			free_temp(rp, src);
+*/
 		break;
 	case REG_TYPE_INPUT:
 		idx = cs->inputs[index].reg;
@@ -790,9 +820,11 @@ static int t_hw_dst(struct r300_fragment_program *rp,
 		}
 		idx = cs->temps[index].reg;
 
+/*
 		if (!REG_GET_NO_USE(dest) &&
 		(--cs->temps[index].refcount == 0))
 			free_temp(rp, dest);
+*/
 
 		cs->dest_in_node |= (1 << idx);
 		cs->used_in_node |= (1 << idx);
@@ -1201,7 +1233,6 @@ static GLboolean parse_program(struct r300_fragment_program *rp)
 			flags = PFS_FLAG_SAT;
 		else
 			flags = 0;
-
 		if (fpi->Opcode != OPCODE_KIL) {
 			dest = t_dst(rp, fpi->DstReg);
 			mask =

[R300][PATCH] Add/fix COS & SIN + FP fixes

2007-02-10 Thread Rune Petersen
Hi,

Getting proper SIN and COS wasn't as easy as it appeared. I had to make
make some changes to the fragment program code.

general FP changes:
- support HHH swizzle for vector instructions.
- don't copy a source to a temp when it is not XYZW swizzled, but
  combine the two and have the swizzle resolve any issues.
  (saves temps/instructions with more elaborate shader code)
- Disable refcounting of temps.
  The temp R0 in progs/fp/tri-cos.c is freed prematurely.
  This should be resolved properly.
- fix overflow in cnstv[].


SIN & COS:
they are based on:
http://www.devmaster.net/forums/showthread.php?t=5784

There is an fast and a slow(high precision) version of SIN & COS.

For SIN:
fast = 2 vector instructions
slow = 5 vector instructions

For COS:
fast = 5 vector instructions + 2 scaler instructions
slow = 8 vector instructions + 2 scaler instructions

The fast version appears to do a fine enough job, at least with the
simple test I have made.


Rune Petersen

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel