Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread José Fonseca

On 2002.04.10 19:40 José Fonseca wrote:
 
 ... since the specs give some tolerance it would be nice to run the 
 conformance tests with different settings in mmx_blend.S, specially the 
 single multiply w/o rouding ...

I've started to play with glean and I tried to check this myself, but it 
seems there is no effect in the results no matter what changes I make in 
mmx_blend.S. The command line I use to run glean is:

LD_LIBRARY_PATH=/home/jfonseca/projects/dri/mesa3d/Mesa/lib 
./glean -r mesa

Am I doing some thing wrong here?

Also, the whole test takes a bunch of time. Which of the tests should I 
look for? Is it just blendFunc?

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Brian Paul

José Fonseca wrote:
 
 On 2002.04.10 19:40 José Fonseca wrote:
 
  ... since the specs give some tolerance it would be nice to run the
  conformance tests with different settings in mmx_blend.S, specially the
  single multiply w/o rouding ...
 
 I've started to play with glean and I tried to check this myself, but it
 seems there is no effect in the results no matter what changes I make in
 mmx_blend.S. The command line I use to run glean is:
 
 LD_LIBRARY_PATH=/home/jfonseca/projects/dri/mesa3d/Mesa/lib
 ./glean -r mesa
 
 Am I doing some thing wrong here?

Hmmm, I had put a printf in the _swrast_choose_blend_func() function to
be sure the mmx routine was being chosen, and it was.

Are you sure you're compiling Mesa with -DUSE_MMX_ASM?  There's also a
runtime check for MMX support.  When compiled with the DEBUG token
defined, Mesa will print a mesage to stdout to indicate if MMX, SSE
and 3DNow are being used.


 Also, the whole test takes a bunch of time. Which of the tests should I
 look for? Is it just blendFunc?

Yes, just blendFunc is sufficient.  I usually run glean like this:

glean -r res --visuals id==35 -t blendFunc

Where '35' is a typial GLX visual.  Otherwise all the visuals are tested -
which is overkill if you're focused on one particular test case.

-Brian

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread José Fonseca

On 2002.04.12 14:43 José Fonseca wrote:
 On 2002.04.10 19:40 José Fonseca wrote:
 
 ... since the specs give some tolerance it would be nice to run the 
 conformance tests with different settings in mmx_blend.S, specially the 
 single multiply w/o rouding ...
 
 I've started to play with glean and I tried to check this myself, but it 
 seems there is no effect in the results no matter what changes I make in 
 mmx_blend.S. ...
 

I've come to the conclusion that glean requirements lowered quite a deal! 
It even passes making

s = (p*a + q*(255 - a))  8

without a warning, thought it recognizes the less accuracy when comparing 
with the default implementation.

The only way I managed to make it fail was setting a = 0xf0, to reduce 
heavily the precision of results!!

It seems that there is a bug on glean. I'm using latest CVS.

I'll try to run the glperf tests now and see if I can get to the bottom of 
this issue.

José Fonseca




___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Michael

On Fri, Apr 12, 2002 at 08:33:17AM -0600, Brian Paul wrote:
 Yes, just blendFunc is sufficient.  I usually run glean like this:
 
   glean -r res --visuals id==35 -t blendFunc
 
 Where '35' is a typial GLX visual.  Otherwise all the visuals are tested -
 which is overkill if you're focused on one particular test case.

iirc, there's a bug in tblend.cpp, when it does the
check, it doesn't increment ePix, aPix so some pixels aren't checked.

Unless I've not got the latest version which is possible
-- 
Michael.

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Allen Akin

On Fri, Apr 12, 2002 at 02:43:00PM +0100, José Fonseca wrote:
| 
| I've started to play with glean and I tried to check this myself, but it 
| seems there is no effect in the results no matter what changes I make in 
| mmx_blend.S. ...

The blend test only fails if an error is greater than one
least-significant bit in the framebuffer color channels.  If I
understood your earlier messages correctly, none of the methods you're
investigating has more than 1 LSB error; they differ mainly in rounding
behavior, which I'd expect to introduce errors on the order of 0.5 LSB.
So it's possible that glean is too lenient to tell the difference
between the methods you're testing.

|   ...The command line I use to run glean is:
| 
|   LD_LIBRARY_PATH=/home/jfonseca/projects/dri/mesa3d/Mesa/lib 
| ./glean -r mesa
| 
| Am I doing some thing wrong here?
| 
| Also, the whole test takes a bunch of time. Which of the tests should I 
| look for? Is it just blendFunc?

Sure, just run the test you need, on the Visuals you need.  If there are
dependencies between tests, glean will automatically run the
prerequisite tests first.

You can select the Visual number like Brian suggested, or you can
specify a Visual filter string on the glean command line.  For casual
testing I normally use something like this:

glean -r mesa --visuals max rgb, z, s, db -t blendFunc

Allen

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Allen Akin

On Fri, Apr 12, 2002 at 05:12:33PM +0100, Michael wrote:
| 
| iirc, there's a bug in tblend.cpp, when it does the
| check, it doesn't increment ePix, aPix so some pixels aren't checked.

Yep, that's definitely a bug.  Why haven't I heard a bug report before
now? :-)

Fix is checked in.

Allen

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread José Fonseca

On 2002.04.12 18:26 Allen Akin wrote:
 On Fri, Apr 12, 2002 at 02:43:00PM +0100, José Fonseca wrote:
 |
 | I've started to play with glean and I tried to check this myself, but
 it
 | seems there is no effect in the results no matter what changes I make
 in
 | mmx_blend.S. ...
 
 The blend test only fails if an error is greater than one
 least-significant bit in the framebuffer color channels.  If I
 understood your earlier messages correctly, none of the methods you're
 investigating has more than 1 LSB error; they differ mainly in rounding
 behavior, which I'd expect to introduce errors on the order of 0.5 LSB.

Yes, in general that's true. Although doing (p*a+q*(1-a))  8 can 
introduce up to 1 LSB error and worst, it doesn't obey to the rule of 
255*255 = 255 as 255*255/256 = 254. I know that in Mesa's C blending code 
this special case of a=255 is always checked, but in the MMX code it 
isn't, and glean doesn't complain of that.

 So it's possible that glean is too lenient to tell the difference
 between the methods you're testing.
 

So how come the Mesa blending code in s_blend.c has coments such as This 
is pretty close, but Glean complains, This is slower but satisfies 
Glean, and This satisfies Glean and should be reasonably fast...?

I can only understand these statments if Mesa was being compared to some 
reference implementation..

 ...
 
 You can select the Visual number like Brian suggested, or you can
 specify a Visual filter string on the glean command line.  For casual
 testing I normally use something like this:
 
   glean -r mesa --visuals max rgb, z, s, db -t blendFunc
 

Ok. Thanks!

 Allen
 

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Allen Akin

On Fri, Apr 12, 2002 at 07:01:08PM +0100, José Fonseca wrote:
|  ... Although doing (p*a+q*(1-a))  8 can 
| introduce up to 1 LSB error and worst, it doesn't obey to the rule of 
| 255*255 = 255 as 255*255/256 = 254. I know that in Mesa's C blending code 
| this special case of a=255 is always checked, but in the MMX code it 
| isn't, and glean doesn't complain of that.

If the expected value is 255 and the OpenGL implementation yields 254,
that's only one LSB of error, so glean probably won't complain about it.

We could make the test more stringent, but then some reasonable
implementations (especially some hardware implementations) would fail.
Also, maintaining enough accuracy to yield results correct to 1 LSB is
already pretty challenging when color channels are deeper than 8 bits.

| So how come the Mesa blending code in s_blend.c has coments such as This 
| is pretty close, but Glean complains, This is slower but satisfies 
| Glean, and This satisfies Glean and should be reasonably fast...?

I don't know.  It might be related to deep color channels, or possibly
to glean tests other than the basic blending tests.  Brian might
remember.

Thanks for all your good work, by the way!

Allen

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Brian Paul

Allen Akin wrote:
 
 On Fri, Apr 12, 2002 at 07:01:08PM +0100, José Fonseca wrote:
 |  ... Although doing (p*a+q*(1-a))  8 can
 | introduce up to 1 LSB error and worst, it doesn't obey to the rule of
 | 255*255 = 255 as 255*255/256 = 254. I know that in Mesa's C blending code
 | this special case of a=255 is always checked, but in the MMX code it
 | isn't, and glean doesn't complain of that.
 
 If the expected value is 255 and the OpenGL implementation yields 254,
 that's only one LSB of error, so glean probably won't complain about it.
 
 We could make the test more stringent, but then some reasonable
 implementations (especially some hardware implementations) would fail.
 Also, maintaining enough accuracy to yield results correct to 1 LSB is
 already pretty challenging when color channels are deeper than 8 bits.

This brings up some interesting questions about blending that aren't
directly addressed in the OpenGL specification.

One might expect that the following identities be true for blending terms:

1.0 * 1.0 == 1.0
1.0 * x == x * 1.0 == x
0.0 * x == x * 0.0 == 0.0

So for 8-bit channels, in fixed point:

255 * 255 == 255

I can easily imagine cases in which applications would depend on these
identities being true (in blending and elsewhere).  In fact, I have a
vague recollection of someone bringing up this issue a few years ago.

I'd like to see Mesa satisfy the 255*255=255 identity.  Is it hard to
implement that in the MMX code?  If it is, we could let it go for now
and see if anyone complains.


 | So how come the Mesa blending code in s_blend.c has coments such as This
 | is pretty close, but Glean complains, This is slower but satisfies
 | Glean, and This satisfies Glean and should be reasonably fast...?
 
 I don't know.  It might be related to deep color channels, or possibly
 to glean tests other than the basic blending tests.  Brian might
 remember.

It's been at least a year since I touched that code.  As far as I can
remember the comments are correct.  Though I don't remember if it was
an issue at 5/6/5 or 8/8/8 color depth, or both.  I don't know what
else might have changed since then to cause different results with
Glean.


 Thanks for all your good work, by the way!

Yes!

-Brian

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Gareth Hughes

Allen Akin wrote:
 
 If the expected value is 255 and the OpenGL implementation yields 254,
 that's only one LSB of error, so glean probably won't complain about it.
 
 We could make the test more stringent, but then some reasonable
 implementations (especially some hardware implementations) would fail.
 Also, maintaining enough accuracy to yield results correct to 1 LSB is
 already pretty challenging when color channels are deeper than 8 bits.

Perhaps we need a -pedantic-like command-line option to force more 
stringent tests?  It could be used when testing software implementations 
and (perhaps) newer hardware implementations, but not used on older 
cards (I seem to recall the 3dfx cards were a major source of problems 
here...).

-- Gareth

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-12 Thread Allen Akin

On Fri, Apr 12, 2002 at 01:14:30PM -0600, Brian Paul wrote:
| 
| One might expect that the following identities be true for blending terms:
| 
|   1.0 * 1.0 == 1.0
|   1.0 * x == x * 1.0 == x
|   0.0 * x == x * 0.0 == 0.0
| 
| So for 8-bit channels, in fixed point:
| 
|   255 * 255 == 255
| 
| I can easily imagine cases in which applications would depend on these
| identities being true (in blending and elsewhere).

Sounds like a good candidate for a glean test.  Anyone out there want to
give it a try?

Allen

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread Sergey V. Udaltsov

 I've finally ( hopefully) finished the rewrite of Mesa's MMX blend code.
Is it already in binary snapshots?

Cheers,

Sergey

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread José Fonseca

On 2002.04.10 09:03 Sergey V. Udaltsov wrote:
  I've finally ( hopefully) finished the rewrite of Mesa's MMX blend
 code.
 Is it already in binary snapshots?
 
 Cheers,
 
 Sergey
 

Nope. It's really a small drop in the ocean so there is no need to rush. I 
hope Brian will integrate on the mesa trunk soon. This way there are less 
places to fix eventual bugs.

Regards,

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread Brian Paul

José Fonseca wrote:
 
 On 2002.04.10 09:03 Sergey V. Udaltsov wrote:
   I've finally ( hopefully) finished the rewrite of Mesa's MMX blend
  code.
  Is it already in binary snapshots?
 
  Cheers,
 
  Sergey
 
 
 Nope. It's really a small drop in the ocean so there is no need to rush. I
 hope Brian will integrate on the mesa trunk soon. This way there are less
 places to fix eventual bugs.

I've checked it into Mesa CVS, both the trunk and the mesa_4_0_branch
in case there's a 4.0.3 release.

-Brian

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread Brian Paul


José,

I've checked in the code after testing with Glean and the OpenGL conformance
tests.

Was I supposed to change something in the C code?  It passes the conformance
tests as-is.

Thanks for you work!

-Brian

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Mesa3d-dev] Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread José Fonseca

On 2002.04.10 17:42 Brian Paul wrote:
 
 José,
 
 I've checked in the code after testing with Glean and the OpenGL
 conformance
 tests.
 

Great.

 Was I supposed to change something in the C code?  It passes the
 conformance tests as-is.
 

I was surprised that the C code passed the conformance tests, because of 
the signed arithmetic it doesn't give the same results as before. So I've 
made a small comparision with the several methods (test program attached):

// Nathan's method - unsigned 24bit arithmetic
// NOTE: this was the original Mesa code
t1 = p*a + q*(255 - a);
s1 = (t1 + (t1  8) + 256)  16;
 
// Nathan's method - signed 24bit arithmetic (less one multiply)
// NOTE: this is how I changed and is now
t2 = (p - q)*a;
s2 = (t2 + (t2  8) + 256)  16;
s2 += q;
s2 = 0xff;
 
// Blin's method - unsigned 16bit arithmetic
// NOTE: is exact
t3 = p*a + q*(255-a) + 128;
s3 = (t3 + (t3  8))  8;
 
// Blin's method - signed 16bit arithmetic (less one multiply)
// NOTE: is exact because the negative sign is considered
t4 = ((p - q)*a + (p  q ? 128 : -128))  0x;
s4 = (t4 + (t4  8))  8;
s4 += q;
s4 = 0xff;

When one compares with the exact result

// exact result - rounded
s = (unsigned) (((double)p)*(((double)a)/255.0) + 
((double)q)*(1.0-((double)a)/255.0) + 0.5);

one gets:

1: 8164890 differences in 16777216
2: 8148697 differences in 16777216
3: 0 differences in 16777216
4: 0 differences in 16777216

So spite of the different results between 1 and 2, 2 gives better results 
overall!!

What happens is that method 1 is aimed to follow the truncated results and 
not the rounded. If one compares with the truncated result

// truncated result
s = (unsigned) (((double)p)*(((double)a)/255.0) + 
((double)q)*(1.0-((double)a)/255.0));

one gets:

1: 15467 differences in 16777216
2: 31660 differences in 16777216
3: 8180357 differences in 16777216
4: 8180357 differences in 16777216

Notice that, by this point of view, the method 2 is indeed worst, but this 
really doesn't matter because is the wrong point of view.

This explains why the current C code passes the conformance tests.

At this moment the MMX code implements method 4, which is very fast. There 
is no point in implement method 2, spite being a little faster than method 
4 (because of the simpler rounding) because it would requite 24bit 
arithmetic instead of 16, so less numbers could be multiplied at the same 
time.

So, in contrary of what I thought, there is no need to switch to method 1. 
When I implement the double blend trick I will have to use the method 4, 
again for the same reasons of above.

But since the specs give some tolerance it would be nice to run the 
conformance tests with different settings in mmx_blend.S, specially the 
single multiply w/o rouding which would give at least 5% improvement (it 
will be a little more because it would allow to free some registers 
allowing to leaving some necessary constants there).

For that is just necessary to change

#define GMBT_ROUNDOFF   0

leaving the rest as before

#define GMBT_ALPHA_PLUS_ONE 0
#define GMBT_GEOMETRIC_SERIES   1
#define GMBT_SIGNED_ARITHMETIC  1

Using the alpha+1 method and not using the geometric series would be the 
even faster but it is already marked on the C code as rejected by glean...

 Thanks for you work!
 
 -Brian
 

Regards,

José Fonseca


#include stdio.h
#include stdlib.h

int main()
{
	unsigned short p, q, a;
	unsigned c1 = 0, c2 = 0, c3 = 0, c4 = 0;
	
	for (p = 0; p = 255; ++p)
	for (q = 0; q = 255; ++q)
	for (a = 0; a = 255; ++a)
	{
		unsigned s;
		unsigned s1, s2, s3, s4;
		unsigned t1, t2, t3, t4;

#if 1
		// exact result - rounded
		s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0) + 0.5);
#else
		// truncated result
		s = (unsigned) (((double)p)*(((double)a)/255.0) + ((double)q)*(1.0-((double)a)/255.0));
#endif

		// Nathan's method - unsigned 24bit arithmetic
		t1 = p*a + q*(255 - a);
		s1 = (t1 + (t1  8) + 256)  16;
		
		// Nathan's method - signed 24bit arithmetic
		t2 = (p - q)*a;
		s2 = (t2 + (t2  8) + 256)  16;
		s2 += q;
		s2 = 0xff;
		
		// Blin's method - unsigned 16bit arithmetic
		// NOTE: is exact
		t3 = p*a + q*(255-a) + 128;
		s3 = (t3 + (t3  8))  8;
		
		// Blin's method - signed 16bit arithmetic
		// NOTE: is exact because the negative sign is considered
		t4 = ((p - q)*a + (p  q ? 128 : -128))  0x;
		s4 = (t4 + (t4  8))  8;
		s4 += q;
		s4 = 0xff;
		
		if(s1 != s) ++c1;
		if(s2 != s) ++c2;
		if(s3 != s) ++c3;
		if(s4 != s) ++c4;
		if (s1 != s || s2 != s || s3 != s || s4 != s)
		{
//			printf(%3ux%3ux%3u:\t(%3u)\t%3u\t%3u\t%3u\t%3u\n, p, a, q, s, s1, s2, 

Re: [Dri-devel] Mesa MMX blend code finished

2002-04-10 Thread José Fonseca

On 2002.04.10 11:41 Sergey V. Udaltsov wrote:
  Nope. It's really a small drop in the ocean so there is no need to
 rush. I
  hope Brian will integrate on the mesa trunk soon. This way there are
 less
  places to fix eventual bugs.
 I see. Actually, AFAIU it would be not exactly mach64 snapshot but
 rather libGL shapshot (since it is about indirect rendering speedup). Am
 right?
 

Not really. It would speed up the indirect rendering and the mach64 when 
it fallbacks to software, which doesn't happens so often yet because we're 
not really striving for opengl conformance *yet*.

 Looking forward to hear some news from mach64 front.
 
 Regards,
 
 Sergey
 

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



[Dri-devel] Mesa MMX blend code finished

2002-04-09 Thread José Fonseca

Ufh!! 8-O

I've finally ( hopefully) finished the rewrite of Mesa's MMX blend code.

The code is configurable - allowing to choose several methods for the 
blending equation.


Here are the benchmarks that I made (on a Pentium III 700Mhz):

C code: 8.142382 sec
Old MMX code:   4.363946 sec

exact (single multiply w/o rounding):   3.637088 sec= fastest that 
will satisfy glean
exact (two multiplies w/o rounding):3.817152 sec

approx (single multiply w/o rouding):   3.476336 sec
approx (two multiplies w/o rounding:3.629378 sec

approx (single multiply with alpha+1):  3.250325 sec


Attached is the file and the testsuite I used (note that to use the 
testsuite it's needed to remove the static and ASSERTs from 
_mesa_mmx_blend_transparency since this function is called directly).


I'll eventually make something very similar to this on the lines of the 
double blend trick in the C code, but it will stay on hold because I 
feel I have to dedicate myself to mach64 again.


Regards,

José Fonseca


PS: I've CC'd to dri-devel because this subject was first raised on the 
DRI meeting, but I'm not sure if there is a real interest since most of 
its subscribers eventually subscribe mesa3d-dev as well... or am I wrong?


PPS: I didn't name this thread Mesa software blending for the obvious 
reasons... :-)


/*
 * Written by José Fonseca [EMAIL PROTECTED]
 */

#include matypes.h

/*
 * make the following approximation to the division (Sree)
 *
 *   rgb*a/255 ~= (rgb*(a+1))  256
 *
 * which is the fastest method that satisfies the following OpenGL criteria
 *
 *   0*0 = 0 and 255*255 = 255
 *
 * note this one should be used alone
 */
#define GMBT_ALPHA_PLUS_ONE	0

/*
 * take the geometric series approximation to the division
 *
 *   t/255 = (t  8) + (t  16) + (t  24) ..
 *
 * in this case just the first two terms to fit in 16bit arithmetic
 *
 *   t/255 ~= (t + (t  8))  8
 *
 * note that just by itself it doesn't satisfies the OpenGL criteria, as 255*255 = 254, 
 * so the special case a = 255 must be accounted or roundoff must be used
 */
#define GMBT_GEOMETRIC_SERIES	1

/*
 * when using a geometric series division instead of truncating the result 
 * use roundoff in the approximation (Jim Blinn)
 *
 *   t = rgb*a + 0x80
 *
 * achieving the exact results
 */
#define GMBT_ROUNDOFF		1

/*
 * do
 *
 *   s = (q - p)*a + q
 *
 * instead of
 *
 *   s = p*a + q*(1-a)
 *
 * this eliminates a multiply at the expense of
 * complicating the roundoff but is generally worth it
 */
#define GMBT_SIGNED_ARITHMETIC	1

#if GMBT_ROUNDOFF
SEG_DATA

ALIGNDATA8
const_80:
	D_LONG 0x00800080, 0x00800080
#endif 

   SEG_TEXT

ALIGNTEXT16
GLOBL GLNAME(_mesa_mmx_blend_transparency)

/*
 * void blend_transparency( GLcontext *ctx,
 *  GLuint n, 
 *  const GLubyte mask[],
 *  GLchan rgba[][4], 
 *  CONST GLchan dest[][4] )
 * 
 * Common transparency blending mode.
 */
GLNAME( _mesa_mmx_blend_transparency ):

PUSH_L ( EBP )
MOV_L  ( ESP, EBP )
PUSH_L ( ESI )
PUSH_L ( EDI )
PUSH_L ( EBX )

MOV_L  ( REGOFF(12, EBP), ECX )		/* n */
CMP_L  ( CONST(0), ECX)
JE ( LLBL (GMBT_return) )

MOV_L  ( REGOFF(16, EBP), EBX )		/* mask */
MOV_L  ( REGOFF(20, EBP), EDI ) /* rgba */
MOV_L  ( REGOFF(24, EBP), ESI ) /* dest */

TEST_L ( CONST(4), EDI )		/* align rgba on an 8-byte boundary */
JZ ( LLBL (GMBT_align_end) )

CMP_B  ( CONST(0), REGIND(EBX) )	/* *mask == 0 */
JE ( LLBL (GMBT_align_continue) )

PXOR   ( MM0, MM0 )			/*   0x  |   0x  |   0x  |   0x  */

MOVD   ( REGIND(ESI), MM1 )		/* | | | | qa1 | qb1 | qg1 | qr1 */
MOVD   ( REGIND(EDI), MM2 )		/* | | | | pa1 | pb1 | pg1 | pr1 */

PUNPCKLBW  ( MM0, MM1 )			/*qa1|qb1|qg1|qr1*/
PUNPCKLBW  ( MM0, MM2 )			/*pa1|pb1|pg1|pr1*/

MOVQ   ( MM2, MM3 )

PUNPCKHWD  ( MM3, MM3 )			/*pa1|pa1|   |   */
PUNPCKHDQ  ( MM3, MM3 ) /*pa1|pa1|pa1|pa1*/

#if GMBT_ALPHA_PLUS_ONE
PCMPEQW( MM4, MM4 )			/*   0x  |   0x  |   0x  |   0x  */

PSUBW  ( MM4, MM3 ) /*   pa1 + 1 |   pa1 + 1 |   pa1 + 1 |   pa1 + 1 */
#endif

#if GMBT_SIGNED_ARITHMETIC
PSUBW  ( MM1, MM2 ) /* pa1 - qa1 | pb1 - qb1 | pg1 - qg1 | pr1 - qr1 */

PSLLW  ( CONST(8), MM1 )		/*q1  8*/

#if GMBT_ROUNDOFF
MOVQ   ( MM2, MM4 )
#endif

PMULLW ( MM3, MM2 )			/*  t1 = (q1 - p1)*pa1   */

#if GMBT_ROUNDOFF
PSRLW  ( CONST(15), MM4 )		/* q1  p1 ? 1 : 0   */

PSLLW