On Mon, 23 Dec 2002, Rodolphe Ortalo wrote:
>  0) What should a KGI-accel-oriented developper do next? :-)

I'd like to see the get/put issues worked out personally, and of
course, whatever can be done to bring the accelerator command queues
up to max speed (by using optimal data transfer modes) deserves attention.

Also, do remember that we are supposed to be providing a *secure* graphics
system, so putting some thought into how to prevent an application
from using the accelerator or graphics registers in a hostile fashion
is important, rather than reworking the whole design later because
it wasn't up to that task.  I don't know what your code's status is on
these two points.

>  1) Are {Get,Put} functions required for an accel sublib? Currently they
> are not implemented and I am reluctant to implement them the straight
> way - though I may try it for experiments. (If the userspace buffer is not
> in DMA-capable memory, or not big enough, I guess I may end up locking the
> machine.)

They are not required, but IMO we should be testing out ways to
do this.

I have been looking at the Radeon code in this respect as well, and 
perhaps a description of what I'd like to do there would apply to MGA
as well... unfortunately Filip isn't around to comment on this until 
after the holidays.

I'd like to see the put/get functions and the pixel functions all
use the accelerator, because it gets rid of all the issues surrounding
idling the accelerator when mixing direct FB accesses with accelerator
primitives (except of course for Directbuffer, but it's not like
Directbuffer doesn't come with caveats anyway.)

In the case of pixels, this may be slower than using the FB directly
with applications that put a lot of pixels sequentially.  However, most 
apps (other than stars and demos like that) won't be doing this, and in 
an environment where pixels are mixed with other primitives, freeing the 
CPU from busy-polling for accelerator-free every time it wants to put a pixel 
will likely result in an overall speed gain.

In the case of put/get box it may also be slower than direct access,
as far as raw throughput is concerned, but in addition to freeing 
the CPU during the data transfer, accelerating these operations can
make the target more easily capable of implementing hardware Z and 
alpha support (see below).

There are basically 3 options for "accelerating" put/get functions.

1) package the data in the command stream.
2) Have KGI allocate a "swatch" area in the framebuffer that unless/until
   the user uses LibGAlloc, is invisible to the user.  Data for
   put* primitives is copied to VRAM as a solid chunk, and then the
   accelerator does the clipping/rasterization stride.
   /* note to Christoph -- this does not mean that display-kgi would
    * require libGAlloc, rather that a special "swatch" resource would
    * automagically appear in the reslist like the base frame resource does.
    */
3) As with 2), but the "swatch" is in a non-swappable RAM buffer which is 
   allocated by KGI.

In researching implementing LibBuf for Radeon, it became apparent from the 
ATI docs that only 3D primitives can use the Z buffer, and 3D primitives 
can not take "inline" data from the command queue, so I have not thought
too much about 1).

With 2) and 3) the accelerator only needs to be idled when the "swatch"
fills up.  The swatch can be treated as a simple ring-buffer
and is thus really easy to manage.  The only intricacy is breaking up 
putbox requests into chunks when they are too big for the remaining
free "swatch" to contain.  (getbox requires idling the accelerator
anyway and so can be done the "old" way.)

2) is easy to implement without any modifications to the KGI driver
(at least in the Radeon's case.)  The overhead of the extra copy is 
absorbed entirely by the GPU and the graphics memory bus.  However, the 
CPU still spins on the slow(er than main memory) graphics bus while 
initially transferring the data.

3) requires that KGI know how to allocate and mmap nonswappable RAM buffers.
I've done this before and though the linux kernel has been in flux in the
past on this matter, what I've gleaned from changelogs is that the
developers have finally been convinced that yes, it *is* important to
some hardware to be able to allocate and MMAP nonswappable RAM, so if anything
this should be getting easier.  (Note the RAM should also be MTRR'd
as non-cached if possible.)

In 3) the CPU spends less time transferring the data as it can throw it
at full memmory bus speed.  The GPU then bus-masters or AGP's it out of
the swatch.  This is faster for the CPU but burns more main memory bus
bandwidth, and so might slow down applications that are bus-hungry --
however, math-intensive applications that try to stay in cache will 
cruise along entirely untouched.

I favor 3) for reasons listed below, but 3) and 2) can cooexist, at least
on Radeon, because the only difference to the command stream is the base 
address used for the swatch data.  An argument could be provided to 
display-kgi to change the size of the swatch and whether it is located
in system RAM or VRAM.

Now, on to the future:

LibGAlloc has a resource called a "swatch" which is a buffer, likely
a buffer of some sort of "special" RAM, which contains pixels in the
same format as the main framebuffer.  When such a resource is attached
to a visual using LibBuf, addresses inside the "swatch" buffer can be 
used in normal LibGGI put/get operations, with the following considerations:

1) If the RAM is "special," get/put operation using a swatch may return 
before LibGGI is done with data in the swatch.  So, before altering or 
retreiving the data, one should perform a ggiFlush manually to idle the accel.

2) There may be some alignment restrictions on what valid start
addresses are if not referencing the first pixel location in the swatch.

Swatch buffers are more convenient than virtual area as a scratch
space because a) the swatch area doesn't have a stride like the main
fb does, b) the swatches can be in RAM and thus the data in them
is more efficiently manipulated and c) the swatch buffers
are not subject to ggiSetReadFrame/ggiSetWriteFrame.

Implementing 2) and 3) for the base LibGGI above would make implementing 
user swatch buffers as easy as pie -- just DMA/AGP from the user's 
swatch instead of from LibGGI's "system swatch"

In the case of the Radeon, additionally, using 3D primitives and
implementing 2) and 3) would allow me to simply skip writing a separate
rendering sublib for LibBuf/Radeon entirely, as all LibBuf would need to do
is alter some kgi-Radeon renderer internal data telling the renderer that the
registers that control Alpha and Z need to be updated before the 
next primitive is dispatched, and a quick test added to the 
top of the kgi-Radeon renderer's primitives.

>  4) Do you have a {Mystique,Millenium,G400,G550}?

I have a Mystique, as soon as I fix the poor system it's installed in.

--
Brian

Reply via email to