Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 17:27 -0700, Ian Romanick wrote:
|
|> Apps are using and will increasingly use the glMapBuffer path.  With the
|> information currently at hand, doing the alloc/copy/upload/free in the
|> driver might be the win.  Great.  It's way too soon to box ourselves
|> into that route.  If we're going to be stuck with an unchangeable
|> interface for another 5 years, it had better be flexible enough to
|> support more than one way to do things under the sheets.
|
| No-one is forcing anyone to do anything -- certainly gem supports mmap
| as well as pwrite, so you can do whichever you prefer. The reason pwrite
| was added was to support the SubData path directly in the kernel,
| instead of providing only a map/copy/unmap path. That way the kernel
| gets to choose how it implements both paths, and so does the
| application.

I know...that's one of the things I liked about it. :)
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMkPSX1gOwKyEAw8RAsXwAKCXMRPm/c8SYBXRM3q2PQnOKU7DJgCfRGHJ
jBqKf0km6HHDJulmCGFfqec=
=j+l5
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 17:27 -0700, Ian Romanick wrote:

> Apps are using and will increasingly use the glMapBuffer path.  With the
> information currently at hand, doing the alloc/copy/upload/free in the
> driver might be the win.  Great.  It's way too soon to box ourselves
> into that route.  If we're going to be stuck with an unchangeable
> interface for another 5 years, it had better be flexible enough to
> support more than one way to do things under the sheets.

No-one is forcing anyone to do anything -- certainly gem supports mmap
as well as pwrite, so you can do whichever you prefer. The reason pwrite
was added was to support the SubData path directly in the kernel,
instead of providing only a map/copy/unmap path. That way the kernel
gets to choose how it implements both paths, and so does the
application.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:
|
|> It depends on the hardware.  In the second approach the driver has no
|> opportunity to do something "smart" if the copy path isn't the fast
|> path.  Applications are being tuned more for the hardware where the copy
|> path isn't the fast path.
|
| It really only depends on the CPU and bus architecture; the GPU
| architecture is not relevant here. The cost is getting data from the CPU
| into the GPU cache coherence domain, currently that involves actually
| writing the data from the CPU over some kind of bus to physical memory.
|
|> The obvious overhead I was referring to is the extra malloc / free.
|> That's why I went on to say "So, now I have to go back and spend time
|> caching the buffer allocations and doing other things to make it fast."
|> ~ In that context, "I" is idr as an app developer. :)
|
| You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
| than the cost of map/write/unmap.

Using glMapBuffer does not necessarily mean that the driver is doing
map/write/unmap.  In fact, based on measurements I took back in 2006,
fglrx doesn't (or didn't at the time, anyway).  See section 4.3 of
http://web.cecs.pdx.edu/~idr/publications/ddc2006-opengl_immediate_mode.pdf

It means that the driver *CAN DO THAT IF IT WANTS TO.*  Some drivers are
some platforms are clearly doing that, and they're running really fast.
~ Using glBufferSubData *FORCES* the driver to do the copy and *FORCES*
the app to do extra buffer management.

Apps are using and will increasingly use the glMapBuffer path.  With the
information currently at hand, doing the alloc/copy/upload/free in the
driver might be the win.  Great.  It's way too soon to box ourselves
into that route.  If we're going to be stuck with an unchangeable
interface for another 5 years, it had better be flexible enough to
support more than one way to do things under the sheets.

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMhrlX1gOwKyEAw8RArwCAJ9fHW/TXdwpiXro6LIjk6twgaT36ACfVywo
iwIy4DMhiybnmOo1Myk4Hps=
=j+Uc
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


[Bug 10744] REGRESSION: video driver stuck after screen blank

2008-05-19 Thread bugme-daemon
http://bugzilla.kernel.org/show_bug.cgi?id=10744


[EMAIL PROTECTED] changed:

   What|Removed |Added

 CC||[EMAIL PROTECTED],
   ||[EMAIL PROTECTED]
   ||foundation.org
 Status|NEW |CLOSED
 Resolution||CODE_FIX




--- Comment #1 from [EMAIL PROTECTED]  2008-05-19 14:13 ---
fixed by commit af6061af0d9f84a4665f88186dc1ff9e4fb78330


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Daniel Stone
On Mon, May 19, 2008 at 08:52:49PM +0100, Dave Airlie wrote:
> Now I would be willing to provide a drm tuneable sorta like memory 
> overcommit that could be used on embedded systems and basically says I've 
> designed my system so I never need suspend/resume and I really udnerstand 
> what I'm doing, so don't ensure I have backing store for VRAM allocations. 
> This would never be the default.

Right.  'I don't have swap, and I don't lose memory content across deep
sleep anyway.'

Cheers,
Daniel


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Thomas Hellström
Dave Airlie wrote:
>> 1) The ideal thing would be for the card contents to be quickly copied 
>> to backing-store and suspend is done.
>> However, this requires pinning as much physical pages as there is VRAM.
>>
>> 2) The other approach is to have a backing object of some sort, either a 
>> list of swap-entries or perhaps a gem object.
>> The gem object would, at the point of suspend, either be paged out or 
>> unpopulated which means (provided that the swap sub-system is up at the 
>> suspend point) there will be heavy disk-access and the operation might 
>> fail due to a shortage of either swap space or physical memory for the 
>> swap system bookkeeping.
>>
>> Just want to know what's the general opinion here. Are the VRAM card 
>> developers planning to back all VRAM objects with pinned physical pages, 
>> or are we looking at approach 2) which might fail?
>>
>> 
>
> By default for a laptop system you *have* to have swappable backing store 
> for everything in VRAM preallocated, not pinned, but available.
>   
The thing is, to guarantee success we need to pin.

The second best approach is to have swap-entries pre-allocated,
but if the swapping system fails to allocate pages for internal data 
structures att swapout time, it might still fail.
At that time, the current swapping code actually frees the swap-entry 
and tries to realloc it. I'm not sure why it's doing that, but it 
probably has a good reason.

The third best approach is to have gem objects allocated and ready. 
However, at the time of swapout, these are probably either unpopulated 
with pages, or paged out. Unpopulated is basically the same as not 
having alloced anything at all.

Last but not least, we may have systems without swap, as Jakob pointed 
out. In that case we will probably fail a lot.
> Failing suspend is not the answer, as user closes laptop lid, and sticks 
> it in bag, he doesn't expect it not to suspend because he has blender and 
> compiz running or whatever. So we have to fail on object allocation if we 
> have no backing store available. 
>   
I think unless we pin, we have no _reliable_ backing store.
However, allocating swap entries is a way to have fairly reliable 
backing store, but that is a bit rude to other kernel resources that 
might also need swap space to be able to suspend.
> Now I would be willing to provide a drm tuneable sorta like memory 
> overcommit that could be used on embedded systems and basically says I've 
> designed my system so I never need suspend/resume and I really udnerstand 
> what I'm doing, so don't ensure I have backing store for VRAM allocations. 
> This would never be the default.
>   
Note this is not a crusade against backing store :).
Although I must say I think _pinned_ backing store should not really be 
considered for VRAM.

Perhaps pre-allocating swap entries is as good as it gets, but I think 
we should have the option to defer that pre-allocation to 
prepare-for-suspend for the typical desktop user. We also need to have 
some swapping code written up and exported from the main kernel.

/Thomas




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 12:13 -0700, Ian Romanick wrote:

> It depends on the hardware.  In the second approach the driver has no
> opportunity to do something "smart" if the copy path isn't the fast
> path.  Applications are being tuned more for the hardware where the copy
> path isn't the fast path.

It really only depends on the CPU and bus architecture; the GPU
architecture is not relevant here. The cost is getting data from the CPU
into the GPU cache coherence domain, currently that involves actually
writing the data from the CPU over some kind of bus to physical memory.

> The obvious overhead I was referring to is the extra malloc / free.
> That's why I went on to say "So, now I have to go back and spend time
> caching the buffer allocations and doing other things to make it fast."
> ~ In that context, "I" is idr as an app developer. :)

You'd be wrong then -- the cost of the malloc/write/copy/free is cheaper
than the cost of map/write/unmap.

> One problem that we have here is that none of the benchmarks currently
> being used hit any of these paths.  OpenArena, Enemy Territory (I assume
> this is the older Quake 3 engine game), and gears don't use MapBuffer at
> all.  Unfortunately, any apps that would hit these paths are so
> fill-rate bound on i965 that they're useless for measuring CPU overhead.

The only place we see significant map/write/unmap vs
malloc/write/copy/free is with batch buffers, and so far the
measurements that I've taken which appear to show a benefit haven't been
reproduced by others...

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 20:11 +0100, Keith Whitwell wrote:

> I'm still confused by your test setup...  Stepping back from cache
> metaphysics, why doesn't classic pin the hardware, if it's still got
> 60% cpu to burn?

glxgears under classic is definitely not pinning the hardware -- the
'intel_idle' tool shows that it's only using about 70% of the GPU. GEM
is pinning the hardware. Usually this means there's some synchronization
between the CPU and GPU causing each to wait part of the time while the
other executes. I haven't really looked at the non-gem case though; the
numbers seem similar enough to what I've seen in the past.

> I think getting reproducible results makes a lot of sense.  What
> hardware are you actually using -- ie. what is this laptop?

This is a Panasonic CF-R4.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Dave Airlie
> 
> Spliting the cmd before they get submited is the way to go, likely we can
> ask the kernel for estimate of available memory and so userspace can stop
> building cmd stream but this isn't easy. Well anyway this would be a
> userspace problem. Anyway we still will have to fail in superioctl if
> for instance memory fragmentation get in the way.

It is easy, I did it for i915/i965 already, you just need a prepare/commit 
stage for state atoms and all buffers referenced.

We should never fail for memory fragmentation unless we are really 
screwed, for the userspace memory manager fixes I ended up destroying all 
memory objects are relaying them out to fit the buffers in. With pinned 
buffers this may not be so easy, but really we need to have pinned buffers 
on one end of the aperture and others on the other end, also ideally I'd 
like to have in-kernel support (modesetting) for moving all pinned buffers 
if needs be. This means no userspace pinning if we can avoid it.

Dave.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Dave Airlie
> 
> 1) The ideal thing would be for the card contents to be quickly copied 
> to backing-store and suspend is done.
> However, this requires pinning as much physical pages as there is VRAM.
> 
> 2) The other approach is to have a backing object of some sort, either a 
> list of swap-entries or perhaps a gem object.
> The gem object would, at the point of suspend, either be paged out or 
> unpopulated which means (provided that the swap sub-system is up at the 
> suspend point) there will be heavy disk-access and the operation might 
> fail due to a shortage of either swap space or physical memory for the 
> swap system bookkeeping.
> 
> Just want to know what's the general opinion here. Are the VRAM card 
> developers planning to back all VRAM objects with pinned physical pages, 
> or are we looking at approach 2) which might fail?
> 

By default for a laptop system you *have* to have swappable backing store 
for everything in VRAM preallocated, not pinned, but available.

Failing suspend is not the answer, as user closes laptop lid, and sticks 
it in bag, he doesn't expect it not to suspend because he has blender and 
compiz running or whatever. So we have to fail on object allocation if we 
have no backing store available. 

Now I would be willing to provide a drm tuneable sorta like memory 
overcommit that could be used on embedded systems and basically says I've 
designed my system so I never need suspend/resume and I really udnerstand 
what I'm doing, so don't ensure I have backing store for VRAM allocations. 
This would never be the default.

Dave.



-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:
|
|>  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
|>  GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
|>  if (data == NULL) {
|>  /* fail */
|>  }
|>
|>  /* Fill in buffer data */
|>
|>  glUnmapBuffer(GL_ARRAY_BUFFER);
|>
|> Over:
|>
|>  GLfloat *data = malloc(buffer_size);
|>  if (data == NULL) {
|>  /* fail */
|>  }
|>
|>  /* Fill in buffer data */
|>
|>  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
|>  glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
|>  free(data);
|
| In terms of system performance, that 'extra copy' is not a problem
| though; the only cost is the traffic to the graphics chip, and these
| both do precisely the same amount of work. The benefit to the latter
| approach is that we get to use cache-aware copy code. The former can't
| do this as easily when updating only a portion of the data.

It depends on the hardware.  In the second approach the driver has no
opportunity to do something "smart" if the copy path isn't the fast
path.  Applications are being tuned more for the hardware where the copy
path isn't the fast path.

|> The second version obviously has extra overhead and takes a performance
|> hit.
|
| My measurements show that doing a cache-aware copy is a net performance
| win over using cache-ignorant word-at-a-time writes.

The obvious overhead I was referring to is the extra malloc / free.
That's why I went on to say "So, now I have to go back and spend time
caching the buffer allocations and doing other things to make it fast."
~ In that context, "I" is idr as an app developer. :)

One problem that we have here is that none of the benchmarks currently
being used hit any of these paths.  OpenArena, Enemy Territory (I assume
this is the older Quake 3 engine game), and gears don't use MapBuffer at
all.  Unfortunately, any apps that would hit these paths are so
fill-rate bound on i965 that they're useless for measuring CPU overhead.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMdFgX1gOwKyEAw8RAvTBAJ4vEFUkCalQuEadOdh99BFIcz4WAwCfQ8e+
omh7z+g5Ja6AABvs5zrsR4k=
=HXx2
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Keith Whitwell
>
> glxgears uses 40% of the CPU in both classic and gem. Note that the gem
> version takes about 20 seconds to reach a steady state -- the gem driver
> isn't clearing the gtt actively and so glxgears gets far ahead of the
> gpu.
>
> My theory is that this shows that using cache-aware copies from a single
> static batch buffer (as gem does now) improves cache performance and
> write bandwidth.

I'm still confused by your test setup...  Stepping back from cache
metaphysics, why doesn't classic pin the hardware, if it's still got
60% cpu to burn?

I think getting reproducible results makes a lot of sense.  What
hardware are you actually using -- ie. what is this laptop?

Keith

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jerome Glisse wrote:
| On Mon, 19 May 2008 10:25:16 -0700
| Ian Romanick <[EMAIL PROTECTED]> wrote:
|
|> | Does in GL3 object must be unmapped before being use ? IIRC this
what is
|> | required in current GL 1.x and GL 2.x. If so i think i can still
use VRAM
|> | as cache ie i put their object which are barely never mapped (like a
|> | constant texture, or constant vertex table). This avoid me to think to
|> | complexe solution for cleanly handling unmappable vram.
|>
|> Be careful here.  An object must be unmapped in the context where it is
|> used for drawing.  However, buffer objects can be shared between
|> contexts.  This means that even today in OpenGL 1.5 context A can be
|> drawing with a buffer object while context B has it mapped.  Of course,
|> context A doesn't have to see the changes caused by context B until the
|> next time it binds the buffer.  This means that copying data for the map
|> will "just work."
|
| Is the result defined by GL specification ? ie does B need to access an
| old copy of the object or if A is rendering to this object can we let B
| see the ongoing rendering.
|
| In latter case this likely lead to broken rendering if there is no
| synchronization btw A & B.

The GLX spec says, basically, that the results of changes to a shared
object in context A are guaranteed to be visible to context B when
context B binds the object.  It leaves a lot of slack for changes to
show up earlier.  This is part of the reason that app developers want
NV_fence-like functionality.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMc8rX1gOwKyEAw8RAposAKCWQ8fTtIPHuvNXmj36eq+P7qeNIACfYHuB
564mXsChzRey46q8RXv15bI=
=0GZC
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 20:32 +0200, Thomas Hellström wrote:
> Keith Packard wrote:
> > On Mon, 2008-05-19 at 05:09 -0700, Keith Whitwell wrote:
> >
> >   
> >> I
> >> think the latter is the significant result -- none of these experiments
> >> in memory management significantly change the command stream the
> >> hardware has to operate on, so what we're varying essentially is the
> >> CPU behaviour to acheive that command stream.  And it is in CPU usage
> >> where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly
> >> dissapoint.
> >> 
> >
> > Your GEM results do not match mine; perhaps we're running different
> > kernels? Anything older than 2.6.24 won't be using clflush and will
> > instead use wbinvd, a significant performance impact.  Profiling would
> > show whether this is the case.
> >
> > I did some fairly simple measurements using openarena and enemy
> > territory. Kernel version 2.6.25, CPU 1.3GHz Pentium M, 915GMS with the
> > slowest possible memory. I'm afraid I don't have a working TTM
> > environment at present; I will try to get that working so I can do more
> > complete comparisons.
> > 
> > fps realuserkernel
> > glxgears classic:   665
> > glxgears GEM:   889
> > openareana classic: 17.1 59.19   37.13   1.80
> > openarena GEM:  24.6 44.06   25.52   5.29
> > enemy territory classic: 9.0382.13  226.38  11.51   
> > enemy territory GEM:15.7212.80  121.72  40.50
> >
> >   
> Keith,
> 
> The GEM timings were done with 2.6.25, except on the i915 system texdown 
> timings which used 2.6.24.
> Indeed, Michel reported much worse GEM figures with 2.6.23.

We clearly need to find a way to generate reproducible benchmark data.

Here's what I'm running:

kernel: 

commit 4b119e21d0c66c22e8ca03df05d9de623d0eb50f
Author: Linus Torvalds <[EMAIL PROTECTED]>
Date:   Wed Apr 16 19:49:44 2008 -0700

Linux 2.6.25

(there's a patch to export shmem_file_setup on top of this)

mesa (from git://people.freedesktop.org/~keithp/mesa):

commit 8b49cc104dd556218fc769178b96f4a8a428d057
Author: Keith Packard <[EMAIL PROTECTED]>
Date:   Sat May 17 23:34:47 2008 -0700

[intel-gem] Don't calloc reloc buffers

Only a few relocations are typically used, so don't clear
the
whole thing.

drm (from git://people.freedesktop.org/~keithp/drm):

commit 6e46a3c762919af05fcc6a08542faa7d185487a1
Author: Eric Anholt <[EMAIL PROTECTED]>
Date:   Mon May 12 15:42:20 2008 -0700

[GEM] Update testcases for new API.

xf86-video-intel (from git://people.freedesktop.org/~keithp/xf86-video-intel):

commit c81050c0058e32098259b5078515807038beb7d6
Merge: 9c9a5d0... e9532f3...
Author: Keith Packard <[EMAIL PROTECTED]>
Date:   Sat May 17 23:26:14 2008 -0700

Merge commit 'origin/master' into drm-gem

> Your figures look a bit odd. Is glxgears classic CPU-bound? If not, why 
> does it give a significantly slower framerate than
> glxgears GEM?

glxgears uses 40% of the CPU in both classic and gem. Note that the gem
version takes about 20 seconds to reach a steady state -- the gem driver
isn't clearing the gtt actively and so glxgears gets far ahead of the
gpu.

My theory is that this shows that using cache-aware copies from a single
static batch buffer (as gem does now) improves cache performance and
write bandwidth.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 20:22 +0200, Thomas Hellström wrote:

> I think the point here is when the buffer in 1) is mapped write-combined 
> which IMHO is the obvious approach,
> the caches aren't affected at all.

write-combining only wins if you manage to get writes to the same cache
line to line up appropriately. Doing significant computation between
writes to the WC region means failing to meet the necessary conditions,
so the WC writes end up trickling out slowly.

> In 2) you have two opportunities to completely fill the cache with data 
> that shouldn't need to be reused. With cache-aware copy code you can 
> reduce the impact of one of those opportunities.

The allocator should be re-using recently freed pages for other
activity, so your cache-loaded pages will not go to waste, even if all
you did was fill them with data and copy them to the graphics object.

So, it turns out the 'malloc, fill, copy, free' cycle is actually fairly
good from a cache perspective. And, the gem benchmarks bear this out
with better-than-classic bandwidth from CPU to GPU for raw vertices. We
might do better by using WC pages on the backend, rather than using
clflush, but not a lot.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Thomas Hellström
Keith Packard wrote:
> On Mon, 2008-05-19 at 05:09 -0700, Keith Whitwell wrote:
>
>   
>> I
>> think the latter is the significant result -- none of these experiments
>> in memory management significantly change the command stream the
>> hardware has to operate on, so what we're varying essentially is the
>> CPU behaviour to acheive that command stream.  And it is in CPU usage
>> where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly
>> dissapoint.
>> 
>
> Your GEM results do not match mine; perhaps we're running different
> kernels? Anything older than 2.6.24 won't be using clflush and will
> instead use wbinvd, a significant performance impact.  Profiling would
> show whether this is the case.
>
> I did some fairly simple measurements using openarena and enemy
> territory. Kernel version 2.6.25, CPU 1.3GHz Pentium M, 915GMS with the
> slowest possible memory. I'm afraid I don't have a working TTM
> environment at present; I will try to get that working so I can do more
> complete comparisons.
>   
>   fps realuserkernel
> glxgears classic: 665
> glxgears GEM: 889
> openareana classic:   17.1 59.19   37.13   1.80
> openarena GEM:24.6 44.06   25.52   5.29
> enemy territory classic:   9.0382.13  226.38  11.51   
> enemy territory GEM:  15.7212.80  121.72  40.50
>
>   
Keith,

The GEM timings were done with 2.6.25, except on the i915 system texdown 
timings which used 2.6.24.
Indeed, Michel reported much worse GEM figures with 2.6.23.

Your figures look a bit odd. Is glxgears classic CPU-bound? If not, why 
does it give a significantly slower framerate than
glxgears GEM?

The other apps are obviously GPU bound judging from the timings. They 
shouldn't really differ in frame-rate?

/Thomas





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Thomas Hellström
Keith Packard wrote:
> On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:
>
>   
>>  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
>>  GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
>>  if (data == NULL) {
>>  /* fail */
>>  }
>>
>>  /* Fill in buffer data */
>>
>>  glUnmapBuffer(GL_ARRAY_BUFFER);
>>
>> Over:
>>
>>  GLfloat *data = malloc(buffer_size);
>>  if (data == NULL) {
>>  /* fail */
>>  }
>>
>>  /* Fill in buffer data */
>>
>>  glBindBuffer(GL_ARRAY_BUFFER, my_buf);
>>  glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
>>  free(data);
>> 
>
> In terms of system performance, that 'extra copy' is not a problem
> though; the only cost is the traffic to the graphics chip, and these
> both do precisely the same amount of work. The benefit to the latter
> approach is that we get to use cache-aware copy code. The former can't
> do this as easily when updating only a portion of the data.
>
>   
>> The second version obviously has extra overhead and takes a performance
>> hit. 
>> 
>
> My measurements show that doing a cache-aware copy is a net performance
> win over using cache-ignorant word-at-a-time writes.
>   
I think the point here is when the buffer in 1) is mapped write-combined 
which IMHO is the obvious approach,
the caches aren't affected at all.

In 2) you have two opportunities to completely fill the cache with data 
that shouldn't need to be reused. With cache-aware copy code you can 
reduce the impact of one of those opportunities.

/Thomas

>   
> 
>
> -
> This SF.net email is sponsored by: Microsoft 
> Defy all challenges. Microsoft(R) Visual Studio 2008. 
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> 
>
> --
> ___
> Dri-devel mailing list
> Dri-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dri-devel
>   




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 10:25 -0700, Ian Romanick wrote:

>   glBindBuffer(GL_ARRAY_BUFFER, my_buf);
>   GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
>   if (data == NULL) {
>   /* fail */
>   }
> 
>   /* Fill in buffer data */
> 
>   glUnmapBuffer(GL_ARRAY_BUFFER);
> 
> Over:
> 
>   GLfloat *data = malloc(buffer_size);
>   if (data == NULL) {
>   /* fail */
>   }
> 
>   /* Fill in buffer data */
> 
>   glBindBuffer(GL_ARRAY_BUFFER, my_buf);
>   glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
>   free(data);

In terms of system performance, that 'extra copy' is not a problem
though; the only cost is the traffic to the graphics chip, and these
both do precisely the same amount of work. The benefit to the latter
approach is that we get to use cache-aware copy code. The former can't
do this as easily when updating only a portion of the data.

> The second version obviously has extra overhead and takes a performance
> hit. 

My measurements show that doing a cache-aware copy is a net performance
win over using cache-ignorant word-at-a-time writes.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Thomas Hellström
Ian Romanick wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Keith Whitwell wrote:
> |> Ian Romanick wrote:
> |>
> |> | I've read the GEM documentation several times, and I think I have a
> good
> |> | grasp of it.  I don't have any non-trivial complaints about GEM, but I
> |> | do have a couple comments / observations:
> |> |
> |> | - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
> |> | needs to be handled.  I know of at least one piece of hardware with a
> |> | kooky command buffer that wants to be used that way.
> |> |
> |> | - I suspect that in the (near) future we may want multiple
> read_domains.
> |> | ~ I can envision cases where applications using, for example, vertex
> |> | feedback mode would want to read from a buffer while the GPU is also
> |> | reading from the buffer.
> |> |
> |> | - I think drm_i915_gem_relocation_entry should have a "size" field.
> |> | There are a lot of cases in the current GL API (and more to come) where
> |> | the entire object will trivially not be used.  Clamped LOD on textures
> |> | is a trivial example, but others exist as well.
> |>
> |> Another question occurred to me.  What happens on over-commit?  Meaning,
> |> in order to draw 1 polygon more memory must be accessible to the GPU
> |> than exists.  This was a problem that I never solved in my 2004
> |> proposal.  At the time on R200 it was possible to have 6 maximum size
> |> textures active which would require more than the possible on-card + AGP
> |> memory.
> |
> | I don't actually think the problem is solvable for buffer-based memory
> managers -- the best we can do is spot the failure and recover, either
> early as the commands are submitted by the API, or some point later, and
> for some meaning of 'recover' (eg - fail cleanly, fallback,
> use-smaller-mipmaps, disable texturing, etc).
>
> For OpenGL, the only valid choice there is to fallback to software.
> That was my question, actually.  At what point does GEM detect the
> over-commit and how is it communicated back to user mode?
>
>
>   
Actually I think neither GEM nor TTM deals with this problem. GEM leaves 
everything up to the driver writer.
TTM utility routines returns an -ENOMEM to the device driver, in which 
case the device driver should evict everything possible to get rid of 
fragmentation and retry.

At that point we can't even fall back to software, so user-space really 
needs to be able to assume a certain always-working space for texture.


/Thomas

> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
>
> iD8DBQFIMbhsX1gOwKyEAw8RApmPAKCIjLLH/RRHgCWOlwxCNj8Cug4NfQCbBeZQ
> vEjqToEq75bAS/BYNh02OBs=
> =UhzR
> -END PGP SIGNATURE-
>
> -
> This SF.net email is sponsored by: Microsoft 
> Defy all challenges. Microsoft(R) Visual Studio 2008. 
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> --
> ___
> Dri-devel mailing list
> Dri-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dri-devel
>   




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 05:09 -0700, Keith Whitwell wrote:

> I
> think the latter is the significant result -- none of these experiments
> in memory management significantly change the command stream the
> hardware has to operate on, so what we're varying essentially is the
> CPU behaviour to acheive that command stream.  And it is in CPU usage
> where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly
> dissapoint.

Your GEM results do not match mine; perhaps we're running different
kernels? Anything older than 2.6.24 won't be using clflush and will
instead use wbinvd, a significant performance impact.  Profiling would
show whether this is the case.

I did some fairly simple measurements using openarena and enemy
territory. Kernel version 2.6.25, CPU 1.3GHz Pentium M, 915GMS with the
slowest possible memory. I'm afraid I don't have a working TTM
environment at present; I will try to get that working so I can do more
complete comparisons.

fps realuserkernel
glxgears classic:   665
glxgears GEM:   889
openareana classic: 17.1 59.19   37.13   1.80
openarena GEM:  24.6 44.06   25.52   5.29
enemy territory classic: 9.0382.13  226.38  11.51   
enemy territory GEM:15.7212.80  121.72  40.50

> Or to put it another way, GEM & master/TTM seem to burn huge
> amounts
> of CPU just running the memory manager.

I'm not seeing that in these demos; actual allocation is costing about
3% of the CPU time. Of course, for this hardware, the obvious solution
of re-using batch buffers would eliminate that cost entirely.

It would be nice to see the kernel time reduced further, but it's not
terrible so far.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: REGRESSION: video driver stuck after screen blank

2008-05-19 Thread Jesse Barnes
On Friday, May 16, 2008 2:26 pm Stephen Hemminger wrote:
> After the screensaver does it's idle shut off of the monitor, it never
> comes back. This is a new problem and only shows up post 2.6.25.
>
> Console log messages:
> Note: this message should be WARN_ON_ONCE() since it fills the disk!
>
> May 16 14:12:16 extreme kernel: [17192412.800268] trying to get vblank
> count for disabled pipe 0 May 16 14:12:16 extreme kernel: [17192412.800268]
> trying to get vblank count for disabled pipe 0 May 16 14:12:16 extreme
> kernel: [17192412.822388] trying to get vblank count for disabled pipe 0
> May 16 14:12:16 extreme kernel: [17192412.838770] trying to get vblank
> count for disabled pipe 0 May 16 14:12:16 extreme kernel: [17192412.858769]
> trying to get vblank count for disabled pipe 0 May 16 14:12:16 extreme
> kernel: [17192412.878769] trying to get vblank count for disabled pipe 0
> May 16 14:12:16 extreme kernel: [17192412.898769] trying to get vblank
> count for disabled pipe 0 May 16 14:12:16 extreme kernel: [17192412.918769]
> trying to get vblank count for disabled pipe 0 May 16 14:12:16 extreme
> kernel: [17192412.938769] trying to get vblank count for disabled pipe 0
> May 16 14:12:16 extreme kernel: [17192412.962768] trying to get vblank
> count for disabled pipe 0 May 16 14:12:16 extreme kernel: [17192412.980269]
> trying to get vblank count for disabled pipe 0 May 16 14:12:16 extreme
> kernel: [17192412.998769] trying to get vblank count for disabled pipe 0
> May 16 14:12:16 extreme kernel: [17192413.018769] trying to get vblank
> count for disabled pipe 0 May 16 14:12:16 extreme kernel: [17192413.038769]
> trying to get vblank count for disabled pipe 0 May 16 14:12:16 extreme
> kernel: [17192413.058769] trying to get vblank count for disabled pipe 0

Dave, I thought you put together a tree to revert this stuff...  I guess Linus 
hasn't pulled yet?  Did he miss the pull request?

Jesse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 10:25:16 -0700
Ian Romanick <[EMAIL PROTECTED]> wrote:

> 
> | Does in GL3 object must be unmapped before being use ? IIRC this what is
> | required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
> | as cache ie i put their object which are barely never mapped (like a
> | constant texture, or constant vertex table). This avoid me to think to
> | complexe solution for cleanly handling unmappable vram.
> 
> Be careful here.  An object must be unmapped in the context where it is
> used for drawing.  However, buffer objects can be shared between
> contexts.  This means that even today in OpenGL 1.5 context A can be
> drawing with a buffer object while context B has it mapped.  Of course,
> context A doesn't have to see the changes caused by context B until the
> next time it binds the buffer.  This means that copying data for the map
> will "just work."
> 

Is the result defined by GL specification ? ie does B need to access an
old copy of the object or if A is rendering to this object can we let B
see the ongoing rendering.

In latter case this likely lead to broken rendering if there is no
synchronization btw A & B.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dave Airlie wrote:

| Nothing can solve Ians
| problems where the app gives you a single working set that is too
large at
| least with current GL.

Eh?  It's called fallback to software.  It's the only thing the GL spec
allows you to do.  We've been doing it for years, and we had damn well
better be able to keep doing it.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMbn3X1gOwKyEAw8RAkZnAJ0fLkuM5hK5S0J59b2KIwd7WL8vqgCfawxu
zoJwg3K8rbuB9X5oVo05650=
=I0eF
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Whitwell wrote:
|> Ian Romanick wrote:
|>
|> | I've read the GEM documentation several times, and I think I have a
good
|> | grasp of it.  I don't have any non-trivial complaints about GEM, but I
|> | do have a couple comments / observations:
|> |
|> | - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
|> | needs to be handled.  I know of at least one piece of hardware with a
|> | kooky command buffer that wants to be used that way.
|> |
|> | - I suspect that in the (near) future we may want multiple
read_domains.
|> | ~ I can envision cases where applications using, for example, vertex
|> | feedback mode would want to read from a buffer while the GPU is also
|> | reading from the buffer.
|> |
|> | - I think drm_i915_gem_relocation_entry should have a "size" field.
|> | There are a lot of cases in the current GL API (and more to come) where
|> | the entire object will trivially not be used.  Clamped LOD on textures
|> | is a trivial example, but others exist as well.
|>
|> Another question occurred to me.  What happens on over-commit?  Meaning,
|> in order to draw 1 polygon more memory must be accessible to the GPU
|> than exists.  This was a problem that I never solved in my 2004
|> proposal.  At the time on R200 it was possible to have 6 maximum size
|> textures active which would require more than the possible on-card + AGP
|> memory.
|
| I don't actually think the problem is solvable for buffer-based memory
managers -- the best we can do is spot the failure and recover, either
early as the commands are submitted by the API, or some point later, and
for some meaning of 'recover' (eg - fail cleanly, fallback,
use-smaller-mipmaps, disable texturing, etc).

For OpenGL, the only valid choice there is to fallback to software.
That was my question, actually.  At what point does GEM detect the
over-commit and how is it communicated back to user mode?


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMbhsX1gOwKyEAw8RApmPAKCIjLLH/RRHgCWOlwxCNj8Cug4NfQCbBeZQ
vEjqToEq75bAS/BYNh02OBs=
=UhzR
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jerome Glisse wrote:

| Thanks Ian to stress current and future usage, i was really hopping that
| with GL3 buffer object mapping would have vanish but i guess as you said
| that the fire-and-forget path never get optimized.

I think various drivers have tried to optimize it.  I think it's just a
case where an application managed suballocator will just always be faster.

| Does in GL3 object must be unmapped before being use ? IIRC this what is
| required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
| as cache ie i put their object which are barely never mapped (like a
| constant texture, or constant vertex table). This avoid me to think to
| complexe solution for cleanly handling unmappable vram.

Be careful here.  An object must be unmapped in the context where it is
used for drawing.  However, buffer objects can be shared between
contexts.  This means that even today in OpenGL 1.5 context A can be
drawing with a buffer object while context B has it mapped.  Of course,
context A doesn't have to see the changes caused by context B until the
next time it binds the buffer.  This means that copying data for the map
will "just work."

But to actually answer the original question, a buffer that will be used
as a source or destination by the GL must be unmapped at Begin time.

| A side question is there any data on tlb flush ? ie how much map/unmap,
| from client vma, cycle cost.
|
| In the meantime i think we can promote use of pread/pwrite or
BufferSubData
| to take advantage of offset & size information in software we do
(mesa, EXA,
| ...).
|
| Ian do you know why dev hate BufferSubData ? Is there any place i can read
| about it ? I have been focusing on driver dev and i am little bit out
dated
| on today typical GL usage, i assumed that hw manufacturer did promote
use of
| BufferSubData to software dev.

Because it forces them to make extra copies of their data and do extra
copy operations.  As an app developer, I *much* prefer:

glBindBuffer(GL_ARRAY_BUFFER, my_buf);
GLfloat *data = glMapBufferData(GL_ARRAY_BUFFER, GL_READ_WRITE);
if (data == NULL) {
/* fail */
}

/* Fill in buffer data */

glUnmapBuffer(GL_ARRAY_BUFFER);

Over:

GLfloat *data = malloc(buffer_size);
if (data == NULL) {
/* fail */
}

/* Fill in buffer data */

glBindBuffer(GL_ARRAY_BUFFER, my_buf);
glBufferSubData(GL_ARRAY_BUFFER, 0, buffer_size, data);
free(data);

The second version obviously has extra overhead and takes a performance
hit.  So, now I have to go back and spend time caching the buffer
allocations and doing other things to make it fast.  In the MapBuffer
version, I can leverage the work done by the smart guys that write drivers.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMbf5X1gOwKyEAw8RAvvRAJ4gEjByIWndSs4NWmVFTAOgBQHqAgCaA3pK
2ShJGYatMlCxHR57CSYbuTk=
=FOBu
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 18:55:46 +0200
Thomas Hellström <[EMAIL PROTECTED]> wrote:

> Yes this is a way to do the actual implementation.
> But we will always have situations where writing to swap may fail. 
> Systems without swap, systems low on swap, and systems without enough 
> physical memory to manage the swap area.
> 
> We can try to reserve all the swap we need for graphics and be rude to 
> other system resources, but I think the best approach is simply to 
> allocate what we need at suspend - prepare time. If we're out of 
> swap-space we need to cancel suspend.
> 

I think there is a lack in kernel interface we should have a way
to tell the kernel okay we will need 1Go (of whatever storage and
we can use temporary small buffer to evict memory to this storage)
to suspend then it's up to the kernel to either find this Go or to
fail and abord suspend. We likely are the only device writter
needing so much memory so obviously we should talk to kernel
side as they should have ask us what is needed for proper video
suspend.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Thomas Hellström
Jerome Glisse wrote:
> On Mon, 19 May 2008 16:25:13 +0200
> "Jakob Bornecrantz" <[EMAIL PROTECTED]> wrote:
>
>   
>> On Mon, May 19, 2008 at 3:13 PM, Jerome Glisse <[EMAIL PROTECTED]> wrote:
>> 
>>> On Mon, 19 May 2008 15:03:50 +0200
>>> Thomas Hellström <[EMAIL PROTECTED]> wrote:
>>>
>>>   
 Hi!

 Parallell to the memory manager discussion, I think we need to revisit
 the case of what happens when a
 VRAM driver is suspending to memory.

 1) The ideal thing would be for the card contents to be quickly copied
 to backing-store and suspend is done.
 However, this requires pinning as much physical pages as there is VRAM.

 2) The other approach is to have a backing object of some sort, either a
 list of swap-entries or perhaps a gem object.
 The gem object would, at the point of suspend, either be paged out or
 unpopulated which means (provided that the swap sub-system is up at the
 suspend point) there will be heavy disk-access and the operation might
 fail due to a shortage of either swap space or physical memory for the
 swap system bookkeeping.

 Just want to know what's the general opinion here. Are the VRAM card
 developers planning to back all VRAM objects with pinned physical pages,
 or are we looking at approach 2) which might fail?

 /Thomas
 
>>> The ideal things, from my POV, is to ask to kernel to reserve swap space
>>> each time we allocate a buffer, if we ran out of swap space or whatever
>>> backing store the kernel can propose then we should fail buffer allocation
>>> even if we still have vram available. Idea being that once we migrate out
>>> stuff we know that there won't be failure to give us space to save things.
>>> Anothers things is to add a hint from userspace which tell if we can
>>> drop buffer content, i think we can use such hint for cmd buffer or
>>> others temporary buffer we know of, won't save much but still might
>>> help to save somebandwidth.
>>>
>>> Of course this need that kernel provide us an interface for such things.
>>> If i am right shmem is already talking with kernel about swap space.
>>>
>>> Cheers,
>>> Jerome Glisse
>>>   
>> The biggest question is where we can write or read pages to swap at
>> suspend to RAM and resume from RAM under all occasions.
>>
>> If not we have no other option then to have pages as backing store if
>> we want to support suspend to ram for cards with VRAM that turn the
>> ram of at suspend to RAM.
>>
>> The problem is that no solution is good.
>>
>> Cheers Jakob.
>> 
>
> The idea i did have in mind was a small scratch area where i dma from
> vram. Then i use a kernel interface to write to swap area (could be
> disk or flash or whatever other storage we can think of). So basicly
> you loop using the scratch area until you have moved all vram.
>
> Cheers,
> Jerome Glisse
>   
Yes this is a way to do the actual implementation.
But we will always have situations where writing to swap may fail. 
Systems without swap, systems low on swap, and systems without enough 
physical memory to manage the swap area.

We can try to reserve all the swap we need for graphics and be rude to 
other system resources, but I think the best approach is simply to 
allocate what we need at suspend - prepare time. If we're out of 
swap-space we need to cancel suspend.

/Thomas







-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jakob Bornecrantz
On Mon, May 19, 2008 at 5:03 PM, Jakob Bornecrantz <[EMAIL PROTECTED]> wrote:
> On Mon, May 19, 2008 at 4:35 PM, Keith Whitwell
> <[EMAIL PROTECTED]> wrote:
>>> The biggest question is where we can write or read pages to swap at
>>> suspend to RAM and resume from RAM under all occasions.
>>>
>>> If not we have no other option then to have pages as backing store if
>>> we want to support suspend to ram for cards with VRAM that turn the
>>> ram of at suspend to RAM.
>>
>> It should be possible.  The two-phased approach that seems to be
>> ascendent would give all the opportunity you need in the "prepare"
>> phase, while the full system is still running.
>>
>> http://lwn.net/Articles/274008/
>>
>> Keith
>
> Yes a two step program can work. At prepare we turn on backing storage
> mode for VRAM, that is all buffers in VRAM gain backing storage and
> all buffers that are moved to VRAM keep their pages. Then at suspend
> we copy everything down from VRAM and turn the card off.
>
> The problem is that after prepare things must still work as before so
> clients can still render so we can't copy pages down then. I also
> don't think we can count on swap being around at the call to suspend.
> Heck a system might not even have swap.

So the whole idea is that we have a backing store mode that is only
used between prepare and suspend to make sure that buffers in VRAM can
be saved.

Some updates, allocating 1GB plus of memory might not work at prepare,
for that matter allocating 64-256mb might not go that well over with
the kernel developers for that matter.

So the idea is that at prepare we evict all buffers that we can from
VRAM and if we feel like it AGP too. If the driver is using a
superioctl we can also temporary stop all submissions through it and
wait on the last fence to be sure we can move all buffers out. That
should only leave the cmdbuffer(s) and scanout buffer(s) left. For the
buffers that are in VRAM we now allocate backing stores, this should
not fail since we have just unpinned a lot of memory. Should this fail
we are out of luck and can not suspend.

After that we turn on backing store mode and unlock the superioctl so
that clients can use it. Since the card should still work after
prepare from what I can understand.

Cheers Jakob.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jakob Bornecrantz
On Mon, May 19, 2008 at 4:35 PM, Keith Whitwell
<[EMAIL PROTECTED]> wrote:
>> The biggest question is where we can write or read pages to swap at
>> suspend to RAM and resume from RAM under all occasions.
>>
>> If not we have no other option then to have pages as backing store if
>> we want to support suspend to ram for cards with VRAM that turn the
>> ram of at suspend to RAM.
>
> It should be possible.  The two-phased approach that seems to be
> ascendent would give all the opportunity you need in the "prepare"
> phase, while the full system is still running.
>
> http://lwn.net/Articles/274008/
>
> Keith

Yes a two step program can work. At prepare we turn on backing storage
mode for VRAM, that is all buffers in VRAM gain backing storage and
all buffers that are moved to VRAM keep their pages. Then at suspend
we copy everything down from VRAM and turn the card off.

The problem is that after prepare things must still work as before so
clients can still render so we can't copy pages down then. I also
don't think we can count on swap being around at the call to suspend.
Heck a system might not even have swap.

Cheers Jakob.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


[Bug 15881] [i915 i965] glean case api2/fragProg1/texCombine/ vertProg1 failed with assertion failure

2008-05-19 Thread bugzilla-daemon
http://bugs.freedesktop.org/show_bug.cgi?id=15881





--- Comment #5 from Brian Paul <[EMAIL PROTECTED]>  2008-05-19 07:46:51 PST ---
Fix committed to git.  Please re-test and update.


-- 
Configure bugmail: http://bugs.freedesktop.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 16:25:13 +0200
"Jakob Bornecrantz" <[EMAIL PROTECTED]> wrote:

> On Mon, May 19, 2008 at 3:13 PM, Jerome Glisse <[EMAIL PROTECTED]> wrote:
> > On Mon, 19 May 2008 15:03:50 +0200
> > Thomas Hellström <[EMAIL PROTECTED]> wrote:
> >
> >> Hi!
> >>
> >> Parallell to the memory manager discussion, I think we need to revisit
> >> the case of what happens when a
> >> VRAM driver is suspending to memory.
> >>
> >> 1) The ideal thing would be for the card contents to be quickly copied
> >> to backing-store and suspend is done.
> >> However, this requires pinning as much physical pages as there is VRAM.
> >>
> >> 2) The other approach is to have a backing object of some sort, either a
> >> list of swap-entries or perhaps a gem object.
> >> The gem object would, at the point of suspend, either be paged out or
> >> unpopulated which means (provided that the swap sub-system is up at the
> >> suspend point) there will be heavy disk-access and the operation might
> >> fail due to a shortage of either swap space or physical memory for the
> >> swap system bookkeeping.
> >>
> >> Just want to know what's the general opinion here. Are the VRAM card
> >> developers planning to back all VRAM objects with pinned physical pages,
> >> or are we looking at approach 2) which might fail?
> >>
> >> /Thomas
> >
> > The ideal things, from my POV, is to ask to kernel to reserve swap space
> > each time we allocate a buffer, if we ran out of swap space or whatever
> > backing store the kernel can propose then we should fail buffer allocation
> > even if we still have vram available. Idea being that once we migrate out
> > stuff we know that there won't be failure to give us space to save things.
> > Anothers things is to add a hint from userspace which tell if we can
> > drop buffer content, i think we can use such hint for cmd buffer or
> > others temporary buffer we know of, won't save much but still might
> > help to save somebandwidth.
> >
> > Of course this need that kernel provide us an interface for such things.
> > If i am right shmem is already talking with kernel about swap space.
> >
> > Cheers,
> > Jerome Glisse
> 
> The biggest question is where we can write or read pages to swap at
> suspend to RAM and resume from RAM under all occasions.
> 
> If not we have no other option then to have pages as backing store if
> we want to support suspend to ram for cards with VRAM that turn the
> ram of at suspend to RAM.
> 
> The problem is that no solution is good.
> 
> Cheers Jakob.

The idea i did have in mind was a small scratch area where i dma from
vram. Then i use a kernel interface to write to swap area (could be
disk or flash or whatever other storage we can think of). So basicly
you loop using the scratch area until you have moved all vram.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jakob Bornecrantz
On Mon, May 19, 2008 at 3:13 PM, Jerome Glisse <[EMAIL PROTECTED]> wrote:
> On Mon, 19 May 2008 15:03:50 +0200
> Thomas Hellström <[EMAIL PROTECTED]> wrote:
>
>> Hi!
>>
>> Parallell to the memory manager discussion, I think we need to revisit
>> the case of what happens when a
>> VRAM driver is suspending to memory.
>>
>> 1) The ideal thing would be for the card contents to be quickly copied
>> to backing-store and suspend is done.
>> However, this requires pinning as much physical pages as there is VRAM.
>>
>> 2) The other approach is to have a backing object of some sort, either a
>> list of swap-entries or perhaps a gem object.
>> The gem object would, at the point of suspend, either be paged out or
>> unpopulated which means (provided that the swap sub-system is up at the
>> suspend point) there will be heavy disk-access and the operation might
>> fail due to a shortage of either swap space or physical memory for the
>> swap system bookkeeping.
>>
>> Just want to know what's the general opinion here. Are the VRAM card
>> developers planning to back all VRAM objects with pinned physical pages,
>> or are we looking at approach 2) which might fail?
>>
>> /Thomas
>
> The ideal things, from my POV, is to ask to kernel to reserve swap space
> each time we allocate a buffer, if we ran out of swap space or whatever
> backing store the kernel can propose then we should fail buffer allocation
> even if we still have vram available. Idea being that once we migrate out
> stuff we know that there won't be failure to give us space to save things.
> Anothers things is to add a hint from userspace which tell if we can
> drop buffer content, i think we can use such hint for cmd buffer or
> others temporary buffer we know of, won't save much but still might
> help to save somebandwidth.
>
> Of course this need that kernel provide us an interface for such things.
> If i am right shmem is already talking with kernel about swap space.
>
> Cheers,
> Jerome Glisse

The biggest question is where we can write or read pages to swap at
suspend to RAM and resume from RAM under all occasions.

If not we have no other option then to have pages as backing store if
we want to support suspend to ram for cards with VRAM that turn the
ram of at suspend to RAM.

The problem is that no solution is good.

Cheers Jakob.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: i915 performance, master, i915tex & gem

2008-05-19 Thread Thomas Hellström
Keith Whitwell wrote:
> Texdown
> 1327MB/s (i915tex)
> 551MB/s (master, ttm)
> 572MB/s (master, no-ttm)
> Texdown, subimage
> 1014MB/s (i915tex)
> 134MB/s (master, ttm)
> 148MB/s (master, no-ttm)
>   

Gem on this machine (kernel 2.6.24) is hitting
Texdown 342MB/s
Texdown, subimage 76MB/s

...
>
>- a separate regression seems to have killed texture upload performance on 
> master/ttm relative to it's ancestor i915tex.
>   

Actually I think these are mostly issues stemming from not using 
write-combined mappings and instead using write-back mappings with 
clflush and chipset flush before binding to the GTT.

Note that, from what I can tell, the i915 gem driver is still using mmap 
for these operations.
/Thomas

> Keith
>
>   




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


[Bug 10709] 2.6.26-rc2 hosed X?

2008-05-19 Thread bugme-daemon
http://bugzilla.kernel.org/show_bug.cgi?id=10709





--- Comment #2 from [EMAIL PROTECTED]  2008-05-19 07:09 ---
I confirm that for proper operation I still need the drm-revert patch mentioned
above under patch. Fortunately the other one mentioned in Comment #1 has been
included.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell
On Mon, May 19, 2008 at 2:06 PM, Jerome Glisse <[EMAIL PROTECTED]> wrote:
> On Mon, 19 May 2008 12:16:57 +0100 (IST)
> Dave Airlie <[EMAIL PROTECTED]> wrote:
>> >
>> > For radeon the plan was to return error from superioctl as during
>> > superioctl and validation i do know if there is enough gart/vram to do
>> > the things. Then i think it's up to upper level to properly handle such
>> > failure from superioctl
>>
>> You really want to work this out in advance, at superioctl stage it is too
>> late, have a look at the changes I made to the dri_bufmgr.c classic memory
>> manager case to deal with this for Intel hw, if you got to superioctl and
>> failed unwrapping would be a real pain in the ass, you might have a number
>> of pieces of app state you can't reconstruct. I think DirectX handled this
>> with cut-points where with the buffer you passed the kernel a set of
>> places it could break the batch without too much effort. I think we are
>> better just giving the mesa driver a limit and when it hits that limit it
>> submits the buffer. The kernel can give it a new optimal limit at any
>> point and it should use that as soon as possible. Nothing can solve Ians
>> problems where the app gives you a single working set that is too large at
>> least with current GL. However you have to deal with the fact that
>> batchbuffer has many operations and the total working set needs to fit in
>> RAM to be relocated. I've added all the hooks in dri_bufmgr.c for non-TTM
>> case, TTM shouldn't be a major effort to add.
>>
>
> Spliting the cmd before they get submited is the way to go, likely we can
> ask the kernel for estimate of available memory and so userspace can stop
> building cmd stream but this isn't easy. Well anyway this would be a
> userspace problem. Anyway we still will have to fail in superioctl if
> for instance memory fragmentation get in the way.

It's as good as we can do...  We need more than an estimate, of course
-- if the estimate is optimistic, then you're back in the same
situation - trying to deal with it after the fact..

For userspace splitting to work, there needs to be a memory number
given by the kernel which is a *guarantee* that this amount of vram
(or equivalent) is available, and that as long as userspace sticks
within that, the kernel must guarantee that commands submitted will
run...

Unfortunately, this doesn't interact well with the pinning of buffers,
eg for scanout, which may be happening asynchronously in other
processes/threads.  There are some possibilities to have pinning
co-exist with these guarantees, eg by partitioning VRAM into a
pinnable pool and a 'normal' pool, and using the size of the 'normal'
pool as the guaranteed vram number.

Keith

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: VRAM vs suspend.

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 15:03:50 +0200
Thomas Hellström <[EMAIL PROTECTED]> wrote:

> Hi!
> 
> Parallell to the memory manager discussion, I think we need to revisit 
> the case of what happens when a
> VRAM driver is suspending to memory.
> 
> 1) The ideal thing would be for the card contents to be quickly copied 
> to backing-store and suspend is done.
> However, this requires pinning as much physical pages as there is VRAM.
> 
> 2) The other approach is to have a backing object of some sort, either a 
> list of swap-entries or perhaps a gem object.
> The gem object would, at the point of suspend, either be paged out or 
> unpopulated which means (provided that the swap sub-system is up at the 
> suspend point) there will be heavy disk-access and the operation might 
> fail due to a shortage of either swap space or physical memory for the 
> swap system bookkeeping.
> 
> Just want to know what's the general opinion here. Are the VRAM card 
> developers planning to back all VRAM objects with pinned physical pages, 
> or are we looking at approach 2) which might fail?
> 
> /Thomas

The ideal things, from my POV, is to ask to kernel to reserve swap space
each time we allocate a buffer, if we ran out of swap space or whatever
backing store the kernel can propose then we should fail buffer allocation
even if we still have vram available. Idea being that once we migrate out
stuff we know that there won't be failure to give us space to save things.
Anothers things is to add a hint from userspace which tell if we can
drop buffer content, i think we can use such hint for cmd buffer or
others temporary buffer we know of, won't save much but still might
help to save somebandwidth.

Of course this need that kernel provide us an interface for such things.
If i am right shmem is already talking with kernel about swap space.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 12:16:57 +0100 (IST)
Dave Airlie <[EMAIL PROTECTED]> wrote:
> > 
> > For radeon the plan was to return error from superioctl as during 
> > superioctl and validation i do know if there is enough gart/vram to do 
> > the things. Then i think it's up to upper level to properly handle such 
> > failure from superioctl
> 
> You really want to work this out in advance, at superioctl stage it is too 
> late, have a look at the changes I made to the dri_bufmgr.c classic memory 
> manager case to deal with this for Intel hw, if you got to superioctl and 
> failed unwrapping would be a real pain in the ass, you might have a number 
> of pieces of app state you can't reconstruct. I think DirectX handled this 
> with cut-points where with the buffer you passed the kernel a set of 
> places it could break the batch without too much effort. I think we are 
> better just giving the mesa driver a limit and when it hits that limit it 
> submits the buffer. The kernel can give it a new optimal limit at any 
> point and it should use that as soon as possible. Nothing can solve Ians 
> problems where the app gives you a single working set that is too large at 
> least with current GL. However you have to deal with the fact that 
> batchbuffer has many operations and the total working set needs to fit in 
> RAM to be relocated. I've added all the hooks in dri_bufmgr.c for non-TTM 
> case, TTM shouldn't be a major effort to add.
> 

Spliting the cmd before they get submited is the way to go, likely we can
ask the kernel for estimate of available memory and so userspace can stop
building cmd stream but this isn't easy. Well anyway this would be a
userspace problem. Anyway we still will have to fail in superioctl if
for instance memory fragmentation get in the way.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


VRAM vs suspend.

2008-05-19 Thread Thomas Hellström
Hi!

Parallell to the memory manager discussion, I think we need to revisit 
the case of what happens when a
VRAM driver is suspending to memory.

1) The ideal thing would be for the card contents to be quickly copied 
to backing-store and suspend is done.
However, this requires pinning as much physical pages as there is VRAM.

2) The other approach is to have a backing object of some sort, either a 
list of swap-entries or perhaps a gem object.
The gem object would, at the point of suspend, either be paged out or 
unpopulated which means (provided that the swap sub-system is up at the 
suspend point) there will be heavy disk-access and the operation might 
fail due to a shortage of either swap space or physical memory for the 
swap system bookkeeping.

Just want to know what's the general opinion here. Are the VRAM card 
developers planning to back all VRAM objects with pinned physical pages, 
or are we looking at approach 2) which might fail?

/Thomas




-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


i915 performance, master, i915tex & gem

2008-05-19 Thread Keith Whitwell
Just reposting this with a new subject line and less preamble.



- Original Message 

> 
> Well the thing is I can't believe we don't know enough to do this in some 
> way generically, but maybe the TTM vs GEM thing proves its not possible. 


I don't think there's anything particularly wrong with the GEM
interface -- I just need to know that the implementation can be fixed
so that performance doesn't suck as hard as it does in the current one,
and that people's political views on basic operations like mapping
buffers don't get in the way of writing a decent driver.

We've run a few benchmarks against i915 drivers in all their permutations, and 
to summarize the results look like:
- for GPU-bound apps, there are small differences, perhaps up to 10%.  I'm 
really not concerned about these (yet).
- for CPU-bound apps, the overheads introduced by Intel's approach to 
buffer handling impose a significant penalty in the region of 50-100%.

I
think the latter is the significant result -- none of these experiments
in memory management significantly change the command stream the
hardware has to operate on, so what we're varying essentially is the
CPU behaviour to acheive that command stream.  And it is in CPU usage
where GEM (and Keith/Eric's now-abandoned TTM driver) do significantly
dissapoint.

Or to put it another way, GEM & master/TTM seem to burn huge
amounts
of CPU just running the memory manager.  This isn't true for
master/no-ttm or for i915tex using userspace sub-allocation, where the
CPU penalty for getting decent memory management seems to be minimal
relative to the non-ttm baseline.  

If there's a political
desire to not use userspace sub-allocation, then whatever kernel-based
approach you want to investigate should nonetheless make some effort to
hit reasonable performance goals -- and neither of the current two
kernel-allocation-based approaches currently are at all impressive.

Keith


==
And on an i945G, dual core Pentium D 3Ghz 2MB cache, FSB 800 Mhz, 
single-channel ram:


Openarena timedemo at 640x480:

master w/o TTM:  840 frames, 17.1 seconds: 49.0 fps, 12.24s user 1.02s system 
63% cpu 20.880 total
master with TTM: 840 frames, 15.8 seconds: 53.1 fps, 13.51s user 5.15s system 
95% cpu 19.571 total
i915tex_branch:  840 frames, 13.8 seconds: 61.0 fps, 12.54s user 2.34s system 
85% cpu 17.506 total
gem: 840 frames, 15.9 seconds: 52.8 fps, 11.96s user 4.44s system 
83% cpu 19.695 total

KW: 
It's less obvious here than some of the tests below, but the pattern is
still clear -- compared to master/no-ttm i915tex is getting about the
same ratio of fps to CPU usage, whereas both master/ttm and gem are
significantly worse, burning much more CPU per fps, with a large chunk
of the extra CPU being spent in the kernel.  

The particularly
worrying thing about GEM is that it isn't hitting *either* 100% cpu
*or* maximum framerates from the hardware -- that's really not very
good, as it implies hardware is being left idle unecessarily.


glxgears:

A: ~1029 fps, 20.63user 2.88system 1:00.00elapsed 39%CPU  (master, no ttm) 
B: ~1072 fps, 23.97user 18.06system 1:00.00elapsed 70%CPU  (master, ttm)
C: ~1128 fps, 22.38user 5.21system 1:00.00elapsed 45%CPU  (i915tex, new)
D: ~1167 fps, 23.14user 9.07system 1:00.00elapsed 53%CPU  (i915tex, old)
F: ~1112 fps, 24.70user 21.95system 1:00.00elapsed 77%CPU  (gem)

KW:
The high CPU overhead imposed by GEM and (non-suballocating) master/TTM
should be pretty clear here.  master/TTM burns 30% of CPU just running
the memory manager!!  GEM gets slightly higher framerates but uses even
more CPU than master/TTM.  

fgl_glxgears -fbo:

A: n/a
B: ~244 fps, 7.03user 5.30system 1:00.01elapsed 20%CPU  (master, ttm)
C: ~255 fps, 6.24user 1.71system 1:00.00elapsed 13%CPU  (i915tex, new)
D: ~260 fps, 6.60user 2.44system 1:00.00elapsed 15%CPU  (i915tex, old)
F: ~258 fps, 7.56user 6.44system 1:00.00elapsed 23%CPU  (gem)

KW: GEM & master/ttm burn more cpu to build/submit the same command streams.

openarena 1280x1024:

A: 840 frames, 44.5 seconds: 18.9 fps  (master, no ttm)
B: 840 frames, 40.8 seconds: 20.6 fps  (master, ttm)
C: 840 frames, 40.4 seconds: 20.8 fps  (i915tex, new)
D: 840 frames, 37.9 seconds: 22.2 fps  (i915tex, old)
F: 840 frames, 40.3 seconds: 20.8 fps  (gem)

KW: 
no cpu measurements taken here, but almost certainly GPU bound.  A lot
of similar numbers, I don't believe the deltas have anything in
particular to do with memory management interface choices...

ipers:

A: ~285000 Poly/sec (master, no ttm)
B: ~217000 Poly/sec (master, ttm)
C: ~298000 Poly/sec (i915tex, new)
D: ~227000 Poly/sec (i915tex, old)
F: ~125000 Poly/sec (gen, GPU lockup on first attempt)

KW: no cpu measurements in this run, but all are almost certainly 100% pinned 
on CPU.  
  - i915tex (in particular i915tex, new) show similar performance to classic - 
ie low cpu 

Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell

> > It's not clear to me which of the above the r300 & nv people are aiming at, 
> but in my opinion the latter is such a significant departure from what we 
> have 
> been thinking about that I have always believed it should be addressed by a 
> new 
> set of interfaces.
> > 
> 
> My understanding of future hw is that we are heading to virtualized GPU 
> memory 
> (IRQ assistance for
> page fault).

Yes, of course.  This is the vista advanced scheduler and I guess it will be 
enforced by whql or some other mandatory scheme.  Here's a post from 2006 that 
lays out the concepts:

http://blogs.msdn.com/greg_schechter/archive/2006/04/02/566767.aspx

The graphics rumour sites suggest that one or more of the IHVs failed to 
achieve this for the vista deadlines, so it might be a bit of a tough technical 
problem...

My belief is that there are two different problems - buffer based memory 
managent and page-based virtualized GPU memory, and they should be solved with 
different implementations and probably different interfaces.  Moreover, we 
should try and get a workable buffer-based scheme for current hardware and then 
commence navel-gazing to support future cards... delaying an adequate 
buffer-based memory manager (ttm+cleaner-interface or gem+performance-fixes) to 
wait for a page-based one doesn't make any sense as the page-based one won't 
ever work on current cards.

The opposite is true, however -- a decent set of buffer-based interfaces will 
keep working for a long time, giving breathing room to create a page-baed 
manager later.

Keith


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell


- Original Message 
> From: Dave Airlie <[EMAIL PROTECTED]>
> To: Jerome Glisse <[EMAIL PROTECTED]>
> Cc: Keith Whitwell <[EMAIL PROTECTED]>; Ian Romanick <[EMAIL PROTECTED]>; DRI 
> 
> Sent: Monday, May 19, 2008 12:16:57 PM
> Subject: Re: TTM vs GEM discussion questions
> 
> > 
> > For radeon the plan was to return error from superioctl as during 
> > superioctl and validation i do know if there is enough gart/vram to do 
> > the things. Then i think it's up to upper level to properly handle such 
> > failure from superioctl
> 
> You really want to work this out in advance, at superioctl stage it is too 
> late, have a look at the changes I made to the dri_bufmgr.c classic memory 
> manager case to deal with this for Intel hw, if you got to superioctl and 
> failed unwrapping would be a real pain in the ass, you might have a number 
> of pieces of app state you can't reconstruct. I think DirectX handled this 
> with cut-points where with the buffer you passed the kernel a set of 
> places it could break the batch without too much effort. I think we are 
> better just giving the mesa driver a limit and when it hits that limit it 
> submits the buffer. The kernel can give it a new optimal limit at any 
> point and it should use that as soon as possible. Nothing can solve Ians 
> problems where the app gives you a single working set that is too large at 
> least with current GL. However you have to deal with the fact that 
> batchbuffer has many operations and the total working set needs to fit in 
> RAM to be relocated. I've added all the hooks in dri_bufmgr.c for non-TTM 
> case, TTM shouldn't be a major effort to add.
> 
> > 
> > My understanding of future hw is that we are heading to virtualized GPU 
> > memory (IRQ assistance for page fault).
> > 
> 
> I think we'll have this for r700, not sure i965 does this, r500 has I 
> think per-process GART.

I don't think you can restart i9xx after a pagefault, may be wrong...

Note per-process GART != support for virtualized memory, though it gets you one 
step of the way.  You also need the support so the kernel can figure out what 
page needs to be swapped in, be able to restart the GPU after the pagefault, 
etc, and probably some way to have the hardware go off and do something useful 
on another context in the meantime.  

I'd like to just try and get buffer based memory management working well first, 
then draw a line under that and work on these more "advanced" concepts...

Keith


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell


- Original Message 
> From: Dave Airlie <[EMAIL PROTECTED]>
> To: Ian Romanick <[EMAIL PROTECTED]>
> Cc: DRI 
> Sent: Monday, May 19, 2008 4:38:02 AM
> Subject: Re: TTM vs GEM discussion questions
> 
> 
> > 
> > All the good that's done us and our users.  After more than *5 years* of
> > various memory manager efforts we can't support basic OpenGL 1.0 (yes,
> > 1.0) functionality in a performant manner (i.e., glCopyTexImage and
> > friends).  We have to get over this "it has to be perfect or it will
> > never get in" crap.  Our 3D drivers are entirely irrelevant at this point.
> 
> Except on Intel hardware, who's relevance may or may not be relevant.

These can't do copyteximage with the in-kernel drm.  


> > To say that "userspace APIs cannot die once released" is not a relevant
> > counterpoint.  We're not talking about a userspace API for general
> > application use.  This isn't futexs, sysfs, or anything that
> > applications will directly depend upon.  This is an interface between a
> > kernel portion of a driver and a usermode portion of a driver.  If we
> > can't be allowed to change or deprecate those interfaces, we have no hope.
> > 
> > Note that the closed source guys don't have this artificial handicap.
> 
> Ian, fine you can take this up with Linus and Andrew Morton, I'm not 
> making this up just to stop you from putting 50 unsupportable memory 
> managers in the kernel. If you define any interface to userspace from the 
> kernel (ioctls, syscalls), you cannot just make it go away. The rule is 
> simple and is that if you install a distro with a kernel 2.6.x.distro, and 
> it has Mesa 7.0 drivers on it, upgrading the kernel to kernel 2.6.x+n 
> without touching userspace shouldn't break userspace ever. If we can't 
> follow this rule we can't put out code into Linus's kernel. So don't argue 
> about it, deal with it, this isn't going to change.
> 
> and yes I've heard this crap about closed source guys, but we can't follow 
> their route and be distributed by vendors. How many vendors ship the 
> closed drivers?
> 
> > This is also a completely orthogonal issue to maintaining any particular
> > driver.  Drivers are removed from the kernel just the same as they are
> > removed from X.org.  Assume we upstreamed either TTM or GEM "today."
> > Clearly that memory manager would continue to exist as long as some
> > other driver continued to depend on it.  I don't see how this is
> > different from cfb or any of the other interfaces within the X server
> > that we've gutted recently.
> 
> Drivers and pieces of the kernel aren't removed like you think. I think we 
> nuked gamma (didn't have a working userspcae anymore) and ffb (it sucked 
> and couldn't  be fixed). Someone is bound to bring up OSS->ALSA, but that 
> doesn't count as ALSA had OSS emulation layer so userspace apps didn't 
> just stop working. Removing chunks of X is vastly different to removing an 
> exposed kernel userspace interface. Please talk to any IBM kernel person 
> and clarify how this stuff works. (Maybe benh could chime in...??)
> 
> > If you want to remove a piece of infrastructure, you have three choices.
> > ~ If nothing uses it, you gut it.  If something uses it, you either fix
> > that "something" to use different infrastructure (which puts you in the
> > "nothing uses it" state) or you leave things as they are.  In spite of
> > all the fussing the kernel guys do in this respect, the kernel isn't
> > different in this respect from any other large, complex piece of
> > infrastructure.
> 
> So you are going to go around and fix the userspaces on machines that are 
> already deployed? how? e.g. Andrew Morton has a Fedora Core 1 install on a 
> laptop booting 2.6.x-mm kernels, when 3D stops working on that laptop we 
> get to hear about it. So yes you can redesign and move around the kernel 
> internals as much as you like, but you damn well better expose the old 
> interface and keep it working.
> 
> > managers or that we may want to have N memory managers now that will be
> > gutted later.  It seems that the real problem is that the memory
> > managers have been exposed as a generic, directly usable, device
> > independent piece of infrastructure.  Maybe the right answer is to punt
> > on the entire concept of a general memory manager.  At best we'll have
> > some shared, optional use infrastructure, and all of the interfaces that
> > anything in userspace can ever see are driver dependent.  That limits
> > the exposure of the interfaces and lets us solve todays problems today.
> > 
> > As is trivially apparent, we don't know what the "best" (for whatever
> > definition of best we choose) answer is for a memory manager interface.
> > ~ We're probably not going to know that answer in the near future.  To
> > not let our users have anything until we can give them the best thing is
> > an incredible disservice to them, and it makes us look silly (at best).
> > 
> 
> Well the thing is I can't believe we don't kno

Re: TTM vs GEM discussion questions

2008-05-19 Thread Dave Airlie
> 
> For radeon the plan was to return error from superioctl as during 
> superioctl and validation i do know if there is enough gart/vram to do 
> the things. Then i think it's up to upper level to properly handle such 
> failure from superioctl

You really want to work this out in advance, at superioctl stage it is too 
late, have a look at the changes I made to the dri_bufmgr.c classic memory 
manager case to deal with this for Intel hw, if you got to superioctl and 
failed unwrapping would be a real pain in the ass, you might have a number 
of pieces of app state you can't reconstruct. I think DirectX handled this 
with cut-points where with the buffer you passed the kernel a set of 
places it could break the batch without too much effort. I think we are 
better just giving the mesa driver a limit and when it hits that limit it 
submits the buffer. The kernel can give it a new optimal limit at any 
point and it should use that as soon as possible. Nothing can solve Ians 
problems where the app gives you a single working set that is too large at 
least with current GL. However you have to deal with the fact that 
batchbuffer has many operations and the total working set needs to fit in 
RAM to be relocated. I've added all the hooks in dri_bufmgr.c for non-TTM 
case, TTM shouldn't be a major effort to add.

> 
> My understanding of future hw is that we are heading to virtualized GPU 
> memory (IRQ assistance for page fault).
> 

I think we'll have this for r700, not sure i965 does this, r500 has I 
think per-process GART.

Dave.

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 03:49:04 -0700 (PDT)
Keith Whitwell <[EMAIL PROTECTED]> wrote:
> 
> 
> I don't actually think the problem is solvable for buffer-based memory 
> managers -- the best we can do is spot the failure and recover, either early 
> as the commands are submitted by the API, or some point later, and for some 
> meaning of 'recover' (eg - fail cleanly, fallback, use-smaller-mipmaps, 
> disable texturing, etc).
> 

For radeon the plan was to return error from superioctl as during superioctl 
and validation i do know
if there is enough gart/vram to do the things. Then i think it's up to upper 
level to properly handle
such failure from superioctl

> The only real way to solve it is to move to a page-based virtualizaization of 
> GPU memory, which requires hardware support and isn't possible on most cards. 
>  Note that this is different from per-process GPU address spaces, and is a 
> signficantly tougher problem even on supporting hardware.
> 
> Note there are two concepts with similar common names:
> 
>- virtual GPU memory -- ie per-context page tables, but still a 
> buffer-based memory manager, textures pre-loaded into GPU memory prior to 
> command execution
>   
>- virtualized GPU memory -- as above, but with page faulting, typically 
> IRQ-driven with kernel assistance.  Parts of textures may be paged in/out as 
> required, according to the "memory access" patterns of active shaders.
> 
> It's not clear to me which of the above the r300 & nv people are aiming at, 
> but in my opinion the latter is such a significant departure from what we 
> have been thinking about that I have always believed it should be addressed 
> by a new set of interfaces.
> 

My understanding of future hw is that we are heading to virtualized GPU memory 
(IRQ assistance for
page fault).

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell


- Original Message 
> From: Thomas Hellström <[EMAIL PROTECTED]>
> To: Stephane Marchesin <[EMAIL PROTECTED]>
> Cc: DRI 
> Sent: Monday, May 19, 2008 9:49:21 AM
> Subject: Re: TTM vs GEM discussion questions
> 
> Stephane Marchesin wrote:
> > On 5/18/08, Thomas Hellström wrote:
> >
> >  
> >>  > Yes, that was really my point. If the memory manager we use (whatever
> >>  > it is) does not allow this kind of behaviour, that'll force all cards
> >>  > to use a kernel-validated command submission model, which might not be
> >>  > too fast, and more difficult to implement on such hardware.
> >>  >
> >>  > I'm not in favor of having multiple memory managers, but if the chosen
> >>  > one is both slower and more complex to support in the future, that'll
> >>  > be a loss for everyone. Unless we want to have another memory manager
> >>  > implementation in 2 years from now...
> >>  >
> >>  > Stephane
> >>  >
> >>
> >> First, TTM does not enforce kernel command submission, but it forces you
> >>  to tell the kernel about command completion status in order for the
> >>  kernel to be able to move and delete buffers.
> >>
> >
> > Yes, emitting the moves from the kernel is not a necessity. If your
> > card can do memory protection, you can setup the protection bits in
> > the kernel and ask user space to do the moves. Doing so means in-order
> > execution in the current context, which means that in the normal case
> > rendering does not need to synchronize with fences at all.
> >
> >  
> >>  I'm not sure how you could avoid that with ANY kernel based memory
> >>  manager, but I would be interested to know how you expect to solve that
> >>  problem.
> >>
> >
> > See above, if the kernel controls the memory protection bits, it can
> > pretty much enforce things on use space anyway.
> >
> >  
> Well, the primary reason for the kernel to sync and move a buffer object 
> would be to evict it from VRAM, in which case I don't think the 
> user-space approach would be a valid solution, unless, of course, you 
> plan to use VRAM as a cache and back it all with system memory.
> 
> Just out of interest (I think this is a valid thing to know, and I'm not 
> being TTM / GEM specific here):
> 1) I've never seen a kernel round-trip per batchbuffer as a huge 
> performance-problem, and it surely simplifies things for an in-kernel 
> memory manger. Do you have any data to back this?
> 2) What do the Nvidia propriety drivers do w r t this?


What I understand is that each hardware context (and there are lots of hardware 
contexts) has a ringbuffer which is mapped into the address space of the driver 
assigned that context.  The driver just inserts commands into that ringbuffer 
and the hardware itself schedules & context-switches between rings.

Then the question is how does this interact with a memory manager.  There still 
has to be some entity managing the global view of memory -- just as the kernel 
does for the regular vm system on the CPU.  A context/driver shouldn't be able 
to rewrite its own page tables, for instance.

Keith


-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Keith Whitwell


- Original Message 
> From: Ian Romanick <[EMAIL PROTECTED]>
> To: DRI 
> Sent: Monday, May 19, 2008 10:04:09 AM
> Subject: Re: TTM vs GEM discussion questions
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Ian Romanick wrote:
> 
> | I've read the GEM documentation several times, and I think I have a good
> | grasp of it.  I don't have any non-trivial complaints about GEM, but I
> | do have a couple comments / observations:
> |
> | - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
> | needs to be handled.  I know of at least one piece of hardware with a
> | kooky command buffer that wants to be used that way.
> |
> | - I suspect that in the (near) future we may want multiple read_domains.
> | ~ I can envision cases where applications using, for example, vertex
> | feedback mode would want to read from a buffer while the GPU is also
> | reading from the buffer.
> |
> | - I think drm_i915_gem_relocation_entry should have a "size" field.
> | There are a lot of cases in the current GL API (and more to come) where
> | the entire object will trivially not be used.  Clamped LOD on textures
> | is a trivial example, but others exist as well.
> 
> Another question occurred to me.  What happens on over-commit?  Meaning,
> in order to draw 1 polygon more memory must be accessible to the GPU
> than exists.  This was a problem that I never solved in my 2004
> proposal.  At the time on R200 it was possible to have 6 maximum size
> textures active which would require more than the possible on-card + AGP
> memory.

I don't actually think the problem is solvable for buffer-based memory managers 
-- the best we can do is spot the failure and recover, either early as the 
commands are submitted by the API, or some point later, and for some meaning of 
'recover' (eg - fail cleanly, fallback, use-smaller-mipmaps, disable texturing, 
etc).

The only real way to solve it is to move to a page-based virtualizaization of 
GPU memory, which requires hardware support and isn't possible on most cards.  
Note that this is different from per-process GPU address spaces, and is a 
signficantly tougher problem even on supporting hardware.

Note there are two concepts with similar common names:

   - virtual GPU memory -- ie per-context page tables, but still a buffer-based 
memory manager, textures pre-loaded into GPU memory prior to command execution
  
   - virtualized GPU memory -- as above, but with page faulting, typically 
IRQ-driven with kernel assistance.  Parts of textures may be paged in/out as 
required, according to the "memory access" patterns of active shaders.

It's not clear to me which of the above the r300 & nv people are aiming at, but 
in my opinion the latter is such a significant departure from what we have been 
thinking about that I have always believed it should be addressed by a new set 
of interfaces.


Ketih

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Jerome Glisse
On Mon, 19 May 2008 02:22:00 -0700
Ian Romanick <[EMAIL PROTECTED]> wrote:
> 
> There is also a bunch of up-and-coming GL functionality that you may not
> be aware of that changes this picture a *LOT*.
> 
> - - GL_EXT_texture_buffer_object allows a portion of a buffer object to be
> used to back a texture.
> 
> - - GL_EXT_bindable_uniform allows a portion of a buffer object to be used
> to back a block of uniforms.
> 
> - - GL_EXT_transform_feedback allows the output of the vertex shader or
> geometry shader to be stored to buffer objects.
> 
> - - Long's Peak has functionality where a buffer object can be mapped
> *without* waiting for all previous GL commands to complete.
> GL_APPLE_flush_buffer_range has similar functionality.
> 
> - - Long's Peak has NV_fence-like synchronization objects.
> 
> The usage scenario that ISVs (and that other vendors are going to make
> fast) is one where transform feedback is used to "render" a bunch of
> objects to a single buffer object.  There is a fair amount of overhead
> in changing all the output buffer object bindings, so ISVs don't want to
> be forced to take that performance hit.  If a fence is set after each
> object is sent down the pipe, the app can wait for one object to finish
> rendering, map the buffer object, and operate on the data.
> 
> Similar situations can occur even without transform feedback.  Imagine
> an app that is streaming data into a buffer object.  It streams in one
> object (via MapBuffer), does a draw command, sets a fence, streams in
> the next, etc.  When the buffer is full, it waits on the first fence,
> and starts back at the beginning.  Apparently, a lot of ISVs are wanting
> to do this.  I'm not a big fan of this usage.  It seems that nobody ever
> got fire-and-forget buffer objects (repeated cycle of allocate, fill,
> use, delete) to be fast, so this is what ISVs are wanting instead.
> 
> In other news, app developers *hate* BufferSubData.  They much prefer to
> just map the buffer and write to it or read from it.
> 
> All of this points to mapping buffers to the CPU in, on, and around GPU
> usage being a very common operation.  It's also an operation that app
> developers expect to be fast.
> 

Thanks Ian to stress current and future usage, i was really hopping that
with GL3 buffer object mapping would have vanish but i guess as you said
that the fire-and-forget path never get optimized.

Does in GL3 object must be unmapped before being use ? IIRC this what is
required in current GL 1.x and GL 2.x. If so i think i can still use VRAM
as cache ie i put their object which are barely never mapped (like a
constant texture, or constant vertex table). This avoid me to think to
complexe solution for cleanly handling unmappable vram.

A side question is there any data on tlb flush ? ie how much map/unmap,
from client vma, cycle cost.

In the meantime i think we can promote use of pread/pwrite or BufferSubData
to take advantage of offset & size information in software we do (mesa, EXA,
...).

Ian do you know why dev hate BufferSubData ? Is there any place i can read
about it ? I have been focusing on driver dev and i am little bit out dated
on today typical GL usage, i assumed that hw manufacturer did promote use of
BufferSubData to software dev.

Cheers,
Jerome Glisse

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 00:14 -0700, Ian Romanick wrote:
|
|> - - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
|> needs to be handled.  I know of at least one piece of hardware with a
|> kooky command buffer that wants to be used that way.
|
| Oh, so mapping the same command buffer for both activities.
|
| For Intel, we use batch buffers written with the CPU and queued to the
| GPU by the kernel, using suitable flushing to get data written to memory
| before the GPU is asked to read it.
|
| It could be that this 'command domain' just needs to be separate, and
| mapped coherent between GPU and CPU so that this works.
|
| However, instead of messing with the API on some theoretical hardware,
| I'm really only interested in seeing how the API fits to actual
| hardware. Having someone look at how a gem-like API would work on Radeon
| or nVidia hardware would go a long ways to exploring what pieces are
| general purpose and which are UMA- (or even Intel-) specific.

Sorry for being subtle.  It isn't theoretical hardware.  It's XP10.  It
uses a weird linked-list mechanism for commands.  Each command has a
header that contains a pointer to the next command and a flag.  The flag
says whether the command is valid or an end-of-list sentinel.  The
driver can then keep linking in new commands and changing sentinels to
commands while the hardware is going.

I'd have to go back and look, but I think MGA would work well with a
similar domain setting.

|> - - I suspect that in the (near) future we may want multiple
read_domains.
|
| That's why the argument is called 'read_domains' and not 'read_domain'.
| We already have operations that read objects to both the sampler and
| render caches.

Ah, so it is.  It wasn't clear in the document that the domain values
were bits.  It looks more like they're enums.

|> - - I think drm_i915_gem_relocation_entry should have a "size" field.
|> There are a lot of cases in the current GL API (and more to come) where
|> the entire object will trivially not be used.  Clamped LOD on textures
|> is a trivial example, but others exist as well.

The specific situation I was thinking of above is where a 2048x2048
mipmapped texture has been evicted from texturable memory.  The LOD
range of the card is clamped so that only the 512x512 mipmap will be
used (imagine doing render-to-texture to generate the 256x256 mipmap
from the 512x512).  Having both an offset and a size allows the memory
manager to only bring back in the required subset of the texture.

| There are a couple of places where this might be useful (presumably both
| offset and length); the 'set_domain' operation seems like one of them,
| and if we place it there, then other places where domain information is
| passed across the API might be good places to include that as well.
|
| The most obvious benefit here is reducing clflush action as we flip
| buffers from GPU to CPU for fallbacks; however, flipping objects back
| and forth should be avoided anyway, eliminating this kind of fallback
| would provide more performance benefit than making the fallback a bit
| faster.

There is also a bunch of up-and-coming GL functionality that you may not
be aware of that changes this picture a *LOT*.

- - GL_EXT_texture_buffer_object allows a portion of a buffer object to be
used to back a texture.

- - GL_EXT_bindable_uniform allows a portion of a buffer object to be used
to back a block of uniforms.

- - GL_EXT_transform_feedback allows the output of the vertex shader or
geometry shader to be stored to buffer objects.

- - Long's Peak has functionality where a buffer object can be mapped
*without* waiting for all previous GL commands to complete.
GL_APPLE_flush_buffer_range has similar functionality.

- - Long's Peak has NV_fence-like synchronization objects.

The usage scenario that ISVs (and that other vendors are going to make
fast) is one where transform feedback is used to "render" a bunch of
objects to a single buffer object.  There is a fair amount of overhead
in changing all the output buffer object bindings, so ISVs don't want to
be forced to take that performance hit.  If a fence is set after each
object is sent down the pipe, the app can wait for one object to finish
rendering, map the buffer object, and operate on the data.

Similar situations can occur even without transform feedback.  Imagine
an app that is streaming data into a buffer object.  It streams in one
object (via MapBuffer), does a draw command, sets a fence, streams in
the next, etc.  When the buffer is full, it waits on the first fence,
and starts back at the beginning.  Apparently, a lot of ISVs are wanting
to do this.  I'm not a big fan of this usage.  It seems that nobody ever
got fire-and-forget buffer objects (repeated cycle of allocate, fill,
use, delete) to be fast, so this is what ISVs are wanting instead.

In other news, app developers *hate* BufferSubData.  They much pr

Re: TTM vs GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ian Romanick wrote:

| I've read the GEM documentation several times, and I think I have a good
| grasp of it.  I don't have any non-trivial complaints about GEM, but I
| do have a couple comments / observations:
|
| - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
| needs to be handled.  I know of at least one piece of hardware with a
| kooky command buffer that wants to be used that way.
|
| - I suspect that in the (near) future we may want multiple read_domains.
| ~ I can envision cases where applications using, for example, vertex
| feedback mode would want to read from a buffer while the GPU is also
| reading from the buffer.
|
| - I think drm_i915_gem_relocation_entry should have a "size" field.
| There are a lot of cases in the current GL API (and more to come) where
| the entire object will trivially not be used.  Clamped LOD on textures
| is a trivial example, but others exist as well.

Another question occurred to me.  What happens on over-commit?  Meaning,
in order to draw 1 polygon more memory must be accessible to the GPU
than exists.  This was a problem that I never solved in my 2004
proposal.  At the time on R200 it was possible to have 6 maximum size
textures active which would require more than the possible on-card + AGP
memory.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMUKGX1gOwKyEAw8RAqGGAJ95wX42j7vQPPsOggwWDiR9L+mPrACfZ9GV
EZClpUDfr035nlt+s78bKY8=
=zfbH
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Thomas Hellström
Stephane Marchesin wrote:
> On 5/18/08, Thomas Hellström <[EMAIL PROTECTED]> wrote:
>
>   
>>  > Yes, that was really my point. If the memory manager we use (whatever
>>  > it is) does not allow this kind of behaviour, that'll force all cards
>>  > to use a kernel-validated command submission model, which might not be
>>  > too fast, and more difficult to implement on such hardware.
>>  >
>>  > I'm not in favor of having multiple memory managers, but if the chosen
>>  > one is both slower and more complex to support in the future, that'll
>>  > be a loss for everyone. Unless we want to have another memory manager
>>  > implementation in 2 years from now...
>>  >
>>  > Stephane
>>  >
>>
>> First, TTM does not enforce kernel command submission, but it forces you
>>  to tell the kernel about command completion status in order for the
>>  kernel to be able to move and delete buffers.
>> 
>
> Yes, emitting the moves from the kernel is not a necessity. If your
> card can do memory protection, you can setup the protection bits in
> the kernel and ask user space to do the moves. Doing so means in-order
> execution in the current context, which means that in the normal case
> rendering does not need to synchronize with fences at all.
>
>   
>>  I'm not sure how you could avoid that with ANY kernel based memory
>>  manager, but I would be interested to know how you expect to solve that
>>  problem.
>> 
>
> See above, if the kernel controls the memory protection bits, it can
> pretty much enforce things on use space anyway.
>
>   
Well, the primary reason for the kernel to sync and move a buffer object 
would be to evict it from VRAM, in which case I don't think the 
user-space approach would be a valid solution, unless, of course, you 
plan to use VRAM as a cache and back it all with system memory.

Just out of interest (I think this is a valid thing to know, and I'm not 
being TTM / GEM specific here):
1) I've never seen a kernel round-trip per batchbuffer as a huge 
performance-problem, and it surely simplifies things for an in-kernel 
memory manger. Do you have any data to back this?
2) What do the Nvidia propriety drivers do w r t this?

/Thomas












-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: GEM discussion questions

2008-05-19 Thread Keith Packard
On Mon, 2008-05-19 at 00:14 -0700, Ian Romanick wrote:

> - - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
> needs to be handled.  I know of at least one piece of hardware with a
> kooky command buffer that wants to be used that way.

Oh, so mapping the same command buffer for both activities.

For Intel, we use batch buffers written with the CPU and queued to the
GPU by the kernel, using suitable flushing to get data written to memory
before the GPU is asked to read it.

It could be that this 'command domain' just needs to be separate, and
mapped coherent between GPU and CPU so that this works.

However, instead of messing with the API on some theoretical hardware,
I'm really only interested in seeing how the API fits to actual
hardware. Having someone look at how a gem-like API would work on Radeon
or nVidia hardware would go a long ways to exploring what pieces are
general purpose and which are UMA- (or even Intel-) specific.

> - - I suspect that in the (near) future we may want multiple read_domains.

That's why the argument is called 'read_domains' and not 'read_domain'.
We already have operations that read objects to both the sampler and
render caches.

> - - I think drm_i915_gem_relocation_entry should have a "size" field.
> There are a lot of cases in the current GL API (and more to come) where
> the entire object will trivially not be used.  Clamped LOD on textures
> is a trivial example, but others exist as well.

There are a couple of places where this might be useful (presumably both
offset and length); the 'set_domain' operation seems like one of them,
and if we place it there, then other places where domain information is
passed across the API might be good places to include that as well.

The most obvious benefit here is reducing clflush action as we flip
buffers from GPU to CPU for fallbacks; however, flipping objects back
and forth should be avoided anyway, eliminating this kind of fallback
would provide more performance benefit than making the fallback a bit
faster.

-- 
[EMAIL PROTECTED]


signature.asc
Description: This is a digitally signed message part
-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel


Re: TTM vs GEM discussion questions

2008-05-19 Thread Ian Romanick
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dave Airlie wrote:

I had a whole bunch of other stuff written, but I deleted it.  I started
~ having Jon Smirl deja vu.  Life is hard for us because King Solomon cut
our drivers in half.   He gave half to usermode and half to the kernel.
~ Wah!  Wh!  Wah!

Okay.  I'm done.

|> managers or that we may want to have N memory managers now that will be
|> gutted later.  It seems that the real problem is that the memory
|> managers have been exposed as a generic, directly usable, device
|> independent piece of infrastructure.  Maybe the right answer is to punt
|> on the entire concept of a general memory manager.  At best we'll have
|> some shared, optional use infrastructure, and all of the interfaces that
|> anything in userspace can ever see are driver dependent.  That limits
|> the exposure of the interfaces and lets us solve todays problems today.
|>
|> As is trivially apparent, we don't know what the "best" (for whatever
|> definition of best we choose) answer is for a memory manager interface.
|> ~ We're probably not going to know that answer in the near future.  To
|> not let our users have anything until we can give them the best thing is
|> an incredible disservice to them, and it makes us look silly (at best).
|
| Well the thing is I can't believe we don't know enough to do this in some
| way generically, but maybe the TTM vs GEM thing proves its not possible.

It's not even TTM vs GEM that proves it.  I passed around my first
memory manager proposal, which was wisely NAKed, at DDC 2004.  That we
still have nothing four years later is the proof.

TTM vs GEM underscores some issues that make it difficult.  On one hand
we want the kernel/user interface to directly map to the operations we
want to do: allocate memory, free memory, access the memory from the
CPU, and access the memory from the GPU.  On the other hand, access
arbitration between CPU/GPU, cache issues, TLB issues, and whole bunch
of other ugly hardware reality makes the performance of that interface lose.

If we flip from that interface to something that has decent performance
we end up exposing an unreasonable amount of the mechanism in the API.
Realistically, as soon as any mechanism is exposed in the API,
maintenance loses.  If we find a better way in 6 months or 5 years we're
still stuck with the old way because the old way was been exposed.

TTM and GEM each expose the two obvious mechanisms for handling this.
One uses locks (basically) over a usermode critical region when command
buffers are created.  The other performs relocations in a critical
region in the kernel.

It's not clear to me which overall strategy is better.

I've spent quite a bit of time looking at both interfaces.  I've written
a *basic* implementation of TTM-style fences.  I've tried (and failed)
to use TTM for command buffers and the framebuffer in the XGI XP10
driver.  Some observations / comments about TTM:

- - The barrier of entry is really high.  The driver writer as to do a lot
of work, it seems, to get anything working.

- - I have often found myself wishing for more TTM documentation.

- - The fencing mechanism is much too tightly coupled with the buffer
mechanism.

I've read the GEM documentation several times, and I think I have a good
grasp of it.  I don't have any non-trivial complaints about GEM, but I
do have a couple comments / observations:

- - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
needs to be handled.  I know of at least one piece of hardware with a
kooky command buffer that wants to be used that way.

- - I suspect that in the (near) future we may want multiple read_domains.
~ I can envision cases where applications using, for example, vertex
feedback mode would want to read from a buffer while the GPU is also
reading from the buffer.

- - I think drm_i915_gem_relocation_entry should have a "size" field.
There are a lot of cases in the current GL API (and more to come) where
the entire object will trivially not be used.  Clamped LOD on textures
is a trivial example, but others exist as well.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMSi9X1gOwKyEAw8RAtB1AJwJUvUCwS1FcdL8LjP8ahVpru2/iwCfR9eA
rfAmdoMsmZDHVmf7lRcN62Y=
=yfRk
-END PGP SIGNATURE-

-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
--
___
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel