On Wed, 14 May 2008 16:36:54 +0200
Thomas Hellström <[EMAIL PROTECTED]> wrote:

> Jerome Glisse wrote:
> I don't agree with you here. EXA is much faster for small composite 
> operations and even small fill blits if fallbacks are used. Even to 
> write-combined memory, but that of course depends on the hardware. This 
> is going to be even more pronounced with acceleration architectures like 
> Glucose and similar, that don't have an optimized path for small 
> hardware composite operations.
> 
> My personal feeling is that pwrites are a workaround for a workaround 
> for a very bad decision:
> 
> To avoid user-space allocators on device-mapped memory. This lead to a 
> hack to avoid cahing-policy changes which lead to  cache trashing 
> problems which put us in the current situation.  How far are we going to 
> follow this path before people wake up? What's wrong with the 
> performance of good old i915tex which even beats "classic" i915 in many 
> cases.
> 
> Having to go through potentially (and even probably) paged-out memory to 
> access buffers to make that are present in VRAM sounds like a very odd 
> approach (to say the least) to me. Even if it's a single page and 
> implementing per-page dirty checks for domain flushing isn't very 
> appealing either.

I don't have number or benchmark to check how fast pread/pwrite path might
be in this use so i am just expressing my feeling which happen to just be
to avoid vma tlb flush as most as we can. I got the feeling that kernel
goes through numerous trick to avoid tlb flushing for a good reason and
also i am pretty sure that with number of core keeping growing anythings
that need cpu broad synchronization is to be avoided.

Hopefully once i got decent amount of time to do benchmark with gem i will
check out my theory. I think simple benchmark can be done on intel hw just
return false in EXA prepare access to force use of download from screen,
and in download from screen use pread then comparing benchmark of this
hacked intel ddx with a normal one should already give some numbers.

> Why should we have to when we can do it right?

Well my point was that mapping vram is not right, i am not saying that
i know the truth. It's just a feeling based on my experiment with ttm
and on the bar restriction stuff and others consideration of same kind.

> No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
> video cards, 4G swap space, and you want to fill both card's videoram 
> with render-and-forget textures for whatever purpose.
> 
> What happens? After you've generated the first say 300M, The system 
> mysteriously starts to page, and when, after a a couple of minutes of 
> crawling texture upload speeds, you're done, The system is using and 
> have written almost 2G of swap. Now, you want to update the textures and 
> expect fast texsubimage...
> 
> So having a backing object that you have to access to get things into 
> VRAM is not the way to go.
> The correct way to do this is to reserve, but not use swap space. Then 
> you can start using it on suspend, provided that the swapping system is 
> still up (which is has to be with the current GEM approach anyway). If 
> pwrite is used in this case, it must not dirty any backing object pages.
> 

For normal desktop i don't expect VRAM amount > RAM amount, people with
1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
3d world are stored in memory, if program are not stupid and trust gl
to keep their texture then you just have the usual ram copy and possibly
a vram copy, so i don't see any waste in the normal use case. Of course
we can always come up with crazy weird setup, but i am more interested
in dealing well with average Joe than dealing mostly well with every
use case.

That said i do see GPGPU as a possible users of temporary big vram buffer
ie buffer you can trash away. For that kind of stuff it does make sense
to not have backing ram/swap area. But i would rather add somethings in
gem like intercepting allocation of such buffer and not creating backing
buffer, or adding driver specific ioctl for that case.

Anyway i think we need benchmark to know what in the end is really the
best option. I don't have code to support my general feeling, so i might
be wrong. Sadly we don't have 2^32 monkeys doing code days and night for
drm to test all solutions :)

Cheers,
Jerome Glisse <[EMAIL PROTECTED]>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
--
_______________________________________________
Dri-devel mailing list
Dri-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to