Re: TTM merging?
On 5/18/08, Thomas Hellström <[EMAIL PROTECTED]> wrote: > > > > What you fail to notice here is that I think most people intend to > > have only one memory manager in the kernel. > > > How on earth can you draw that conclusion from the above statement? > Well, Dave has been saying this to me all along... otherwise I'd probably have my own memory manager too. I also think most people agree that a single memory manager would make things simpler for everyone (especially since there is a need for some glue with things like EXA). > > > So making the wrong > > decisions here will pretty much enforce those decisions on all > > drivers. And therefore, we will not be "able to do what you want to" > > > > > What GEM protagonists have been arguing and propagating for is not a > single memory manager, but a single small common simple memory > management interface to that would allow any driver writer to do pretty > much what they want with their driver. As you might have noticed we're > not really arguing against that. Yeah, again, I'm not taking sides, I'm just concerned that we'll have to revisit the memory manager issue in 2 years when all cards implement full memory protection and paging (and that'll be the case, since the windows driver model pretty much requires that). But maybe it's the right thing to do to move forward now. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Stephane Marchesin wrote: >> Jerome, Dave, Keith >> >> It's hard to argue against people trying things out and finding it's not >> really what they want, so I'm not going to do that. >> >> The biggest argument (apart from the fencing) seems to be that people >> thinks TTM stops them from doing what they want with the hardware, >> although it seems like the Nouveau needs and Intel UMA needs are quite >> opposite. In an open-source community where people work on things >> because they want to, not being able to do what you want to is a bad thing, >> > > What you fail to notice here is that I think most people intend to > have only one memory manager in the kernel. How on earth can you draw that conclusion from the above statement? > So making the wrong > decisions here will pretty much enforce those decisions on all > drivers. And therefore, we will not be "able to do what you want to" > What GEM protagonists have been arguing and propagating for is not a single memory manager, but a single small common simple memory management interface to that would allow any driver writer to do pretty much what they want with their driver. As you might have noticed we're not really arguing against that. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
> > Jerome, Dave, Keith > > It's hard to argue against people trying things out and finding it's not > really what they want, so I'm not going to do that. > > The biggest argument (apart from the fencing) seems to be that people > thinks TTM stops them from doing what they want with the hardware, > although it seems like the Nouveau needs and Intel UMA needs are quite > opposite. In an open-source community where people work on things > because they want to, not being able to do what you want to is a bad thing, What you fail to notice here is that I think most people intend to have only one memory manager in the kernel. So making the wrong decisions here will pretty much enforce those decisions on all drivers. And therefore, we will not be "able to do what you want to" > > OTOH a stall and disagreement about what's the best thing to use is > even worse. It confuses the users and it's particularly bad for people > trying to write drivers on a commercial basis. I don't see how the needs are opposed. A memory manager is just handling pieces of memory, and you should get some kind of flexibility from it, especially if it's going to be the de-facto memory manager for all DRI/X.Org. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eric Anholt schrieb: >> No. Gem can't coop with it. Let's say you have a 512M system with two 1G >> video cards, 4G swap space, and you want to fill both card's videoram >> with render-and-forget textures for whatever purpose. > > Who's selling that system? Who's building that system at home? Video game consoles? According to Wikipedia PS3 has 256 MB of RAM vs 256 MB of VRAM. Philipp P.S.: Even my ColecoVision, has 1 KB of RAM vs 16 KB of VRAM. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFILHQVbtUV+xsoLpoRAsaLAJ0fXyrk1n4TE0m/egvm10uACnIxLwCgqjl3 BE0DdbGE1R61oBsbf/zi8cU= =nq1l -END PGP SIGNATURE- - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Thu, 2008-05-15 at 10:03 -0700, Ian Romanick wrote: > Er...what about glMapBuffer? Are we now going to force drivers to > implement that via copies? No, we'll support it, and make it as fast as possible. The goal is to not use it inside the driver, not to break GL apps. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Keith Packard wrote: | On Wed, 2008-05-14 at 21:41 +0200, Thomas Hellström wrote: | |> As you've previously mentioned, this requires caching policy changes and |> it needs to be used with some care. | | I did't need that in my drivers as GEM handles the WB -> GPU object | transfer already. | | Object mapping is really the least important part of the system; it | should only be necessary when your GPU is deficient, or your API so | broken as to require this inefficient mechanism. I suspect we'll be | tracking 965 performance as we work to eliminate mapping, we should see | a steady increase until we're no longer mapping anything that the GPU | uses into the application's address space. Er...what about glMapBuffer? Are we now going to force drivers to implement that via copies? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFILGzGX1gOwKyEAw8RAhYRAJsF4k9TjewdZseLAvXPlibJdKChrwCgih0p L/D5WQlfEpN+DyDgYvOUA20= =wYa/ -END PGP SIGNATURE- - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Keith Packard wrote: > On Thu, 2008-05-15 at 07:30 +0200, Thomas Hellström wrote: > > >> Static wc-d maps into fairly static objects like scanout buffers or >> buffer pools are not inefficient. They provide the by far highest >> throughput for writing (even beats cache-coherent). But they may take >> some time to set up or tear down, which means you should avoid that as >> much as possible. >> > > Yeah, the 2D driver uses the GTT mapping for drawing to the front > buffer. Otherwise, tiling is way too much work as the code isn't neatly > wrapped up like the Mesa swrast. For that, I just added detiling code. > > I'm not sure how I'd manage objects that can move or be evicted though; > perhaps some page table tricks could be used to avoid locking objects to > the GTT in a way visible to the user application. > Yes, TTM does this by killing user-space mappings when an object is evicted. When they are again accessed by the app, they are simply faulted back mapping the new location wherever that is, changing caching policy if necessary. That approach is in principle reusable for a GEM implementation but I think not with the shmem backing objects, since we cannot overload the SHMEM fault() method, but that's only a limitiation of the current GEM implementation, not of the API design. One way to implement this in the GEM context I'd use the (small) part of drmBOs that deals with this, and use that as base GEM objects instead of SHMEMFS objects. Each GEM object would also always hold a page list. When the object has real pages attached to it (Like UMA objects) it would be populated with pages. When the object points to a place in VRAM, it holds swap_entry references (To reserve swap space used at suspend). If a GEM object, at creation, fails to allocate either pages or swap entries, object creation should fail to avoid failures later at unpredictable points. With this approach the GEM driver needs to decide when to push a GEM object's pages to the swap cache and when to reclaim them. It doesn't happen automatically as with the SMEMFS objects, but OTOH that puts the GEM driver in control on how much memory it will pin. That will avoid bad SHMEMFS decisions. > > >> The current implementation of GEM that doesn't allow overloading of the >> core GEM functions blocked the possibility to set up such mappings. This >> is about to change, and I'm happy with that. >> > > The requirement is that GEM provide the interfaces used by drivers; if a > driver needs some new functionality, we'd naturally work out how to > incorporate that. Right now, our 3D drivers don't care about GTT maps, > but I suspect our 2D ones will as I don't really want to try to deal > with tiling from all of the X server drawing code. > > Sounds good to me. Modesetting people will also be happy. /Thomas. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Thu, 2008-05-15 at 07:30 +0200, Thomas Hellström wrote: > Static wc-d maps into fairly static objects like scanout buffers or > buffer pools are not inefficient. They provide the by far highest > throughput for writing (even beats cache-coherent). But they may take > some time to set up or tear down, which means you should avoid that as > much as possible. Yeah, the 2D driver uses the GTT mapping for drawing to the front buffer. Otherwise, tiling is way too much work as the code isn't neatly wrapped up like the Mesa swrast. For that, I just added detiling code. I'm not sure how I'd manage objects that can move or be evicted though; perhaps some page table tricks could be used to avoid locking objects to the GTT in a way visible to the user application. > The current implementation of GEM that doesn't allow overloading of the > core GEM functions blocked the possibility to set up such mappings. This > is about to change, and I'm happy with that. The requirement is that GEM provide the interfaces used by drivers; if a driver needs some new functionality, we'd naturally work out how to incorporate that. Right now, our 3D drivers don't care about GTT maps, but I suspect our 2D ones will as I don't really want to try to deal with tiling from all of the X server drawing code. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Keith Packard wrote: > On Wed, 2008-05-14 at 21:41 +0200, Thomas Hellström wrote: > > >> As you've previously mentioned, this requires caching policy changes and >> it needs to be used with some care. >> > > I did't need that in my drivers as GEM handles the WB -> GPU object > transfer already. > > Object mapping is really the least important part of the system; it > should only be necessary when your GPU is deficient, or your API so > broken as to require this inefficient mechanism. I suspect we'll be > tracking 965 performance as we work to eliminate mapping, we should see > a steady increase until we're no longer mapping anything that the GPU > uses into the application's address space. > > Static wc-d maps into fairly static objects like scanout buffers or buffer pools are not inefficient. They provide the by far highest throughput for writing (even beats cache-coherent). But they may take some time to set up or tear down, which means you should avoid that as much as possible. For things like scanout buffers or video buffers you should really use such mappings, otherwise you lose big. The current implementation of GEM that doesn't allow overloading of the core GEM functions blocked the possibility to set up such mappings. This is about to change, and I'm happy with that. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, May 14, 2008 at 05:22:06PM -0700, Keith Packard wrote: | On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote: | > In the OpenGL case, object mapping wasn't originally a part of the API. | > It was added because people building hardware and apps for Intel-based | > PCs determined that it was worthwhile, and demanded it. | | In a UMA environment, it seems so obvious to map objects into the | application and just bypass the whole kernel API issue. That, however, | ignores caching effects, which appear to dominate performance effects | these days. I think the confusion arises because the mechanism is used for several purposes, some of which are likely to be dominated by cache effects on some implementations, and others that aren't. I'm thinking about the differences between piecemeal updating of the elements of a vertex array, versus grabbing an image from a video capture card or a direct read() from a file into a texture buffer. The API is intended to allow apps and drivers to make intelligent choices between cases like those. Check out BufferData() and MapBuffer() in section 2.9 of the OpenGL 2.1 spec for a discussion which specifically mentions cache effects. | > This wasn't on my watch, so I can't give you the history in detail, but | > my recollection is that the primary uses were texture loading for games | > and video apps, and incremental changes to vertex arrays for games and | > rendering apps. | | Most of which can be efficiently performed with a pwrite-like system | where the application explicitly tells the system which portions of the | object to modify. ... Interfaces of that style are present in OpenGL, and predate the mapping interfaces. I know they were regarded as too slow for some apps, so the mapping interfaces were added. The early extensions were driven by vendors who didn't support UMA, so that couldn't have been the only model they were concerned about. Beyond that I'm not sure. | > So maybe the hardware has changed sufficiently that the old reasoning | > and performance measurements are no longer valid. It would still be | > good to know for sure that eliminating low-level support for the | > mechanism won't be drastically bad for the classes of apps that use it. | | I'm not sure we can (or want to) eliminate it entirely, all that I | discovered was that it should be avoided as it has negative performance | consequences. Not dire, but certainly not positive either. | | I don't know how old these measurements were, but certainly the gap | between CPU and memory speed has been rapidly increasing for years, | along with cache sizes, both of which have a fairly dramatic effect on | how best to access actual memory. The first reference I can find to an object-mapping API in OpenGL is from 2001. I'm sure the vendors had implementations internally before then, but that's when things were mature enough to start standardizing. Since the functionality is present in OpenGL 2.0 (vintage 2006?), apparently someone thought it was still useful enough to carry over from OpenGL 1.X. Again, sorry I don't know the entire history on this one. Allen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote: > On Wed, May 14, 2008 at 03:48:47PM -0700, Keith Packard wrote: > | Object mapping is really the least important part of the system; it > | should only be necessary when your GPU is deficient, or your API so > | broken as to require this inefficient mechanism. > > In the OpenGL case, object mapping wasn't originally a part of the API. > It was added because people building hardware and apps for Intel-based > PCs determined that it was worthwhile, and demanded it. In a UMA environment, it seems so obvious to map objects into the application and just bypass the whole kernel API issue. That, however, ignores caching effects, which appear to dominate performance effects these days. > This wasn't on my watch, so I can't give you the history in detail, but > my recollection is that the primary uses were texture loading for games > and video apps, and incremental changes to vertex arrays for games and > rendering apps. Most of which can be efficiently performed with a pwrite-like system where the application explicitly tells the system which portions of the object to modify. Again, it seems insane when everything is a uniform mass of pages, except for the subtle differences in cache behaviour. > So maybe the hardware has changed sufficiently that the old reasoning > and performance measurements are no longer valid. It would still be > good to know for sure that eliminating low-level support for the > mechanism won't be drastically bad for the classes of apps that use it. I'm not sure we can (or want to) eliminate it entirely, all that I discovered was that it should be avoided as it has negative performance consequences. Not dire, but certainly not positive either. I don't know how old these measurements were, but certainly the gap between CPU and memory speed has been rapidly increasing for years, along with cache sizes, both of which have a fairly dramatic effect on how best to access actual memory. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, May 14, 2008 at 03:48:47PM -0700, Keith Packard wrote: | Object mapping is really the least important part of the system; it | should only be necessary when your GPU is deficient, or your API so | broken as to require this inefficient mechanism. In the OpenGL case, object mapping wasn't originally a part of the API. It was added because people building hardware and apps for Intel-based PCs determined that it was worthwhile, and demanded it. This wasn't on my watch, so I can't give you the history in detail, but my recollection is that the primary uses were texture loading for games and video apps, and incremental changes to vertex arrays for games and rendering apps. So maybe the hardware has changed sufficiently that the old reasoning and performance measurements are no longer valid. It would still be good to know for sure that eliminating low-level support for the mechanism won't be drastically bad for the classes of apps that use it. Allen - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 21:51 +0200, Thomas Hellström wrote: > Eric Anholt wrote: > > > > If the implementation of those ioctls in generic code doesn't work for > > some drivers (say, early shmfs object creation turns out to be a bad > > idea for VRAM drivers), I'll happily push it out to the driver. > > > > > Or perhaps use generic ioctls, but provide hooks in the driver to > overload the core GEM functions with other implementations. Yeah, that's what I was thinking. -- Eric Anholt [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 21:41 +0200, Thomas Hellström wrote: > As you've previously mentioned, this requires caching policy changes and > it needs to be used with some care. I did't need that in my drivers as GEM handles the WB -> GPU object transfer already. Object mapping is really the least important part of the system; it should only be necessary when your GPU is deficient, or your API so broken as to require this inefficient mechanism. I suspect we'll be tracking 965 performance as we work to eliminate mapping, we should see a steady increase until we're no longer mapping anything that the GPU uses into the application's address space. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Eric Anholt wrote: > > If the implementation of those ioctls in generic code doesn't work for > some drivers (say, early shmfs object creation turns out to be a bad > idea for VRAM drivers), I'll happily push it out to the driver. > > Or perhaps use generic ioctls, but provide hooks in the driver to overload the core GEM functions with other implementations. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Keith Packard wrote: > On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote: > > >> My personal feeling is that pwrites are a workaround for a workaround >> for a very bad decision >> > > Feel free to map VRAM then if you can; I didn't need to on Intel as > there isn't any difference. > > With mapping device memory on UMA devices I'm referring to mapping through the GTT aperture. Either as stolen memory, Pre-bound GTT pools or simply buffer object memory temporarily bound to the GTT. As you've previously mentioned, this requires caching policy changes and it needs to be used with some care. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Keith Packard wrote: > On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote: > > >> Nobody can force you to take one path or the other, but it's certainly >> my intention when considering drivers for VRAM hardware to support >> single-copy-number textures, and for that reason, I'd be unhappy to >> see a system adopted that prevented that. >> > > And, GEM on UMA does single-copy texture updates, just as TTM does. > From an object management perspective, GEM isn't very different from > TTM, it's just that the current code is written for UMA, and no-one has > shown code for either of these running on non-UMA hardware. > That's not exactly true. Stolen memory behaves just like VRAM from a driver writer's perspective. Implemented as VRAM on i915 modesetting-101 and Psb. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, May 14, 2008 at 2:30 PM, Eric Anholt <[EMAIL PROTECTED]> wrote: > On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote: > > > The real question is whether TTM suits the driver writers for use in > Linux > > > desktop and embedded environments, and I think so far I'm not seeing > > > enough positive feedback from the desktop side. > > > > > I actually haven't seen much feedback at all. At least not on the > > mailing lists. > > Anyway we need to look at the alternatives which currently is GEM. > > > > GEM, while still in development basically brings us back to the > > functionality of TTM 0.1, with added paging support but without > > fine-grained locking and caching policy support. > > > > I might have misunderstood things but quickly browsing the code raises > > some obvious questions: > > > > 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects > > use GFP_HIGHUSER, and it's hardcoded into the linux swap code. > > The obvious solution here is what many DMA APIs do for IOMMUs that can't > address all of memory -- keep a pool of pages within the addressable > range and bounce data through them. I think the Linux kernel even has > interfaces to support us in this. Since it's not going to be a very > common case, we may not care about the performance. If we do find that > we care about the performance, we should first attempt to get what we > need into the linux kernel so we don't have to duplicate code, and only > if that fails do the duplication. > > I'm pretty sure the AGP chipsets versus >32-bits pages danger has been > overstated, though. Besides the fact that you needed to load one of > these older supposed machines with a full 4GB of memory (well, > theoretically 3.5GB but how often can you even boot a system with a 2, > 1, .5gb combo?), you also need a chipset that does >32-bit addressing. > > At least all AMD and Intel chipsets don't appear to have this problem in > the survey I did last night, as they've either got >32-bit chipset and > >32-bit gart, or 32-bit chipset and 32-bit gart. Basically all I'm > worried about is ATI PCI[E]GART at this point. AMD PCIE and IGP GART support 40 bits (Dave just committed support this morning) so we should be fine on r3xx and newer PCIE cards. Alex - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Eric Anholt wrote: > On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote: > >>> The real question is whether TTM suits the driver writers for use in Linux >>> desktop and embedded environments, and I think so far I'm not seeing >>> enough positive feedback from the desktop side. >>> >>> >> I actually haven't seen much feedback at all. At least not on the >> mailing lists. >> Anyway we need to look at the alternatives which currently is GEM. >> >> GEM, while still in development basically brings us back to the >> functionality of TTM 0.1, with added paging support but without >> fine-grained locking and caching policy support. >> >> I might have misunderstood things but quickly browsing the code raises >> some obvious questions: >> >> 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects >> use GFP_HIGHUSER, and it's hardcoded into the linux swap code. >> > > The obvious solution here is what many DMA APIs do for IOMMUs that can't > address all of memory -- keep a pool of pages within the addressable > range and bounce data through them. I think the Linux kernel even has > interfaces to support us in this. Since it's not going to be a very > common case, we may not care about the performance. If we do find that > we care about the performance, we should first attempt to get what we > need into the linux kernel so we don't have to duplicate code, and only > if that fails do the duplication. > > I'm pretty sure the AGP chipsets versus >32-bits pages danger has been > overstated, though. Besides the fact that you needed to load one of > these older supposed machines with a full 4GB of memory (well, > theoretically 3.5GB but how often can you even boot a system with a 2, > 1, .5gb combo?), you also need a chipset that does >32-bit addressing. > > At least all AMD and Intel chipsets don't appear to have this problem in > the survey I did last night, as they've either got >32-bit chipset and > >> 32-bit gart, or 32-bit chipset and 32-bit gart. Basically all I'm >> > worried about is ATI PCI[E]GART at this point. > > http://dri.freedesktop.org/wiki/GARTAddressingLimits > > > > There will probably turn up a couple of more devices or incomplete drivers, but in the long run this is a fixable problem. >> 5) What's protecting i915 GEM object privates and lists in a >> multi-threaded environment? >> > > Nothing at the moment. That's my current project. dev->struct_mutex is > the plan -- I don't want to see finer-grained locking until we show that > contention on that locking is an issue. Fine-grained locking takes > significant care, and there's a lot more important performance > improvements to work on before then. > > >> 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing >> some severe ranting about it on the lkml? >> > > We've talked it over with Arjan, and until we can use real fds as our > handles to objects, he thought it sounded OK. But apparently Al Viro's > working on making it so that allocating a thousand fds would be feasible > for us. At that point mmap/pread/pwrite/close ioctls could be replaced > with the syscalls they were named for, and the kernel guys love us. > > >> TTM is designed to cope with most hardware quirks I've come across with >> different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and >> some other ones. GEM basically leaves it up to the driver writer to >> reinvent the wheel.. >> > > The problem with TTM is that it's designed to expose one general API for > all hardware, when that's not what our drivers want. The GPU-GPU cache > handling for intel, for example, mapped the hardware so poorly that > every batch just flushed everything. Bolting on the clflush-based > cpu-gpu caching management for our platform recovered a lot of > performance, but we're still having to reuse buffers in userland at a > memory cost because allocating buffers is overly expensive for the > general supporting-everybody (but oops, it's not swappable!) object > allocator. > > Swapping drmBOs is a couple of days implementation and some core kernel exports. It's just that someone needs find the time and the right person to talk to in the right way to get certain swapping functions exported. > We're trying to come at it from the other direction: Implement one > driver well. When someone else implements another driver and finds that > there's code that should be common, make it into a support library and > share it. > > I actually would have liked the whole interface to userland to be > driver-specific with a support library for the parts we think other > people would want, but DRI2 wants to use buffer objects for its shared > memory transport and I didn't want to rock its boat too hard, so the > ioctls that should be supportable for everyone got moved to generic. > > If the implementation of those ioctls in generic code doesn't work for > some drivers (say, early shmfs object c
Re: TTM merging?
On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote: > Nobody can force you to take one path or the other, but it's certainly > my intention when considering drivers for VRAM hardware to support > single-copy-number textures, and for that reason, I'd be unhappy to > see a system adopted that prevented that. And, GEM on UMA does single-copy texture updates, just as TTM does. From an object management perspective, GEM isn't very different from TTM, it's just that the current code is written for UMA, and no-one has shown code for either of these running on non-UMA hardware. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 19:08 +0200, Jerome Glisse wrote: > I don't have number or benchmark to check how fast pread/pwrite path might > be in this use so i am just expressing my feeling which happen to just be > to avoid vma tlb flush as most as we can. For batch buffers, pwrite is 3X faster than map/write/unmap, at least as measured by that most estimable benchmark 'glxgears'. Take that with as much skepticism as it deserves. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote: > My personal feeling is that pwrites are a workaround for a workaround > for a very bad decision Feel free to map VRAM then if you can; I didn't need to on Intel as there isn't any difference. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote: > > The real question is whether TTM suits the driver writers for use in Linux > > desktop and embedded environments, and I think so far I'm not seeing > > enough positive feedback from the desktop side. > > > I actually haven't seen much feedback at all. At least not on the > mailing lists. > Anyway we need to look at the alternatives which currently is GEM. > > GEM, while still in development basically brings us back to the > functionality of TTM 0.1, with added paging support but without > fine-grained locking and caching policy support. > > I might have misunderstood things but quickly browsing the code raises > some obvious questions: > > 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects > use GFP_HIGHUSER, and it's hardcoded into the linux swap code. The obvious solution here is what many DMA APIs do for IOMMUs that can't address all of memory -- keep a pool of pages within the addressable range and bounce data through them. I think the Linux kernel even has interfaces to support us in this. Since it's not going to be a very common case, we may not care about the performance. If we do find that we care about the performance, we should first attempt to get what we need into the linux kernel so we don't have to duplicate code, and only if that fails do the duplication. I'm pretty sure the AGP chipsets versus >32-bits pages danger has been overstated, though. Besides the fact that you needed to load one of these older supposed machines with a full 4GB of memory (well, theoretically 3.5GB but how often can you even boot a system with a 2, 1, .5gb combo?), you also need a chipset that does >32-bit addressing. At least all AMD and Intel chipsets don't appear to have this problem in the survey I did last night, as they've either got >32-bit chipset and >32-bit gart, or 32-bit chipset and 32-bit gart. Basically all I'm worried about is ATI PCI[E]GART at this point. http://dri.freedesktop.org/wiki/GARTAddressingLimits > 5) What's protecting i915 GEM object privates and lists in a > multi-threaded environment? Nothing at the moment. That's my current project. dev->struct_mutex is the plan -- I don't want to see finer-grained locking until we show that contention on that locking is an issue. Fine-grained locking takes significant care, and there's a lot more important performance improvements to work on before then. > 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing > some severe ranting about it on the lkml? We've talked it over with Arjan, and until we can use real fds as our handles to objects, he thought it sounded OK. But apparently Al Viro's working on making it so that allocating a thousand fds would be feasible for us. At that point mmap/pread/pwrite/close ioctls could be replaced with the syscalls they were named for, and the kernel guys love us. > TTM is designed to cope with most hardware quirks I've come across with > different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and > some other ones. GEM basically leaves it up to the driver writer to > reinvent the wheel.. The problem with TTM is that it's designed to expose one general API for all hardware, when that's not what our drivers want. The GPU-GPU cache handling for intel, for example, mapped the hardware so poorly that every batch just flushed everything. Bolting on the clflush-based cpu-gpu caching management for our platform recovered a lot of performance, but we're still having to reuse buffers in userland at a memory cost because allocating buffers is overly expensive for the general supporting-everybody (but oops, it's not swappable!) object allocator. We're trying to come at it from the other direction: Implement one driver well. When someone else implements another driver and finds that there's code that should be common, make it into a support library and share it. I actually would have liked the whole interface to userland to be driver-specific with a support library for the parts we think other people would want, but DRI2 wants to use buffer objects for its shared memory transport and I didn't want to rock its boat too hard, so the ioctls that should be supportable for everyone got moved to generic. If the implementation of those ioctls in generic code doesn't work for some drivers (say, early shmfs object creation turns out to be a bad idea for VRAM drivers), I'll happily push it out to the driver. -- Eric Anholt [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-dev
Re: TTM merging?
On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote: > >> 2) Reserving pages when allocating VRAM buffers is also a very bad > >> solution particularly on systems with a lot of VRAM and little system > >> RAM. (Multiple card machines?). GEM basically needs to reserve > >> swap-space when buffers are created, and put a limit on the pinned > >> physical pages. We basically should not be able to fail memory > >> allocation during execbuf, because we cannot recover from that. > >> > > > > Well this solve the suspend problem we were discussing at xds ie what > > to do on buffer. If we know that we have room to put buffer then we > > don't to worry about which buffer we are ready to loose. Given that > > opengl don't give any clue on that this sounds like a good approach. > > > > For embedded device where every piece of ram still matter i guess > > you also have to deal with suspend case so you have a way to either > > save vram content or to preserve it. I don't see any problem with > > gem to cop with this case too. > > > No. Gem can't coop with it. Let's say you have a 512M system with two 1G > video cards, 4G swap space, and you want to fill both card's videoram > with render-and-forget textures for whatever purpose. Who's selling that system? Who's building that system at home? -- Eric Anholt [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 10:21 -0700, Keith Whitwell wrote: > > - Original Message > > From: Jerome Glisse <[EMAIL PROTECTED]> > > To: Thomas Hellström <[EMAIL PROTECTED]> > > Cc: Dave Airlie <[EMAIL PROTECTED]>; Keith Packard <[EMAIL PROTECTED]>; DRI > > ; Dave Airlie <[EMAIL PROTECTED]> > > Sent: Wednesday, May 14, 2008 6:08:55 PM > > Subject: Re: TTM merging? > > > > On Wed, 14 May 2008 16:36:54 +0200 > > Thomas Hellström wrote: > > > > > Jerome Glisse wrote: > > > I don't agree with you here. EXA is much faster for small composite > > > operations and even small fill blits if fallbacks are used. Even to > > > write-combined memory, but that of course depends on the hardware. This > > > is going to be even more pronounced with acceleration architectures like > > > Glucose and similar, that don't have an optimized path for small > > > hardware composite operations. > > > > > > My personal feeling is that pwrites are a workaround for a workaround > > > for a very bad decision: > > > > > > To avoid user-space allocators on device-mapped memory. This lead to a > > > hack to avoid cahing-policy changes which lead to cache trashing > > > problems which put us in the current situation. How far are we going to > > > follow this path before people wake up? What's wrong with the > > > performance of good old i915tex which even beats "classic" i915 in many > > > cases. > > > > > > Having to go through potentially (and even probably) paged-out memory to > > > access buffers to make that are present in VRAM sounds like a very odd > > > approach (to say the least) to me. Even if it's a single page and > > > implementing per-page dirty checks for domain flushing isn't very > > > appealing either. > > > > I don't have number or benchmark to check how fast pread/pwrite path might > > be in this use so i am just expressing my feeling which happen to just be > > to avoid vma tlb flush as most as we can. I got the feeling that kernel > > goes through numerous trick to avoid tlb flushing for a good reason and > > also i am pretty sure that with number of core keeping growing anythings > > that need cpu broad synchronization is to be avoided. > > > > Hopefully once i got decent amount of time to do benchmark with gem i will > > check out my theory. I think simple benchmark can be done on intel hw just > > return false in EXA prepare access to force use of download from screen, > > and in download from screen use pread then comparing benchmark of this > > hacked intel ddx with a normal one should already give some numbers. > > > > > Why should we have to when we can do it right? > > > > Well my point was that mapping vram is not right, i am not saying that > > i know the truth. It's just a feeling based on my experiment with ttm > > and on the bar restriction stuff and others consideration of same kind. > > > > > No. Gem can't coop with it. Let's say you have a 512M system with two 1G > > > video cards, 4G swap space, and you want to fill both card's videoram > > > with render-and-forget textures for whatever purpose. > > > > > > What happens? After you've generated the first say 300M, The system > > > mysteriously starts to page, and when, after a a couple of minutes of > > > crawling texture upload speeds, you're done, The system is using and > > > have written almost 2G of swap. Now, you want to update the textures and > > > expect fast texsubimage... > > > > > > So having a backing object that you have to access to get things into > > > VRAM is not the way to go. > > > The correct way to do this is to reserve, but not use swap space. Then > > > you can start using it on suspend, provided that the swapping system is > > > still up (which is has to be with the current GEM approach anyway). If > > > pwrite is used in this case, it must not dirty any backing object pages. > > > > > > > For normal desktop i don't expect VRAM amount > RAM amount, people with > > 1Go VRAM are usually hard gamer with 4G of ram :). Also most object in > > 3d world are stored in memory, if program are not stupid and trust gl > > to keep their texture then you just have the usual ram copy and possibly > > a vram copy, so i don't see any waste in the normal use case. Of course > > we can always come up with cra
Re: TTM merging?
On Wed, 14 May 2008 10:21:15 -0700 (PDT) Keith Whitwell <[EMAIL PROTECTED]> wrote: > > On Wed, 14 May 2008 16:36:54 +0200 > > Thomas Hellström wrote: > > > > > Jerome Glisse wrote: > > > I don't agree with you here. EXA is much faster for small composite > > > operations and even small fill blits if fallbacks are used. Even to > > > write-combined memory, but that of course depends on the hardware. This > > > is going to be even more pronounced with acceleration architectures like > > > Glucose and similar, that don't have an optimized path for small > > > hardware composite operations. > > > > > > My personal feeling is that pwrites are a workaround for a workaround > > > for a very bad decision: > > > > > > To avoid user-space allocators on device-mapped memory. This lead to a > > > hack to avoid cahing-policy changes which lead to cache trashing > > > problems which put us in the current situation. How far are we going to > > > follow this path before people wake up? What's wrong with the > > > performance of good old i915tex which even beats "classic" i915 in many > > > cases. > > > > > > Having to go through potentially (and even probably) paged-out memory to > > > access buffers to make that are present in VRAM sounds like a very odd > > > approach (to say the least) to me. Even if it's a single page and > > > implementing per-page dirty checks for domain flushing isn't very > > > appealing either. > > > > I don't have number or benchmark to check how fast pread/pwrite path might > > be in this use so i am just expressing my feeling which happen to just be > > to avoid vma tlb flush as most as we can. I got the feeling that kernel > > goes through numerous trick to avoid tlb flushing for a good reason and > > also i am pretty sure that with number of core keeping growing anythings > > that need cpu broad synchronization is to be avoided. > > > > Hopefully once i got decent amount of time to do benchmark with gem i will > > check out my theory. I think simple benchmark can be done on intel hw just > > return false in EXA prepare access to force use of download from screen, > > and in download from screen use pread then comparing benchmark of this > > hacked intel ddx with a normal one should already give some numbers. > > > > > Why should we have to when we can do it right? > > > > Well my point was that mapping vram is not right, i am not saying that > > i know the truth. It's just a feeling based on my experiment with ttm > > and on the bar restriction stuff and others consideration of same kind. > > > > > No. Gem can't coop with it. Let's say you have a 512M system with two 1G > > > video cards, 4G swap space, and you want to fill both card's videoram > > > with render-and-forget textures for whatever purpose. > > > > > > What happens? After you've generated the first say 300M, The system > > > mysteriously starts to page, and when, after a a couple of minutes of > > > crawling texture upload speeds, you're done, The system is using and > > > have written almost 2G of swap. Now, you want to update the textures and > > > expect fast texsubimage... > > > > > > So having a backing object that you have to access to get things into > > > VRAM is not the way to go. > > > The correct way to do this is to reserve, but not use swap space. Then > > > you can start using it on suspend, provided that the swapping system is > > > still up (which is has to be with the current GEM approach anyway). If > > > pwrite is used in this case, it must not dirty any backing object pages. > > > > > > > For normal desktop i don't expect VRAM amount > RAM amount, people with > > 1Go VRAM are usually hard gamer with 4G of ram :). Also most object in > > 3d world are stored in memory, if program are not stupid and trust gl > > to keep their texture then you just have the usual ram copy and possibly > > a vram copy, so i don't see any waste in the normal use case. Of course > > we can always come up with crazy weird setup, but i am more interested > > in dealing well with average Joe than dealing mostly well with every > > use case. > > It's always been a big win to go to single-copy texturing. Textures tend to > be large and nobody has so much memory that doubling up on textures has ever > been appealing... And there are obvious use-cases like textured video where > only having a single copy is a big performance. > > It certainly makes things easier for the driver to duplicate textures -- > which is why all the old DRI drivers did it -- but it doesn't make it > right... And the old DRI drivers also copped out on things like > render-to-texture, etc, so whatever gains you make in simplicity by treating > VRAM as a cache, some of those will be lost because you'll have to keep track > of which one of the two copies of a texture is up-to-date, and you'll still > have to preserve (modified) texture contents on eviction, which old DRI never > had to. > > Ultimately it boils do
Re: TTM merging?
- Original Message > From: Jerome Glisse <[EMAIL PROTECTED]> > To: Thomas Hellström <[EMAIL PROTECTED]> > Cc: Dave Airlie <[EMAIL PROTECTED]>; Keith Packard <[EMAIL PROTECTED]>; DRI > ; Dave Airlie <[EMAIL PROTECTED]> > Sent: Wednesday, May 14, 2008 6:08:55 PM > Subject: Re: TTM merging? > > On Wed, 14 May 2008 16:36:54 +0200 > Thomas Hellström wrote: > > > Jerome Glisse wrote: > > I don't agree with you here. EXA is much faster for small composite > > operations and even small fill blits if fallbacks are used. Even to > > write-combined memory, but that of course depends on the hardware. This > > is going to be even more pronounced with acceleration architectures like > > Glucose and similar, that don't have an optimized path for small > > hardware composite operations. > > > > My personal feeling is that pwrites are a workaround for a workaround > > for a very bad decision: > > > > To avoid user-space allocators on device-mapped memory. This lead to a > > hack to avoid cahing-policy changes which lead to cache trashing > > problems which put us in the current situation. How far are we going to > > follow this path before people wake up? What's wrong with the > > performance of good old i915tex which even beats "classic" i915 in many > > cases. > > > > Having to go through potentially (and even probably) paged-out memory to > > access buffers to make that are present in VRAM sounds like a very odd > > approach (to say the least) to me. Even if it's a single page and > > implementing per-page dirty checks for domain flushing isn't very > > appealing either. > > I don't have number or benchmark to check how fast pread/pwrite path might > be in this use so i am just expressing my feeling which happen to just be > to avoid vma tlb flush as most as we can. I got the feeling that kernel > goes through numerous trick to avoid tlb flushing for a good reason and > also i am pretty sure that with number of core keeping growing anythings > that need cpu broad synchronization is to be avoided. > > Hopefully once i got decent amount of time to do benchmark with gem i will > check out my theory. I think simple benchmark can be done on intel hw just > return false in EXA prepare access to force use of download from screen, > and in download from screen use pread then comparing benchmark of this > hacked intel ddx with a normal one should already give some numbers. > > > Why should we have to when we can do it right? > > Well my point was that mapping vram is not right, i am not saying that > i know the truth. It's just a feeling based on my experiment with ttm > and on the bar restriction stuff and others consideration of same kind. > > > No. Gem can't coop with it. Let's say you have a 512M system with two 1G > > video cards, 4G swap space, and you want to fill both card's videoram > > with render-and-forget textures for whatever purpose. > > > > What happens? After you've generated the first say 300M, The system > > mysteriously starts to page, and when, after a a couple of minutes of > > crawling texture upload speeds, you're done, The system is using and > > have written almost 2G of swap. Now, you want to update the textures and > > expect fast texsubimage... > > > > So having a backing object that you have to access to get things into > > VRAM is not the way to go. > > The correct way to do this is to reserve, but not use swap space. Then > > you can start using it on suspend, provided that the swapping system is > > still up (which is has to be with the current GEM approach anyway). If > > pwrite is used in this case, it must not dirty any backing object pages. > > > > For normal desktop i don't expect VRAM amount > RAM amount, people with > 1Go VRAM are usually hard gamer with 4G of ram :). Also most object in > 3d world are stored in memory, if program are not stupid and trust gl > to keep their texture then you just have the usual ram copy and possibly > a vram copy, so i don't see any waste in the normal use case. Of course > we can always come up with crazy weird setup, but i am more interested > in dealing well with average Joe than dealing mostly well with every > use case. It's always been a big win to go to single-copy texturing. Textures tend to be large and nobody has so much memory that doubling up on textures has ever been appealing... And there are obvious use-cases like textured video where only having a single copy is a big performance. It certainly makes things easi
Re: TTM merging?
On Wed, 14 May 2008 16:36:54 +0200 Thomas Hellström <[EMAIL PROTECTED]> wrote: > Jerome Glisse wrote: > I don't agree with you here. EXA is much faster for small composite > operations and even small fill blits if fallbacks are used. Even to > write-combined memory, but that of course depends on the hardware. This > is going to be even more pronounced with acceleration architectures like > Glucose and similar, that don't have an optimized path for small > hardware composite operations. > > My personal feeling is that pwrites are a workaround for a workaround > for a very bad decision: > > To avoid user-space allocators on device-mapped memory. This lead to a > hack to avoid cahing-policy changes which lead to cache trashing > problems which put us in the current situation. How far are we going to > follow this path before people wake up? What's wrong with the > performance of good old i915tex which even beats "classic" i915 in many > cases. > > Having to go through potentially (and even probably) paged-out memory to > access buffers to make that are present in VRAM sounds like a very odd > approach (to say the least) to me. Even if it's a single page and > implementing per-page dirty checks for domain flushing isn't very > appealing either. I don't have number or benchmark to check how fast pread/pwrite path might be in this use so i am just expressing my feeling which happen to just be to avoid vma tlb flush as most as we can. I got the feeling that kernel goes through numerous trick to avoid tlb flushing for a good reason and also i am pretty sure that with number of core keeping growing anythings that need cpu broad synchronization is to be avoided. Hopefully once i got decent amount of time to do benchmark with gem i will check out my theory. I think simple benchmark can be done on intel hw just return false in EXA prepare access to force use of download from screen, and in download from screen use pread then comparing benchmark of this hacked intel ddx with a normal one should already give some numbers. > Why should we have to when we can do it right? Well my point was that mapping vram is not right, i am not saying that i know the truth. It's just a feeling based on my experiment with ttm and on the bar restriction stuff and others consideration of same kind. > No. Gem can't coop with it. Let's say you have a 512M system with two 1G > video cards, 4G swap space, and you want to fill both card's videoram > with render-and-forget textures for whatever purpose. > > What happens? After you've generated the first say 300M, The system > mysteriously starts to page, and when, after a a couple of minutes of > crawling texture upload speeds, you're done, The system is using and > have written almost 2G of swap. Now, you want to update the textures and > expect fast texsubimage... > > So having a backing object that you have to access to get things into > VRAM is not the way to go. > The correct way to do this is to reserve, but not use swap space. Then > you can start using it on suspend, provided that the swapping system is > still up (which is has to be with the current GEM approach anyway). If > pwrite is used in this case, it must not dirty any backing object pages. > For normal desktop i don't expect VRAM amount > RAM amount, people with 1Go VRAM are usually hard gamer with 4G of ram :). Also most object in 3d world are stored in memory, if program are not stupid and trust gl to keep their texture then you just have the usual ram copy and possibly a vram copy, so i don't see any waste in the normal use case. Of course we can always come up with crazy weird setup, but i am more interested in dealing well with average Joe than dealing mostly well with every use case. That said i do see GPGPU as a possible users of temporary big vram buffer ie buffer you can trash away. For that kind of stuff it does make sense to not have backing ram/swap area. But i would rather add somethings in gem like intercepting allocation of such buffer and not creating backing buffer, or adding driver specific ioctl for that case. Anyway i think we need benchmark to know what in the end is really the best option. I don't have code to support my general feeling, so i might be wrong. Sadly we don't have 2^32 monkeys doing code days and night for drm to test all solutions :) Cheers, Jerome Glisse <[EMAIL PROTECTED]> - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Wed, 2008-05-14 at 12:09 +0200, Thomas Hellström wrote: > 1) The inability to map device memory. The design arguments and proposed > solution for VRAM are not really valid. Think of this, probably not too > uncommon, scenario of a single pixel fallback composite to a scanout > buffer in vram. Or a texture or video frame upload: Nothing prevents you from mapping device memory; it's just that on a UMA device, there's no difference, and some significant advantages to using the direct mapping. I wrote the API I needed for my device; I think it's simple enough that other devices can add the APIs they need. But, what we've learned in the last few months is that mapping *any* pages into user space is a last-resort mechanism. Mapping pages WC or UC requires inter-processor interrupts, and using normal WB pages means invoking clflush on regions written from user space. The glxgears "benchmark" demonstrates this with some clarity -- using pwrite to send batch buffers is nearly three times faster (888 fps using pwrite vs 300 fps using mmap) than mapping pages to user space and then clflush'ing them in the kernel. > A) Page in all GEM pages, because they've been paged out. > B) Copy the complete scanout buffer to GEM because it's dirty. Untile. > C) Write the pixel. > D) Copy the complete buffer back while tiling. First off, I don't care about fallbacks; any driver using fallbacks is broken. Second, if you had to care about fallbacks on non-UMA hardware, you'd compute the pages necessary for the fallback and only map/copy those anyway. > 2) Reserving pages when allocating VRAM buffers is also a very bad > solution particularly on systems with a lot of VRAM and little system > RAM. (Multiple card machines?). GEM basically needs to reserve > swap-space when buffers are created, and put a limit on the pinned > physical pages. We basically should not be able to fail memory > allocation during execbuf, because we cannot recover from that. As far as I know, any device using VRAM will not save it across suspend/resume. From my perspective, this means you don't get a choice about allocating backing store for that data Because GEM has backing store, we can limit pinned memory to only those pages needed for the current operation, waiting to pin pages until the device is ready to execute the operation. As I said in my earlier email, that part of the kernel driver is not written yet. I was hoping to get that finished before launching into this discussion as it is always better to argue with running code. > This means that the dependency on SHMEMFS propably needs to be dropped > and replaced with some sort of DRMFS that allows overloading of mmap and > a correct swap handling, address the caching issue and also avoids the > driver do_mmap(). Because GEM doesn't expose the use of shmfs to the user, there's no requirement that all objects use this abstraction. You could even have multiple object creation functions if that made sense in your driver. -- [EMAIL PROTECTED] signature.asc Description: This is a digitally signed message part - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/-- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Ben Skeggs wrote: 1) I feel there hasn't been enough open driver coverage to prove it. So far we have done an Intel IGD, we have a lot of code that isn't required for these devices, so the question of how much code exists purely to support poulsbo closed source userspace there is and why we need to live with it. Both radeon and nouveau developers have expressed frustration about the fencing internals being really hard to work with which doesn't bode well for maintainability in the future. >>> OK. So basically what I'm asking is that when we have full-feathered open >>> source drivers available that >>> utilize TTM, either as part of DRM core, or, if needed, as part of >>> driver-specific code, do you see anything >>> else that prevents that from being pushed? That would be very valuable to >>> know >>> for anyone starting porting work. ? >>> >> I was hoping that by now, one of the radeon or nouveau drivers would have >> adopted TTM, or at least demoed something working using it, this hasn't >> happened which worries me, perhaps glisse or darktama could fill in on >> what limited them from doing it. The fencing internals are very very scary >> and seem to be a major stumbling block. >> > The fencing internals do seem overly complicated indeed, but that's > something that I'm personally OK with taking the time to figure out how > to get right. Is there any good documentation around that describes it > in detail? > Yes, there is a wiki page. http://dri.freedesktop.org/wiki/TTMFencing > I actually started working on nouveau/ttm again a month or so back, with > the intention of actually having the work land this time. Overall, I > don't have much problem with TTM and would be willing to work with it. > Supporting G8x/G9x chips was the reason the work's stalled again, I > wasn't sure at the time what requirements we'd have from a memory > manager. > > The issue on G8x is that the 3D engine will refuse to render to linear > surfaces, and in order to setup tiling we need to make use of a > channel's page tables. The driver doesn't get any control when VRAM is > allocated so that it can setup the page tables appropriately etc. I > just had a thought that the driver-specific validation ioctl could > probably handle that at the last minute, so perhaps that's also not an > issue. I'll look more into G8x/ttm after I finish my current G8x work. > > Another minor issue (probably doesn't effect merging?): Nouveau makes > extensive use fence classes, we assign 1 fence class to each GPU channel > (read: context + command submission mechanism). We have 128 of these on > G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient > even for NV1x hardware. > Ouch. Yes it should be OK to bump that as long as kmalloc doesn't complain. > So overall, I'm basically fine with TTM now that I've actually made a > proper attempt at using it.. GEM does seem interesting, I'll also > follow its development while I continue with other non-mm G80 work. > > Cheers, > Ben. > Nice to know Ben. Anyway whatever happens, the fencing code will remain for some drivers either device specific or common, so if you find ways to simplify or things that doesn't look right, please let me know. /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Jerome Glisse wrote: > On Wed, 14 May 2008 12:09:06 +0200 > Thomas Hellström <[EMAIL PROTECTED]> wrote: > > >> Jerome Glisse wrote: >> Jerome, Dave, Keith >> >> >> 1) The inability to map device memory. The design arguments and proposed >> solution for VRAM are not really valid. Think of this, probably not too >> uncommon, scenario of a single pixel fallback composite to a scanout >> buffer in vram. Or a texture or video frame upload: >> >> A) Page in all GEM pages, because they've been paged out. >> B) Copy the complete scanout buffer to GEM because it's dirty. Untile. >> C) Write the pixel. >> D) Copy the complete buffer back while tiling. >> > > With pwrite/pread you give offset and size of things you are interested in. > So for single pixel case it will pread a page and pwrite it once fallback > finished. I totaly agree that dowloading whole object on fallback is to be > avoided. But as long as we don't have a fallback which draws the whole > screen then we are fine, and as anyway such fallback will be desastrous > wether we map vram or not lead me to discard this drawback and just > accept pain for such fallback. > > I don't agree with you here. EXA is much faster for small composite operations and even small fill blits if fallbacks are used. Even to write-combined memory, but that of course depends on the hardware. This is going to be even more pronounced with acceleration architectures like Glucose and similar, that don't have an optimized path for small hardware composite operations. My personal feeling is that pwrites are a workaround for a workaround for a very bad decision: To avoid user-space allocators on device-mapped memory. This lead to a hack to avoid cahing-policy changes which lead to cache trashing problems which put us in the current situation. How far are we going to follow this path before people wake up? What's wrong with the performance of good old i915tex which even beats "classic" i915 in many cases. Having to go through potentially (and even probably) paged-out memory to access buffers to make that are present in VRAM sounds like a very odd approach (to say the least) to me. Even if it's a single page and implementing per-page dirty checks for domain flushing isn't very appealing either. > Also i am confident that we can find a more clever way in such case. > Like doing the whole rendering in ram and updating the final result > so assuming that the up to date copy is in ram and that vram might > be out of sync. > Why should we have to when we can do it right? > > >> 2) Reserving pages when allocating VRAM buffers is also a very bad >> solution particularly on systems with a lot of VRAM and little system >> RAM. (Multiple card machines?). GEM basically needs to reserve >> swap-space when buffers are created, and put a limit on the pinned >> physical pages. We basically should not be able to fail memory >> allocation during execbuf, because we cannot recover from that. >> > > Well this solve the suspend problem we were discussing at xds ie what > to do on buffer. If we know that we have room to put buffer then we > don't to worry about which buffer we are ready to loose. Given that > opengl don't give any clue on that this sounds like a good approach. > > For embedded device where every piece of ram still matter i guess > you also have to deal with suspend case so you have a way to either > save vram content or to preserve it. I don't see any problem with > gem to cop with this case too. > No. Gem can't coop with it. Let's say you have a 512M system with two 1G video cards, 4G swap space, and you want to fill both card's videoram with render-and-forget textures for whatever purpose. What happens? After you've generated the first say 300M, The system mysteriously starts to page, and when, after a a couple of minutes of crawling texture upload speeds, you're done, The system is using and have written almost 2G of swap. Now, you want to update the textures and expect fast texsubimage... So having a backing object that you have to access to get things into VRAM is not the way to go. The correct way to do this is to reserve, but not use swap space. Then you can start using it on suspend, provided that the swapping system is still up (which is has to be with the current GEM approach anyway). If pwrite is used in this case, it must not dirty any backing object pages. /Thomas > >> Other things like GFP_HIGHUSER etc are probably fixable if there is a >> will to do it. >> >> So if GEM is the future, these shortcomings must IMHO be addressed. In >> particular GEM should not stop people from mapping device memory >> directly. Particularly not in the view of the arguments against TTM >> previously outlined. >> > > As i said i have come to the opinion that not mapping vram in userspace > vma sounds like a good plan. I am even thinking that avoiding all mapping > and encourage pread/pwrite is a better solution
Re: TTM merging?
On Wed, 14 May 2008 12:09:06 +0200 Thomas Hellström <[EMAIL PROTECTED]> wrote: > Jerome Glisse wrote: > Jerome, Dave, Keith > > > 1) The inability to map device memory. The design arguments and proposed > solution for VRAM are not really valid. Think of this, probably not too > uncommon, scenario of a single pixel fallback composite to a scanout > buffer in vram. Or a texture or video frame upload: > > A) Page in all GEM pages, because they've been paged out. > B) Copy the complete scanout buffer to GEM because it's dirty. Untile. > C) Write the pixel. > D) Copy the complete buffer back while tiling. With pwrite/pread you give offset and size of things you are interested in. So for single pixel case it will pread a page and pwrite it once fallback finished. I totaly agree that dowloading whole object on fallback is to be avoided. But as long as we don't have a fallback which draws the whole screen then we are fine, and as anyway such fallback will be desastrous wether we map vram or not lead me to discard this drawback and just accept pain for such fallback. Also i am confident that we can find a more clever way in such case. Like doing the whole rendering in ram and updating the final result so assuming that the up to date copy is in ram and that vram might be out of sync. > 2) Reserving pages when allocating VRAM buffers is also a very bad > solution particularly on systems with a lot of VRAM and little system > RAM. (Multiple card machines?). GEM basically needs to reserve > swap-space when buffers are created, and put a limit on the pinned > physical pages. We basically should not be able to fail memory > allocation during execbuf, because we cannot recover from that. Well this solve the suspend problem we were discussing at xds ie what to do on buffer. If we know that we have room to put buffer then we don't to worry about which buffer we are ready to loose. Given that opengl don't give any clue on that this sounds like a good approach. For embedded device where every piece of ram still matter i guess you also have to deal with suspend case so you have a way to either save vram content or to preserve it. I don't see any problem with gem to cop with this case too. > Other things like GFP_HIGHUSER etc are probably fixable if there is a > will to do it. > > So if GEM is the future, these shortcomings must IMHO be addressed. In > particular GEM should not stop people from mapping device memory > directly. Particularly not in the view of the arguments against TTM > previously outlined. As i said i have come to the opinion that not mapping vram in userspace vma sounds like a good plan. I am even thinking that avoiding all mapping and encourage pread/pwrite is a better solution. For me vram is a temporary storage card maker use to speed up their hw as so it should not be directly used for userspace. Note that this does not go against having user space choosing policy for vram usage ie which object to put where. Cheers, Jerome Glisse - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
> > > 1) I feel there hasn't been enough open driver coverage to prove it. So > > > far > > > we have done an Intel IGD, we have a lot of code that isn't required for > > > these devices, so the question of how much code exists purely to support > > > poulsbo closed source userspace there is and why we need to live with it. > > > Both radeon and nouveau developers have expressed frustration about the > > > fencing internals being really hard to work with which doesn't bode well > > > for > > > maintainability in the future. > > > > > OK. So basically what I'm asking is that when we have full-feathered open > > source drivers available that > > utilize TTM, either as part of DRM core, or, if needed, as part of > > driver-specific code, do you see anything > > else that prevents that from being pushed? That would be very valuable to > > know > > for anyone starting porting work. ? > > I was hoping that by now, one of the radeon or nouveau drivers would have > adopted TTM, or at least demoed something working using it, this hasn't > happened which worries me, perhaps glisse or darktama could fill in on > what limited them from doing it. The fencing internals are very very scary > and seem to be a major stumbling block. The fencing internals do seem overly complicated indeed, but that's something that I'm personally OK with taking the time to figure out how to get right. Is there any good documentation around that describes it in detail? I actually started working on nouveau/ttm again a month or so back, with the intention of actually having the work land this time. Overall, I don't have much problem with TTM and would be willing to work with it. Supporting G8x/G9x chips was the reason the work's stalled again, I wasn't sure at the time what requirements we'd have from a memory manager. The issue on G8x is that the 3D engine will refuse to render to linear surfaces, and in order to setup tiling we need to make use of a channel's page tables. The driver doesn't get any control when VRAM is allocated so that it can setup the page tables appropriately etc. I just had a thought that the driver-specific validation ioctl could probably handle that at the last minute, so perhaps that's also not an issue. I'll look more into G8x/ttm after I finish my current G8x work. Another minor issue (probably doesn't effect merging?): Nouveau makes extensive use fence classes, we assign 1 fence class to each GPU channel (read: context + command submission mechanism). We have 128 of these on G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient even for NV1x hardware. So overall, I'm basically fine with TTM now that I've actually made a proper attempt at using it.. GEM does seem interesting, I'll also follow its development while I continue with other non-mm G80 work. Cheers, Ben. > > I do worry that TTM is not Linux enough, it seems you have decided that we > can never do in-kernel allocations at any useable speed and punted the > work into userspace, which makes life easier for Gallium as its more like > what Windows does, but I'm not sure this is a good solution for Linux. > > The real question is whether TTM suits the driver writers for use in Linux > desktop and embedded environments, and I think so far I'm not seeing > enough positive feedback from the desktop side. > > Also wrt the i915 driver it has too many experiments in it, the i915 users > need to group together and remove the codepaths that make no sense and > come up with a ssuitable userspace driver for it, remove all unused > fencing mechanisms etc.. > > Dave. > > > > > /Thomas > > > > > > > > > > > > > > - > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ > -- > ___ > Dri-devel mailing list > Dri-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dri-devel - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Jerome Glisse wrote: > On Tue, 13 May 2008 21:35:16 +0100 (IST) > Dave Airlie <[EMAIL PROTECTED]> wrote: > > >> 1) I feel there hasn't been enough open driver coverage to prove it. So >> far we have done an Intel IGD, we have a lot of code that isn't required >> for these devices, so the question of how much code exists purely to >> support poulsbo closed source userspace there is and why we need to live >> with it. Both radeon and nouveau developers have expressed frustration >> about the fencing internals being really hard to work with which doesn't >> bode well for maintainability in the future. >> > > Well my ttm experiment bring me up to EXA with radeon, i also done several > small 3d test to see how i want to send command. So from my experiments here > are the things that are becoming painfull for me. > > On some radeon hw (most of newer card with big amount of ram) you can't > map vram beyond aperture, well you can be you need to reprogram card > aperture and it's not somethings you want to do. TTM assumption is that > memory access are done through map of the buffer and so in this situation > this become cumberstone. We already discussed this and the idea was to > split vram but i don't like this solution. So in the end i am more and > more convinced that we should avoid object mapping in vma of client i see > 2 advantages to this : no tlb flush on vma, no hard to solve page maping > aliasing. > > On fence side i hoped that i could have reasonable code using IRQ working > reliably but after discussion with AMD what i was doing was obviously not > recommanded and prone to hard GPU lockup which is no go for me. The last > solution i have in mind about synchronization ie knowing when gpu is done > with a buffer could not use IRQ at least not on all hw i am interesed in > (r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is > done. Also fence code put too much assumption on what we should provide, > while fencing might prove usefull, i think it can be more well served by > driver specific ioctl than by a common infrastructure where hw obviously > doesn't fit well in the scheme due to their differences. > > And like Stephane, i think virtual memory from GPU stuff can't be used > at its best in this scheme. > > That said, i share also some concern on GEM like the high memory page but > i think this one is workable with help of kernel people. For vram the > solution discussed so far and which i like is to have driver choose > based on client request on which object to put their and to see vram as > a cache. So we will have all object backed by a ram copy (which can be > swapped) then it's all a matter on syncing vram copy & ram copy when > necessary. Domain & pread/pwrite access let you easily do this sync > only on the necessary area. Also for suspend becomes easier just sync > object where write domain is GPU. So all in all i agree that GEM might > ask each driver to redo some stuff but i think a large set of helper > function can leverage this, but more importantly i see this as freedom > for each driver and the only way to cope with hw differences. > > Cheers, > Jerome Glisse <[EMAIL PROTECTED]> > Jerome, Dave, Keith It's hard to argue against people trying things out and finding it's not really what they want, so I'm not going to do that. The biggest argument (apart from the fencing) seems to be that people thinks TTM stops them from doing what they want with the hardware, although it seems like the Nouveau needs and Intel UMA needs are quite opposite. In an open-source community where people work on things because they want to, not being able to do what you want to is a bad thing, OTOH a stall and disagreement about what's the best thing to use is even worse. It confuses the users and it's particularly bad for people trying to write drivers on a commercial basis. I've looked through KeithPs mail to look for a way to use GEM for future development. Since many things will be device-dependent I think it's possible for us to work around some issues I see, but a couple of big things remain. 1) The inability to map device memory. The design arguments and proposed solution for VRAM are not really valid. Think of this, probably not too uncommon, scenario of a single pixel fallback composite to a scanout buffer in vram. Or a texture or video frame upload: A) Page in all GEM pages, because they've been paged out. B) Copy the complete scanout buffer to GEM because it's dirty. Untile. C) Write the pixel. D) Copy the complete buffer back while tiling. 2) Reserving pages when allocating VRAM buffers is also a very bad solution particularly on systems with a lot of VRAM and little system RAM. (Multiple card machines?). GEM basically needs to reserve swap-space when buffers are created, and put a limit on the pinned physical pages. We basically should not be able to fail memory allocation during execbuf, because we cannot recover from that. Other
Re: TTM merging?
> I do worry that TTM is not Linux enough, it seems you have decided that we > can never do in-kernel allocations at any useable speed and punted the > work into userspace, which makes life easier for Gallium as its more like > what Windows does, but I'm not sure this is a good solution for Linux. > I have no idea where this set of ideas come from, and it's a little disturbing to me. On a couple of levels, it's clearly bogus. Firstly, TTM and its libdrm interfaces predate gallium by years. Secondly, the windows work we've done with gallium to date has been on XP and _entirely_ in kernel space, so the whole issue of user/kernel allocation strategies never came up. Thirdly, Gallium's backend interfaces are all about abstracting away from the OS, so that drivers can be picked up and dumped down in multiple places. It's ludicrous to suggest that the act of abstracting away from TTM has in itself skewed TTM -- the point is that the driver has been made independent of TTM. The point of Gallium is that it should work on top of *anything* -- if we had had to skew TTM in some way to achieve that, then we would have already failed right at the starting point... Lastly, and most importantly, I believe that using TTM kernel allocations to back a user space sub-allocator *is the right strategy*. This has nothing to do with Gallium. No matter how fast you make a kernel allocator (and I applaud efforts to make it fast), it is always going to be quicker to do allocations locally. This is the reason we have malloc() and not just mmap() or brk/sbrk. Also, sub-allocation doesn't imply massive preallocation. That bug is well fixed by Thomas' user-space slab allocator code. Keith - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On Tue, 13 May 2008 21:35:16 +0100 (IST) Dave Airlie <[EMAIL PROTECTED]> wrote: > 1) I feel there hasn't been enough open driver coverage to prove it. So > far we have done an Intel IGD, we have a lot of code that isn't required > for these devices, so the question of how much code exists purely to > support poulsbo closed source userspace there is and why we need to live > with it. Both radeon and nouveau developers have expressed frustration > about the fencing internals being really hard to work with which doesn't > bode well for maintainability in the future. Well my ttm experiment bring me up to EXA with radeon, i also done several small 3d test to see how i want to send command. So from my experiments here are the things that are becoming painfull for me. On some radeon hw (most of newer card with big amount of ram) you can't map vram beyond aperture, well you can be you need to reprogram card aperture and it's not somethings you want to do. TTM assumption is that memory access are done through map of the buffer and so in this situation this become cumberstone. We already discussed this and the idea was to split vram but i don't like this solution. So in the end i am more and more convinced that we should avoid object mapping in vma of client i see 2 advantages to this : no tlb flush on vma, no hard to solve page maping aliasing. On fence side i hoped that i could have reasonable code using IRQ working reliably but after discussion with AMD what i was doing was obviously not recommanded and prone to hard GPU lockup which is no go for me. The last solution i have in mind about synchronization ie knowing when gpu is done with a buffer could not use IRQ at least not on all hw i am interesed in (r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is done. Also fence code put too much assumption on what we should provide, while fencing might prove usefull, i think it can be more well served by driver specific ioctl than by a common infrastructure where hw obviously doesn't fit well in the scheme due to their differences. And like Stephane, i think virtual memory from GPU stuff can't be used at its best in this scheme. That said, i share also some concern on GEM like the high memory page but i think this one is workable with help of kernel people. For vram the solution discussed so far and which i like is to have driver choose based on client request on which object to put their and to see vram as a cache. So we will have all object backed by a ram copy (which can be swapped) then it's all a matter on syncing vram copy & ram copy when necessary. Domain & pread/pwrite access let you easily do this sync only on the necessary area. Also for suspend becomes easier just sync object where write domain is GPU. So all in all i agree that GEM might ask each driver to redo some stuff but i think a large set of helper function can leverage this, but more importantly i see this as freedom for each driver and the only way to cope with hw differences. Cheers, Jerome Glisse <[EMAIL PROTECTED]> - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
On 5/14/08, Dave Airlie <[EMAIL PROTECTED]> wrote: > > I was hoping that by now, one of the radeon or nouveau drivers would have > adopted TTM, or at least demoed something working using it, this hasn't > happened which worries me, perhaps glisse or darktama could fill in on > what limited them from doing it. The fencing internals are very very scary > and seem to be a major stumbling block. > Aside from the fencing code, I have some othern more general, concerns with respect to using TTM on recent hardware. Although I've raised them before, it was on IRC, not really on the list. The main issue in my opinion, is that TTM enforces most things to be done form the kernel, and how those things should be done: command checking with relocations, fence emission, memory moves... Depending on the hardware functionality available, this might be useless or even counter-productive. Also, I'm concerned about handling chips that can do page faults in video memory. It is interesting to be able to use this feature (which was asked for by the windows guys). For example we could have the ability to have huge textures paged in progressively at the memory manager level. So to me the current TTM design lacks enough flexibility for recent chip features. I'm not saying all of this has to be implemented now, but it should not be prevented by the design. After all, if the memory manager is here to stay, I'd say it needs to be future-proof. Stephane - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Dave Airlie wrote: >>> 1) I feel there hasn't been enough open driver coverage to prove it. So far >>> we have done an Intel IGD, we have a lot of code that isn't required for >>> these devices, so the question of how much code exists purely to support >>> poulsbo closed source userspace there is and why we need to live with it. >>> Both radeon and nouveau developers have expressed frustration about the >>> fencing internals being really hard to work with which doesn't bode well for >>> maintainability in the future. >>> >>> >> OK. So basically what I'm asking is that when we have full-feathered open >> source drivers available that >> utilize TTM, either as part of DRM core, or, if needed, as part of >> driver-specific code, do you see anything >> else that prevents that from being pushed? That would be very valuable to >> know >> for anyone starting porting work. ? >> > > I was hoping that by now, one of the radeon or nouveau drivers would have > adopted TTM, or at least demoed something working using it, this hasn't > happened which worries me, perhaps glisse or darktama could fill in on > what limited them from doing it. The fencing internals are very very scary > and seem to be a major stumbling block. > Yes, it would be good to get some details here. Exactly what parts are scary? It seems Ian Romanick has made it work fine with xgi. 122 locs including license headers. I915 fencing can be made equally short if all sample (flushing) code is removed. > I do worry that TTM is not Linux enough, it seems you have decided that we > can never do in-kernel allocations at any useable speed and punted the > work into userspace, which makes life easier for Gallium as its more like > what Windows does, but I'm not sure this is a good solution for Linux. > > In-kernel allocations should be really fast unless they involve changing caching policy. If they are not, it's not a design issue but an implementation one which should be fixable. Trying to make mmap(anonymous) lightning fast when there is malloc() doesn't really make sense to me. > The real question is whether TTM suits the driver writers for use in Linux > desktop and embedded environments, and I think so far I'm not seeing > enough positive feedback from the desktop side. > I actually haven't seen much feedback at all. At least not on the mailing lists. Anyway we need to look at the alternatives which currently is GEM. GEM, while still in development basically brings us back to the functionality of TTM 0.1, with added paging support but without fine-grained locking and caching policy support. I might have misunderstood things but quickly browsing the code raises some obvious questions: 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects use GFP_HIGHUSER, and it's hardcoded into the linux swap code. 2) How will user-space mapping of IO memory (AGP apertures) work? Eviction and associated killing / refaulting of IO memory mappings? 3) How do we avoid illegal physical page aliasing with non-Intel hardware? And how are we going to get the kernel purists to accept it when they already complain about WC - UC aliasing? 4) How is VRAM incoporated in the GEM design? How do we map it and keep the mapping during eviction? 5) What's protecting i915 GEM object privates and lists in a multi-threaded environment? 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing some severe ranting about it on the lkml? TTM is designed to cope with most hardware quirks I've come across with different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and some other ones. GEM basically leaves it up to the driver writer to reinvent the wheel.. > Also wrt the i915 driver it has too many experiments in it, the i915 users > need to group together and remove the codepaths that make no sense and > come up with a ssuitable userspace driver for it, remove all unused > fencing mechanisms etc.. > Agreed, but back to the real and to me very important question: If I embark on a new OS driver today and want to use advanced memory manager stuff. Have VRAM and multiple advanced syncing mechanisms. What's my best option to get it into the kernel? Can I hook up driver specific TTM and get it in? /Thomas > Dave. > > > > >> /Thomas >> >> >> >> >> >> >> - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
> > 1) I feel there hasn't been enough open driver coverage to prove it. So far > > we have done an Intel IGD, we have a lot of code that isn't required for > > these devices, so the question of how much code exists purely to support > > poulsbo closed source userspace there is and why we need to live with it. > > Both radeon and nouveau developers have expressed frustration about the > > fencing internals being really hard to work with which doesn't bode well for > > maintainability in the future. > > > OK. So basically what I'm asking is that when we have full-feathered open > source drivers available that > utilize TTM, either as part of DRM core, or, if needed, as part of > driver-specific code, do you see anything > else that prevents that from being pushed? That would be very valuable to know > for anyone starting porting work. ? I was hoping that by now, one of the radeon or nouveau drivers would have adopted TTM, or at least demoed something working using it, this hasn't happened which worries me, perhaps glisse or darktama could fill in on what limited them from doing it. The fencing internals are very very scary and seem to be a major stumbling block. I do worry that TTM is not Linux enough, it seems you have decided that we can never do in-kernel allocations at any useable speed and punted the work into userspace, which makes life easier for Gallium as its more like what Windows does, but I'm not sure this is a good solution for Linux. The real question is whether TTM suits the driver writers for use in Linux desktop and embedded environments, and I think so far I'm not seeing enough positive feedback from the desktop side. Also wrt the i915 driver it has too many experiments in it, the i915 users need to group together and remove the codepaths that make no sense and come up with a ssuitable userspace driver for it, remove all unused fencing mechanisms etc.. Dave. > > /Thomas > > > > > > - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
Dave Airlie wrote: >> Dave, >> >> Could you list what fixes / changes you think are needed to get TTM into >> the mainline kernel? >> >> > > 2 main reasons: > > 1) I feel there hasn't been enough open driver coverage to prove it. So > far we have done an Intel IGD, we have a lot of code that isn't required > for these devices, so the question of how much code exists purely to > support poulsbo closed source userspace there is and why we need to live > with it. Both radeon and nouveau developers have expressed frustration > about the fencing internals being really hard to work with which doesn't > bode well for maintainability in the future. > OK. So basically what I'm asking is that when we have full-feathered open source drivers available that utilize TTM, either as part of DRM core, or, if needed, as part of driver-specific code, do you see anything else that prevents that from being pushed? That would be very valuable to know for anyone starting porting work. ? /Thomas - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: TTM merging?
> Dave, > > Could you list what fixes / changes you think are needed to get TTM into > the mainline kernel? > 2 main reasons: 1) I feel there hasn't been enough open driver coverage to prove it. So far we have done an Intel IGD, we have a lot of code that isn't required for these devices, so the question of how much code exists purely to support poulsbo closed source userspace there is and why we need to live with it. Both radeon and nouveau developers have expressed frustration about the fencing internals being really hard to work with which doesn't bode well for maintainability in the future. 2) Intel have asked that we don't push i915 support upstream as they believe it isn't ready and as they end up supporting the kernel module in the longer term I cannot go against that without a good reason. I have no other driver to push hence stalled. I'll leave keithp to comment on this further. Dave. - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel