Re: [Dri-devel] Mach64 dma fixes
Linus Torvalds wrote: A hot system call takes about 0.2 us on an athlon (it takes significantly longer on a P4, which I'm beating up Intel over all the time). The ioctl stuff goes through slightly more layers, but we're not talking huge numbers here. The system calls are fast enough that you're better off trying to keep stuff in the cache, than trying to minimize system calls. This is an education for me, too. Thanks for the info. Any idea how heavy IOCTL's are on a P4? NOTE NOTE NOTE! The tradeoffs are seldom all that clear. Sometimes big buffers and few system calls are better. Sometimes they aren't. It just depends on a lot of things. You bet--and the real issue we're constantly swimming up stream against is security in open source. Most hardware vendors design the hardware for closed source drivers and don't put many (or sometimes any) time into making sure their hardware is optimized for performance *and* security. Consequently most modern graphics chips are optimized for user space DMA and they rely on security through obscurity of their closed source drivers. Then, the DRI team comes along and has to figure out how to kludge together a secure path that doesn't sacrafice *all* the performance. Linus, if you have any ideas on how we can uphold the security strengths of Linux without leaving all this performance on the table simply because we embrace open source, then I'd love to hear it. It really hurts to be competing tooth and nail against closed source drivers (on Linux even) and have to leave potentially large performance gains on the table. The other paradox here is that security is paramount for the server market where Linux is strong. But we're trying to help Linux into the domain of the graphics workstation and game machine markets where users already have full access to the machine (even physically). So how is all this security really helping us address those market? Sorry, I'm venting. This has been a difficult issue since the beginning of the DRI project--but I'm glad I got it off my chest :-) Regards, Jens -- /\ Jens Owen/ \/\ _ [EMAIL PROTECTED] /\ \ \ Steamboat Springs, Colorado ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Mon, 27 May 2002, Jens Owen wrote: This is an education for me, too. Thanks for the info. Any idea how heavy IOCTL's are on a P4? Much heavier. For some yet unexplained reason, a P4 takes about 1us to do a simple system call. That's on a 1.8GHz system, so it basically implies that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles (the 200+ cycles also matches a pentium reasonably well, so it's really the P4 that stands out here). The rest of the ioctl overhead is not really noticeable compared to those 1800 cycles spent on the enter/exit kernel mode. Even so, those memcpy vs pipe throughput numbers I quoted were off my P4 machine: _despite_ the fact that a P4 is inexplicably bad at system calls, those 1800 CPU cycles is just a whole lot less than a lot of cache misses with modern hardware. It doesn't take many cache misses to make 1800 cycles just noise. And if the 1800 cycles are less than cache misses on normal non-IO benchmarks, they are going to be _completely_ swamped by any PCI/AGP overhead. You bet--and the real issue we're constantly swimming up stream against is security in open source. Most hardware vendors design the hardware for closed source drivers and don't put many (or sometimes any) time into making sure their hardware is optimized for performance *and* security. I realize this, and I feel for you. It's nasty. I don't know what the answer is. It _might_ even be something like a bi-modal system: - apps by default get the traditional GLX behaviour: the X server does all the 3D for them. No DRI. - there is some mechanism to tell which apps are trusted, and trusted apps get direct hw access and just aren't secure. I actually think that if the abstraction level is just high enough, DRI shouldn't matter in theory. Shared memory areas with X for the high-level data (to avoid the copies for things like the obviously huge texture data). From a game standpoint, think quake engine. The actual game doesn't need to tell the GX engine everything over and over again all the time. It tells it the basic stuff once, and then it just says render me. You don't need DRI for sending the render me command, you need DRI because you send each vertex separately. In that kind of high-level abstraction, the X client-server model should still work fine. In fact, it should work especially well on small-scale SMP (which seems inevitable). Are people thinkin gabout the next stage, when 2D just doesn't exist any more except as a high-level abstraction on top of a 3D model? Where the X server actually gets to render the world view, and the application doesn't need to (or want to) know about things like level-of-detail? Linus ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Linus Torvalds wrote: On Mon, 27 May 2002, Jens Owen wrote: This is an education for me, too. Thanks for the info. Any idea how heavy IOCTL's are on a P4? Much heavier. For some yet unexplained reason, a P4 takes about 1us to do a simple system call. That's on a 1.8GHz system, so it basically implies that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles (the 200+ cycles also matches a pentium reasonably well, so it's really the P4 that stands out here). This is remarkable. I thought things were getting better, not worse. ... You bet--and the real issue we're constantly swimming up stream against is security in open source. Most hardware vendors design the hardware for closed source drivers and don't put many (or sometimes any) time into making sure their hardware is optimized for performance *and* security. I realize this, and I feel for you. It's nasty. I don't know what the answer is. It _might_ even be something like a bi-modal system: - apps by default get the traditional GLX behaviour: the X server does all the 3D for them. No DRI. - there is some mechanism to tell which apps are trusted, and trusted apps get direct hw access and just aren't secure. I actually think that if the abstraction level is just high enough, DRI shouldn't matter in theory. Shared memory areas with X for the high-level data (to avoid the copies for things like the obviously huge texture data). I like this because it offers a way out, although I would keep the direct, secure approach to 3d we currently have for the other clients. Indirect rendering is pretty painful... However: The applications that most people would want to 'trust' are things like quake or other closed source games, which makes the situation a little murkier. From a game standpoint, think quake engine. The actual game doesn't need to tell the GX engine everything over and over again all the time. It tells it the basic stuff once, and then it just says render me. You don't need DRI for sending the render me command, you need DRI because you send each vertex separately. You could view the static geometry of quake levels as a single display list and ask for the whole thing to be rendered each frame. However, the reality of the quake type games is anything but - huge amounts of effort have gone into the process of figuring out (as quickly as possible) what minimal amount of work can be done to render the visible portion of the level at each frame. Quake generates very dynamic data from quite a static environment in the name of performance... In that kind of high-level abstraction, the X client-server model should still work fine. In fact, it should work especially well on small-scale SMP (which seems inevitable). Games are free to partition themselves in other ways that help smp but keep their ability for a tight binding with the display system -- for example the physics (rigid body simulation) subsytem is a big and growing consumer of cpu and is quite easily seperated out from the graphics engine. AI is also a target for its own thread. Are people thinkin gabout the next stage, when 2D just doesn't exist any more except as a high-level abstraction on top of a 3D model? Where the X server actually gets to render the world view, and the application doesn't need to (or want to) know about things like level-of-detail? Yes, but there are a few steps between here and there, and there have been a few differences of opinion along the way. It would have been possible to get a lot of the X render extension via a client library emitting GL calls, for example. Keith ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Keith Whitwell wrote: Linus Torvalds wrote: On Mon, 27 May 2002, Jens Owen wrote: This is an education for me, too. Thanks for the info. Any idea how heavy IOCTL's are on a P4? Much heavier. For some yet unexplained reason, a P4 takes about 1us to do a simple system call. That's on a 1.8GHz system, so it basically implies that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles (the 200+ cycles also matches a pentium reasonably well, so it's really the P4 that stands out here). This is remarkable. I thought things were getting better, not worse. ... You bet--and the real issue we're constantly swimming up stream against is security in open source. Most hardware vendors design the hardware for closed source drivers and don't put many (or sometimes any) time into making sure their hardware is optimized for performance *and* security. I realize this, and I feel for you. It's nasty. I don't know what the answer is. It _might_ even be something like a bi-modal system: - apps by default get the traditional GLX behaviour: the X server does all the 3D for them. No DRI. - there is some mechanism to tell which apps are trusted, and trusted apps get direct hw access and just aren't secure. I actually think that if the abstraction level is just high enough, DRI shouldn't matter in theory. Shared memory areas with X for the high-level data (to avoid the copies for things like the obviously huge texture data). I like this because it offers a way out, although I would keep the direct, secure approach to 3d we currently have for the other clients. Indirect rendering is pretty painful... A bi-modal system could be very possible from an implementation perspective in the short term. We have a security mechanism in place now for validating which processes are allowed to access the direct rendering mechanism. It is based on user ID's, and no process is allowed access to these resources unless they have: 1) Access to the X Server as an X client. 2) Their permission is acceptable based on how the DRI permissions are defined in the XF86Config file. Most distributions have picked up on this and now have a typical usage model that allows the DRI to work for all desktop users. If we do get some type of indirect rendering path working quicker, then perhaps we could tighten up these defaults so that the usage model required explicit administrative permision to a user before being allowed access to direct rendering. However, after going to all this trouble of making a decent level of fall back performance, I would then want to push the performance envelop for those processes that did meet the criteria for access to direct rendering resources, and soften the security requirements for just those processes. This could possible be users that have been given explicit permission and the X server itself (doing HW accellerated indirect rendering). There would really be three prongs of attach for this approach: 1) Audit the current DRI security model and confirm that it is strong enough to be used to prevent non authorized users from gaining access to the DRI mechanisms. Work with distros to tighten up the usage model (and possible the DRI security mechanism itself) so only explicit desktop users are allowed access to the DRI. 2) Develop a device independent indirect rendering module that plugs into the X server to utilize our 3D drivers. After getting some HW accel working, look at speeding up this path by utilizing Chormium-like technologies and/or shared memory for high level data. 3) Transition the direct rendering drivers to take full advantage of their user space DMA capabilities. The is a large amount of work, but something we should consider if step 1 can be achieved to the kernel teams satisfaction. It is even possible the direct path could be obsoleted over the long term as step 2 becomes more and more streamlined. However: The applications that most people would want to 'trust' are things like quake or other closed source games, which makes the situation a little murkier. Yes, but is this really any worse than a typical install for these apps that requires root level access. From a game standpoint, think quake engine. The actual game doesn't need to tell the GX engine everything over and over again all the time. It tells it the basic stuff once, and then it just says render me. You don't need DRI for sending the render me command, you need DRI because you send each vertex separately. You could view the static geometry of quake levels as a single display list and ask for the whole thing to be rendered each frame. However, the reality of the quake type games is anything but - huge amounts of effort have gone into the process of figuring out (as quickly as possible) what minimal amount of work can be done to render the
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.27 16:28 Jens Owen wrote: ... If we do get some type of indirect rendering path working quicker, then perhaps we could tighten up these defaults so that the usage model required explicit administrative permision to a user before being allowed access to direct rendering. However, after going to all this trouble of making a decent level of fall back performance, I would then want to push the performance envelop for those processes that did meet the criteria for access to direct rendering resources, and soften the security requirements for just those processes. This could possible be users that have been given explicit permission and the X server itself (doing HW accellerated indirect rendering). There would really be three prongs of attach for this approach: 1) Audit the current DRI security model and confirm that it is strong enough to be used to prevent non authorized users from gaining access to the DRI mechanisms. Work with distros to tighten up the usage model (and possible the DRI security mechanism itself) so only explicit desktop users are allowed access to the DRI. 2) Develop a device independent indirect rendering module that plugs into the X server to utilize our 3D drivers. After getting some HW accel working, look at speeding up this path by utilizing Chormium-like technologies and/or shared memory for high level data. 3) Transition the direct rendering drivers to take full advantage of their user space DMA capabilities. The is a large amount of work, but something we should consider if step 1 can be achieved to the kernel teams satisfaction. It is even possible the direct path could be obsoleted over the long term as step 2 becomes more and more streamlined. ... Jens, if I understood correctly, basically you're suggesting having the OpenGL state machine on the X server process context, and therefore the GL drivers too, and most of the data (textures, display lists). So there would be no layering between the DMA buffer construction and its submition - as boths things would be carried by the GL drivers. This means that we would have a single driver model instead of 3. But the GLX protocol isn't good for this, is it? Hence the need for shared memory for big data. Am I getting the right picture, or am I way off..? José Fonseca PS: It would be nice to discuss these issues in tonight's meeting. ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Mon, 27 May 2002, Keith Whitwell wrote: Linus Torvalds wrote: Much heavier. For some yet unexplained reason, a P4 takes about 1us to do a simple system call. That's on a 1.8GHz system, so it basically implies that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles (the 200+ cycles also matches a pentium reasonably well, so it's really the P4 that stands out here). This is remarkable. I thought things were getting better, not worse. In general, they are. I suspect the P4 system call slowness is just another artifact of some first-generation issues - the same way the P4 tends to be limited when it comes to shifts etc. It will get fixed eventually. And running at 3GHz+ makes some CPU cycles seem cheap if you can make up for them elsewhere. However, you should put all of this into perspective: those 1800 cycles are just about the same time it takes to do one _single_ read from an ISA device. It's roughly the time it takes for one cacheline to be DMA'd over PCI. Linus ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
From a game standpoint, think quake engine. The actual game doesn't need to tell the GX engine everything over and over again all the time. It tells it the basic stuff once, and then it just says render me. You don't need DRI for sending the render me command, you need DRI because you send each vertex separately. You could view the static geometry of quake levels as a single display list and ask for the whole thing to be rendered each frame. However, the reality of the quake type games is anything but - huge amounts of effort have gone into the process of figuring out (as quickly as possible) what minimal amount of work can be done to render the visible portion of the level at each frame. Quake generates very dynamic data from quite a static environment in the name of performance... I think I understand...even though Linus is refering to Quake's wire protocol here, you are pointing out that the real challenge is the underlying game engine which is highly optimized for that specific application. Am I correct? I think the multiplayer aspects of the game are a separate issue. Talking about the difference between a big display list with the whole quake level in it and the visibility/bsp-tree/whatever-new-technique coding that quake other games use to squeeze as much as possible out of the hardware. It may be that simple visibility issues are pretty well understood now, and that the competition between game engines is moving to the shading engines (and physics engines if the reports about doom 3 are right). Keith ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Around 18 o'clock on May 27, Keith Whitwell wrote: I think the multiplayer aspects of the game are a separate issue. Talking about the difference between a big display list with the whole quake level in it and the visibility/bsp-tree/whatever-new-technique coding that quake other games use to squeeze as much as possible out of the hardware. We had a big display-list vs immediate-mode war around 1990 and immediate mode won. It's just a lot easier to send the whole frame worth of polygons each time than to try an edit display lists. Of course, this particular battle was framed by the scientific visualization trend of that era where each frame was generated from a completely new set of data. In that context, stored mode graphics lose pretty badly. However, given our experience with shared memory transport for images, and given the tremendous differential between CPU and bus speeds these days, it might make some sense to revisit the current 3D architecture. A system where the shared memory commands are validated by a user-level X server and passed to the graphics engine with only a small kernel level helper for DMA would allow for a greater possible level of security than the current DRI model does today. This would also provide for accelerated 3D graphics for remote applications, something that DRI doesn't support today, and which would take some significant work to enable. I would hope that it could also provide a significantly easier configuration environment; getting 3D running with the DRI is still a significant feat for the average Linux user. The question is whether this would impact performance at all; we're talking a process-process context switch instead of process-kernel for each chunk of data. However, we'd eliminate the current DRI overhead when running multiple 3D applications, and we'd be able to take better advantage of SMP systems. One trick would be to have the X server avoid reading much of the command buffer; much of that would make SMP performance significantly worse. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Keith Packard wrote: We had a big display-list vs immediate-mode war around 1990 and immediate mode won. It's just a lot easier to send the whole frame worth of polygons each time than to try an edit display lists. Of course, this particular battle was framed by the scientific visualization trend of that era where each frame was generated from a completely new set of data. In that context, stored mode graphics lose pretty badly. If you're referring to the OpenGL vs PEX war, there was more than technical issues weighing in...there was the reality that Microsoft *was* willing to support OpenGL. That made OpenGL a better cross platform choice. Kind of ironic, but predictable that Microsoft is now trying to sink OpenGL...but that's a thread for another group. However, given our experience with shared memory transport for images, and given the tremendous differential between CPU and bus speeds these days, it might make some sense to revisit the current 3D architecture. A system where the shared memory commands are validated by a user-level X server and passed to the graphics engine with only a small kernel level helper for DMA would allow for a greater possible level of security than the current DRI model does today. I wouldn't say we're laking in security today, we've in good shape now. This would also provide for accelerated 3D graphics for remote applications, something that DRI doesn't support today, and which would take some significant work to enable. In relative scale, getting HW acceleration for direct rendering is *much* smaller than the more aggressive architectural changes we're discussing. Let's just keep that in perspective. It might be a less aggressive first step to get the missing module(s) for HW acclerated indirect rendering going, then move to these types of more aggressive indirect methods. I would hope that it could also provide a significantly easier configuration environment; getting 3D running with the DRI is still a significant feat for the average Linux user. Hmm. I would have agreed a year ago, but most of the distributions appear to have a good handle on making this just happen...when there is driver support. The question is whether this would impact performance at all; we're talking a process-process context switch instead of process-kernel for each chunk of data. However, we'd eliminate the current DRI overhead when running multiple 3D applications, and we'd be able to take better advantage of SMP systems. One trick would be to have the X server avoid reading much of the command buffer; much of that would make SMP performance significantly worse. The performance path I'd like to push hardest in the short term in direct rendering completely within the user space context. Your suggestions for optimizing an indirect path are great. That path would become much more critical than today as general purpose processes that would no longer have access to the *faster* direct path. -- /\ Jens Owen/ \/\ _ [EMAIL PROTECTED] /\ \ \ Steamboat Springs, Colorado ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
José Fonseca wrote: On 2002.05.27 16:28 Jens Owen wrote: ... If we do get some type of indirect rendering path working quicker, then perhaps we could tighten up these defaults so that the usage model required explicit administrative permision to a user before being allowed access to direct rendering. However, after going to all this trouble of making a decent level of fall back performance, I would then want to push the performance envelop for those processes that did meet the criteria for access to direct rendering resources, and soften the security requirements for just those processes. This could possible be users that have been given explicit permission and the X server itself (doing HW accellerated indirect rendering). There would really be three prongs of attach for this approach: 1) Audit the current DRI security model and confirm that it is strong enough to be used to prevent non authorized users from gaining access to the DRI mechanisms. Work with distros to tighten up the usage model (and possible the DRI security mechanism itself) so only explicit desktop users are allowed access to the DRI. 2) Develop a device independent indirect rendering module that plugs into the X server to utilize our 3D drivers. After getting some HW accel working, look at speeding up this path by utilizing Chormium-like technologies and/or shared memory for high level data. 3) Transition the direct rendering drivers to take full advantage of their user space DMA capabilities. The is a large amount of work, but something we should consider if step 1 can be achieved to the kernel teams satisfaction. It is even possible the direct path could be obsoleted over the long term as step 2 becomes more and more streamlined. ... Jens, if I understood correctly, basically you're suggesting having the OpenGL state machine on the X server process context, and therefore the GL drivers too, and most of the data (textures, display lists). So there would be no layering between the DMA buffer construction and its submition - as boths things would be carried by the GL drivers. This means that we would have a single driver model instead of 3. But the GLX protocol isn't good for this, is it? Hence the need for shared memory for big data. Am I getting the right picture, or am I way off..? Sorry, we covered a lot of things at once. Let me simplify... 1) We loosen security requirements for 3D drivers. This will allow far less data copying, memory mapping/unmapping and system calls. Many modern graphics chips can have their data managed completely in a user space AGP ring buffer removing the need to call the kernel module at all. The primary limitation that has kept us from persuing these implementations so far have been security holes with AGP blits. 2) We implement HW acclerated indirect rendering for those processes that don't have the permissions to use the new optimized drivers. Most of the fancy architecture discusssions we had here are related to making indirect rendering faster...and could be done as a follow on to basic HW accelerated indirect rendering. The first and easiest way to implement this is to make the X server use our direct rendering drivers. I'm not really advocating going to a different model at all. Rather, I'm just advocating moving more of the kernel side validation we're currently doing, back into the 3D driver. PS: It would be nice to discuss these issues in tonight's meeting. I guess that's starting now. It's at irc.openproject.net #dri-devel for those interested in joing in... -- /\ Jens Owen/ \/\ _ [EMAIL PROTECTED] /\ \ \ Steamboat Springs, Colorado ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Mon, 27 May 2002 15:01:47 -0600 Jens Owen [EMAIL PROTECTED] wrote: 1) We loosen security requirements for 3D drivers. This will allow far less data copying, memory mapping/unmapping and system calls. Many modern graphics chips can have their data managed completely in a user space AGP ring buffer removing the need to call the kernel module at all. The primary limitation that has kept us from persuing these implementations so far have been security holes with AGP blits. I dont pretend to understand everything here, but wouldnt it be more secure, and STILL blindingly fast, to set up the data in userspace, and trigger the AGP DMA / blits from kernel space with some bounds checking? surely 1 system call per DMA isnt that bad? ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Ian Molton wrote: On Mon, 27 May 2002 15:01:47 -0600 Jens Owen [EMAIL PROTECTED] wrote: 1) We loosen security requirements for 3D drivers. This will allow far less data copying, memory mapping/unmapping and system calls. Many modern graphics chips can have their data managed completely in a user space AGP ring buffer removing the need to call the kernel module at all. The primary limitation that has kept us from persuing these implementations so far have been security holes with AGP blits. I dont pretend to understand everything here, but wouldnt it be more secure, and STILL blindingly fast, to set up the data in userspace, and trigger the AGP DMA / blits from kernel space with some bounds checking? surely 1 system call per DMA isnt that bad? That's what we do for the cases where we can do so securely. All the vertex data on most cards takes this route. Some data can't go this way because the buffers are subject to attack after the checking has been performed but before they reach the hardware. Whether specific operations are vulnerable or not depends on the details of the card's dma engine. Keith ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Linus and Keith P., Thank you very much for your valuable insights - they cleared a misconception I had about memory transfers. Of course, to get to the bottom of this, we will have to test several buffer sizes - I'm sure it will be an interesting study. Regards, José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.25 06:10 Leif Delgass wrote: On Fri, 24 May 2002, Frank C. Earl wrote: On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote: I've committed code to read BM_GUI_TABLE to reclaim processed buffers and disabled frame and buffer aging with the pattern registers. I've disabled saving/restoring the pattern registers in the DDX and moved the wait for idle to the XAA Sync. This fixes the slowdown on mouse moves. I also fixed a bug in getting the ring head. One bug that remains is that when starting tuxracer or quake (and possibly other apps) from a fresh restart of X, there is a problem where old bits of the back- or frame-buffer show through. With tuxracer (windowed), if I move the window, the problem goes away. It seems that some initial state is not being set correctly or clears aren't working. If I run glxgears (or even tunnel, which uses textures) first, after starting X and before starting another app, the problem isn't there. If someone has a cvs build or binary from before Monday the 20th but after Saturday the 18th, could you test to see if this happens? I'm not sure if this is new behavior or not. I tried removing the flush on swaps in my tree and things seem to still work fine (the bug mentioned above is still there, however). We may need to think of an alternate way to do frame aging and throttling, without using a pattern register. I've been pondering the code you've done (not the latest commited, but what was described to me a couple of weeks back...) how do you account for securing the BM_GUI_TABLE check and the pattern register aging in light of the engine being able to write to most all registers? It occured to me that there's a potential security risk (allowing malicious clients to possibly confuse/hang the engine) with the design described to me back a little while ago. Well, I just went back and looked at Jose's test for writing BM_GUI_TABLE_CMD from within a buffer and realized that it had a bug. The register addresses weren't converted to MM offsets. So I fixed that Indeed. The other registers were being specified by their value and not their macro, so I forgot about that detail... and ran the test. With two descriptors, writing BM_GUI_TABLE_CMD does not cause the second descriptor to be read from the new address, but BM_GUI_TABLE reads back with the new address written in the first buffer at the end of the test. Then I tried setting up three descriptors, and lo and behold, after processing the first two descriptors, the engine switches to the new table address written in the first buffer! I think it's because of the pre-incrementing (prefetching?) of BM_GUI_TABLE that there's a delay of one descriptor, but it IS possible to derail a bus master in progress and set it processing from a different table in mid-stream. Plus, if the address is bogus or the table is misconstructed, this will cause an engine lockup and take out DMA until the machine is cold restarted. The code for the test I used is attached. Wow! Bummer... I already had convinced myself that the card was secure! So it would appear that allowing clients to add register commands to a buffer without verifying them is _not_ secure. This is going to make things harder, especially for vertex buffers. This is going to require copying buffers and adding commands or unmapping and verifying client-submitted buffers in the DRM. I'd like to continue on the path we're on until we can get DMA running smoothly and then we'll have to come back and fix this problem. Yep. It's not the end of the world, but it's gonna mean that the CPU will be a little more stressed, and that we have much more code to do... Good catch, Leif! José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Fwd: Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 12:10 am, Leif Delgass wrote: So it would appear that allowing clients to add register commands to a buffer without verifying them is _not_ secure. This is going to make things harder, especially for vertex buffers. This is going to require copying buffers and adding commands or unmapping and verifying client-submitted buffers in the DRM. I'd like to continue on the path we're on until we can get DMA running smoothly and then we'll have to come back and fix this problem. Check back to what I'd said for my work- how I'd envisioned things working. It doesn't really rely on a register being set. Yes, it uses a register to verify a completion of a pass, but the main way to do that is to see if the chip's idle- it's more of an extra, redundant check. If the chip is idle, we know it's done with the pass. If it's not idle after about 3-5 interrupts, you know it's probably locked and needs a reset. Now, with the DRI locks, etc. we can ensure that we know there's going to be nobody that isn't a client pushing stuff out until we're done and flagged as such. We also know that nobody's going to be allowed register access, so they can't keep the engine on the chip busy except by DMA requests. DMA requests are not initiated by the the callers, they're handled by a block of code tied to an interrupt handler in the module. This code, if it gets the lock, submits a combined DMA pass of as much of the submitter's data as is reasonable. We then check to see if that pass is completed every time the interrupt gets called. Upon completion, you unlink the buffers in the pass and hand them to the free list. With that, you're already as secure as you're likely going to get with the Mach64- the DRM is in the driver's seat the whole time for any submitted data. Otherwise, you're going to be doing copying of some sort, which pretty much burns up any speed advantages the optimal way of doing this gives you. -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 03:01 am, José Fonseca wrote: Wow! Bummer... I already had convinced myself that the card was secure! It is, if you don't rely on a register being set by something for your control of things. You may get peak performance with the design in question, but it's not secure. I'd almost bet we could get as good a performance doing it with what I'd started, if we re-worked the interrupts so that we got double them up using the scanline one in addition to the VBLANK one. Yep. It's not the end of the world, but it's gonna mean that the CPU will be a little more stressed, and that we have much more code to do... If you guys don't mind, I'd like to revisit the work by modernizing my branch and finalizing what I'd started. I think it'd do well and make it secure -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.25 17:16 Frank C. Earl wrote: On Saturday 25 May 2002 03:01 am, José Fonseca wrote: Wow! Bummer... I already had convinced myself that the card was secure! It is, if you don't rely on a register being set by something for your control of things. ... Frank, Leif was pretty clear and I quote: it IS possible to derail a bus master in progress and set it processing from a different table in mid-stream. Plus, if the address is bogus or the table is misconstructed, this will cause an engine lockup and take out DMA until the machine is cold restarted. And this can happen regardless if a specific register is to be read or not. (In fact, if you look at the test case you'll see that no register is being read except for debugging purposes.) Yep. It's not the end of the world, but it's gonna mean that the CPU will be a little more stressed, and that we have much more code to do... If you guys don't mind, I'd like to revisit the work by modernizing my branch and finalizing what I'd started. I think it'd do well and make it secure Sure, Frank. I wish you can prove us that we are wrong, but before you dedicate too much time on it don't forget that now it's pretty straightforward to come up with a test case to break the transfer. So if you can't secure it in the end, your extra effort will be in vain. José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes (fwd)
Forgot to cc the list... -- Leif Delgass http://www.retinalburn.net -- Forwarded message -- Date: Sat, 25 May 2002 12:56:08 -0400 (EDT) From: Leif Delgass [EMAIL PROTECTED] To: Frank C. Earl [EMAIL PROTECTED] Subject: Re: [Dri-devel] Mach64 dma fixes On Sat, 25 May 2002, Frank C. Earl wrote: On Saturday 25 May 2002 12:10 am, you wrote: So it would appear that allowing clients to add register commands to a buffer without verifying them is _not_ secure. This is going to make things harder, especially for vertex buffers. This is going to require copying buffers and adding commands or unmapping and verifying client-submitted buffers in the DRM. I'd like to continue on the path we're on until we can get DMA running smoothly and then we'll have to come back and fix this problem. Check back to what I'd said for my work- how I'd envisioned things working. It doesn't really rely on a register being set. Yes, it uses a register to verify a completion of a pass, but the main way to do that is to see if the chip's idle- it's more of an extra, redundant check. If the chip is idle, we know it's done with the pass. If it's not idle after about 3-5 interrupts, you know it's probably locked and needs a reset. Now, with the DRI locks, etc. we can ensure that we know there's going to be nobody that isn't a client pushing stuff out until we're done and flagged as such. What prevents a client from modifying the contents of a buffer after it's been submitted? Sure, you can't send new buffers without the lock, but the client can still write to a buffer that's already been submitted and dispatched without holding the lock. We also know that nobody's going to be allowed register access, so they can't keep the engine on the chip busy except by DMA requests. The registers are already being mapped read only in client space now. DMA requests are not initiated by the the callers, they're handled by a block of code tied to an interrupt handler in the module. This code, if it gets the lock, submits a combined DMA pass of as much of the submitter's data as is reasonable. We then check to see if that pass is completed every time the interrupt gets called. Upon completion, you unlink the buffers in the pass and hand them to the free list. With that, you're already as secure as you're likely going to get with the Mach64- the DRM is in the driver's seat the whole time for any submitted data. Otherwise, you're going to be doing copying of some sort, which pretty much burns up any speed advantages the optimal way of doing this gives you. I don't see the interrupt method being that different from a security perspective. The DRM is in the driver's seat in either case, the method without interrupts is essentially the same, but with the trigger for starting a new pass in a different place. The problem isn't just relying on reading registers that can be modified by the client, but ensuring that the client doesn't add commands to derail the DMA pass or lock the engine. The only way to make sure this doesn't happen is by copying or unmapping and verifying the buffers. I think the i830 driver does this. Yes, it will impact performance, but I don't see a way to get around it and still make the driver secure. At least this extra work can be done while the card is busy with a DMA operation. -- Leif Delgass http://www.retinalburn.net ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 11:48 am, José Fonseca wrote: On 2002.05.25 17:16 Frank C. Earl wrote: Frank, Leif was pretty clear and I quote: it IS possible to derail a bus master in progress and set it processing from a different table in mid-stream. Plus, if the address is bogus or the table is misconstructed, this will cause an engine lockup and take out DMA until the machine is cold restarted. And this can happen regardless if a specific register is to be read or not. (In fact, if you look at the test case you'll see that no register is being read except for debugging purposes.) Yep. I looked at his example again and it just didn't initially click when I looked at it the first time- I really do need to NOT post things just after waking up... This is extremely disappointing to say the least. Doing the copying is going to eat at least part if not all the advantage of doing either route. -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 11:56 am, you wrote: What prevents a client from modifying the contents of a buffer after it's been submitted? Sure, you can't send new buffers without the lock, but the client can still write to a buffer that's already been submitted and dispatched without holding the lock. Nothing. If the chip had been as secure as we'd initially thought, it would have not mattered because all they'd do is scribble all over the screen at the worst. If you're unmapping on submission, you don't have to lock things on the client end because they can't alter after the fact. Then you only have to worry about bad data. In this case, what you're going to want to do is to unmap, build the real structure by filling in the commands for the vertex entries, and submit to the processing queue. Multiple callers could then still submit what they wanted to be DMAed without waiting (in the current model, don't each of the clients have to wait if one's got the lock?) because there's a peice of code multiplexing the DMA resource instead of a lock managing it. I don't see the interrupt method being that different from a security perspective. The DRM is in the driver's seat in either case, the method without interrupts is essentially the same, but with the trigger for starting a new pass in a different place. The problem isn't just relying on reading registers that can be modified by the client, but ensuring that the client doesn't add commands to derail the DMA pass or lock the engine. The only way to make sure this doesn't happen is by copying or unmapping and verifying the buffers. I think the i830 driver does this. Yes, it will impact performance, but I don't see a way to get around it and still make the driver secure. At least this extra work can be done while the card is busy with a DMA operation. If it had been secure and you couldn't derail DMA, it wouldn't have peices that could be confused by malicious clients, meaning you didn't need to do copying, etc. to secure the pathway, ensuring peak overall performance. With your latest test case, it's a moot point. We're going to have to secure the stream proper in the form of code that has inner loops, etc. (The i830 does an unmap and a single append only- we've got a lot more to do with the Mach64. I've been thinking of ways around that on the i830 and i810 that I'm going to be trying at some point.) Your way would be as secure in this environment. Now, as to which is more effiicient, that's still up to debate. I can't say which is going to be faster overall. There's the aging in your design that allows for buffers being released sooner than in mine. There's the need for serialization in your design that is unrequired in mine. Which causes the worst bottlenecks in performance? -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Sat, 25 May 2002, Frank C. Earl wrote: On Saturday 25 May 2002 11:56 am, you wrote: What prevents a client from modifying the contents of a buffer after it's been submitted? Sure, you can't send new buffers without the lock, but the client can still write to a buffer that's already been submitted and dispatched without holding the lock. Nothing. If the chip had been as secure as we'd initially thought, it would have not mattered because all they'd do is scribble all over the screen at the worst. If you're unmapping on submission, you don't have to lock things on the client end because they can't alter after the fact. Then you only have to worry about bad data. In this case, what you're going to want to do is to unmap, build the real structure by filling in the commands for the vertex entries, and submit to the processing queue. Multiple callers could then still submit what they wanted to be DMAed without waiting (in the current model, don't each of the clients have to wait if one's got the lock?) because there's a peice of code multiplexing the DMA resource instead of a lock managing it. I'm using the same model you had set up. When a client submits a buffer, it's added to the queue (but not dispatched) and there's no blocking. The DRM batch submits buffers when the high water mark is reached or the flush ioctl is called (needed before reading/writing to the framebuffer, e.g.). Clients have to wait for the lock to submit the buffer, but the ioctl quickly returns. The only place where a client has to wait is in freelist_get if the freelist is empty. That's where buffer aging or reading the ring head allows the call to return as soon as a single buffer is available, rather than waiting for the whole DMA pass to complete. I don't see the interrupt method being that different from a security perspective. The DRM is in the driver's seat in either case, the method without interrupts is essentially the same, but with the trigger for starting a new pass in a different place. The problem isn't just relying on reading registers that can be modified by the client, but ensuring that the client doesn't add commands to derail the DMA pass or lock the engine. The only way to make sure this doesn't happen is by copying or unmapping and verifying the buffers. I think the i830 driver does this. Yes, it will impact performance, but I don't see a way to get around it and still make the driver secure. At least this extra work can be done while the card is busy with a DMA operation. If it had been secure and you couldn't derail DMA, it wouldn't have peices that could be confused by malicious clients, meaning you didn't need to do copying, etc. to secure the pathway, ensuring peak overall performance. With your latest test case, it's a moot point. We're going to have to secure the stream proper in the form of code that has inner loops, etc. (The i830 does an unmap and a single append only- we've got a lot more to do with the Mach64. I've been thinking of ways around that on the i830 and i810 that I'm going to be trying at some point.) Your way would be as secure in this environment. For vertex data, we can add the register commands based on the primitive type and buffer size. By placing the commands, we can ensure that any commands in the buffer would just be seen as data. This would require an unmap and loop through the buffer, but we wouldn't have to copy all the data. I'm going to try doing gui-master blits using BM_HOSTDATA rather than BM_ADDR and HOST_DATA[0-15] and see if we can elimintate the register commands in the buffer. We could also use system bus masters for blits, but that would require ending the current DMA op and setting up a new one for each blit, since blits done this way use BM_SYSTEM_TABLE instead of BM_GUI_TABLE. With BM_HOSTDATA it would be a matter of changing the descriptors for blits, but they could co-exist in the same stream as vertex and state gui-master ops. Now, as to which is more effiicient, that's still up to debate. I can't say which is going to be faster overall. There's the aging in your design that allows for buffers being released sooner than in mine. There's the need for serialization in your design that is unrequired in mine. Which causes the worst bottlenecks in performance? As I explained above, serialization isn't needed. It's really a question of which method of checking completion and dispatching buffers leaves the least amount of idle time. Buffer aging could still be used in the interrupt driven model, so that's not really a constraint of one approach versus the other. I don't think it would be too difficult to test both methods without too much change in the basic code infrastructure. -- Leif Delgass http://www.retinalburn.net ___ Don't miss the 2002 Sprint PCS
Re: [Dri-devel] Mach64 dma fixes
Frank, On 2002.05.25 18:24 Frank C. Earl wrote: ... This is extremely disappointing to say the least. Doing the copying is going to eat at least part if not all the advantage of doing either route. Yes, it's something we have to deal regardless of how we flush the DMA buffers.. Of course that it will be always slower, but I think that it can be reduced to a barely noticed different. It really depends were the bottleneck will be on a regular OpenGL application. On older CPUs with mach64 perhaps not, but the laptops where the mach64 chip is common have fairly good CPUs comparing with the Mach64 abilities, so I believe that the bottleneck will be on the card. This means that if we do this right, i.e., do it the less CPU intensive way and use fairly large buffers (since Mach64 allows to use scatter gather memory), then the only difference will be a slightly increased latency, but not really a lower number of fps. In other words, the bandwidth to the card should be unaffected. Anyway, until then we still have to optimize the vertex buffers contructions, and after that we should be able to compare the performance with and without this security enforcement. Regards, José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 11:48 am, José Fonseca wrote: So if you can't secure it in the end, your extra effort will be in vain. I just thought of something to try to change the nature of the test case problem. What happens if you have a second descriptor of commands that merely resets the DMA engine settings to what they should be for the third descriptor? I'd say it'd depend on what the chip was actually doing- what do you guys think? -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 01:14 pm, Leif Delgass wrote: I'm using the same model you had set up. When a client submits a buffer, it's added to the queue (but not dispatched) and there's no blocking. The DRM batch submits buffers when the high water mark is reached or the flush ioctl is called (needed before reading/writing to the framebuffer, e.g.). Clients have to wait for the lock to submit the buffer, but the ioctl quickly returns. The only place where a client has to wait is in freelist_get if the freelist is empty. That's where buffer aging or reading the ring head allows the call to return as soon as a single buffer is available, rather than waiting for the whole DMA pass to complete. I guess I misunderstood somewhere. Then the only real question is, can we safely/stably manage aging or do we need to do it the way I had planned? Got it. For vertex data, we can add the register commands based on the primitive type and buffer size. By placing the commands, we can ensure that any commands in the buffer would just be seen as data. This would require an unmap and loop through the buffer, but we wouldn't have to copy all the data. I'm going to try doing gui-master blits using BM_HOSTDATA rather than BM_ADDR and HOST_DATA[0-15] and see if we can elimintate the register commands in the buffer. We could also use system bus masters for blits, but that would require ending the current DMA op and setting up a new one for each blit, since blits done this way use BM_SYSTEM_TABLE instead of BM_GUI_TABLE. With BM_HOSTDATA it would be a matter of changing the descriptors for blits, but they could co-exist in the same stream as vertex and state gui-master ops. It's still something eating cycles that if we could have secured the chip better that we wouldn't have to be doing. Unmapping's not a good thing to be doing with something you're trying to do quickly and it's rough on the kernel memory system. Now, as to which is more effiicient, that's still up to debate. I can't As I explained above, serialization isn't needed. It's really a question of which method of checking completion and dispatching buffers leaves the least amount of idle time. Buffer aging could still be used in the interrupt driven model, so that's not really a constraint of one approach versus the other. I don't think it would be too difficult to test both methods without too much change in the basic code infrastructure. Works for me. -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.25 20:36 Frank C. Earl wrote: On Saturday 25 May 2002 11:48 am, José Fonseca wrote: So if you can't secure it in the end, your extra effort will be in vain. I just thought of something to try to change the nature of the test case problem. What happens if you have a second descriptor of commands that merely resets the DMA engine settings to what they should be for the third descriptor? I'd say it'd depend on what the chip was actually doing- what do you guys think? You mean, setting the descriptor to the right value in between? hmm... I doubt that it in the middle works because as Leif noticed, the changes that we make to BM_GUI_TABLE only affect the descriptor that is two entries ahead, so it would be too late... ..but your idea in principal is quite ingenious! What if we just fill the last 8 bytes of each 4K block with that command to reset the value of BM_GUI_TABLE? So even if the client tries to mess things up, we would put it right in the end. Of course that it would be a pain [to code for] to reserve 8 bytes on the end of each 4k block, but it's doable [It would also be a pain to code the kernel unmap/verification routine] This means that not only the client can't mess up the descriptor table, but they can't tamper with BM_GUI_TABLE, so we can also still use it as a progression meter [in both implementations]. José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Sat, 25 May 2002, José Fonseca wrote: On 2002.05.25 20:36 Frank C. Earl wrote: On Saturday 25 May 2002 11:48 am, José Fonseca wrote: So if you can't secure it in the end, your extra effort will be in vain. I just thought of something to try to change the nature of the test case problem. What happens if you have a second descriptor of commands that merely resets the DMA engine settings to what they should be for the third descriptor? I'd say it'd depend on what the chip was actually doing- what do you guys think? You mean, setting the descriptor to the right value in between? hmm... I doubt that it in the middle works because as Leif noticed, the changes that we make to BM_GUI_TABLE only affect the descriptor that is two entries ahead, so it would be too late... ..but your idea in principal is quite ingenious! What if we just fill the last 8 bytes of each 4K block with that command to reset the value of BM_GUI_TABLE? So even if the client tries to mess things up, we would put it right in the end. Of course that it would be a pain [to code for] to reserve 8 bytes on the end of each 4k block, but it's doable [It would also be a pain to code the kernel unmap/verification routine] This means that not only the client can't mess up the descriptor table, but they can't tamper with BM_GUI_TABLE, so we can also still use it as a progression meter [in both implementations]. This had crossed my mind too. The only problem is that there could still be a short period of time where BM_GUI_TABLE isn't accurate, so it still leaves the problem of being able to trust the contents of BM_GUI_TABLE for buffer aging and adding descriptors to the ring. -- Leif Delgass http://www.retinalburn.net ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Sat, 25 May 2002, Leif Delgass wrote: On Sat, 25 May 2002, José Fonseca wrote: On 2002.05.25 20:36 Frank C. Earl wrote: On Saturday 25 May 2002 11:48 am, José Fonseca wrote: So if you can't secure it in the end, your extra effort will be in vain. I just thought of something to try to change the nature of the test case problem. What happens if you have a second descriptor of commands that merely resets the DMA engine settings to what they should be for the third descriptor? I'd say it'd depend on what the chip was actually doing- what do you guys think? You mean, setting the descriptor to the right value in between? hmm... I doubt that it in the middle works because as Leif noticed, the changes that we make to BM_GUI_TABLE only affect the descriptor that is two entries ahead, so it would be too late... ..but your idea in principal is quite ingenious! What if we just fill the last 8 bytes of each 4K block with that command to reset the value of BM_GUI_TABLE? So even if the client tries to mess things up, we would put it right in the end. Of course that it would be a pain [to code for] to reserve 8 bytes on the end of each 4k block, but it's doable [It would also be a pain to code the kernel unmap/verification routine] This means that not only the client can't mess up the descriptor table, but they can't tamper with BM_GUI_TABLE, so we can also still use it as a progression meter [in both implementations]. This had crossed my mind too. The only problem is that there could still be a short period of time where BM_GUI_TABLE isn't accurate, so it still leaves the problem of being able to trust the contents of BM_GUI_TABLE for buffer aging and adding descriptors to the ring. It just occurred to me that resetting BM_GUI_TABLE could put the card into a loop. You might have to put in an address two descriptors away. We'd need to test this. -- Leif Delgass http://www.retinalburn.net ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 03:44 pm, Leif Delgass wrote: This had crossed my mind too. The only problem is that there could still be a short period of time where BM_GUI_TABLE isn't accurate, so it still leaves the problem of being able to trust the contents of BM_GUI_TABLE for buffer aging and adding descriptors to the ring. Yeah, but that's only a problem if you're aging them. This is more food for thought type stuff at this point- it all boils down to what is the most optimal secure way of doing things. The less things we do per submission, the better. We may still end up un-mapping things, but pushing out 8 bytes for each 4k is less work than pushing out a command every vertex in the buffer. I'd love to come up with a way to not have to un-map things if at all possible so that we're not doing that either. Like I said, it's not really something you want to do often (just like you don't want to do ioctls all that often either... :-) -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 03:50 pm, Leif Delgass wrote: It just occurred to me that resetting BM_GUI_TABLE could put the card into a loop. You might have to put in an address two descriptors away. We'd need to test this. Hmm... Didn't think about that possibility. I had this last line of thought while I was mowing the yard (still doing yardwork...). I was going to plug in your test case, the one José just came up with, and my proposed one into my test driver code and see what comes out from all of it. -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.25 21:50 Leif Delgass wrote: On Sat, 25 May 2002, Leif Delgass wrote: On Sat, 25 May 2002, José Fonseca wrote: ... This had crossed my mind too. The only problem is that there could still be a short period of time where BM_GUI_TABLE isn't accurate, so it still leaves the problem of being able to trust the contents of BM_GUI_TABLE for buffer aging and adding descriptors to the ring. I see... I had the [wrong] impression that the value wasn't actually changed in the process since it was preincremented... have you tested if this actually happens? Anyway, this doesn't prevent to use the buffer aging. On the end of processing a buffer we still end up with the correct value of BM_GUI_TABLE and we can use the last bit of BM_COMMAND to know if it's processing the last entry or not. The only draw back is that we aren't able to reclaim the buffers so soon, so we would need a scratch register to know which buffers were free or not. But then, what damage can a client do by tampering BM_COMMAND? Even if it can mess with BM_COMMAND, we can still work without it. We just let the card go on, but when it stops we just need to see wich was the last processed buffer and go on. But probaby there are more registers that the client can mess and damage... probably is not just BM_GUI_TABLE and BM_COMMAND, but a bunch more of them, and we're effectively reducing the card communication bandwith asking to reset so much thing in the end of _each_ 4kb buffers. It just occurred to me that resetting BM_GUI_TABLE could put the card into a loop. You might have to put in an address two descriptors away. We'd need to test this. Yes. Unfortunately I don't have the time to do it myself, so it will have to wait in what concerns me. In any event, I think what these issues must be really addressed only when the DMA is complete. Our knowledge of the card behavior is constantly changing, and it would be waste of time to make such an effort to make things work out and discover later on that it was hopeless. In fact, although I try to stay impartial and have an open mind about this, I can't stop thiking that we aren't going to accomplish nothing with circunveining the card's security drawbacks like this. It's like covering the sun with that-thing-full-of-holes-which-I-dont-recall-the-name... I'm getting more and more inclined towards a robust implementation, like unmapping buffers, which probably won't have a significant performance impact, and which will give us much freedom to control the card properly. Well, time will answer this... José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Saturday 25 May 2002 04:27 pm, you wrote: Anyway, this doesn't prevent to use the buffer aging. On the end of processing a buffer we still end up with the correct value of BM_GUI_TABLE and we can use the last bit of BM_COMMAND to know if it's processing the last entry or not. The only draw back is that we aren't able to reclaim the buffers so soon, so we would need a scratch register to know which buffers were free or not. But then, what damage can a client do by tampering BM_COMMAND? Even if it can mess with BM_COMMAND, we can still work without it. We just let the card go on, but when it stops we just need to see wich was the last processed buffer and go on. But probaby there are more registers that the client can mess and damage... probably is not just BM_GUI_TABLE and BM_COMMAND, but a bunch more of them, and we're effectively reducing the card communication bandwith asking to reset so much thing in the end of _each_ 4kb buffers. We're just going to have to play with it. As far as I'm concerned, we're going forward with things as they are at this point. I'm just spending a little of what little time I do happen to have exhausting all possible ways of avoiding doing unneeded operations- as long as my branch is mostly in lock-step with the functionality that you're providing I'll be happy. I'll not spend too much more time trying this stuff, but I want to know. If something comes of all of it, great. If not, well, it was my time to waste and it's not really wasted- we'll have answers when someone else comes along and asks if that's the best we can do with this chip. I'm just not liking having to secure the path the way that it's being thought of. Not because I'm against doing the work- there is a bottleneck in doing this stuff that way. Some of the things we would do to secure things don't consume a lot of resources CPU-wise. Some of the things we would do if we can't just ignore the path would consume a lot of resources. If the books covering the design of the Linux kernel are to be believed, there's a decent number of operations done by mapping or unmapping memory to/from userspace- it really wasn't designed with the kinds of stuff we're asking of it (namely doing a LOT of it and doing it quickly and often...) in mind. Yes. Unfortunately I don't have the time to do it myself, so it will have to wait in what concerns me. I'm planning on setting things up this evening to see if any of this is worth bothering with as a continuing conversation. In any event, I think what these issues must be really addressed only when the DMA is complete. Our knowledge of the card behavior is constantly changing, and it would be waste of time to make such an effort to make things work out and discover later on that it was hopeless. Indeed. That's why I intend on tinkering a little while longer with this but only so much so. As for knowlege of the card behavior changing and whatnot. I'd believe that not everything is known about any of the other cards out there. In fact, although I try to stay impartial and have an open mind about this, I can't stop thiking that we aren't going to accomplish nothing with circunveining the card's security drawbacks like this. It's like covering the sun with that-thing-full-of-holes-which-I-dont-recall-the-name... I'm getting more and more inclined towards a robust implementation, like unmapping buffers, which probably won't have a significant performance impact, and which will give us much freedom to control the card properly. Well, time will answer this... That's my take on things... -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Sat, 25 May 2002, Frank C. Earl wrote: Linus, if you're still listening in, can you spare us a moment to tell us what consequences quickly mapping and unmapping memory reigons into userspace has on the system? It's reasonably fine on UP, and it often _really_ sucks on SMP. On UP, the real cost is not so much the actual TLB invalidate (which works at a page granularity anyway on any recent CPU), but the fact that you need to walk the page tables (cache miss heaven), and you will eventually need to fault another page back in (page fault, cache miss, whatever). On SMP, especially if the program is threaded (which games often are: even if the actual graphics engine is single-threaded, you end up having another thread for sound, one possible for AI or input etc), the cost goes up noticeably thanks to a (synchronous) CPU cross-call for a proper TLB invalidate. We've got a couple of the DRM modules that do that to ensure the driver is secure. I'm thinking it's a source of some performance degredation in the drivers and it may not be good on the memory subsytem. My gut feel is that especially under SMP, you're actually better off copying stuff, especially if we're talking about buffers that are mostly less than few kB. Basically, if the data can fit in the cache (ie the app has just generated them, and the data is already in the CPU cache and not big enough to blow that cache to kingdom come), copying is almost guaranteed to be a win, even on UP. (And please do note the cache issues: while a big buffer can often improve performance, it can equally easily _decrease_ performance by putting more cache pressure on the system. You're often better off re-using a smaller 8kB buffer many times - and doing most everything out of the cache - than trying to use a 1MB buffer and aiming for perfect scaling). DMA'ing directly from user space is most likely advantageous for doing things like textures, which are bound to be fairly big anyway. I'd _hope_ that those don't have security issues (ie they'd be DMA'able as just data, no command interface), but I don't have any information about the card details. Linus ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On 2002.05.26 00:49 Linus Torvalds wrote: On Sat, 25 May 2002, Frank C. Earl wrote: Linus, if you're still listening in, can you spare us a moment to tell us what consequences quickly mapping and unmapping memory reigons into userspace has on the system? It's reasonably fine on UP, and it often _really_ sucks on SMP. On UP, the real cost is not so much the actual TLB invalidate (which works at a page granularity anyway on any recent CPU), but the fact that you need to walk the page tables (cache miss heaven), and you will eventually need to fault another page back in (page fault, cache miss, whatever). On SMP, especially if the program is threaded (which games often are: even if the actual graphics engine is single-threaded, you end up having another thread for sound, one possible for AI or input etc), the cost goes up noticeably thanks to a (synchronous) CPU cross-call for a proper TLB invalidate. We've got a couple of the DRM modules that do that to ensure the driver is secure. I'm thinking it's a source of some performance degredation in the drivers and it may not be good on the memory subsytem. My gut feel is that especially under SMP, you're actually better off copying stuff, especially if we're talking about buffers that are mostly less than few kB. The vertex data alone (no textures here) can be several MBs per frame, and the number of frames per second can be as high as the card can handle, so the total buffer memory must be also big. I don't know if having lots of small buffers won't create a overhead due to the overhead with IOCTL and buffer submission (well, mostly the IOCTLs, since buffers can be queued by the kernel). Throwing some numbers just to get a rough idea: 2[MB/frame] x 25[frames/second] / 4[Kb/buffer] = 12800 buffers/second. I'm not very familiar with these issues but won't this number of ioctls per second create a significant overhead here? Or would the benefits of having each buffer fit on the cache (facilitating the copy) prevail? On other extreme we would have e.g., a 2MB buffer costing a single IOCTL + unmapping mapping to user space. (I know that the most likely is that we will need to benchmark this anyway...) Basically, if the data can fit in the cache (ie the app has just generated them, and the data is already in the CPU cache and not big enough to blow that cache to kingdom come), copying is almost guaranteed to be a win, even on UP. (And please do note the cache issues: while a big buffer can often improve performance, it can equally easily _decrease_ performance by putting more cache pressure on the system. You're often better off re-using a smaller 8kB buffer many times - and doing most everything out of the cache - than trying to use a 1MB buffer and aiming for perfect scaling). DMA'ing directly from user space is most likely advantageous for doing things like textures, which are bound to be fairly big anyway. I'd _hope_ that those don't have security issues (ie they'd be DMA'able as just data, no command interface), but I don't have any information about the card details. Yes. There are several ways to accomplish blits with this card, and at least one of them is secure and efficient. José Fonseca ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Sun, 26 May 2002, José Fonseca wrote: The vertex data alone (no textures here) can be several MBs per frame Yes, yes, I realize that the cumulative sizes are big. The question is not the absolute size, but the size of one bunch. Throwing some numbers just to get a rough idea: 2[MB/frame] x 25[frames/second] / 4[Kb/buffer] = 12800 buffers/second. The thing is, if you do processing of vertexes, I wouldn't be surpised if you're better off using a 8kB buffer over and over, and just doing 6400 system call entries, than you are to actually to trye to buffer up 2MB and then just doing 25 system call entries. Sure, in one case you do 6400 system calls, and in the other case you do only 25, so people who are afraid of system calls think that obviously the 25 system calls must be faster. But that obviously is just wrong. Pretty much all modern CPU's handle big working sets badly, and handle tight nice loops very well. Just to take a non-graphics-related example: on my machine, lmbench reports that I get pipe bandwidths that sometimes exceed 1GB/s. At the same time, a normal memcpy() goes along at 625MB/s. In short: according to that benchmark it is _faster_ to copy data from one process to another though a pipe, than it is to use memcpy() within one process. That's obviously a load of bull, and yet lmbench isn't really lying. The reason the pipe throughput is higher than the memory copy throughput is simply that the pipe data is chunked up in 4kB chunks, and because the source and the destinations are re-used in the pipe benchmarks in 64kB chunks, you get a lot better cache behaviour. (In fact, even TCP beats a plain memcpy() occasionally, which also says that the Linux TCP layer is an impressive piece of work _despite_ the same cache advantage). I'm not very familiar with these issues but won't this number of ioctls per second create a significant overhead here? Or would the benefits of having each buffer fit on the cache (facilitating the copy) prevail? A hot system call takes about 0.2 us on an athlon (it takes significantly longer on a P4, which I'm beating up Intel over all the time). The ioctl stuff goes through slightly more layers, but we're not talking huge numbers here. The system calls are fast enough that you're better off trying to keep stuff in the cache, than trying to minimize system calls. (The memcpy() example is a perfect example of something where _zero_ system calls is slower than two system calls and a task switch, simply because the zero system call example ends up being all noncached). Note that the cache issues show up on the instruction side too, especially on the P4 which has a fairly small trace cache. You can often make things go faster by simplifying and streamlining the code rather than trying to be clever and having a big footprint. Ask Keith Packard about the X frame buffer code and this very issue some day. NOTE NOTE NOTE! The tradeoffs are seldom all that clear. Sometimes big buffers and few system calls are better. Sometimes they aren't. It just depends on a lot of things. Linus ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
Around 18 o'clock on May 25, Linus Torvalds wrote: You can often make things go faster by simplifying and streamlining the code rather than trying to be clever and having a big footprint. Ask Keith Packard about the X frame buffer code and this very issue some day. The frame buffer code has very different tradeoffs; all of the memory references are across the PCI/AGP bus, so even issues like instruction caches are pretty much lost in the noise. That means you can completely ignore instruction count issues when estimating algorithm performance and look only at bus cycles. The result is similar; code gets rolled up into the smallest space, but not entirely for efficiency, but rather to make it easier to understand and count memory cycles. Of course, it's also nice to avoid trashing the i-cache so that when the frame buffer access is done there isn't a huge penalty in getting back to the rest of the X server. Reading data from the frame buffer takes nearly forever -- uncached PCI/AGP reads are completely synchronous. The frame buffer code stands on it's head to avoid that, even at the cost of some significant code expansion in places. For example, when filling rectangles, the edges are often not aligned on 32-bit boundaries. It's much more efficient to do a sequence of byte/short writes than the read-mask-write cycle that the older frame buffer code used. Writes are a bit better, but the lame Intel CPUs can't saturate an AGP bus in write combining mode -- that mode doesn't go through the regular cache logic and instead uses a separate buffer which isn't deep enough to cover the bus latency. Hence the performance difference between DMA and PIO for simple 2D graphics operations. The code also takes advantage of dynamic branch prediction; tests which resolve the same direction each pass through a loop are left inside the loop instead of duplicating the code to avoid the test; there isn't be a pipeline branch penalty while running through the loop because the predictor will guess right every time. The result is code which handles all of the X data formats (1,4,8,16,24,32) in about half the space the older code used to handle only a single format. The old code was optimized for 60ns CPUs with 300ns memory systems; new machines have much faster CPUs but only marginally faster memory. Getting a chance to implement the same spec in two radically different performance environments has been a lot of fun. Keith PackardXFree86 Core TeamHP Cambridge Research Lab ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote: I've committed code to read BM_GUI_TABLE to reclaim processed buffers and disabled frame and buffer aging with the pattern registers. I've disabled saving/restoring the pattern registers in the DDX and moved the wait for idle to the XAA Sync. This fixes the slowdown on mouse moves. I also fixed a bug in getting the ring head. One bug that remains is that when starting tuxracer or quake (and possibly other apps) from a fresh restart of X, there is a problem where old bits of the back- or frame-buffer show through. With tuxracer (windowed), if I move the window, the problem goes away. It seems that some initial state is not being set correctly or clears aren't working. If I run glxgears (or even tunnel, which uses textures) first, after starting X and before starting another app, the problem isn't there. If someone has a cvs build or binary from before Monday the 20th but after Saturday the 18th, could you test to see if this happens? I'm not sure if this is new behavior or not. I tried removing the flush on swaps in my tree and things seem to still work fine (the bug mentioned above is still there, however). We may need to think of an alternate way to do frame aging and throttling, without using a pattern register. I've been pondering the code you've done (not the latest commited, but what was described to me a couple of weeks back...) how do you account for securing the BM_GUI_TABLE check and the pattern register aging in light of the engine being able to write to most all registers? It occured to me that there's a potential security risk (allowing malicious clients to possibly confuse/hang the engine) with the design described to me back a little while ago. -- Frank Earl ___ Don't miss the 2002 Sprint PCS Application Developer's Conference August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 dma fixes
On Fri, 24 May 2002, Frank C. Earl wrote: On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote: I've committed code to read BM_GUI_TABLE to reclaim processed buffers and disabled frame and buffer aging with the pattern registers. I've disabled saving/restoring the pattern registers in the DDX and moved the wait for idle to the XAA Sync. This fixes the slowdown on mouse moves. I also fixed a bug in getting the ring head. One bug that remains is that when starting tuxracer or quake (and possibly other apps) from a fresh restart of X, there is a problem where old bits of the back- or frame-buffer show through. With tuxracer (windowed), if I move the window, the problem goes away. It seems that some initial state is not being set correctly or clears aren't working. If I run glxgears (or even tunnel, which uses textures) first, after starting X and before starting another app, the problem isn't there. If someone has a cvs build or binary from before Monday the 20th but after Saturday the 18th, could you test to see if this happens? I'm not sure if this is new behavior or not. I tried removing the flush on swaps in my tree and things seem to still work fine (the bug mentioned above is still there, however). We may need to think of an alternate way to do frame aging and throttling, without using a pattern register. I've been pondering the code you've done (not the latest commited, but what was described to me a couple of weeks back...) how do you account for securing the BM_GUI_TABLE check and the pattern register aging in light of the engine being able to write to most all registers? It occured to me that there's a potential security risk (allowing malicious clients to possibly confuse/hang the engine) with the design described to me back a little while ago. Well, I just went back and looked at Jose's test for writing BM_GUI_TABLE_CMD from within a buffer and realized that it had a bug. The register addresses weren't converted to MM offsets. So I fixed that and ran the test. With two descriptors, writing BM_GUI_TABLE_CMD does not cause the second descriptor to be read from the new address, but BM_GUI_TABLE reads back with the new address written in the first buffer at the end of the test. Then I tried setting up three descriptors, and lo and behold, after processing the first two descriptors, the engine switches to the new table address written in the first buffer! I think it's because of the pre-incrementing (prefetching?) of BM_GUI_TABLE that there's a delay of one descriptor, but it IS possible to derail a bus master in progress and set it processing from a different table in mid-stream. Plus, if the address is bogus or the table is misconstructed, this will cause an engine lockup and take out DMA until the machine is cold restarted. The code for the test I used is attached. So it would appear that allowing clients to add register commands to a buffer without verifying them is _not_ secure. This is going to make things harder, especially for vertex buffers. This is going to require copying buffers and adding commands or unmapping and verifying client-submitted buffers in the DRM. I'd like to continue on the path we're on until we can get DMA running smoothly and then we'll have to come back and fix this problem. -- Leif Delgass http://www.retinalburn.net static int mach64_bm_dma_test2( drm_device_t *dev ) { drm_mach64_private_t *dev_priv = dev-dev_private; dma_addr_t data_handle, data2_handle, table2_handle; void *cpu_addr_data, *cpu_addr_data2, *cpu_addr_table2; u32 data_addr, data2_addr, table2_addr; u32 *table, *data, *table2, *data2; u32 regs[3], expected[3]; int i; DRM_DEBUG( %s\n, __FUNCTION__ ); table = (u32 *) dev_priv-cpu_addr_table; /* FIXME: get a dma buffer from the freelist here rather than using the pool */ DRM_DEBUG( Allocating data memory ...\n ); cpu_addr_data = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, data_handle ); cpu_addr_data2 = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, data2_handle ); cpu_addr_table2 = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, table2_handle ); if (!cpu_addr_data || !data_handle || !cpu_addr_data2 || !data2_handle || !cpu_addr_table2 || !table2_handle) { DRM_INFO( data-memory allocation failed!\n ); return -ENOMEM; } else { data = (u32 *) cpu_addr_data; data_addr = (u32) data_handle; data2 = (u32 *) cpu_addr_data2; data2_addr = (u32) data2_handle; table2 = (u32 *) cpu_addr_table2; table2_addr = (u32) table2_handle; } DRM_INFO( data1: 0x%08x data2: 0x%08x\n, data_addr, data2_addr ); DRM_INFO( table2: 0x%08x\n, table2_addr ); MACH64_WRITE(
Re: [Dri-devel] Mach64 DMA, blits, AGP textures
On Sat, 18 May 2002, Felix Kühling wrote: On Sat, 18 May 2002 11:30:28 -0400 (EDT) Leif Delgass [EMAIL PROTECTED] wrote: Did you have a 2D accelerated server running on another vt? The DDX saves and restores its register state on mode switches, so it could be a problem with the FIFO depth or pattern registers being changed. Try testing without another X server running if you haven't already. Also, does anything show up in the system log? I did have a 2D accelerated X-server running. But I started the DRI server from a text console and didn't switch between the servers during the tests, so it shouldn't matter. As to the syslog, my kern.log file is actually the syslog. My distro configures syslog so that it sends all syslog messages from the kernel to kern.log. In the other syslog files there are no further messages. Sorry, I was in a hurry before and didn't read your message very carefully. I think I've fixed the segfault, can you do a cvs update in xc/programs/Xserver/hw/xfree86/drivers/ati and then do a 'make install' in that directory. You should only need to rebuild the DDX. -- Leif Delgass http://www.retinalburn.net ___ Hundreds of nodes, one monster rendering program. Now that's a super model! Visit http://clustering.foundries.sf.net/ ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA, blits, AGP textures
On 2002.05.18 15:40 Felix Kühling wrote: On Sat, 18 May 2002 15:01:51 +0200 Felix Kühling [EMAIL PROTECTED] wrote: For this test I compiled everything with gcc-2.95.4. I had a different problem after compiling with gcc-3.0. I have to try that again and check for compile errors. The problem was that the X server segfaulted on startup. I'll report more details later. Ok, I recompiled with gcc-3.0 again. There are no errors in world.log. The X-server segfaults on startup. Note that I had a working Xserver+DRI compiled with gcc-3.0 before Leif's last changes. These are the relevant parts of my logfiles: kern.log: [...] May 18 16:18:28 viking kernel: [drm] AGP 0.99 on VIA Apollo KT133 @ 0xd000 64MB May 18 16:18:28 viking kernel: [drm] Initialized mach64 1.0.0 20020417 on minor 0 May 18 16:18:29 viking kernel: [drm] Setting FIFO size to 128 entries May 18 16:18:29 viking kernel: [drm] Creating pci pool May 18 16:18:29 viking kernel: [drm] Allocating descriptor table memory May 18 16:18:29 viking kernel: [drm] descriptor table: cpu addr: 0xc0268000, bus addr: 0x00268000 May 18 16:18:29 viking kernel: [drm] Starting DMA test... May 18 16:18:29 viking kernel: [drm] starting DMA transfer... May 18 16:18:29 viking kernel: [drm] waiting for idle... May 18 16:18:29 viking kernel: [drm] waiting for idle...done May 18 16:18:29 viking kernel: [drm] DMA test succeeded, using asynchronous DMA mode XFree86.1.log: [...] (==) ATI(0): Write-combining range (0xd400,0x80) (II) ATI(0): [drm] SAREA 2200+1212: 3412 drmOpenDevice: minor is 0 drmOpenDevice: node name is /dev/dri/card0 drmOpenDevice: open result is -1, (No such device) drmOpenDevice: Open failed drmOpenDevice: minor is 0 drmOpenDevice: node name is /dev/dri/card0 drmOpenDevice: open result is -1, (No such device) drmOpenDevice: Open failed drmOpenDevice: minor is 0 drmOpenDevice: node name is /dev/dri/card0 drmOpenDevice: open result is 11, (OK) drmGetBusid returned '' (II) ATI(0): [drm] loaded kernel module for mach64 driver (II) ATI(0): [drm] created mach64 driver at busid PCI:1:0:0 (II) ATI(0): [drm] added 8192 byte SAREA at 0xd08bf000 (II) ATI(0): [drm] mapped SAREA 0xd08bf000 to 0x40015000 (II) ATI(0): [drm] framebuffer handle = 0xd400 (II) ATI(0): [drm] added 1 reserved context for kernel (II) ATI(0): [agp] Using AGP 1x Mode (II) ATI(0): [agp] Using 8 MB AGP aperture (II) ATI(0): [agp] Mode 0x1f000201 [AGP 0x1106/0x0305; Card 0x1002/0x474d] (II) ATI(0): [agp] 8192 kB allocated with handle 0xd10c3000 (II) ATI(0): [agp] Using 2 MB DMA buffer size (II) ATI(0): [agp] vertex buffers handle = 0xd000 (II) ATI(0): [agp] Vertex buffers mapped at 0x40a38000 (II) ATI(0): [agp] AGP texture region handle = 0xd020 (II) ATI(0): [agp] AGP Texture region mapped at 0x40c38000 (II) ATI(0): [drm] register handle = 0xd600 (II) ATI(0): [dri] Visual configs initialized (II) ATI(0): [dri] Block 0 base at 0xd6000400 (II) ATI(0): Memory manager initialized to (0,0) (640,1637) (II) ATI(0): Reserved back buffer from (0,480) to (640,960) (II) ATI(0): Reserved depth buffer from (0,960) to (640,1440) (II) ATI(0): Reserved 6144 kb for textures at offset 0x1ff900 (II) ATI(0): Largest offscreen areas (with overlaps): (II) ATI(0): 640 x 6072 rectangle at 0,480 (II) ATI(0): 512 x 6073 rectangle at 0,480 (**) ATI(0): Option XaaNoScreenToScreenCopy (II) ATI(0): Using XFree86 Acceleration Architecture (XAA) Setting up tile and stipple cache: 32 128x128 slots 18 256x256 slots 7 512x512 slots (==) ATI(0): Backing store disabled (==) ATI(0): Silken mouse enabled (**) Option dpms (**) ATI(0): DPMS enabled (II) ATI(0): X context handle = 0x0001 (II) ATI(0): [drm] installed DRM signal handler (II) ATI(0): [DRI] installation complete (II) ATI(0): [drm] Added 128 16384 byte DMA buffers Fatal server error: Caught signal 11. Server aborting I also got a debugger backtrace after the segfault: Program received signal SIGSEGV, Segmentation fault. 0x086c0c3c in ?? () #0 0x086c0c3c in ?? () #1 0x086bef1b in ?? () #2 0x080bfe18 in AddScreen (pfnInit=0x86be944, argc=5, argv=0xba04) at main.c:768 #3 0x0806c425 in InitOutput (pScreenInfo=0x81cda00, argc=5, argv=0xba04) at xf86Init.c:819 #4 0x080bf378 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 I know, this backtrace is not very helpful. Is there a way to get the ?? resolved? Regards, Felix Nice report ;-) Try with xfree-gdb (http://www.dawa.demon.co.uk/xfree-gdb/) to see if you have better luck. José Fonseca ___ Hundreds of nodes, one monster rendering program. Now that's a super model! Visit http://clustering.foundries.sf.net/ ___ Dri-devel mailing list [EMAIL PROTECTED]
Re: [Dri-devel] Mach64 DMA, blits, AGP textures
On Sat, 18 May 2002 11:30:28 -0400 (EDT) Leif Delgass [EMAIL PROTECTED] wrote: Did you have a 2D accelerated server running on another vt? The DDX saves and restores its register state on mode switches, so it could be a problem with the FIFO depth or pattern registers being changed. Try testing without another X server running if you haven't already. Also, does anything show up in the system log? I did have a 2D accelerated X-server running. But I started the DRI server from a text console and didn't switch between the servers during the tests, so it shouldn't matter. As to the syslog, my kern.log file is actually the syslog. My distro configures syslog so that it sends all syslog messages from the kernel to kern.log. In the other syslog files there are no further messages. Regards, Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! ___ Hundreds of nodes, one monster rendering program. Now that's a super model! Visit http://clustering.foundries.sf.net/ ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA, blits, AGP textures
On Sat, 18 May 2002 15:56:00 +0100 José Fonseca [EMAIL PROTECTED] wrote: Nice report ;-) Thanks :) Try with xfree-gdb (http://www.dawa.demon.co.uk/xfree-gdb/) to see if you have better luck. Yep, that gave better results. Since I have only one computer here and the display turns black I had to do this with a gdb command script. This is the script: run :1 vt8 -xf86config XF86Config-mach64004 bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt continue bt Here is the log: Program received signal SIGUSR1, User defined signal 1. _loader_debug_state () at loader.c:1331 1331{ #0 _loader_debug_state () at loader.c:1331 #1 0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, ppLookup=0xb848) at loader.c:1036 #2 0x0809cd56 in LoaderOpen ( module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, wasLoaded=0xb898) at loader.c:1183 #3 0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924 #4 0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790) at xf86Init.c:1716 #5 0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04) at xf86Init.c:358 #6 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 #7 0x4006e14f in __libc_start_main () from /lib/libc.so.6 Program received signal SIGUSR1, User defined signal 1. _loader_debug_state () at loader.c:1331 1331{ #0 _loader_debug_state () at loader.c:1331 #1 0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, ppLookup=0xb848) at loader.c:1036 #2 0x0809cd56 in LoaderOpen ( module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, wasLoaded=0xb898) at loader.c:1183 #3 0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924 #4 0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790) at xf86Init.c:1716 #5 0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04) at xf86Init.c:358 #6 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 #7 0x4006e14f in __libc_start_main () from /lib/libc.so.6 Program received signal SIGUSR1, User defined signal 1. _loader_debug_state () at loader.c:1331 1331{ #0 _loader_debug_state () at loader.c:1331 #1 0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, ppLookup=0xb848) at loader.c:1036 #2 0x0809cd56 in LoaderOpen ( module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, wasLoaded=0xb898) at loader.c:1183 #3 0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924 #4 0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790) at xf86Init.c:1716 #5 0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04) at xf86Init.c:358 #6 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 #7 0x4006e14f in __libc_start_main () from /lib/libc.so.6 Program received signal SIGUSR1, User defined signal 1. _loader_debug_state () at loader.c:1331 1331{ #0 _loader_debug_state () at loader.c:1331 #1 0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, ppLookup=0xb848) at loader.c:1036 #2 0x0809cd56 in LoaderOpen ( module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, wasLoaded=0xb898) at loader.c:1183 #3 0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924 #4 0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790) at xf86Init.c:1716 #5 0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04) at xf86Init.c:358 #6 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 #7 0x4006e14f in __libc_start_main () from /lib/libc.so.6 Program received signal SIGUSR1, User defined signal 1. _loader_debug_state () at loader.c:1331 1331{ #0 _loader_debug_state () at loader.c:1331 #1 0x0809c71a in ARCHIVELoadModule (modrec=0x85d7188, arfd=8, ppLookup=0xb848) at loader.c:1036 #2 0x0809cd56 in LoaderOpen ( module=0x862ed08 /usr/X11R6-mach64004/lib/modules/fonts/libspeedo.a, cname=0x8667760 speedo, handle=0,
Re: [Dri-devel] Mach64 DMA, blits, AGP textures
On Sat, 18 May 2002 18:26:52 +0100 José Fonseca [EMAIL PROTECTED] wrote: I also have to start using another X server in a sep window cause having to log out everytime I want to test is a PITA. I'm not sure whether I get this correctly. Anyway, I have my 2D Xserver running on vt7 and start the 3D Xserver from a text console on vt8. bt continue ... Here is the log: ... Program received signal SIGSEGV, Segmentation fault. 0x082385e0 in DRILock (pScreen=0x0, flags=0) at dri.c:1759 1759DRIScreenPrivPtr pDRIPriv = DRI_SCREEN_PRIV(pScreen); #0 0x082385e0 in DRILock (pScreen=0x0, flags=0) at dri.c:1759 The problem is that pScreen is NULL here and DRILock is trying to dereference it. This is the second sigsegv. I found it strange that the debugger could continue after the first one. I assume that this one actually happens while the first one is handled. #1 0x086d9ffe in intE6_handler () #2 0x086ff93d in VBEGetVBEpmi () at atipreinit.c:548 #3 0x08706fa9 in fbBlt (srcLine=0x0, srcStride=0, srcX=0, dstLine=0x8706fa9, dstStride=-1073744732, dstX=0, width=0, height=141643160, alu=1, pm=141643240, bpp=137618992, reverse=0, upsidedown=142025672) at fbblt.c:295 #4 0x080a8ca8 in xf86XVLeaveVT (index=0, flags=0) at xf86xv.c:1241 #5 0x0806d5de in AbortDDX () at xf86Init.c:1135 #6 0x080dbf20 in AbortServer () at utils.c:436 #7 0x080dd62f in FatalError () at utils.c:1399 #8 0x08080d0b in xf86SigHandler (signo=11) at xf86Events.c:1085 See ... #9 0x4007e6b8 in sigaction () from /lib/libc.so.6 #10 0x086ea8d8 in intE6_handler () #11 0x080c60f0 in AddScreen (pfnInit=0x86ea268 intE6_handler+79240, argc=5, argv=0xba04) at main.c:768 #12 0x0806c383 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04) at xf86Init.c:819 #13 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380 #14 0x4006e14f in __libc_start_main () from /lib/libc.so.6 Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. I grepped for intE6_handler and found it in programs/Xserver/hw/xfree86/int10/xf86int10.c. I think, it's not mach64 specific and it hasn't changed since January. So the actual problem must be somewhere else. Don't forget that the problem ocurred in DRILock and not intE6_handler. The first sigsegv occurred in intE6_handler. First let's try to eliminate the most simple options first. I noticed on the CVS Update log that Leif changed quite a deal of places. You mentioned in your first post that you'd recompiled everything. Did you also re-installed everything? It happened quite a lot having problems because I forgot to recompile/reinstall parts that had been changed. I used make world to recompile everything and make install to install. I also copied the kernel module. Did I forget anything? Regards, Felix __\|/_____ ___ ___ __Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___ _Felix___\Ä/\ \_\ \_\ \__U___just not everything [EMAIL PROTECTED]o__/ \___/ \___/at the same time! ___ Hundreds of nodes, one monster rendering program. Now that's a super model! Visit http://clustering.foundries.sf.net/ ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Monday 11 March 2002 01:55, Frank C. Earl wrote: On Sunday 10 March 2002 11:44 am, Jos? Fonseca wrote: I really don't know much about that, since it must happened before I subscribed to this mailing list, but perhaps you'll want to take a look to the Utah-GLX and this list archives. You can get these archives in mbox format and also filtered with just the messages regarding mach64 at http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/ The problem was that the XAA driver for mach64 was setting the FIFO size up for some reason and it was leaving the chip in a state that wouldn't work for the DMA mode. If we set the size back to the default setting before we do the first DMA pass, everybody's happy. I suspect if we got with the developer of the XAA driver we can sell him on leaving that setting alone in the driver's setup. Sorry for being silent for so long gang. Been, yet again, crushed under with lovely personal business. I have started a new branch (mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been playing with into a unified DMA framework. I should be putting the first updates to the branch in over the next couple of days. Of note, when I did find some spare time, I ran tests on what we needed to do to secure the chip's DMA path. I found out some interesting facts. It will accept any values written to the registers. It will not act on any of those settings during the DMA pass unless they're a GUI specific operation when it's doing a command-type DMA. It will not act on many of the settings after a DMA pass is complete. It will not let you set up any sort of DMA pass during the operation. Junk commands, by themselves, do not seem to hose up the engine in operation. Mapping and unmapping a memory space is somewhat compute intensive. Thanks Frank, this was just what I was after... ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On 2002.03.10 11:30 Robert Lunnon wrote: A while back there was a problem with the Mach64 initialisation such that it locked up after executing dma, can someone point at what the resolution to that problem was and where things were patched so I can have a look at it ? Thanks Bob I really don't know much about that, since it must happened before I subscribed to this mailing list, but perhaps you'll want to take a look to the Utah-GLX and this list archives. You can get these archives in mbox format and also filtered with just the messages regarding mach64 at http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/ I hope this helps. Regards, José Fonseca ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Sunday 10 March 2002 11:44 am, José Fonseca wrote: I really don't know much about that, since it must happened before I subscribed to this mailing list, but perhaps you'll want to take a look to the Utah-GLX and this list archives. You can get these archives in mbox format and also filtered with just the messages regarding mach64 at http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/ The problem was that the XAA driver for mach64 was setting the FIFO size up for some reason and it was leaving the chip in a state that wouldn't work for the DMA mode. If we set the size back to the default setting before we do the first DMA pass, everybody's happy. I suspect if we got with the developer of the XAA driver we can sell him on leaving that setting alone in the driver's setup. Sorry for being silent for so long gang. Been, yet again, crushed under with lovely personal business. I have started a new branch (mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been playing with into a unified DMA framework. I should be putting the first updates to the branch in over the next couple of days. Of note, when I did find some spare time, I ran tests on what we needed to do to secure the chip's DMA path. I found out some interesting facts. It will accept any values written to the registers. It will not act on any of those settings during the DMA pass unless they're a GUI specific operation when it's doing a command-type DMA. It will not act on many of the settings after a DMA pass is complete. It will not let you set up any sort of DMA pass during the operation. Junk commands, by themselves, do not seem to hose up the engine in operation. Mapping and unmapping a memory space is somewhat compute intensive. -- Frank Earl ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On 2002.03.10 15:55 Frank C. Earl wrote: On Sunday 10 March 2002 11:44 am, José Fonseca wrote: ... Sorry for being silent for so long gang. Been, yet again, crushed under with lovely personal business. I have started a new branch (mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been playing with into a unified DMA framework. I should be putting the first updates to the branch in over the next couple of days. I look forward to check it out. Of note, when I did find some spare time, I ran tests on what we needed to do to secure the chip's DMA path. I found out some interesting facts. It will accept any values written to the registers. It will not act on any of those settings during the DMA pass unless they're a GUI specific operation when it's doing a command-type DMA. It will not act on many of the settings after a DMA pass is complete. It will not let you set up any sort of DMA pass during the operation. Junk commands, by themselves, do not seem to hose up the engine in operation. I didn't fully understand the implications of above, but shouldn't the direct access to the chip registers still be denied to clients? Mapping and unmapping a memory space is somewhat compute intensive. This one has to be compared to the time that takes to copy a buffer, unless there is a way to do it in a secure manner without copying or unmapping. -- Frank Earl José Fonseca ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Sunday 10 March 2002 04:36 pm, José Fonseca wrote: I didn't fully understand the implications of above, but shouldn't the direct access to the chip registers still be denied to clients? Depends. Looking at the gamma source code (I could be wrong, mind...) it appears that the DRM is taking in direct commands from userspace in DMA buffers and sending them on to the chip as time goes along. If you can make it so that it doesn't lock up the card or the machine and doesn't compromise system security in any way (i.e. issuing a DMA pass from a part of memory to the framebuffer and then from the framebuffer to another part of memory so as to clobber things or to pilfer info from other areas), it's pretty safe to do direct commands. From observations, it appears that you can't get the engine to do anything except GUI operations during a properly set up GUI-mastering pass. The only risks that I can see right at the moment with sending direct commands over indirect commands is one of not resetting certain select registers to what they were before the pass and one of the engine not handling certian GUI operations well. The first is easily taken care of by having the driver have a 4k block submitted in the descriptor chain as the last entry that updates those registers accordingly- the list of commands should only need to be built once and reused often since these registers won't be changed by the DRM engine, Mesa driver, or the XAA driver after the XAA driver does it's setup for the chip operation. The second case is a tough one and one that copying/mapping won't protect you from it- you have to process commands to prevent them from occuring (compute intensive and there might be other cases, each time you'd have to come up with yet another workaround) or find something to detect a hang on the engine and reset it proper. I seriously doubt that we'd encounter one of those, but we might all the same. I've still got one or two more tests to run (I've yet to deliberately hang the engine, detect the same, and then do a reset- but then I've yet to be able to hang the engine with it _properly_ set up...) but most of the innards for copying commands or whatever would be largely the same (some of the interfaces might change, but that's less of an issue than the heart of the DMA engine itself which is the same no matter what...) so I'm going to get _SOMETHING_ in place to see what our performance actually would be with some DMA operation going on. Mapping and unmapping a memory space is somewhat compute intensive. This one has to be compared to the time that takes to copy a buffer, unless there is a way to do it in a secure manner without copying or unmapping. If you don't have issues with sending commands directly, you don't need to copy or map/unmap. You don't need special clear commands or swap commands, you only need to issue DMAable buffers of commands to the DRM engine for eventual submission to the chip. Right now, I'm not 100% sure that the mach64 is one of those sorts of chips, but it's shaping up to be a good prospect for that. -- Frank Earl ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA Was: Pseudo DMA?
On 2002.02.10 09:31 Gareth Hughes wrote: ... These chips can read and write arbitrary locations in system memory. For all chips that have this feature, the only safe way to program them is from within Which of the chips currently supported by DRI is more similar in this [DMA programming] sense to mach64 and could be looked as a reference implementation? a DRM kernel module. Only clients that have been authenticated via the usual (X auth) means are able to talk to such modules. There is simply no other way to do it. You can trust the X server and the kernel module. You CANNOT trust anything else -- a client-side 3D driver, something masquerading as one, whatever... There is a reason why all the DRI drivers for commodity cards are designed like this. It's a pain, but that's the price you pay for a secure system. -- Gareth Regards, Jose Fonseca _ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Tuesday 08 January 2002 04:12 pm, Leif Delgass and Manuel Teira wrote: Happy New Year! Hopefully for all, it will be a better one than the last... Well, after the holidays, I would like to recover the development in the mach64 branch. I started today to investigate the DMA stuff because I think that perhaps Frank is very busy and he has no time to do this work. The DMA problem was discovered aprox. Oct 21th and we have no news about any progress in DMA. I'm sure that Frank would do it better than me, but I can try. I've had starts and stops. However, I am still working on things and have been with what time I've actually had that I could think straight on and I'm pretty close to having something- sorry about the delays on my part, I KNOW you're all chomping at the bit, I am wanting this to happen as much as you all are. It sounded like Frank had written some code already (he mentioned being halfway done in early December). Frank, is your work in a state that could be commited to the branch so others could help finish it? If so, this might be a good place to start a new branch tag since we currently have a working driver. Before long we'll also need to merge in changes from the trunk, since 4.2.0 seems close to release. I'm about ready to actually make that branch- I nearly have the code in place (You'd not believe the fun stuff that has conspired against me to finish the code... Job hunting, honey do projects, my father being ill- I couldn't get focused long enough to sit down and plug the code in... But enough of that, it's in the past.) and I'm planning on getting it completed sometime this upcoming week and start verifying that I've not broken anything. I'll go ahead and make a branch at that point in time. I've been looking at the r128 freelist implementation, so I've derived that the register called R128_LAST_DISPATCH_REG (actually R128_GUI_SCRATCH_REG1) is used to store the age of the last discarded buffer. So, the r128_freelist_get is able to wait for a discarded buffer if there's no free buffer available. Could this be made in the mach64 driver, say with MACH64_SCRATCH_REG1 ? In my register reference it says that these registers can be for exchanging information between the BIOS and card drivers, so, is sane to use them for this task? I'm not sure that that would be safe to use. According to r128_reg.h, the r128 has BIOS scratch registers and GUI scratch registers, where the mach64 has only the scratch registers used by the BIOS. The mach64 Programmer's Guide says that SCRATCH_REG1 is used to store the ROM segment location and installed mode information and should only be used by the BIOS or a driver to get the ROM segment location for calling ROM service routines. Hm... I've been wondering why they used a scratch register when the private area's available and could hold the data as well as anything else. Anybody care to comment? As it stands, I've got the info being placed in the private data area as a variable. I've also seen that there's no r128_freelist_put (it's present in mga driver, for example). Isn't it necessary? Depends. I'm not sure how I'm going to code things. I've got to account for clients holding onto or discarding their buffers upon submission (As well as burning them off because they're shutting down) in the DRM and I'm working on this part (The actual DMA submission part's fairly easy, but the trappings within the DRM are a different story.) right now. My thinking is that if it's a discard, we push it back into the freelist and view it as unavailable unless the age is past the timeout point. And, when is a buffer supposed to be discarded. Is this situation produced in the client side? It appears to be a client side behavior. The way that most of the cards seem to do their DMA is they queue up the buffer in question with an ioctl such as DRM_IOCTL_R128_VERTEX, using a struct not unlike this for parameters for the ioctl: typedef struct drm_r128_vertex { int prim; int idx;/* Index of vertex buffer */ int count; /* Number of vertices in buffer */ int discard;/* Client finished with buffer? */ } drm_r128_vertex_t; idx is the value pulled from the value in drm_buf_t. I'll admit that prim's still a little foggy to me, but count is obvious as is discard. Basically the client tells the DRM to put it back into the freelist because it's done with it. I would think that there is a tradeoff for holding onto versus releasing buffers- holding onto them would be a speed boost, but at the expense of limiting how many clients had buffers. 128k's not a lot of buffer space- it doesn't allow for many verticies, etc. so I'd wonder what kind of benefit a client would derive from holding onto the buffer for things like verticies. Textures might be an advantage, but again, you're stealing things
RE: [Dri-devel] Mach64 DMA
Frank C. Earl wrote: While we're discussing things here, can anyone tell me why things like the emit state code is in the DRM instead of in the Mesa drivers? It looks like it could just as easily be in the Mesa driver at least in the case of the RagePRO code- is there a good reason why it's in the DRM? Security. -- Gareth ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Sunday 13 January 2002 11:50 pm, Gareth Hughes wrote: While we're discussing things here, can anyone tell me why things like the emit state code is in the DRM instead of in the Mesa drivers? It looks like it could just as easily be in the Mesa driver at least in the case of the RagePRO code- is there a good reason why it's in the DRM? Security. Okay. I can buy that as a reason. Unfortunately, the reason why it's more secure doesn't appear to be well documented anywhere and there are several cards that do not appear to have this as a feature of their DRM module- would you (or anyone else for that matter) enlighten us as to the why it's better/more secure? (I'm coding it right now, it just strikes me as adding a bunch of extra kernel space call overhead without much benefit- I'd love to understand why I'm doing it this way while I'm coding it into the DRM... :-) -- Frank Earl ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] Mach64 DMA
On Mon, 7 Jan 2002, Manuel Teira wrote: Hello again. First of all, happy new year to everybody. Happy New Year! Well, after the holidays, I would like to recover the development in the mach64 branch. I started today to investigate the DMA stuff because I think that perhaps Frank is very busy and he has no time to do this work. The DMA problem was discovered aprox. Oct 21th and we have no news about any progress in DMA. I'm sure that Frank would do it better than me, but I can try. It sounded like Frank had written some code already (he mentioned being halfway done in early December). Frank, is your work in a state that could be commited to the branch so others could help finish it? If so, this might be a good place to start a new branch tag since we currently have a working driver. Before long we'll also need to merge in changes from the trunk, since 4.2.0 seems close to release. And now, the questions: I've been looking at the r128 freelist implementation, so I've derived that the register called R128_LAST_DISPATCH_REG (actually R128_GUI_SCRATCH_REG1) is used to store the age of the last discarded buffer. So, the r128_freelist_get is able to wait for a discarded buffer if there's no free buffer available. Could this be made in the mach64 driver, say with MACH64_SCRATCH_REG1 ? In my register reference it says that these registers can be for exchanging information between the BIOS and card drivers, so, is sane to use them for this task? I'm not sure that that would be safe to use. According to r128_reg.h, the r128 has BIOS scratch registers and GUI scratch registers, where the mach64 has only the scratch registers used by the BIOS. The mach64 Programmer's Guide says that SCRATCH_REG1 is used to store the ROM segment location and installed mode information and should only be used by the BIOS or a driver to get the ROM segment location for calling ROM service routines. I've also seen that there's no r128_freelist_put (it's present in mga driver, for example). Isn't it necessary? And, when is a buffer supposed to be discarded. Is this situation produced in the client side? Best regards. -- Manuel Teira ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel -- Leif Delgass http://www.retinalburn.net ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [Dri-devel] mach64 dma question
On Wednesday 24 October 2001 01:54 am, [EMAIL PROTECTED] wrote: a=readl(kms-reg_aperture+MACH64_BUS_CNTL); writel((a | (31) )(~(16)), kms-reg_aperture+MACH64_BUS_CNTL); same other code works fine. Now why would this be ? This could be caused by the same thing that was giving us fits up until recently. The 4.X XFree86 driver does some things in its init function that override the chip's default settings (ostensibly to improve performance...) that cause at least busmastering for gui operations to not work at all. Do you have to set bits 1 and 2 every time you want a DMA pass or is it just with the first time you do it? -- Frank Earl ___ Dri-devel mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/dri-devel