Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Jens Owen

Linus Torvalds wrote:

 A hot system call takes about 0.2 us on an athlon (it takes significantly
 longer on a P4, which I'm beating up Intel over all the time). The ioctl
 stuff goes through slightly more layers, but we're not talking huge
 numbers here. The system calls are fast enough that you're better off
 trying to keep stuff in the cache, than trying to minimize system calls.

This is an education for me, too.  Thanks for the info.  Any idea how
heavy IOCTL's are on a P4?

 NOTE NOTE NOTE! The tradeoffs are seldom all that clear. Sometimes big
 buffers and few system calls are better. Sometimes they aren't. It just
 depends on a lot of things.

You bet--and the real issue we're constantly swimming up stream against
is security in open source.  Most hardware vendors design the hardware
for closed source drivers and don't put many (or sometimes any) time
into making sure their hardware is optimized for performance *and*
security.  Consequently most modern graphics chips are optimized for
user space DMA and they rely on security through obscurity of their
closed source drivers.  Then, the DRI team comes along and has to figure
out how to kludge together a secure path that doesn't sacrafice *all*
the performance.

Linus, if you have any ideas on how we can uphold the security strengths
of Linux without leaving all this performance on the table simply
because we embrace open source, then I'd love to hear it.  It really
hurts to be competing tooth and nail against closed source drivers (on
Linux even) and have to leave potentially large performance gains on the
table.

The other paradox here is that security is paramount for the server
market where Linux is strong.  But we're trying to help Linux into the
domain of the graphics workstation and game machine markets where users
already have full access to the machine (even physically).  So how is
all this security really helping us address those market?

Sorry, I'm venting.  This has been a difficult issue since the beginning
of the DRI project--but I'm glad I got it off my chest :-)

Regards,
Jens

-- /\
 Jens Owen/  \/\ _
  [EMAIL PROTECTED]  /\ \ \   Steamboat Springs, Colorado

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Linus Torvalds



On Mon, 27 May 2002, Jens Owen wrote:

 This is an education for me, too.  Thanks for the info.  Any idea how
 heavy IOCTL's are on a P4?

Much heavier. For some yet unexplained reason, a P4 takes about 1us to do
a simple system call. That's on a 1.8GHz system, so it basically implies
that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just
ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles
(the 200+ cycles also matches a pentium reasonably well, so it's really
the P4 that stands out here).

The rest of the ioctl overhead is not really noticeable compared to those
1800 cycles spent on the enter/exit kernel mode.

Even so, those memcpy vs pipe throughput numbers I quoted were off my P4
machine:  _despite_ the fact that a P4 is inexplicably bad at system
calls, those 1800 CPU cycles is just a whole lot less than a lot of cache
misses with modern hardware. It doesn't take many cache misses to make
1800 cycles just noise.

And if the 1800 cycles are less than cache misses on normal non-IO
benchmarks, they are going to be _completely_ swamped by any PCI/AGP
overhead.

 You bet--and the real issue we're constantly swimming up stream against
 is security in open source.  Most hardware vendors design the hardware
 for closed source drivers and don't put many (or sometimes any) time
 into making sure their hardware is optimized for performance *and*
 security.

I realize this, and I feel for you. It's nasty.

I don't know what the answer is. It _might_ even be something like a
bi-modal system:

 - apps by default get the traditional GLX behaviour: the X server does
   all the 3D for them. No DRI.

 - there is some mechanism to tell which apps are trusted, and trusted
   apps get direct hw access and just aren't secure.

I actually think that if the abstraction level is just high enough, DRI
shouldn't matter in theory. Shared memory areas with X for the high-level
data (to avoid the copies for things like the obviously huge texture
data).

From a game standpoint, think quake engine. The actual game doesn't need
to tell the GX engine everything over and over again all the time. It
tells it the basic stuff once, and then it just says render me. You
don't need DRI for sending the render me command, you need DRI because
you send each vertex separately.

In that kind of high-level abstraction, the X client-server model should
still work fine. In fact, it should work especially well on small-scale
SMP (which seems inevitable).

Are people thinkin gabout the next stage, when 2D just doesn't exist any
more except as a high-level abstraction on top of a 3D model? Where the X
server actually gets to render the world view, and the application doesn't
need to (or want to) know about things like level-of-detail?

Linus


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Keith Whitwell

Linus Torvalds wrote:
 
 On Mon, 27 May 2002, Jens Owen wrote:
 
This is an education for me, too.  Thanks for the info.  Any idea how
heavy IOCTL's are on a P4?

 
 Much heavier. For some yet unexplained reason, a P4 takes about 1us to do
 a simple system call. That's on a 1.8GHz system, so it basically implies
 that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just
 ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles
 (the 200+ cycles also matches a pentium reasonably well, so it's really
 the P4 that stands out here).

This is remarkable.  I thought things were getting better, not worse.

...
 
You bet--and the real issue we're constantly swimming up stream against
is security in open source.  Most hardware vendors design the hardware
for closed source drivers and don't put many (or sometimes any) time
into making sure their hardware is optimized for performance *and*
security.

 
 I realize this, and I feel for you. It's nasty.
 
 I don't know what the answer is. It _might_ even be something like a
 bi-modal system:
 
  - apps by default get the traditional GLX behaviour: the X server does
all the 3D for them. No DRI.
 
  - there is some mechanism to tell which apps are trusted, and trusted
apps get direct hw access and just aren't secure.
 
 I actually think that if the abstraction level is just high enough, DRI
 shouldn't matter in theory. Shared memory areas with X for the high-level
 data (to avoid the copies for things like the obviously huge texture
 data).

I like this because it offers a way out, although I would keep the direct, 
secure approach to 3d we currently have for the other clients.  Indirect 
rendering is pretty painful...

However:  The applications that most people would want to 'trust' are things 
like quake or other closed source games, which makes the situation a little 
murkier.


From a game standpoint, think quake engine. The actual game doesn't need
 to tell the GX engine everything over and over again all the time. It
 tells it the basic stuff once, and then it just says render me. You
 don't need DRI for sending the render me command, you need DRI because
 you send each vertex separately.

You could view the static geometry of quake levels as a single display list 
and ask for the whole thing to be rendered each frame.

However, the reality of the quake type games is anything but - huge amounts of 
effort have gone into the process of figuring out (as quickly as possible) 
what minimal amount of work can be done to render the visible portion of the 
level at each frame.

Quake generates very dynamic data from quite a static environment in the name 
of performance...

 In that kind of high-level abstraction, the X client-server model should
 still work fine. In fact, it should work especially well on small-scale
 SMP (which seems inevitable).

Games are free to partition themselves in other ways that help smp but keep 
their ability for a tight binding with the display system -- for example the 
physics (rigid body simulation) subsytem is a big and growing consumer of cpu 
and is quite easily seperated out from the graphics engine.  AI is also a 
target for its own thread.

 Are people thinkin gabout the next stage, when 2D just doesn't exist any
 more except as a high-level abstraction on top of a 3D model? Where the X
 server actually gets to render the world view, and the application doesn't
 need to (or want to) know about things like level-of-detail?

Yes, but there are a few steps between here and there, and there have been a 
few differences of opinion along the way.  It would have been possible to get 
a lot of the X render extension via a client library emitting GL calls, for 
example.

Keith





___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Jens Owen

Keith Whitwell wrote:
 
 Linus Torvalds wrote:
 
  On Mon, 27 May 2002, Jens Owen wrote:
 
 This is an education for me, too.  Thanks for the info.  Any idea how
 heavy IOCTL's are on a P4?
 
 
  Much heavier. For some yet unexplained reason, a P4 takes about 1us to do
  a simple system call. That's on a 1.8GHz system, so it basically implies
  that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just
  ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles
  (the 200+ cycles also matches a pentium reasonably well, so it's really
  the P4 that stands out here).
 
 This is remarkable.  I thought things were getting better, not worse.
 
 ...
 
 You bet--and the real issue we're constantly swimming up stream against
 is security in open source.  Most hardware vendors design the hardware
 for closed source drivers and don't put many (or sometimes any) time
 into making sure their hardware is optimized for performance *and*
 security.
 
 
  I realize this, and I feel for you. It's nasty.
 
  I don't know what the answer is. It _might_ even be something like a
  bi-modal system:
 
   - apps by default get the traditional GLX behaviour: the X server does
 all the 3D for them. No DRI.
 
   - there is some mechanism to tell which apps are trusted, and trusted
 apps get direct hw access and just aren't secure.
 
  I actually think that if the abstraction level is just high enough, DRI
  shouldn't matter in theory. Shared memory areas with X for the high-level
  data (to avoid the copies for things like the obviously huge texture
  data).
 
 I like this because it offers a way out, although I would keep the direct,
 secure approach to 3d we currently have for the other clients.  Indirect
 rendering is pretty painful...

A bi-modal system could be very possible from an implementation
perspective in the short term.

We have a security mechanism in place now for validating which processes
are allowed to access the direct rendering mechanism.  It is based on
user ID's, and no process is allowed access to these resources unless
they have:

1) Access to the X Server as an X client.

2) Their permission is acceptable based on how the DRI permissions are
defined in the XF86Config file.

Most distributions have picked up on this and now have a typical usage
model that allows the DRI to work for all desktop users.

If we do get some type of indirect rendering path working quicker, then
perhaps we could tighten up these defaults so that the usage model
required explicit administrative permision to a user before being
allowed access to direct rendering.

However, after going to all this trouble of making a decent level of
fall back performance, I would then want to push the performance envelop
for those processes that did meet the criteria for access to direct
rendering resources, and soften the security requirements for just those
processes.  This could possible be users that have been given explicit
permission and the X server itself (doing HW accellerated indirect
rendering).

There would really be three prongs of attach for this approach:

1) Audit the current DRI security model and confirm that it is strong
enough to be used to prevent non authorized users from gaining access to
the DRI mechanisms.  Work with distros to tighten up the usage model
(and possible the DRI security mechanism itself) so only explicit
desktop users are allowed access to the DRI.

2) Develop a device independent indirect rendering module that plugs
into the X server to utilize our 3D drivers.  After getting some HW
accel working, look at speeding up this path by utilizing Chormium-like
technologies and/or shared memory for high level data.

3) Transition the direct rendering drivers to take full advantage of
their user space DMA capabilities.

The is a large amount of work, but something we should consider if step
1 can be achieved to the kernel teams satisfaction.  It is even possible
the direct path could be obsoleted over the long term as step 2 becomes
more and more streamlined.

 However:  The applications that most people would want to 'trust' are things
 like quake or other closed source games, which makes the situation a little
 murkier.

Yes, but is this really any worse than a typical install for these apps
that requires root level access.
 
 From a game standpoint, think quake engine. The actual game doesn't need
  to tell the GX engine everything over and over again all the time. It
  tells it the basic stuff once, and then it just says render me. You
  don't need DRI for sending the render me command, you need DRI because
  you send each vertex separately.
 
 You could view the static geometry of quake levels as a single display list
 and ask for the whole thing to be rendered each frame.
 
 However, the reality of the quake type games is anything but - huge amounts of
 effort have gone into the process of figuring out (as quickly as possible)
 what minimal amount of work can be done to render the 

Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread José Fonseca

On 2002.05.27 16:28 Jens Owen wrote:
 ...
 
 If we do get some type of indirect rendering path working quicker, then
 perhaps we could tighten up these defaults so that the usage model
 required explicit administrative permision to a user before being
 allowed access to direct rendering.
 
 However, after going to all this trouble of making a decent level of
 fall back performance, I would then want to push the performance envelop
 for those processes that did meet the criteria for access to direct
 rendering resources, and soften the security requirements for just those
 processes.  This could possible be users that have been given explicit
 permission and the X server itself (doing HW accellerated indirect
 rendering).
 
 There would really be three prongs of attach for this approach:
 
 1) Audit the current DRI security model and confirm that it is strong
 enough to be used to prevent non authorized users from gaining access to
 the DRI mechanisms.  Work with distros to tighten up the usage model
 (and possible the DRI security mechanism itself) so only explicit
 desktop users are allowed access to the DRI.
 
 2) Develop a device independent indirect rendering module that plugs
 into the X server to utilize our 3D drivers.  After getting some HW
 accel working, look at speeding up this path by utilizing Chormium-like
 technologies and/or shared memory for high level data.
 
 3) Transition the direct rendering drivers to take full advantage of
 their user space DMA capabilities.
 
 The is a large amount of work, but something we should consider if step
 1 can be achieved to the kernel teams satisfaction.  It is even possible
 the direct path could be obsoleted over the long term as step 2 becomes
 more and more streamlined.
 
 ...

Jens, if I understood correctly, basically you're suggesting having the 
OpenGL state machine on the X server process context, and therefore the GL 
drivers too, and most of the data (textures, display lists). So there 
would be no layering between the DMA buffer construction and its submition 
- as boths things would be carried by the GL drivers. This means that we 
would have a single driver model instead of 3.

But the GLX protocol isn't good for this, is it? Hence the need for shared 
memory for big data.

Am I getting the right picture, or am I way off..?

José Fonseca

PS: It would be nice to discuss these issues in tonight's meeting.

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Linus Torvalds



On Mon, 27 May 2002, Keith Whitwell wrote:

 Linus Torvalds wrote:
 
  Much heavier. For some yet unexplained reason, a P4 takes about 1us to do
  a simple system call. That's on a 1.8GHz system, so it basically implies
  that a P4 takes 1800 cycles to do a int 0x80 + iret, which is just
  ludicrous. A 1.2Gz athlon does the same in 0.2us, ie around 250 cycles
  (the 200+ cycles also matches a pentium reasonably well, so it's really
  the P4 that stands out here).

 This is remarkable.  I thought things were getting better, not worse.

In general, they are. I suspect the P4 system call slowness is just
another artifact of some first-generation issues - the same way the P4
tends to be limited when it comes to shifts etc. It will get fixed
eventually. And running at 3GHz+ makes some CPU cycles seem cheap if you
can make up for them elsewhere.

However, you should put all of this into perspective: those 1800 cycles
are just about the same time it takes to do one _single_ read from an ISA
device. It's roughly the time it takes for one cacheline to be DMA'd over
PCI.

Linus


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Keith Whitwell


 
From a game standpoint, think quake engine. The actual game doesn't need
to tell the GX engine everything over and over again all the time. It
tells it the basic stuff once, and then it just says render me. You
don't need DRI for sending the render me command, you need DRI because
you send each vertex separately.

You could view the static geometry of quake levels as a single display list
and ask for the whole thing to be rendered each frame.

However, the reality of the quake type games is anything but - huge amounts of
effort have gone into the process of figuring out (as quickly as possible)
what minimal amount of work can be done to render the visible portion of the
level at each frame.

Quake generates very dynamic data from quite a static environment in the name
of performance...

 
 I think I understand...even though Linus is refering to Quake's wire
 protocol here, you are pointing out that the real challenge is the
 underlying game engine which is highly optimized for that specific
 application.  Am I correct?

I think the multiplayer aspects of the game are a separate issue.  Talking 
about the difference between a big display list with the whole quake level in 
it and the visibility/bsp-tree/whatever-new-technique coding that quake  
other games use to squeeze as much as possible out of the hardware.

It may be that simple visibility issues are pretty well understood now, and 
that the competition between game engines is moving to the shading engines 
(and physics engines if the reports about doom 3 are right).

Keith


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Keith Packard


Around 18 o'clock on May 27, Keith Whitwell wrote:

 I think the multiplayer aspects of the game are a separate issue.  Talking 
 about the difference between a big display list with the whole quake level in 
 it and the visibility/bsp-tree/whatever-new-technique coding that quake  
 other games use to squeeze as much as possible out of the hardware.

We had a big display-list vs immediate-mode war around 1990 and immediate
mode won.  It's just a lot easier to send the whole frame worth of 
polygons each time than to try an edit display lists.  Of course, this 
particular battle was framed by the scientific visualization trend of 
that era where each frame was generated from a completely new set of data.
In that context, stored mode graphics lose pretty badly.

However, given our experience with shared memory transport for images, and
given the tremendous differential between CPU and bus speeds these days, it
might make some sense to revisit the current 3D architecture.  A system
where the shared memory commands are validated by a user-level X server and
passed to the graphics engine with only a small kernel level helper for DMA
would allow for a greater possible level of security than the current DRI
model does today.

This would also provide for accelerated 3D graphics for remote
applications, something that DRI doesn't support today, and which would
take some significant work to enable.  I would hope that it could also 
provide a significantly easier configuration environment; getting 3D 
running with the DRI is still a significant feat for the average Linux 
user.

The question is whether this would impact performance at all; we're 
talking a process-process context switch instead of process-kernel
for each chunk of data.  However, we'd eliminate the current DRI overhead 
when running multiple 3D applications, and we'd be able to take better
advantage of SMP systems.  One trick would be to have the X server avoid 
reading much of the command buffer; much of that would make SMP 
performance significantly worse.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab



___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Jens Owen

Keith Packard wrote:

 We had a big display-list vs immediate-mode war around 1990 and immediate
 mode won.  It's just a lot easier to send the whole frame worth of
 polygons each time than to try an edit display lists.  Of course, this
 particular battle was framed by the scientific visualization trend of
 that era where each frame was generated from a completely new set of data.
 In that context, stored mode graphics lose pretty badly.

If you're referring to the OpenGL vs PEX war, there was more than
technical issues weighing in...there was the reality that Microsoft
*was* willing to support OpenGL.  That made OpenGL a better cross
platform choice.  Kind of ironic, but predictable that Microsoft is now
trying to sink OpenGL...but that's a thread for another group.
 
 However, given our experience with shared memory transport for images, and
 given the tremendous differential between CPU and bus speeds these days, it
 might make some sense to revisit the current 3D architecture.  A system
 where the shared memory commands are validated by a user-level X server and
 passed to the graphics engine with only a small kernel level helper for DMA
 would allow for a greater possible level of security than the current DRI
 model does today.

I wouldn't say we're laking in security today, we've in good shape now.

 This would also provide for accelerated 3D graphics for remote
 applications, something that DRI doesn't support today, and which would
 take some significant work to enable.

In relative scale, getting HW acceleration for direct rendering is
*much* smaller than the more aggressive architectural changes we're
discussing.  Let's just keep that in perspective.  It might be a less
aggressive first step to get the missing module(s) for HW acclerated
indirect rendering going, then move to these types of more aggressive
indirect methods.

 I would hope that it could also
 provide a significantly easier configuration environment; getting 3D
 running with the DRI is still a significant feat for the average Linux
 user.

Hmm.  I would have agreed a year ago, but most of the distributions
appear to have a good handle on making this just happen...when there is
driver support.
 
 The question is whether this would impact performance at all; we're
 talking a process-process context switch instead of process-kernel
 for each chunk of data.  However, we'd eliminate the current DRI overhead
 when running multiple 3D applications, and we'd be able to take better
 advantage of SMP systems.

 One trick would be to have the X server avoid
 reading much of the command buffer; much of that would make SMP
 performance significantly worse.

The performance path I'd like to push hardest in the short term in
direct rendering completely within the user space context.  Your
suggestions for optimizing an indirect path are great.  That path would
become much more critical than today as general purpose processes that
would no longer have access to the *faster* direct path.

-- /\
 Jens Owen/  \/\ _
  [EMAIL PROTECTED]  /\ \ \   Steamboat Springs, Colorado

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Jens Owen

José Fonseca wrote:
 
 On 2002.05.27 16:28 Jens Owen wrote:
  ...
 
  If we do get some type of indirect rendering path working quicker, then
  perhaps we could tighten up these defaults so that the usage model
  required explicit administrative permision to a user before being
  allowed access to direct rendering.
 
  However, after going to all this trouble of making a decent level of
  fall back performance, I would then want to push the performance envelop
  for those processes that did meet the criteria for access to direct
  rendering resources, and soften the security requirements for just those
  processes.  This could possible be users that have been given explicit
  permission and the X server itself (doing HW accellerated indirect
  rendering).
 
  There would really be three prongs of attach for this approach:
 
  1) Audit the current DRI security model and confirm that it is strong
  enough to be used to prevent non authorized users from gaining access to
  the DRI mechanisms.  Work with distros to tighten up the usage model
  (and possible the DRI security mechanism itself) so only explicit
  desktop users are allowed access to the DRI.
 
  2) Develop a device independent indirect rendering module that plugs
  into the X server to utilize our 3D drivers.  After getting some HW
  accel working, look at speeding up this path by utilizing Chormium-like
  technologies and/or shared memory for high level data.
 
  3) Transition the direct rendering drivers to take full advantage of
  their user space DMA capabilities.
 
  The is a large amount of work, but something we should consider if step
  1 can be achieved to the kernel teams satisfaction.  It is even possible
  the direct path could be obsoleted over the long term as step 2 becomes
  more and more streamlined.
 
  ...
 
 Jens, if I understood correctly, basically you're suggesting having the
 OpenGL state machine on the X server process context, and therefore the GL
 drivers too, and most of the data (textures, display lists). So there
 would be no layering between the DMA buffer construction and its submition
 - as boths things would be carried by the GL drivers. This means that we
 would have a single driver model instead of 3.
 
 But the GLX protocol isn't good for this, is it? Hence the need for shared
 memory for big data.
 
 Am I getting the right picture, or am I way off..?

Sorry, we covered a lot of things at once.  Let me simplify...

1) We loosen security requirements for 3D drivers.  This will allow far
less data copying, memory mapping/unmapping and system calls.  Many
modern graphics chips can have their data managed completely in a user
space AGP ring buffer removing the need to call the kernel module at
all.  The primary limitation that has kept us from persuing these
implementations so far have been security holes with AGP blits.

2) We implement HW acclerated indirect rendering for those processes
that don't have the permissions to use the new optimized drivers.  Most
of the fancy architecture discusssions we had here are related to making
indirect rendering faster...and could be done as a follow on to basic HW
accelerated indirect rendering.  The first and easiest way to implement
this is to make the X server use our direct rendering drivers.

I'm not really advocating going to a different model at all.  Rather,
I'm just advocating moving more of the kernel side validation we're
currently doing, back into the 3D driver.

 PS: It would be nice to discuss these issues in tonight's meeting.

I guess that's starting now.

It's at irc.openproject.net #dri-devel for those interested in joing
in...

-- /\
 Jens Owen/  \/\ _
  [EMAIL PROTECTED]  /\ \ \   Steamboat Springs, Colorado

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Ian Molton

On Mon, 27 May 2002 15:01:47 -0600
Jens Owen [EMAIL PROTECTED] wrote:

 1) We loosen security requirements for 3D drivers.  This will allow
 far less data copying, memory mapping/unmapping and system calls. 
 Many modern graphics chips can have their data managed completely in a
 user space AGP ring buffer removing the need to call the kernel module
 at all.  The primary limitation that has kept us from persuing these
 implementations so far have been security holes with AGP blits.

I dont pretend to understand everything here, but wouldnt it be more
secure, and STILL blindingly fast, to set up the data in userspace, and
trigger the AGP DMA / blits from kernel space with some bounds checking?

surely 1 system call per DMA isnt that bad?

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-27 Thread Keith Whitwell

Ian Molton wrote:
 On Mon, 27 May 2002 15:01:47 -0600
 Jens Owen [EMAIL PROTECTED] wrote:
 
 
1) We loosen security requirements for 3D drivers.  This will allow
far less data copying, memory mapping/unmapping and system calls. 
Many modern graphics chips can have their data managed completely in a
user space AGP ring buffer removing the need to call the kernel module
at all.  The primary limitation that has kept us from persuing these
implementations so far have been security holes with AGP blits.

 
 I dont pretend to understand everything here, but wouldnt it be more
 secure, and STILL blindingly fast, to set up the data in userspace, and
 trigger the AGP DMA / blits from kernel space with some bounds checking?
 
 surely 1 system call per DMA isnt that bad?

That's what we do for the cases where we can do so securely.  All the vertex 
data on most cards takes this route.

Some data can't go this way because the buffers are subject to attack after 
the checking has been performed but before they reach the hardware.  Whether 
specific operations are vulnerable or not depends on the details of the card's 
dma engine.

Keith


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-26 Thread José Fonseca

Linus and Keith P.,

Thank you very much for your valuable insights - they cleared a 
misconception I had about memory transfers.

Of course, to get to the bottom of this, we will have to test several 
buffer sizes - I'm sure it will be an interesting study.


Regards,

José Fonseca


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

On 2002.05.25 06:10 Leif Delgass wrote:
 On Fri, 24 May 2002, Frank C. Earl wrote:
 
  On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote:
 
   I've committed code to read BM_GUI_TABLE to reclaim processed buffers
 and
   disabled frame and buffer aging with the pattern registers.  I've
 disabled
   saving/restoring the pattern registers in the DDX and moved the wait
 for
   idle to the XAA Sync.  This fixes the slowdown on mouse moves.  I
 also
   fixed a bug in getting the ring head.  One bug that remains is that
 when
   starting tuxracer or quake (and possibly other apps) from a fresh
 restart
   of X, there is a problem where old bits of the back- or frame-buffer
 show
   through.  With tuxracer (windowed), if I move the window, the problem
 goes
   away.  It seems that some initial state is not being set correctly or
   clears aren't working.  If I run glxgears (or even tunnel, which uses
   textures) first, after starting X and before starting another app,
 the
   problem isn't there.  If someone has a cvs build or binary from
 before
   Monday the 20th but after Saturday the 18th, could you test to see if
 this
   happens? I'm not sure if this is new behavior or not.
  
   I tried removing the flush on swaps in my tree and things seem to
 still
   work fine (the bug mentioned above is still there, however).  We may
 need
   to think of an alternate way to do frame aging and throttling,
 without
   using a pattern register.
 
  I've been pondering the code you've done (not the latest commited, but
 what
  was described to me a couple of weeks back...) how do you account for
  securing the BM_GUI_TABLE check and the pattern register aging in light
 of
  the engine being able to write to most all registers?  It occured to me
 that
  there's a potential security risk (allowing malicious clients to
 possibly
  confuse/hang the engine) with the design described to me back a little
 while
  ago.
 
 Well, I just went back and looked at Jose's test for writing
 BM_GUI_TABLE_CMD from within a buffer and realized that it had a bug.
 The register addresses weren't converted to MM offsets.  So I fixed that

Indeed. The other registers were being specified by their value and not 
their macro, so I forgot about that detail...

 and ran the test.  With two descriptors, writing BM_GUI_TABLE_CMD does
 not
 cause the second descriptor to be read from the new address, but
 BM_GUI_TABLE reads back with the new address written in the first buffer
 at the end of the test.  Then I tried setting up three descriptors, and
 lo
 and behold, after processing the first two descriptors, the engine
 switches to the new table address written in the first buffer!  I think
 it's because of the pre-incrementing (prefetching?) of BM_GUI_TABLE that
 there's a delay of one descriptor, but it IS possible to derail a bus
 master in progress and set it processing from a different table in
 mid-stream.  Plus, if the address is bogus or the table is
 misconstructed,
 this will cause an engine lockup and take out DMA until the machine is
 cold restarted.  The code for the test I used is attached.
 

Wow! Bummer... I already had convinced myself that the card was secure!

 So it would appear that allowing clients to add register commands to a
 buffer without verifying them is _not_ secure.  This is going to make
 things harder, especially for vertex buffers.  This is going to require
 copying buffers and adding commands or unmapping and verifying
 client-submitted buffers in the DRM.  I'd like to continue on the path
 we're on until we can get DMA running smoothly and then we'll have to
 come back and fix this problem.

Yep. It's not the end of the world, but it's gonna mean that the CPU will 
be a little more stressed, and that we have much more code to do...

Good catch, Leif!

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Fwd: Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 12:10 am, Leif Delgass wrote:

 So it would appear that allowing clients to add register commands to a
 buffer without verifying them is _not_ secure.  This is going to make
 things harder, especially for vertex buffers.  This is going to require
 copying buffers and adding commands or unmapping and verifying
 client-submitted buffers in the DRM.  I'd like to continue on the path
 we're on until we can get DMA running smoothly and then we'll have to come
 back and fix this problem.

Check back to what I'd said for my work- how I'd envisioned things working.

It doesn't really rely on a register being set.  Yes, it uses a register to
verify a completion of a pass, but the main way to do that is to see if the
chip's idle- it's more of an extra, redundant check.  If the chip is idle, we
know it's done with the pass.  If it's not idle after about 3-5 interrupts,
you know it's probably locked and needs a reset.  Now, with the DRI locks,
etc. we can ensure that we know there's going to be nobody that isn't a
client pushing stuff out until we're done and flagged as such.  We also know
that nobody's going to be allowed register access, so they can't keep the
engine on the chip busy except by DMA requests.  DMA requests are not
initiated by the the callers, they're handled by a block of code tied to an
interrupt handler in the module.  This code, if it gets the lock, submits a
combined DMA pass of as much of the submitter's data as is reasonable.
We then check to see if that pass is completed every time the interrupt gets
called.  Upon completion, you unlink the buffers in the pass and hand them to
the free list.

With that, you're already as secure as you're likely going to get with the
Mach64- the DRM is in the driver's seat the whole time for any submitted
data.  Otherwise, you're going to be doing copying of some sort, which pretty
much burns up any speed advantages the optimal way of doing this gives you.



--
Frank Earl


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 03:01 am, José Fonseca wrote:

 Wow! Bummer... I already had convinced myself that the card was secure!

It is, if you don't rely on a register being set by something for your 
control of things.  You may get peak performance with the design in question, 
but it's not secure.  I'd almost bet we could get as good a performance doing 
it with what I'd started, if we re-worked the interrupts so that we got 
double them up using the scanline one in addition to the VBLANK one. 

 Yep. It's not the end of the world, but it's gonna mean that the CPU will
 be a little more stressed, and that we have much more code to do...

If you guys don't mind, I'd like to revisit the work by modernizing my branch 
and finalizing what I'd started.  I think it'd do well and make it secure 


-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

On 2002.05.25 17:16 Frank C. Earl wrote:
 On Saturday 25 May 2002 03:01 am, José Fonseca wrote:
 
  Wow! Bummer... I already had convinced myself that the card was secure!
 
 It is, if you don't rely on a register being set by something for your
 control of things.  ...

Frank, Leif was pretty clear and I quote:

it IS possible to derail a bus master in progress and set it 
processing from a different table in mid-stream.  Plus, if the address is 
bogus or the table is misconstructed, this will cause an engine lockup and 
take out DMA until the machine is cold restarted.

And this can happen regardless if a specific register is to be read or 
not. (In fact, if you look at the test case you'll see that no register is 
being read except for debugging purposes.)

  Yep. It's not the end of the world, but it's gonna mean that the CPU
 will
  be a little more stressed, and that we have much more code to do...
 
 If you guys don't mind, I'd like to revisit the work by modernizing my
 branch
 and finalizing what I'd started.  I think it'd do well and make it secure
 

Sure, Frank. I wish you can prove us that we are wrong, but before you 
dedicate too much time on it don't forget that now it's pretty 
straightforward to come up with a test case to break the transfer. So if 
you can't secure it in the end, your extra effort will be in vain.

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes (fwd)

2002-05-25 Thread Leif Delgass

Forgot to cc the list...

-- 
Leif Delgass 
http://www.retinalburn.net

-- Forwarded message --
Date: Sat, 25 May 2002 12:56:08 -0400 (EDT)
From: Leif Delgass [EMAIL PROTECTED]
To: Frank C. Earl [EMAIL PROTECTED]
Subject: Re: [Dri-devel] Mach64 dma fixes

On Sat, 25 May 2002, Frank C. Earl wrote:

 On Saturday 25 May 2002 12:10 am, you wrote:
 
  So it would appear that allowing clients to add register commands to a
  buffer without verifying them is _not_ secure.  This is going to make
  things harder, especially for vertex buffers.  This is going to require
  copying buffers and adding commands or unmapping and verifying
  client-submitted buffers in the DRM.  I'd like to continue on the path
  we're on until we can get DMA running smoothly and then we'll have to come
  back and fix this problem.
 
 Check back to what I'd said for my work- how I'd envisioned things working.  
 
 It doesn't really rely on a register being set.  Yes, it uses a register to 
 verify a completion of a pass, but the main way to do that is to see if the 
 chip's idle- it's more of an extra, redundant check.  If the chip is idle, we 
 know it's done with the pass.  If it's not idle after about 3-5 interrupts, 
 you know it's probably locked and needs a reset.  Now, with the DRI locks, 
 etc. we can ensure that we know there's going to be nobody that isn't a 
 client pushing stuff out until we're done and flagged as such.  

What prevents a client from modifying the contents of a buffer after it's 
been submitted?  Sure, you can't send new buffers without the lock, but 
the client can still write to a buffer that's already been submitted and 
dispatched without holding the lock.

 We also know that nobody's going to be allowed register access, so they
 can't keep the engine on the chip busy except by DMA requests.

The registers are already being mapped read only in client space now.

 DMA requests are not initiated by the the callers, they're handled by a
 block of code tied to an interrupt handler in the module.  This code, if
 it gets the lock, submits a combined DMA pass of as much of the
 submitter's data as is reasonable.  We then check to see if that pass is
 completed every time the interrupt gets called.  Upon completion, you
 unlink the buffers in the pass and hand them to the free list.

 With that, you're already as secure as you're likely going to get with the 
 Mach64- the DRM is in the driver's seat the whole time for any submitted 
 data.  Otherwise, you're going to be doing copying of some sort, which pretty 
 much burns up any speed advantages the optimal way of doing this gives you. 

I don't see the interrupt method being that different from a security 
perspective.  The DRM is in the driver's seat in either case, the method 
without interrupts is essentially the same, but with the trigger for 
starting a new pass in a different place.  The problem isn't just relying 
on reading registers that can be modified by the client, but ensuring that 
the client doesn't add commands to derail the DMA pass or lock the engine.
The only way to make sure this doesn't happen is by copying or 
unmapping and verifying the buffers.  I think the i830 driver does this.  
Yes, it will impact performance, but I don't see a way to get around it 
and still make the driver secure.  At least this extra work can be done 
while the card is busy with a DMA operation.

-- 
Leif Delgass 
http://www.retinalburn.net





___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 11:48 am, José Fonseca wrote:
 On 2002.05.25 17:16 Frank C. Earl wrote:

 Frank, Leif was pretty clear and I quote:

   it IS possible to derail a bus master in progress and set it
 processing from a different table in mid-stream.  Plus, if the address is
 bogus or the table is misconstructed, this will cause an engine lockup and
 take out DMA until the machine is cold restarted.

 And this can happen regardless if a specific register is to be read or
 not. (In fact, if you look at the test case you'll see that no register is
 being read except for debugging purposes.)

Yep.  I looked at his example again and it just didn't initially click when I 
looked at it the first time- I really do need to NOT post things just after 
waking up...  This is extremely disappointing to say the least.  Doing the 
copying is going to eat at least part if not all the advantage of doing 
either route.


-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 11:56 am, you wrote:

 What prevents a client from modifying the contents of a buffer after it's
 been submitted?  Sure, you can't send new buffers without the lock, but
 the client can still write to a buffer that's already been submitted and
 dispatched without holding the lock.

Nothing.  If the chip had been as secure as we'd initially thought, it would 
have not mattered because all they'd do is scribble all over the screen at 
the worst.  

If you're unmapping on submission, you don't have to lock things on the 
client end because they can't alter after the fact.  Then you only have to 
worry about bad data.  In this case, what you're going to want to do is to 
unmap, build the real structure by filling in the commands for the vertex 
entries, and submit to the processing queue.   Multiple callers could then 
still submit what they wanted to be DMAed without waiting (in the current 
model, don't each of the clients have to wait if one's got the lock?) because 
there's a peice of code multiplexing the DMA resource instead of a lock 
managing it.

 I don't see the interrupt method being that different from a security
 perspective.  The DRM is in the driver's seat in either case, the method
 without interrupts is essentially the same, but with the trigger for
 starting a new pass in a different place.  The problem isn't just relying
 on reading registers that can be modified by the client, but ensuring that
 the client doesn't add commands to derail the DMA pass or lock the engine.
 The only way to make sure this doesn't happen is by copying or
 unmapping and verifying the buffers.  I think the i830 driver does this.
 Yes, it will impact performance, but I don't see a way to get around it
 and still make the driver secure.  At least this extra work can be done
 while the card is busy with a DMA operation.

If it had been secure and you couldn't derail DMA, it wouldn't have peices 
that could be confused by malicious clients, meaning you didn't need to do 
copying, etc. to secure the pathway, ensuring peak overall performance.  With 
your latest test case, it's a moot point.
  
We're going to have to secure the stream proper in the form of code that has 
inner loops, etc.  (The i830 does an unmap and a single append only- we've 
got a lot more to do with the Mach64.  I've been thinking of ways around that 
on the i830 and i810 that I'm going to be trying at some point.)  Your way 
would be as secure in this environment.

Now, as to which is more effiicient, that's still up to debate.  I can't say 
which is going to be faster overall.  There's the aging in your design that 
allows for buffers being released sooner than in mine.  There's the need for 
serialization in your design that is unrequired in mine.  Which causes the 
worst bottlenecks in performance?


-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Leif Delgass

On Sat, 25 May 2002, Frank C. Earl wrote:

 On Saturday 25 May 2002 11:56 am, you wrote:
 
  What prevents a client from modifying the contents of a buffer after it's
  been submitted?  Sure, you can't send new buffers without the lock, but
  the client can still write to a buffer that's already been submitted and
  dispatched without holding the lock.
 
 Nothing.  If the chip had been as secure as we'd initially thought, it would 
 have not mattered because all they'd do is scribble all over the screen at 
 the worst.  
 
 If you're unmapping on submission, you don't have to lock things on the 
 client end because they can't alter after the fact.  Then you only have to 
 worry about bad data.  In this case, what you're going to want to do is to 
 unmap, build the real structure by filling in the commands for the vertex 
 entries, and submit to the processing queue.   Multiple callers could then 
 still submit what they wanted to be DMAed without waiting (in the current 
 model, don't each of the clients have to wait if one's got the lock?) because 
 there's a peice of code multiplexing the DMA resource instead of a lock 
 managing it.

I'm using the same model you had set up.  When a client submits a buffer,
it's added to the queue (but not dispatched) and there's no blocking.  
The DRM batch submits buffers when the high water mark is reached or the
flush ioctl is called (needed before reading/writing to the framebuffer,
e.g.).  Clients have to wait for the lock to submit the buffer, but the
ioctl quickly returns.  The only place where a client has to wait is in
freelist_get if the freelist is empty.  That's where buffer aging or
reading the ring head allows the call to return as soon as a single buffer
is available, rather than waiting for the whole DMA pass to complete.
 
  I don't see the interrupt method being that different from a security
  perspective.  The DRM is in the driver's seat in either case, the method
  without interrupts is essentially the same, but with the trigger for
  starting a new pass in a different place.  The problem isn't just relying
  on reading registers that can be modified by the client, but ensuring that
  the client doesn't add commands to derail the DMA pass or lock the engine.
  The only way to make sure this doesn't happen is by copying or
  unmapping and verifying the buffers.  I think the i830 driver does this.
  Yes, it will impact performance, but I don't see a way to get around it
  and still make the driver secure.  At least this extra work can be done
  while the card is busy with a DMA operation.
 
 If it had been secure and you couldn't derail DMA, it wouldn't have peices 
 that could be confused by malicious clients, meaning you didn't need to do 
 copying, etc. to secure the pathway, ensuring peak overall performance.  With 
 your latest test case, it's a moot point.
   
 We're going to have to secure the stream proper in the form of code that has 
 inner loops, etc.  (The i830 does an unmap and a single append only- we've 
 got a lot more to do with the Mach64.  I've been thinking of ways around that 
 on the i830 and i810 that I'm going to be trying at some point.)  Your way 
 would be as secure in this environment.

For vertex data, we can add the register commands based on the primitive 
type and buffer size.  By placing the commands, we can ensure that any 
commands in the buffer would just be seen as data.  This would require an 
unmap and loop through the buffer, but we wouldn't have to copy all the 
data.  I'm going to try doing gui-master blits using BM_HOSTDATA rather 
than BM_ADDR and HOST_DATA[0-15] and see if we can elimintate the register 
commands in the buffer.  We could also use system bus masters for blits, 
but that would require ending the current DMA op and setting up a new one 
for each blit, since blits done this way use BM_SYSTEM_TABLE instead of 
BM_GUI_TABLE.  With BM_HOSTDATA it would be a matter of changing the 
descriptors for blits, but they could co-exist in the same stream as 
vertex and state gui-master ops.
 
 Now, as to which is more effiicient, that's still up to debate.  I can't say 
 which is going to be faster overall.  There's the aging in your design that 
 allows for buffers being released sooner than in mine.  There's the need for 
 serialization in your design that is unrequired in mine.  Which causes the 
 worst bottlenecks in performance?

As I explained above, serialization isn't needed.  It's really a question 
of which method of checking completion and dispatching buffers leaves the 
least amount of idle time.  Buffer aging could still be used in the 
interrupt driven model, so that's not really a constraint of one approach 
versus the other.  I don't think it would be too difficult to test both 
methods without too much change in the basic code infrastructure.

-- 
Leif Delgass 
http://www.retinalburn.net


___

Don't miss the 2002 Sprint PCS 

Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

Frank,

On 2002.05.25 18:24 Frank C. Earl wrote:
 ...  This is extremely disappointing to say the least.  Doing the
 copying is going to eat at least part if not all the advantage of doing
 either route.

Yes, it's something we have to deal regardless of how we flush the DMA 
buffers..

Of course that it will be always slower, but I think that it can be 
reduced to a barely noticed different. It really depends were the 
bottleneck will be on a regular OpenGL application. On older CPUs with 
mach64 perhaps not, but the laptops where the mach64 chip is common have 
fairly good CPUs comparing with the Mach64 abilities, so I believe that 
the bottleneck will be on the card. This means that if we do this right, 
i.e., do it the less CPU intensive way and use fairly large buffers (since 
Mach64 allows to use scatter gather memory), then the only difference will 
be a slightly increased latency, but not really a lower number of fps. In 
other words, the bandwidth to the card should be unaffected.

Anyway, until then we still have to optimize the vertex buffers 
contructions, and after that we should be able to compare the performance 
with and without this security enforcement.

Regards,

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 11:48 am, José Fonseca wrote:

 So if you can't secure it in the end, your extra effort will be in vain.

I just thought of something to try to change the nature of the test case 
problem.  What happens if you have a second descriptor of commands that 
merely resets the DMA engine settings to what they should be for the third 
descriptor?  I'd say it'd depend on what the chip was actually doing- what do 
you guys think?

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 01:14 pm, Leif Delgass wrote:

 I'm using the same model you had set up.  When a client submits a buffer,
 it's added to the queue (but not dispatched) and there's no blocking.
 The DRM batch submits buffers when the high water mark is reached or the
 flush ioctl is called (needed before reading/writing to the framebuffer,
 e.g.).  Clients have to wait for the lock to submit the buffer, but the
 ioctl quickly returns.  The only place where a client has to wait is in
 freelist_get if the freelist is empty.  That's where buffer aging or
 reading the ring head allows the call to return as soon as a single buffer
 is available, rather than waiting for the whole DMA pass to complete.

I guess I misunderstood somewhere.  Then the only real question is, can we 
safely/stably manage aging or do we need to do it the way I had planned? 
Got it.

 For vertex data, we can add the register commands based on the primitive
 type and buffer size.  By placing the commands, we can ensure that any
 commands in the buffer would just be seen as data.  This would require an
 unmap and loop through the buffer, but we wouldn't have to copy all the
 data.  I'm going to try doing gui-master blits using BM_HOSTDATA rather
 than BM_ADDR and HOST_DATA[0-15] and see if we can elimintate the register
 commands in the buffer.  We could also use system bus masters for blits,
 but that would require ending the current DMA op and setting up a new one
 for each blit, since blits done this way use BM_SYSTEM_TABLE instead of
 BM_GUI_TABLE.  With BM_HOSTDATA it would be a matter of changing the
 descriptors for blits, but they could co-exist in the same stream as
 vertex and state gui-master ops.

It's still something eating cycles that if we could have secured the chip 
better that we wouldn't have to be doing.  Unmapping's not a good thing to be 
doing with something you're trying to do quickly and it's rough on the kernel 
memory system.

  Now, as to which is more effiicient, that's still up to debate.  I can't
 As I explained above, serialization isn't needed.  It's really a question
 of which method of checking completion and dispatching buffers leaves the
 least amount of idle time.  Buffer aging could still be used in the
 interrupt driven model, so that's not really a constraint of one approach
 versus the other.  I don't think it would be too difficult to test both
 methods without too much change in the basic code infrastructure.

Works for me.

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

On 2002.05.25 20:36 Frank C. Earl wrote:
 On Saturday 25 May 2002 11:48 am, José Fonseca wrote:
 
  So if you can't secure it in the end, your extra effort will be in
 vain.
 
 I just thought of something to try to change the nature of the test case
 problem.  What happens if you have a second descriptor of commands that
 merely resets the DMA engine settings to what they should be for the
 third
 descriptor?  I'd say it'd depend on what the chip was actually doing-
 what do you guys think?

You mean, setting the descriptor to the right value in between?

hmm... I doubt that it in the middle works because as Leif noticed, the 
changes that we make to BM_GUI_TABLE only affect the descriptor that is 
two entries ahead, so it would be too late...

..but your idea in principal is quite ingenious! What if we just fill the 
last 8 bytes of each 4K block with that command to reset the value of 
BM_GUI_TABLE? So even if the client tries to mess things up, we would put 
it right in the end. Of course that it would be a pain [to code for] to 
reserve 8 bytes on the end of each 4k block, but it's doable [It would 
also be a pain to code the kernel unmap/verification routine]

This means that not only the client can't mess up the descriptor table, 
but they can't tamper with BM_GUI_TABLE, so we can also still use it as a 
progression meter [in both implementations].

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Leif Delgass

On Sat, 25 May 2002, José Fonseca wrote:

 On 2002.05.25 20:36 Frank C. Earl wrote:
  On Saturday 25 May 2002 11:48 am, José Fonseca wrote:
  
   So if you can't secure it in the end, your extra effort will be in
  vain.
  
  I just thought of something to try to change the nature of the test case
  problem.  What happens if you have a second descriptor of commands that
  merely resets the DMA engine settings to what they should be for the
  third
  descriptor?  I'd say it'd depend on what the chip was actually doing-
  what do you guys think?
 
 You mean, setting the descriptor to the right value in between?
 
 hmm... I doubt that it in the middle works because as Leif noticed, the 
 changes that we make to BM_GUI_TABLE only affect the descriptor that is 
 two entries ahead, so it would be too late...
 
 ..but your idea in principal is quite ingenious! What if we just fill the 
 last 8 bytes of each 4K block with that command to reset the value of 
 BM_GUI_TABLE? So even if the client tries to mess things up, we would put 
 it right in the end. Of course that it would be a pain [to code for] to 
 reserve 8 bytes on the end of each 4k block, but it's doable [It would 
 also be a pain to code the kernel unmap/verification routine]
 
 This means that not only the client can't mess up the descriptor table, 
 but they can't tamper with BM_GUI_TABLE, so we can also still use it as a 
 progression meter [in both implementations].

This had crossed my mind too.  The only problem is that there could still 
be a short period of time where BM_GUI_TABLE isn't accurate, so it still 
leaves the problem of being able to trust the contents of BM_GUI_TABLE for 
buffer aging and adding descriptors to the ring.

-- 
Leif Delgass 
http://www.retinalburn.net


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Leif Delgass

On Sat, 25 May 2002, Leif Delgass wrote:

 On Sat, 25 May 2002, José Fonseca wrote:
 
  On 2002.05.25 20:36 Frank C. Earl wrote:
   On Saturday 25 May 2002 11:48 am, José Fonseca wrote:
   
So if you can't secure it in the end, your extra effort will be in
   vain.
   
   I just thought of something to try to change the nature of the test case
   problem.  What happens if you have a second descriptor of commands that
   merely resets the DMA engine settings to what they should be for the
   third
   descriptor?  I'd say it'd depend on what the chip was actually doing-
   what do you guys think?
  
  You mean, setting the descriptor to the right value in between?
  
  hmm... I doubt that it in the middle works because as Leif noticed, the 
  changes that we make to BM_GUI_TABLE only affect the descriptor that is 
  two entries ahead, so it would be too late...
  
  ..but your idea in principal is quite ingenious! What if we just fill the 
  last 8 bytes of each 4K block with that command to reset the value of 
  BM_GUI_TABLE? So even if the client tries to mess things up, we would put 
  it right in the end. Of course that it would be a pain [to code for] to 
  reserve 8 bytes on the end of each 4k block, but it's doable [It would 
  also be a pain to code the kernel unmap/verification routine]
  
  This means that not only the client can't mess up the descriptor table, 
  but they can't tamper with BM_GUI_TABLE, so we can also still use it as a 
  progression meter [in both implementations].
 
 This had crossed my mind too.  The only problem is that there could still 
 be a short period of time where BM_GUI_TABLE isn't accurate, so it still 
 leaves the problem of being able to trust the contents of BM_GUI_TABLE for 
 buffer aging and adding descriptors to the ring.

It just occurred to me that resetting BM_GUI_TABLE could put the card into 
a loop.  You might have to put in an address two descriptors away.  We'd 
need to test this.

-- 
Leif Delgass 
http://www.retinalburn.net


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 03:44 pm, Leif Delgass wrote:

 This had crossed my mind too.  The only problem is that there could still
 be a short period of time where BM_GUI_TABLE isn't accurate, so it still
 leaves the problem of being able to trust the contents of BM_GUI_TABLE for
 buffer aging and adding descriptors to the ring.

Yeah, but that's only a problem if you're aging them.  This is more food for 
thought type stuff at this point- it all boils down to what is the most 
optimal secure way of doing things.  The less things we do per submission, 
the better.  We may still end up un-mapping things, but pushing out 8 bytes 
for each 4k is less work than pushing out a command every vertex in the 
buffer.  I'd love to come up with a way to not have to un-map things if at 
all possible so that we're not doing that either.  Like I said, it's not 
really something you want to do often (just like you don't want to do ioctls 
all that often either...  :-)

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 03:50 pm, Leif Delgass wrote:

 It just occurred to me that resetting BM_GUI_TABLE could put the card into
 a loop.  You might have to put in an address two descriptors away.  We'd
 need to test this.

Hmm...  Didn't think about that possibility.  I had this last line of thought 
while I was mowing the yard (still doing yardwork...).  I was going to plug 
in your test case, the one José just came up with, and my proposed one into 
my test driver code and see what comes out from all of it.

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

On 2002.05.25 21:50 Leif Delgass wrote:
 On Sat, 25 May 2002, Leif Delgass wrote:
 
  On Sat, 25 May 2002, José Fonseca wrote:
 
  ...
 
  This had crossed my mind too.  The only problem is that there could
 still
  be a short period of time where BM_GUI_TABLE isn't accurate, so it
 still
  leaves the problem of being able to trust the contents of BM_GUI_TABLE
 for
  buffer aging and adding descriptors to the ring.
 

I see... I had the [wrong] impression that the value wasn't actually 
changed in the process since it was preincremented... have you tested if 
this actually happens?

Anyway, this doesn't prevent to use the buffer aging. On the end of 
processing a buffer we still end up with the correct value of BM_GUI_TABLE 
and we can use the last bit of BM_COMMAND to know if it's processing the 
last entry or not. The only draw back is that we aren't able to reclaim 
the buffers so soon, so we would need a scratch register to know which 
buffers were free or not.

But then, what damage can a client do by tampering BM_COMMAND?

Even if it can mess with BM_COMMAND, we can still work without it. We just 
let the card go on, but when it stops we just need to see wich was the 
last processed buffer and go on.

But probaby there are more registers that the client can mess and 
damage... probably is not just BM_GUI_TABLE and BM_COMMAND, but a bunch 
more of them, and we're effectively reducing the card communication 
bandwith asking to reset so much thing in the end of _each_ 4kb buffers.


 It just occurred to me that resetting BM_GUI_TABLE could put the card
 into
 a loop.  You might have to put in an address two descriptors away.  We'd
 need to test this.

Yes. Unfortunately I don't have the time to do it myself, so it will have 
to wait in what concerns me.


In any event, I think what these issues must be really addressed only when 
the DMA is complete. Our knowledge of the card behavior is constantly 
changing, and it would be waste of time to make such an effort to make 
things work out and discover later on that it was hopeless.

In fact, although I try to stay impartial and have an open mind about 
this, I can't stop thiking that we aren't going to accomplish nothing with 
circunveining the card's security drawbacks like this. It's like covering 
the sun with that-thing-full-of-holes-which-I-dont-recall-the-name... I'm 
getting more and more inclined towards a robust implementation, like 
unmapping buffers, which probably won't have a significant performance 
impact, and which will give us much freedom to control the card properly. 
Well, time will answer this...

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Frank C. Earl

On Saturday 25 May 2002 04:27 pm, you wrote:

 Anyway, this doesn't prevent to use the buffer aging. On the end of
 processing a buffer we still end up with the correct value of BM_GUI_TABLE
 and we can use the last bit of BM_COMMAND to know if it's processing the
 last entry or not. The only draw back is that we aren't able to reclaim
 the buffers so soon, so we would need a scratch register to know which
 buffers were free or not.

 But then, what damage can a client do by tampering BM_COMMAND?

 Even if it can mess with BM_COMMAND, we can still work without it. We just
 let the card go on, but when it stops we just need to see wich was the
 last processed buffer and go on.

 But probaby there are more registers that the client can mess and
 damage... probably is not just BM_GUI_TABLE and BM_COMMAND, but a bunch
 more of them, and we're effectively reducing the card communication
 bandwith asking to reset so much thing in the end of _each_ 4kb buffers.

We're just going to have to play with it.  As far as I'm concerned, we're 
going forward with things as they are at this point.  I'm just spending a 
little of what little time I do happen to have exhausting all possible ways 
of avoiding doing unneeded operations- as long as my branch is mostly in 
lock-step with the functionality that you're providing I'll be happy.  I'll 
not spend too much more time trying this stuff, but I want to know.  If 
something comes of all of it, great.  If not, well, it was my time to waste 
and it's not really wasted- we'll have answers when someone else comes along 
and asks if that's the best we can do with this chip.

I'm just not liking having to secure the path the way that it's being 
thought of.  Not because I'm against doing the work- there is a bottleneck in 
doing this stuff that way.

Some of the things we would do to secure things don't consume a lot of 
resources CPU-wise.  Some of the things we would do if we can't just ignore 
the path would consume a lot of resources.  If the books covering the design 
of the Linux kernel are to be believed, there's a decent number of operations 
done by mapping or unmapping memory to/from userspace- it really wasn't 
designed with the kinds of stuff we're asking of it (namely doing a LOT of it 
and doing it quickly and often...) in mind.

 Yes. Unfortunately I don't have the time to do it myself, so it will have
 to wait in what concerns me.

I'm planning on setting things up this evening to see if any of this is worth 
bothering with as a continuing conversation.

 In any event, I think what these issues must be really addressed only when
 the DMA is complete. Our knowledge of the card behavior is constantly
 changing, and it would be waste of time to make such an effort to make
 things work out and discover later on that it was hopeless.

Indeed.  That's why I intend on tinkering a little while longer with this but 
only so much so.  As for knowlege of the card behavior changing and whatnot.  
I'd believe that not everything is known about any of the other cards out 
there.  

 In fact, although I try to stay impartial and have an open mind about
 this, I can't stop thiking that we aren't going to accomplish nothing with
 circunveining the card's security drawbacks like this. It's like covering
 the sun with that-thing-full-of-holes-which-I-dont-recall-the-name... I'm
 getting more and more inclined towards a robust implementation, like
 unmapping buffers, which probably won't have a significant performance
 impact, and which will give us much freedom to control the card properly.
 Well, time will answer this...

That's my take on things...  

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Linus Torvalds



On Sat, 25 May 2002, Frank C. Earl wrote:

 Linus, if you're still listening in, can you spare us a moment to tell us
 what consequences quickly mapping and unmapping memory reigons into userspace
 has on the system?

It's reasonably fine on UP, and it often _really_ sucks on SMP.

On UP, the real cost is not so much the actual TLB invalidate (which works
at a page granularity anyway on any recent CPU), but the fact that you
need to walk the page tables (cache miss heaven), and you will eventually
need to fault another page back in (page fault, cache miss, whatever).

On SMP, especially if the program is threaded (which games often are:
even if the actual graphics engine is single-threaded, you end up having
another thread for sound, one possible for AI or input etc), the cost goes
up noticeably thanks to a (synchronous) CPU cross-call for a proper TLB
invalidate.

 We've got a couple of the DRM modules that do that to
 ensure the driver is secure.  I'm thinking it's a source of some performance
 degredation in the drivers and it may not be good on the memory subsytem.

My gut feel is that especially under SMP, you're actually better off
copying stuff, especially if we're talking about buffers that are mostly
less than few kB.

Basically, if the data can fit in the cache (ie the app has just generated
them, and the data is already in the CPU cache and not big enough to blow
that cache to kingdom come), copying is almost guaranteed to be a win,
even on UP.

(And please do note the cache issues: while a big buffer can often improve
performance, it can equally easily _decrease_ performance by putting more
cache pressure on the system. You're often better off re-using a smaller
8kB buffer many times - and doing most everything out of the cache - than
trying to use a 1MB buffer and aiming for perfect scaling).

DMA'ing directly from user space is most likely advantageous for doing
things like textures, which are bound to be fairly big anyway. I'd _hope_
that those don't have security issues (ie they'd be DMA'able as just data,
no command interface), but I don't have any information about the card
details.

Linus


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread José Fonseca

On 2002.05.26 00:49 Linus Torvalds wrote:
 
 
 On Sat, 25 May 2002, Frank C. Earl wrote:
 
  Linus, if you're still listening in, can you spare us a moment to tell
 us
  what consequences quickly mapping and unmapping memory reigons into
 userspace
  has on the system?
 
 It's reasonably fine on UP, and it often _really_ sucks on SMP.
 
 On UP, the real cost is not so much the actual TLB invalidate (which
 works
 at a page granularity anyway on any recent CPU), but the fact that you
 need to walk the page tables (cache miss heaven), and you will eventually
 need to fault another page back in (page fault, cache miss, whatever).
 
 On SMP, especially if the program is threaded (which games often are:
 even if the actual graphics engine is single-threaded, you end up having
 another thread for sound, one possible for AI or input etc), the cost
 goes
 up noticeably thanks to a (synchronous) CPU cross-call for a proper TLB
 invalidate.
 
We've got a couple of the DRM modules that do that to
  ensure the driver is secure.  I'm thinking it's a source of some
 performance
  degredation in the drivers and it may not be good on the memory
 subsytem.
 
 My gut feel is that especially under SMP, you're actually better off
 copying stuff, especially if we're talking about buffers that are
 mostly less than few kB.

The vertex data alone (no textures here) can be several MBs per frame, and 
the number of frames per second can be as high as the card can handle, so 
the total buffer memory must be also big. I don't know if having lots of 
small buffers won't create a overhead due to the overhead with IOCTL and 
buffer submission (well, mostly the IOCTLs, since buffers can be queued by 
the kernel).

Throwing some numbers just to get a rough idea: 2[MB/frame] x 
25[frames/second] / 4[Kb/buffer] = 12800 buffers/second.

I'm not very familiar with these issues but won't this number of ioctls 
per second create a significant overhead here? Or would the benefits of 
having each buffer fit on the cache (facilitating the copy) prevail?

On other extreme we would have e.g., a 2MB buffer costing a single IOCTL + 
unmapping  mapping to user space.

(I know that the most likely is that we will need to benchmark this 
anyway...)

 Basically, if the data can fit in the cache (ie the app has just
 generated
 them, and the data is already in the CPU cache and not big enough to blow
 that cache to kingdom come), copying is almost guaranteed to be a win,
 even on UP.
 
 (And please do note the cache issues: while a big buffer can often
 improve
 performance, it can equally easily _decrease_ performance by putting more
 cache pressure on the system. You're often better off re-using a smaller
 8kB buffer many times - and doing most everything out of the cache - than
 trying to use a 1MB buffer and aiming for perfect scaling).
 
 DMA'ing directly from user space is most likely advantageous for doing
 things like textures, which are bound to be fairly big anyway. I'd _hope_
 that those don't have security issues (ie they'd be DMA'able as just
 data,
 no command interface), but I don't have any information about the card
 details.

Yes. There are several ways to accomplish blits with this card, and at 
least one of them is secure and efficient.

José Fonseca

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Linus Torvalds



On Sun, 26 May 2002, José Fonseca wrote:

 The vertex data alone (no textures here) can be several MBs per frame

Yes, yes, I realize that the cumulative sizes are big. The question is not
the absolute size, but the size of one bunch.

 Throwing some numbers just to get a rough idea: 2[MB/frame] x
 25[frames/second] / 4[Kb/buffer] = 12800 buffers/second.

The thing is, if you do processing of vertexes, I wouldn't be surpised if
you're better off using a 8kB buffer over and over, and just doing 6400
system call entries, than you are to actually to trye to buffer up 2MB and
then just doing 25 system call entries.

Sure, in one case you do 6400 system calls, and in the other case you do
only 25, so people who are afraid of system calls think that obviously
the 25 system calls must be faster.

But that obviously is just wrong. Pretty much all modern CPU's handle
big working sets badly, and handle tight nice loops very well.

Just to take a non-graphics-related example: on my machine, lmbench
reports that I get pipe bandwidths that sometimes exceed 1GB/s.

At the same time, a normal memcpy() goes along at 625MB/s.

In short: according to that benchmark it is _faster_ to copy data from one
process to another though a pipe, than it is to use memcpy() within one
process.

That's obviously a load of bull, and yet lmbench isn't really lying. The
reason the pipe throughput is higher than the memory copy throughput is
simply that the pipe data is chunked up in 4kB chunks, and because the
source and the destinations are re-used in the pipe benchmarks in 64kB
chunks, you get a lot better cache behaviour.

(In fact, even TCP beats a plain memcpy() occasionally, which also says
that the Linux TCP layer is an impressive piece of work _despite_ the same
cache advantage).

 I'm not very familiar with these issues but won't this number of ioctls
 per second create a significant overhead here? Or would the benefits of
 having each buffer fit on the cache (facilitating the copy) prevail?

A hot system call takes about 0.2 us on an athlon (it takes significantly
longer on a P4, which I'm beating up Intel over all the time). The ioctl
stuff goes through slightly more layers, but we're not talking huge
numbers here. The system calls are fast enough that you're better off
trying to keep stuff in the cache, than trying to minimize system calls.

(The memcpy() example is a perfect example of something where _zero_
system calls is slower than two system calls and a task switch, simply
because the zero system call example ends up being all noncached).

Note that the cache issues show up on the instruction side too, especially
on the P4 which has a fairly small trace cache. You can often make things
go faster by simplifying and streamlining the code rather than trying to
be clever and having a big footprint. Ask Keith Packard about the X frame
buffer code and this very issue some day.

NOTE NOTE NOTE! The tradeoffs are seldom all that clear. Sometimes big
buffers and few system calls are better. Sometimes they aren't. It just
depends on a lot of things.

Linus


___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-25 Thread Keith Packard


Around 18 o'clock on May 25, Linus Torvalds wrote:

 You can often make things go faster by simplifying and streamlining the
 code rather than trying to be clever and having a big footprint. Ask Keith
 Packard about the X frame buffer code and this very issue some day.

The frame buffer code has very different tradeoffs; all of the memory 
references are across the PCI/AGP bus, so even issues like instruction 
caches are pretty much lost in the noise.  That means you can completely 
ignore instruction count issues when estimating algorithm performance and 
look only at bus cycles.  

The result is similar; code gets rolled up into the smallest space, but not
entirely for efficiency, but rather to make it easier to understand and
count memory cycles.  Of course, it's also nice to avoid trashing the
i-cache so that when the frame buffer access is done there isn't a huge
penalty in getting back to the rest of the X server.

Reading data from the frame buffer takes nearly forever -- uncached PCI/AGP
reads are completely synchronous. The frame buffer code stands on it's head
to avoid that, even at the cost of some significant code expansion in
places.  For example, when filling rectangles, the edges are often not
aligned on 32-bit boundaries.  It's much more efficient to do a sequence of
byte/short writes than the read-mask-write cycle that the older frame
buffer code used.  Writes are a bit better, but the lame Intel CPUs can't 
saturate an AGP bus in write combining mode -- that mode doesn't go through
the regular cache logic and instead uses a separate buffer which isn't 
deep enough to cover the bus latency.  Hence the performance difference 
between DMA and PIO for simple 2D graphics operations.

The code also takes advantage of dynamic branch prediction; tests which 
resolve the same direction each pass through a loop are left inside the 
loop instead of duplicating the code to avoid the test; there isn't be a 
pipeline branch penalty while running through the loop because the 
predictor will guess right every time.

The result is code which handles all of the X data formats (1,4,8,16,24,32)
in about half the space the older code used to handle only a single format. 
The old code was optimized for 60ns CPUs with 300ns memory systems; new 
machines have much faster CPUs but only marginally faster memory.

Getting a chance to implement the same spec in two radically different 
performance environments has been a lot of fun.

Keith PackardXFree86 Core TeamHP Cambridge Research Lab



___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-24 Thread Frank C. Earl

On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote:

 I've committed code to read BM_GUI_TABLE to reclaim processed buffers and
 disabled frame and buffer aging with the pattern registers.  I've disabled
 saving/restoring the pattern registers in the DDX and moved the wait for
 idle to the XAA Sync.  This fixes the slowdown on mouse moves.  I also
 fixed a bug in getting the ring head.  One bug that remains is that when
 starting tuxracer or quake (and possibly other apps) from a fresh restart
 of X, there is a problem where old bits of the back- or frame-buffer show
 through.  With tuxracer (windowed), if I move the window, the problem goes
 away.  It seems that some initial state is not being set correctly or
 clears aren't working.  If I run glxgears (or even tunnel, which uses
 textures) first, after starting X and before starting another app, the
 problem isn't there.  If someone has a cvs build or binary from before
 Monday the 20th but after Saturday the 18th, could you test to see if this
 happens? I'm not sure if this is new behavior or not.

 I tried removing the flush on swaps in my tree and things seem to still
 work fine (the bug mentioned above is still there, however).  We may need
 to think of an alternate way to do frame aging and throttling, without
 using a pattern register.

I've been pondering the code you've done (not the latest commited, but what 
was described to me a couple of weeks back...) how do you account for 
securing the BM_GUI_TABLE check and the pattern register aging in light of 
the engine being able to write to most all registers?  It occured to me that 
there's a potential security risk (allowing malicious clients to possibly 
confuse/hang the engine) with the design described to me back a little while 
ago.

-- 
Frank Earl

___

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 dma fixes

2002-05-24 Thread Leif Delgass

On Fri, 24 May 2002, Frank C. Earl wrote:

 On Thursday 23 May 2002 04:37 pm, Leif Delgass wrote:
 
  I've committed code to read BM_GUI_TABLE to reclaim processed buffers and
  disabled frame and buffer aging with the pattern registers.  I've disabled
  saving/restoring the pattern registers in the DDX and moved the wait for
  idle to the XAA Sync.  This fixes the slowdown on mouse moves.  I also
  fixed a bug in getting the ring head.  One bug that remains is that when
  starting tuxracer or quake (and possibly other apps) from a fresh restart
  of X, there is a problem where old bits of the back- or frame-buffer show
  through.  With tuxracer (windowed), if I move the window, the problem goes
  away.  It seems that some initial state is not being set correctly or
  clears aren't working.  If I run glxgears (or even tunnel, which uses
  textures) first, after starting X and before starting another app, the
  problem isn't there.  If someone has a cvs build or binary from before
  Monday the 20th but after Saturday the 18th, could you test to see if this
  happens? I'm not sure if this is new behavior or not.
 
  I tried removing the flush on swaps in my tree and things seem to still
  work fine (the bug mentioned above is still there, however).  We may need
  to think of an alternate way to do frame aging and throttling, without
  using a pattern register.
 
 I've been pondering the code you've done (not the latest commited, but what 
 was described to me a couple of weeks back...) how do you account for 
 securing the BM_GUI_TABLE check and the pattern register aging in light of 
 the engine being able to write to most all registers?  It occured to me that 
 there's a potential security risk (allowing malicious clients to possibly 
 confuse/hang the engine) with the design described to me back a little while 
 ago.

Well, I just went back and looked at Jose's test for writing
BM_GUI_TABLE_CMD from within a buffer and realized that it had a bug.  
The register addresses weren't converted to MM offsets.  So I fixed that
and ran the test.  With two descriptors, writing BM_GUI_TABLE_CMD does not
cause the second descriptor to be read from the new address, but
BM_GUI_TABLE reads back with the new address written in the first buffer
at the end of the test.  Then I tried setting up three descriptors, and lo
and behold, after processing the first two descriptors, the engine
switches to the new table address written in the first buffer!  I think
it's because of the pre-incrementing (prefetching?) of BM_GUI_TABLE that
there's a delay of one descriptor, but it IS possible to derail a bus
master in progress and set it processing from a different table in
mid-stream.  Plus, if the address is bogus or the table is misconstructed, 
this will cause an engine lockup and take out DMA until the machine is 
cold restarted.  The code for the test I used is attached.

So it would appear that allowing clients to add register commands to a
buffer without verifying them is _not_ secure.  This is going to make
things harder, especially for vertex buffers.  This is going to require
copying buffers and adding commands or unmapping and verifying
client-submitted buffers in the DRM.  I'd like to continue on the path 
we're on until we can get DMA running smoothly and then we'll have to come 
back and fix this problem.

-- 
Leif Delgass 
http://www.retinalburn.net



static int mach64_bm_dma_test2( drm_device_t *dev )
{
drm_mach64_private_t *dev_priv = dev-dev_private;
dma_addr_t data_handle, data2_handle, table2_handle;
void *cpu_addr_data, *cpu_addr_data2, *cpu_addr_table2;
u32 data_addr, data2_addr, table2_addr;
u32 *table, *data, *table2, *data2;
u32 regs[3], expected[3];
int i;

DRM_DEBUG( %s\n, __FUNCTION__ );

table = (u32 *) dev_priv-cpu_addr_table;

/* FIXME: get a dma buffer from the freelist here rather than using the pool 
*/
DRM_DEBUG( Allocating data memory ...\n );
cpu_addr_data = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, data_handle );
cpu_addr_data2 = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, data2_handle );
cpu_addr_table2 = pci_pool_alloc( dev_priv-pool, SLAB_ATOMIC, table2_handle 
);
if (!cpu_addr_data || !data_handle || !cpu_addr_data2 || !data2_handle || 
!cpu_addr_table2 || !table2_handle) {
DRM_INFO( data-memory allocation failed!\n );
return -ENOMEM;
} else {
data = (u32 *) cpu_addr_data;
data_addr = (u32) data_handle;
data2 = (u32 *) cpu_addr_data2;
data2_addr = (u32) data2_handle;
table2 = (u32 *) cpu_addr_table2;
table2_addr = (u32) table2_handle;
}

DRM_INFO( data1: 0x%08x data2: 0x%08x\n, data_addr, data2_addr );
DRM_INFO( table2: 0x%08x\n, table2_addr );

MACH64_WRITE( 

Re: [Dri-devel] Mach64 DMA, blits, AGP textures

2002-05-20 Thread Leif Delgass

On Sat, 18 May 2002, Felix Kühling wrote:

 On Sat, 18 May 2002 11:30:28 -0400 (EDT)
 Leif Delgass [EMAIL PROTECTED] wrote:
 
  Did you have a 2D accelerated server running on another vt?  The DDX saves
  and restores its register state on mode switches, so it could be a problem
  with the FIFO depth or pattern registers being changed.  Try testing
  without another X server running if you haven't already.  Also, does
  anything show up in the system log?
 
 I did have a 2D accelerated X-server running. But I started the DRI
 server from a text console and didn't switch between the servers during
 the tests, so it shouldn't matter. As to the syslog, my kern.log file is
 actually the syslog. My distro configures syslog so that it sends all
 syslog messages from the kernel to kern.log. In the other syslog files
 there are no further messages.

Sorry, I was in a hurry before and didn't read your message very 
carefully.  I think I've fixed the segfault, can you do a cvs update in 
xc/programs/Xserver/hw/xfree86/drivers/ati and then do a 'make install' in 
that directory.  You should only need to rebuild the DDX. 

-- 
Leif Delgass 
http://www.retinalburn.net


___
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA, blits, AGP textures

2002-05-18 Thread José Fonseca

On 2002.05.18 15:40 Felix Kühling wrote:
 On Sat, 18 May 2002 15:01:51 +0200
 Felix Kühling [EMAIL PROTECTED] wrote:
 
  For this test I compiled everything with gcc-2.95.4. I had a different
  problem after compiling with gcc-3.0. I have to try that again and
 check
  for compile errors. The problem was that the X server segfaulted on
  startup. I'll report more details later.
 
 Ok, I recompiled with gcc-3.0 again. There are no errors in world.log.
 The X-server segfaults on startup. Note that I had a working Xserver+DRI
 compiled with gcc-3.0 before Leif's last changes.
 
 These are the relevant parts of my logfiles:
 
 kern.log:
 [...]
 May 18 16:18:28 viking kernel: [drm] AGP 0.99 on VIA Apollo KT133 @
 0xd000 64MB
 May 18 16:18:28 viking kernel: [drm] Initialized mach64 1.0.0 20020417 on
 minor 0
 May 18 16:18:29 viking kernel: [drm] Setting FIFO size to 128 entries
 May 18 16:18:29 viking kernel: [drm] Creating pci pool
 May 18 16:18:29 viking kernel: [drm] Allocating descriptor table memory
 May 18 16:18:29 viking kernel: [drm] descriptor table: cpu addr:
 0xc0268000, bus addr: 0x00268000
 May 18 16:18:29 viking kernel: [drm] Starting DMA test...
 May 18 16:18:29 viking kernel: [drm] starting DMA transfer...
 May 18 16:18:29 viking kernel: [drm] waiting for idle...
 May 18 16:18:29 viking kernel: [drm] waiting for idle...done
 May 18 16:18:29 viking kernel: [drm] DMA test succeeded, using
 asynchronous DMA mode
 
 XFree86.1.log:
 [...]
 (==) ATI(0): Write-combining range (0xd400,0x80)
 (II) ATI(0): [drm] SAREA 2200+1212: 3412
 drmOpenDevice: minor is 0
 drmOpenDevice: node name is /dev/dri/card0
 drmOpenDevice: open result is -1, (No such device)
 drmOpenDevice: Open failed
 drmOpenDevice: minor is 0
 drmOpenDevice: node name is /dev/dri/card0
 drmOpenDevice: open result is -1, (No such device)
 drmOpenDevice: Open failed
 drmOpenDevice: minor is 0
 drmOpenDevice: node name is /dev/dri/card0
 drmOpenDevice: open result is 11, (OK)
 drmGetBusid returned ''
 (II) ATI(0): [drm] loaded kernel module for mach64 driver
 (II) ATI(0): [drm] created mach64 driver at busid PCI:1:0:0
 (II) ATI(0): [drm] added 8192 byte SAREA at 0xd08bf000
 (II) ATI(0): [drm] mapped SAREA 0xd08bf000 to 0x40015000
 (II) ATI(0): [drm] framebuffer handle = 0xd400
 (II) ATI(0): [drm] added 1 reserved context for kernel
 (II) ATI(0): [agp] Using AGP 1x Mode
 (II) ATI(0): [agp] Using 8 MB AGP aperture
 (II) ATI(0): [agp] Mode 0x1f000201 [AGP 0x1106/0x0305; Card
 0x1002/0x474d]
 (II) ATI(0): [agp] 8192 kB allocated with handle 0xd10c3000
 (II) ATI(0): [agp] Using 2 MB DMA buffer size
 (II) ATI(0): [agp] vertex buffers handle = 0xd000
 (II) ATI(0): [agp] Vertex buffers mapped at 0x40a38000
 (II) ATI(0): [agp] AGP texture region handle = 0xd020
 (II) ATI(0): [agp] AGP Texture region mapped at 0x40c38000
 (II) ATI(0): [drm] register handle = 0xd600
 (II) ATI(0): [dri] Visual configs initialized
 (II) ATI(0): [dri] Block 0 base at 0xd6000400
 (II) ATI(0): Memory manager initialized to (0,0) (640,1637)
 (II) ATI(0): Reserved back buffer from (0,480) to (640,960)
 (II) ATI(0): Reserved depth buffer from (0,960) to (640,1440)
 (II) ATI(0): Reserved 6144 kb for textures at offset 0x1ff900
 (II) ATI(0): Largest offscreen areas (with overlaps):
 (II) ATI(0):  640 x 6072 rectangle at 0,480
 (II) ATI(0):  512 x 6073 rectangle at 0,480
 (**) ATI(0): Option XaaNoScreenToScreenCopy
 (II) ATI(0): Using XFree86 Acceleration Architecture (XAA)
   Setting up tile and stipple cache:
   32 128x128 slots
   18 256x256 slots
   7 512x512 slots
 (==) ATI(0): Backing store disabled
 (==) ATI(0): Silken mouse enabled
 (**) Option dpms
 (**) ATI(0): DPMS enabled
 (II) ATI(0): X context handle = 0x0001
 (II) ATI(0): [drm] installed DRM signal handler
 (II) ATI(0): [DRI] installation complete
 (II) ATI(0): [drm] Added 128 16384 byte DMA buffers
 
 Fatal server error:
 Caught signal 11.  Server aborting
 
 I also got a debugger backtrace after the segfault:
 
 Program received signal SIGSEGV, Segmentation fault.
 0x086c0c3c in ?? ()
 #0  0x086c0c3c in ?? ()
 #1  0x086bef1b in ?? ()
 #2  0x080bfe18 in AddScreen (pfnInit=0x86be944, argc=5, argv=0xba04)
 at main.c:768
 #3  0x0806c425 in InitOutput (pScreenInfo=0x81cda00, argc=5,
 argv=0xba04)
 at xf86Init.c:819
 #4  0x080bf378 in main (argc=5, argv=0xba04, envp=0xba1c) at
 main.c:380
 
 I know, this backtrace is not very helpful. Is there a way to get the ??
 resolved?
 
 Regards,
  Felix

Nice report ;-)

Try with xfree-gdb (http://www.dawa.demon.co.uk/xfree-gdb/) to see if you
have better luck.

José Fonseca


___
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

___
Dri-devel mailing list
[EMAIL PROTECTED]

Re: [Dri-devel] Mach64 DMA, blits, AGP textures

2002-05-18 Thread Felix Kühling

On Sat, 18 May 2002 11:30:28 -0400 (EDT)
Leif Delgass [EMAIL PROTECTED] wrote:

 Did you have a 2D accelerated server running on another vt?  The DDX saves
 and restores its register state on mode switches, so it could be a problem
 with the FIFO depth or pattern registers being changed.  Try testing
 without another X server running if you haven't already.  Also, does
 anything show up in the system log?

I did have a 2D accelerated X-server running. But I started the DRI server from a text 
console and didn't switch between the servers during the tests, so it shouldn't 
matter. As to the syslog, my kern.log file is actually the syslog. My distro 
configures syslog so that it sends all syslog messages from the kernel to kern.log. In 
the other syslog files there are no further messages.

Regards,
Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!

___
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA, blits, AGP textures

2002-05-18 Thread Felix Kühling

On Sat, 18 May 2002 15:56:00 +0100
José Fonseca [EMAIL PROTECTED] wrote:

 Nice report ;-)

Thanks :)

 Try with xfree-gdb (http://www.dawa.demon.co.uk/xfree-gdb/) to see if you
 have better luck.

Yep, that gave better results. Since I have only one computer here and
the display turns black I had to do this with a gdb command script. This
is the script:

run :1 vt8 -xf86config XF86Config-mach64004
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt
continue
bt

Here is the log:

Program received signal SIGUSR1, User defined signal 1.
_loader_debug_state () at loader.c:1331
1331{
#0  _loader_debug_state () at loader.c:1331
#1  0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, 
ppLookup=0xb848) at loader.c:1036
#2  0x0809cd56 in LoaderOpen (
module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, 
cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, 
wasLoaded=0xb898) at loader.c:1183
#3  0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, 
subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, 
errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924
#4  0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790)
at xf86Init.c:1716
#5  0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04)
at xf86Init.c:358
#6  0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380
#7  0x4006e14f in __libc_start_main () from /lib/libc.so.6

Program received signal SIGUSR1, User defined signal 1.
_loader_debug_state () at loader.c:1331
1331{
#0  _loader_debug_state () at loader.c:1331
#1  0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, 
ppLookup=0xb848) at loader.c:1036
#2  0x0809cd56 in LoaderOpen (
module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, 
cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, 
wasLoaded=0xb898) at loader.c:1183
#3  0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, 
subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, 
errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924
#4  0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790)
at xf86Init.c:1716
#5  0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04)
at xf86Init.c:358
#6  0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380
#7  0x4006e14f in __libc_start_main () from /lib/libc.so.6

Program received signal SIGUSR1, User defined signal 1.
_loader_debug_state () at loader.c:1331
1331{
#0  _loader_debug_state () at loader.c:1331
#1  0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, 
ppLookup=0xb848) at loader.c:1036
#2  0x0809cd56 in LoaderOpen (
module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, 
cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, 
wasLoaded=0xb898) at loader.c:1183
#3  0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, 
subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, 
errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924
#4  0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790)
at xf86Init.c:1716
#5  0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04)
at xf86Init.c:358
#6  0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380
#7  0x4006e14f in __libc_start_main () from /lib/libc.so.6

Program received signal SIGUSR1, User defined signal 1.
_loader_debug_state () at loader.c:1331
1331{
#0  _loader_debug_state () at loader.c:1331
#1  0x0809c71a in ARCHIVELoadModule (modrec=0x831cbd0, arfd=8, 
ppLookup=0xb848) at loader.c:1036
#2  0x0809cd56 in LoaderOpen (
module=0x833eb28 /usr/X11R6-mach64004/lib/modules/extensions/libxie.a, 
cname=0x8276b68 xie, handle=0, errmaj=0xb8d4, errmin=0xb8d8, 
wasLoaded=0xb898) at loader.c:1183
#3  0x0809e739 in LoadModule (module=0x8276b08 xie, path=0x0, 
subdirlist=0x0, patternlist=0x0, options=0x0, modreq=0x0, 
errmaj=0xb8d4, errmin=0xb8d8) at loadmod.c:924
#4  0x0806e630 in xf86LoadModules (list=0x820f760, optlist=0x820f790)
at xf86Init.c:1716
#5  0x0806c5f7 in InitOutput (pScreenInfo=0x81e09e0, argc=5, argv=0xba04)
at xf86Init.c:358
#6  0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at main.c:380
#7  0x4006e14f in __libc_start_main () from /lib/libc.so.6

Program received signal SIGUSR1, User defined signal 1.
_loader_debug_state () at loader.c:1331
1331{
#0  _loader_debug_state () at loader.c:1331
#1  0x0809c71a in ARCHIVELoadModule (modrec=0x85d7188, arfd=8, 
ppLookup=0xb848) at loader.c:1036
#2  0x0809cd56 in LoaderOpen (
module=0x862ed08 /usr/X11R6-mach64004/lib/modules/fonts/libspeedo.a, 
cname=0x8667760 speedo, handle=0, 

Re: [Dri-devel] Mach64 DMA, blits, AGP textures

2002-05-18 Thread Felix Kühling

On Sat, 18 May 2002 18:26:52 +0100
José Fonseca [EMAIL PROTECTED] wrote:

 I also have to start using another X server in a sep window cause having 
 to log out everytime I want to test is a PITA.

I'm not sure whether I get this correctly. Anyway, I have my 2D Xserver
running on vt7 and start the 3D Xserver from a text console on vt8.

 
  bt
  continue
  ...
 
  Here is the log:
  
  ...
  
  Program received signal SIGSEGV, Segmentation fault.
  0x082385e0 in DRILock (pScreen=0x0, flags=0) at dri.c:1759
  1759DRIScreenPrivPtr pDRIPriv = DRI_SCREEN_PRIV(pScreen);
  #0  0x082385e0 in DRILock (pScreen=0x0, flags=0) at dri.c:1759
 
 The problem is that pScreen is NULL here and DRILock is trying to 
 dereference it.

This is the second sigsegv. I found it strange that the debugger could
continue after the first one. I assume that this one actually happens
while the first one is handled.

 
  #1  0x086d9ffe in intE6_handler ()
  #2  0x086ff93d in VBEGetVBEpmi () at atipreinit.c:548
  #3  0x08706fa9 in fbBlt (srcLine=0x0, srcStride=0, srcX=0,
  dstLine=0x8706fa9,
  dstStride=-1073744732, dstX=0, width=0, height=141643160, alu=1,
  pm=141643240, bpp=137618992, reverse=0, upsidedown=142025672)
  at fbblt.c:295
  #4  0x080a8ca8 in xf86XVLeaveVT (index=0, flags=0) at xf86xv.c:1241
  #5  0x0806d5de in AbortDDX () at xf86Init.c:1135
  #6  0x080dbf20 in AbortServer () at utils.c:436
  #7  0x080dd62f in FatalError () at utils.c:1399
  #8  0x08080d0b in xf86SigHandler (signo=11) at xf86Events.c:1085

See ...

  #9  0x4007e6b8 in sigaction () from /lib/libc.so.6
  #10 0x086ea8d8 in intE6_handler ()
  #11 0x080c60f0 in AddScreen (pfnInit=0x86ea268 intE6_handler+79240,
  argc=5,
  argv=0xba04) at main.c:768
  #12 0x0806c383 in InitOutput (pScreenInfo=0x81e09e0, argc=5,
  argv=0xba04)
  at xf86Init.c:819
  #13 0x080c55e6 in main (argc=5, argv=0xba04, envp=0xba1c) at
  main.c:380
  #14 0x4006e14f in __libc_start_main () from /lib/libc.so.6
  
  Program terminated with signal SIGSEGV, Segmentation fault.
  The program no longer exists.
  
  I grepped for intE6_handler and found it in
  programs/Xserver/hw/xfree86/int10/xf86int10.c. I think, it's not mach64
  specific and it hasn't changed since January. So the actual problem must
  be somewhere else.
 
 Don't forget that the problem ocurred in DRILock and not intE6_handler.

The first sigsegv occurred in intE6_handler.

 First let's try to eliminate the most simple options first. I noticed on 
 the CVS Update log that Leif changed quite a deal of places. You mentioned 
 in your first post that you'd recompiled everything. Did you also 
 re-installed everything? It happened quite a lot having problems because I 
 forgot to recompile/reinstall parts that had been changed.

I used make world to recompile everything and make install to install. I
also copied the kernel module. Did I forget anything?

Regards,
Felix

   __\|/_____ ___ ___
__Tschüß___\_6 6_/___/__ \___/__ \___/___\___You can do anything,___
_Felix___\Ä/\ \_\ \_\ \__U___just not everything
  [EMAIL PROTECTED]o__/   \___/   \___/at the same time!

___
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-03-11 Thread Robert Lunnon

On Monday 11 March 2002 01:55, Frank C. Earl wrote:
 On Sunday 10 March 2002 11:44 am, Jos? Fonseca wrote:
  I really don't know much about that, since it must happened before I
  subscribed to this mailing list, but perhaps you'll want to take a look
  to the Utah-GLX and this list archives. You can get these archives in
  mbox format and also filtered with just the messages regarding mach64 at
  http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/

 The problem was that the XAA driver for mach64 was setting the FIFO size up
 for some reason and it was leaving the chip in a state that wouldn't work
 for the DMA mode.  If we set the size back to the default setting before we
 do the first DMA pass, everybody's happy.  I suspect if we got with the
 developer of the XAA driver we can sell him on leaving that setting alone
 in the driver's setup.

 Sorry for being silent for so long gang.  Been, yet again, crushed under
 with lovely personal business.  I have started a new branch
 (mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been
 playing with into a unified DMA framework.  I should be putting the first
 updates to the branch in over the next couple of days.

 Of note, when I did find some spare time, I ran tests on what we needed to
 do to secure the chip's DMA path.  I found out some interesting facts.

 It will accept any values written to the registers.
 It will not act on any of those settings during the DMA pass unless they're
 a GUI specific operation when it's doing a command-type DMA.
 It will not act on many of the settings after a DMA pass is complete.
 It will not let you set up any sort of DMA pass during the operation.
 Junk commands, by themselves, do not seem to hose up the engine in
 operation. Mapping and unmapping a memory space is somewhat compute
 intensive.

Thanks Frank, this was just what I was after...

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-03-10 Thread José Fonseca

On 2002.03.10 11:30 Robert Lunnon wrote:
 A while back there was a problem with the Mach64 initialisation such that
 it
 locked up after executing dma, can someone point at what the resolution
 to
 that problem was and where things were patched so I can have a look at it
 ?
 
 Thanks
 
 Bob

I really don't know much about that, since it must happened before I 
subscribed to this mailing list, but perhaps you'll want to take a look to 
the Utah-GLX and this list archives. You can get these archives in mbox 
format and also filtered with just the messages regarding mach64 at 
http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/

I hope this helps.

Regards,

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-03-10 Thread Frank C . Earl

On Sunday 10 March 2002 11:44 am, José Fonseca wrote:

 I really don't know much about that, since it must happened before I
 subscribed to this mailing list, but perhaps you'll want to take a look to
 the Utah-GLX and this list archives. You can get these archives in mbox
 format and also filtered with just the messages regarding mach64 at
 http://mefriss1.swan.ac.uk/~jfonseca/dri/mailing-lists/

The problem was that the XAA driver for mach64 was setting the FIFO size up 
for some reason and it was leaving the chip in a state that wouldn't work for 
the DMA mode.  If we set the size back to the default setting before we do 
the first DMA pass, everybody's happy.  I suspect if we got with the 
developer of the XAA driver we can sell him on leaving that setting alone in 
the driver's setup.

Sorry for being silent for so long gang.  Been, yet again, crushed under with 
lovely personal business.  I have started a new branch 
(mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been 
playing with into a unified DMA framework.  I should be putting the first 
updates to the branch in over the next couple of days.

Of note, when I did find some spare time, I ran tests on what we needed to do 
to secure the chip's DMA path.  I found out some interesting facts.

It will accept any values written to the registers.
It will not act on any of those settings during the DMA pass unless they're a 
GUI specific operation when it's doing a command-type DMA.
It will not act on many of the settings after a DMA pass is complete. 
It will not let you set up any sort of DMA pass during the operation.
Junk commands, by themselves, do not seem to hose up the engine in operation.
Mapping and unmapping a memory space is somewhat compute intensive.

-- 
Frank Earl

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-03-10 Thread José Fonseca

On 2002.03.10 15:55 Frank C. Earl wrote:
 On Sunday 10 March 2002 11:44 am, José Fonseca wrote:
 
 ...
 
 Sorry for being silent for so long gang.  Been, yet again, crushed under
 with lovely personal business.  I have started a new branch
 (mach64-0-0-3-dma-branch), and I'm actually putting the hacks I've been
 playing with into a unified DMA framework.  I should be putting the first
 updates to the branch in over the next couple of days.
 

I look forward to check it out.

 Of note, when I did find some spare time, I ran tests on what we needed
 to do
 to secure the chip's DMA path.  I found out some interesting facts.
 
 It will accept any values written to the registers.
 It will not act on any of those settings during the DMA pass unless
 they're a
 GUI specific operation when it's doing a command-type DMA.
 It will not act on many of the settings after a DMA pass is complete.
 It will not let you set up any sort of DMA pass during the operation.
 Junk commands, by themselves, do not seem to hose up the engine in
 operation.

I didn't fully understand the implications of above, but shouldn't the 
direct access to the chip registers still be denied to clients?

 Mapping and unmapping a memory space is somewhat compute intensive.

This one has to be compared to the time that takes to copy a buffer, 
unless there is a way to do it in a secure manner without copying or 
unmapping.

 
 --
 Frank Earl
 

José Fonseca

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-03-10 Thread Frank C . Earl

On Sunday 10 March 2002 04:36 pm, José Fonseca wrote:

 I didn't fully understand the implications of above, but shouldn't the
 direct access to the chip registers still be denied to clients?

Depends.  

Looking at the gamma source code (I could be wrong, mind...) it appears that 
the DRM is taking in direct commands from userspace in DMA buffers and 
sending them on to the chip as time goes along.  

If you can make it so that it doesn't lock up the card or the machine and 
doesn't compromise system security in any way (i.e. issuing a DMA pass from a 
part of memory to the framebuffer and then from the framebuffer to another 
part of memory so as to clobber things or to pilfer info from other areas), 
it's pretty safe to do direct commands.  From observations, it appears that 
you can't get the engine to do anything except GUI operations during a 
properly set up GUI-mastering pass.  

The only risks that I can see right at the moment with sending direct 
commands over indirect commands is one of not resetting certain select 
registers to what they were before the pass and one of the engine not 
handling certian GUI operations well.  The first is easily taken care of by 
having the driver have a 4k block submitted in the descriptor chain as the 
last entry that updates those registers accordingly- the list of commands 
should only need to be built once and reused often since these registers 
won't be changed by the DRM engine, Mesa driver, or the XAA driver after the 
XAA driver does it's setup for the chip operation.  The second case is a 
tough one and one that copying/mapping won't protect you from it- you have to 
process commands to prevent them from occuring (compute intensive and there 
might be other cases, each time you'd have to come up with yet another 
workaround) or find something to detect a hang on the engine and reset it 
proper.  I seriously doubt that we'd encounter one of those, but we might all 
the same.  I've still got one or two more tests to run (I've yet to 
deliberately hang the engine, detect the same, and then do a reset- but then 
I've yet to be able to hang the engine with it _properly_ set up...) but most 
of the innards for copying commands or whatever would be largely the same 
(some of the interfaces might change, but that's less of an issue than the 
heart of the DMA engine itself which is the same no matter what...) so I'm 
going to get _SOMETHING_ in place to see what our performance actually would 
be with some DMA operation going on.

  Mapping and unmapping a memory space is somewhat compute intensive.

 This one has to be compared to the time that takes to copy a buffer,
 unless there is a way to do it in a secure manner without copying or
 unmapping.

If you don't have issues with sending commands directly, you don't need to 
copy or map/unmap.  You don't need special clear commands or swap commands, 
you only need to issue DMAable buffers of commands to the DRM engine for 
eventual submission to the chip.  Right now, I'm not 100% sure that the 
mach64 is one of those sorts of chips, but it's shaping up to be a good 
prospect for that.


-- 
Frank Earl

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA Was: Pseudo DMA?

2002-02-10 Thread José Fonseca

On 2002.02.10 09:31 Gareth Hughes wrote:
 ...
 
 These chips can read
 and write arbitrary locations in system memory.  For all chips that
 have this feature, the only safe way to program them is from within

Which of the chips currently supported by DRI is more similar in this [DMA
programming] sense to mach64 and could be looked as a reference
implementation?

 a DRM kernel module.  Only clients that have been authenticated via
 the usual (X auth) means are able to talk to such modules.  There is
 simply no other way to do it.  You can trust the X server and the
 kernel module.  You CANNOT trust anything else -- a client-side 3D
 driver, something masquerading as one, whatever...
 
 There is a reason why all the DRI drivers for commodity cards are
 designed like this.  It's a pain, but that's the price you pay for
 a secure system.
 
 -- Gareth
 

Regards,

Jose Fonseca


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com


___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-01-13 Thread Frank C . Earl

On Tuesday 08 January 2002 04:12 pm, Leif Delgass and Manuel Teira wrote:

 Happy New Year!

Hopefully for all, it will be a better one than the last...

  Well, after the holidays, I would like to recover the development in the
  mach64 branch. I started today to investigate the DMA stuff because I
  think that perhaps Frank is very busy and he has no time to do this work.
  The DMA problem was discovered aprox. Oct 21th and we have no news about
  any progress in  DMA. I'm sure  that Frank would do it better than me,
  but I can try.

I've had starts and stops.  However, I am still working on things and have 
been with what time I've actually had that I could think straight on and I'm 
pretty close to having something- sorry about the delays on my part, I KNOW 
you're all chomping at the bit, I am wanting this to happen as much as you 
all are.

 It sounded like Frank had written some code already (he mentioned being
 halfway done in early December).  Frank, is your work in a state that
 could be commited to the branch so others could help finish it?  If so,
 this might be a good place to start a new branch tag since we currently
 have a working driver.  Before long we'll also need to merge in changes
 from the trunk, since 4.2.0 seems close to release.

I'm about ready to actually make that branch- I nearly have the code in place 
(You'd not believe the fun stuff that has conspired against me to finish the 
code...  Job hunting, honey do projects, my father being ill- I couldn't 
get focused long enough to sit down and plug the code in...  But enough of 
that, it's in the past.) and I'm planning on getting it completed sometime 
this upcoming week and start verifying that I've not broken anything.  I'll 
go ahead and make a branch at that point in time.

  I've been looking at the r128 freelist implementation, so I've derived
  that the register called R128_LAST_DISPATCH_REG (actually
  R128_GUI_SCRATCH_REG1) is used to store the age of the last discarded
  buffer. So, the
  r128_freelist_get is able to wait for a discarded buffer if there's no
  free buffer available.
 
  Could this be made in the mach64 driver, say with MACH64_SCRATCH_REG1 ?
  In my register reference it says that these registers can be for
  exchanging information between the BIOS and card drivers, so, is sane to
  use them for this task?

 I'm not sure that that would be safe to use.  According to r128_reg.h, the
 r128 has BIOS scratch registers and GUI scratch registers, where the
 mach64 has only the scratch registers used by the BIOS.  The mach64
 Programmer's Guide says that SCRATCH_REG1 is used to store the ROM
 segment location and installed mode information and should only be used
 by the BIOS or a driver to get the ROM segment location for calling ROM
 service routines.

Hm...  I've been wondering why they used a scratch register when the private 
area's available and could hold the data as well as anything else.  Anybody 
care to comment?  As it stands, I've got the info being placed in the private 
data area as a variable.

  I've also seen that there's no r128_freelist_put (it's present in mga
  driver, for example). Isn't it necessary?

Depends.  I'm not sure how I'm going to code things.  I've got to account for 
clients holding onto or discarding their buffers upon submission (As well as 
burning them off because they're shutting down) in the DRM and I'm working on 
this part (The actual DMA submission part's fairly easy, but the trappings 
within the DRM are a different story.) right now.  My thinking is that if 
it's a discard, we push it back into the freelist and view it as unavailable 
unless the age is past the timeout point.

  And, when is a buffer supposed to be discarded. Is this situation
  produced in the client side?

It appears to be a client side behavior.  The way that most of the cards seem 
to do their DMA is they queue up the buffer in question with an ioctl such as 
DRM_IOCTL_R128_VERTEX, using a struct not unlike this for parameters for the 
ioctl:

typedef struct drm_r128_vertex {
int prim;
int idx;/* Index of vertex buffer */
int count;  /* Number of vertices in buffer */
int discard;/* Client finished with buffer? */
} drm_r128_vertex_t;

idx is the value pulled from the value in drm_buf_t.  I'll admit that prim's 
still a little foggy to me, but count is obvious as is discard.  Basically 
the client tells the DRM to put it back into the freelist because it's done 
with it.  I would think that there is a tradeoff for holding onto versus 
releasing buffers- holding onto them would be a speed boost, but at the 
expense of limiting how many clients had buffers.  128k's not a lot of buffer 
space- it doesn't allow for many verticies, etc. so I'd wonder what kind of 
benefit a client would derive from holding onto the buffer for things like 
verticies.  Textures might be an advantage, but again, you're stealing things 

RE: [Dri-devel] Mach64 DMA

2002-01-13 Thread Gareth Hughes

Frank C. Earl wrote:
 
 While we're discussing things here, can anyone tell me why 
 things like the emit state code is in the DRM instead of in
 the Mesa drivers?  It looks like it could just as easily be
 in the Mesa driver at least in the case of the RagePRO code-
 is there a good reason why it's in the DRM?

Security.

-- Gareth

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-01-13 Thread Frank C . Earl

On Sunday 13 January 2002 11:50 pm, Gareth Hughes wrote:

  While we're discussing things here, can anyone tell me why
  things like the emit state code is in the DRM instead of in
  the Mesa drivers?  It looks like it could just as easily be
  in the Mesa driver at least in the case of the RagePRO code-
  is there a good reason why it's in the DRM?

 Security.

Okay.  I can buy that as a reason.   Unfortunately, the reason why it's more 
secure doesn't appear to be well documented anywhere and there are several 
cards that do not appear to have this as a feature of their DRM module- would 
you (or anyone else for that matter) enlighten us as to the why it's 
better/more secure?  (I'm coding it right now, it just strikes me as adding a 
bunch of extra kernel space call overhead without much benefit- I'd love to 
understand why I'm doing it this way while I'm coding it into the DRM...  :-)

-- 
Frank Earl

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] Mach64 DMA

2002-01-08 Thread Leif Delgass

On Mon, 7 Jan 2002, Manuel Teira wrote:

 Hello again. First of all, happy new year to everybody.

Happy New Year!
 
 Well, after the holidays, I would like to recover the development in the 
 mach64 branch. I started today to investigate the DMA stuff because I think 
 that perhaps Frank is very busy and he has no time to do this work. The DMA 
 problem was discovered aprox. Oct 21th and we have no news about any progress 
 in  DMA. I'm sure  that Frank would do it better than me, but I can try.

It sounded like Frank had written some code already (he mentioned being
halfway done in early December).  Frank, is your work in a state that
could be commited to the branch so others could help finish it?  If so, 
this might be a good place to start a new branch tag since we currently 
have a working driver.  Before long we'll also need to merge in changes 
from the trunk, since 4.2.0 seems close to release.

 And now, the questions:
 
 I've been looking at the r128 freelist implementation, so I've derived that 
 the register called R128_LAST_DISPATCH_REG (actually R128_GUI_SCRATCH_REG1) 
 is used to store the age of the last discarded buffer. So, the 
 r128_freelist_get is able to wait for a discarded buffer if there's no free 
 buffer available.
 
 Could this be made in the mach64 driver, say with MACH64_SCRATCH_REG1 ? In my 
 register reference it says that these registers can be for exchanging 
 information between the BIOS and card drivers, so, is sane to use them for 
 this task?

I'm not sure that that would be safe to use.  According to r128_reg.h, the
r128 has BIOS scratch registers and GUI scratch registers, where the
mach64 has only the scratch registers used by the BIOS.  The mach64
Programmer's Guide says that SCRATCH_REG1 is used to store the ROM
segment location and installed mode information and should only be used
by the BIOS or a driver to get the ROM segment location for calling ROM
service routines.

 I've also seen that there's no r128_freelist_put (it's present in mga driver, 
 for example). Isn't it necessary? 
 
 And, when is a buffer supposed to be discarded. Is this situation produced in 
 the client side?
 
 
 Best regards.
 
 --
 Manuel Teira
 
 
 ___
 Dri-devel mailing list
 [EMAIL PROTECTED]
 https://lists.sourceforge.net/lists/listinfo/dri-devel
 

-- 
Leif Delgass 
http://www.retinalburn.net






___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel



Re: [Dri-devel] mach64 dma question

2001-10-24 Thread Frank C . Earl

On Wednesday 24 October 2001 01:54 am, [EMAIL PROTECTED] wrote:


   a=readl(kms-reg_aperture+MACH64_BUS_CNTL);
   writel((a | (31) )(~(16)), kms-reg_aperture+MACH64_BUS_CNTL);
   
   same other code

 works fine. Now why would this be ?

This could be caused by the same thing that was giving us fits up until 
recently.  The 4.X XFree86 driver does some things in its init function that 
override the chip's default settings (ostensibly to improve performance...) 
that cause at least busmastering for gui operations to not work at all.  Do 
you have to set bits 1 and 2 every time you want a DMA pass or is it just 
with the first time you do it?


-- 
Frank Earl

___
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel