On Thu, 2007-04-26 at 16:55 +1000, Dave Airlie wrote: > Hi, > > The patch is too big to fit on the list and I've no idea how we could > break it down further, it just happens to be a lot of new code.. > > http://people.freedesktop.org/~airlied/ttm/0001-drm-Implement-TTM-Memory-manager-core-functionality.txt > > The patch header and diffstat are below, > > This isn't for integration yet but we'd like an initial review by > anyone with the spare time and inclination, there is a lot stuff > relying on getting this code bet into shape and into the kernel but > any cleanups people can suggest now especially to the user interfaces > would be appreciated as once we set that stuff in stone it'll be a > pain to change... also it doesn't have any driver side code, this is > just the generic pieces. I'll post the intel 915 side code later but > there isn't that much to it.. > > It applies on top of my drm-2.6 git tree drm-mm branch.... > > ----------------------------------------------------------------------------------------------------- > > This patch brings in the TTM (Translation Table Maps) memory > management > system from Thomas Hellstrom at Tungsten Graphics. > > This patch only covers the core functionality and changes to the drm > core. > > The TTM memory manager enables dynamic mapping of memory objects in > and > out of the graphic card accessible memory (e.g. AGP), this implements > the AGP backend for TTM to be used by the i915 driver.
I've been slow responding, but we've been talking a lot on IRC and at Intel about the TTM interface recently, and trying to come up with a concensus between us as to what we'd like to see. 1) Multiplexed ioctls hurt The first issue I have with this version is the userland interface. You've got two ioctls for buffer management and once for fence management, yet these 3 ioctls are actually just attempting to be generic interfaces for around 25 actual functions you want to call (except for the unimplemented ones, drm_bo_fence and drm_bo_ref_fence). So there are quasi-generic arguments to these ioctls, where most of the members are ignored by any given function, but it's not obvious to the caller which ones. There's no comments or anything as to what the arguments to these functions are or what exactly they do. We've got 100 generic ioctl numbers allocated and unused still, so I don't think we should be shy about having separate ioctls for separate functions, if this is the interface we expect to use going forward. 2) Early microoptimizations There's also apparently an unused way to chain these function calls in a single ioctl call for the buffer object ioctl. This is one of a couple of microoptimizations at the expense of code clarity which have bothered me while reviewing the TTM code, when I'm guessing no testing was done to see if it was actually a bottleneck. 3) Fencing and flushing troubles I'm definitely concerned by the fencing interface. Right now, the driver is flushing caches and fencing every buffer with EXE and its driver-specific FLUSHED flag in dispatching command buffers. We almost surely don't want to be flushing for every batch buffer just in case someone wants to do CPU reads from something. However, with the current mechanism, if I fence my operation with just EXE and no flush, then somebody goes to map that buffer, they'll wait for the fence, but no flush will be emitted. The interface we've been imagining wouldn't have driver-specific fence flags, but instead be only a marker of when command execution has passed a certain point (which is what fencing is about, anyway). In validating buffers, you would pass whether they're for READ or WRITE as we currently do, and you'd put it on the unfenced read/write lists as appropriate. Add one buffer object function for emitting the flush, which would then determine whether the next fence-all-unfenced call would cover just the list of unfenced reads or the list of both unfenced reads and unfenced writes. Then, in mapping, check if it's on the unfenced-writes list and emit the flush and fence, and then wait for a fence on the buffer before continuing with the mapping. 4) Locking and deadlocking Right now, the main rendering path I'm seeing is something like this: lock_hardware() # I'm not really sure why this lock is here map(texture) write texture data unmap(texture) unlock_hardware() map(some_buffer) # GL lets you have buffers mapped for a long time write to some_buffer map(some_buffer_2) write to some_buffer_2 map(batchbuffer) # start some actual rendering map(vertices) write vertices write commands unmap(vertices) unmap(batchbuffer) unmap(some_buffer) # the GL client called this unmap(some_buffer_2) lock_hardware() validate(batchbuffer) validate(vertices) validate(texture) validate(some_buffer) validate(some_buffer_2) validate(backbuffer) map_while_validated(batchbuffer) write relocations in batchbuffer unmap(batchbuffer) emit(batchbuffer) fence_unfenced() unlock_hardware() There's also a fallback path: lock_hardware() map(backbuffer) map(frontbuffer) map(vertices) map(some_buffer) rendering with cpu unmaps(...) unlock_hardware() Note: map blocks on fencing, and if you don't pass the magic "while-validated" flag, then validate blocks on unmap. If we've got GL buffer objects shared between clients, we've got easy deadlocks available: client A map buffer 1 client B locks client B validates buffer 2 client A map buffer 2 (block) client B validates buffer 1 (block) I'm not sure to what extent GL allows buffer object sharing, but with EXT_tfp we're going to need to be dealing with this for the X Server versus GL interactions. 5) Fixing locking by not doing it So, it looks like the existing hardware lock isn't cutting it for avoiding deadlocks. Additionally, we'd like to get to a point where I can have some random app running things on my graphics card with my GUI server clueless that it's happening (more access control would be needed, but allowing lock pushdown is the main design issue here, I think). That pretty much means having a hardware lock that an app can hold for an arbitrary amount of time has to go away. Since we've got the lock_hardware() around any series of validates for rendering, we've been thinking about pushing that whole series of operations into one device-specific (probably?) ioctl: submit_buffers(buffer_list, relocation_list, bool full_state) There's a comment suggesting that this is a good idea on intel_batchbuffers already. This submit_buffers would do: choose offsets to validate into perform relocations validate the buffers with the given flags emit flush if indicated emit the fence For multi-context hardware, the kernel then gets to choose which rendering context this batchbuffer will be run on. If it can't give you a context with the same state, it returns EBUSY or whatever and you resubmit with full state upload and the flag saying so. Additionally, the kernel can then do reasonable simulation of multi-context hardware on hardware with context save/restore, which we haven't been doing before. Also, since the kernel has all the buffers needed for validate at once, it can be smarter about placing the allocations, as right now you could possibly get aperture fragmentation from early validates which prevent later buffer validations from succeeding. With this and removing what appears to be an unnecessary hardware lock from the texture write, we've got userland locking out of the hardware rendering path, we've got a nice system for multi-context hardware and the ability for the kernel to simulate multiple contexts if not, and we can remove most of the userland interface, getting us down to: bo_create bo_map bo_unmap bo_ref bo_unref submit_buffers fence_reference fence_unreference fence_wait That's a lot less than 25 entrypoints, and I like that. Plus, with this and the DRM modesetting work, adding access control and batchbuffer validation is "all" that's needed to get the kernel scheduling arbitrary programs doing random things on the card with your gui unaware of it. But this still leaves the map versus validate deadlocks. The closest we've come up with to a solution is: "Maps never block on maps, validate blocks on unmap, and map blocks on (write-flushed if necessary) fence completion. It is up to threads performing submit_buffers and maps on shared objects to arrange external synchronization to avoid deadlock." This has the advantage that those threads likely have some sort of protocol requirements for rendering consistency between those objects, and are going to be doing some sort of synchronization anyway. I'm just not sure yet on how feasible it is to bend GL into this model. -- Eric Anholt [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]
signature.asc
Description: This is a digitally signed message part