Attached is the initial rough-draft of the design document for the next generation memory manager. It is currently plain-text. When I polled people on #dri-devel the consensus was that plain-text would be the most useful format. I suspect at some point I may change to HTML or something else that can link within a document. In any case, I have left in the marks for Jed's (and Emacs') folding mode. My apologies to all who do not fold. :)
The document is not 100% complete. A few sections, such as the replacement policy, need more discussion before they can be completed. I have also included an "issues" section in the spirit of "issues" sections in OpenGL extension documents. I think the most significan issue is 3.13, but I don't think any of them are trivial. I fully expect this section to grow. :)
I would *really* like to discuss the document and anything else related in Monday's #dri-devel meeting. Hopefully people can make it & will have a chance to digest the document by then.
-*- mode: text; mode: fold -*- DRI Memory Manager texmem-0-0-2 Design Ian Romanick <[EMAIL PROTECTED]>
{{{ 1. Introduction The current texture management system used by the open-source DRI drivers has reached its limits. The current implementation works, and it does the functionality that it implements well. However, there is a large class of functionality that the current system simply cannot easily support. Additionally, there are a number of cases where the current system is clearly suboptimal. The current texture memory manager only supports "throw away" data. It is assumed that, with proper synchronization with the rendering hardware, any data that is in either AGP or on-card memory can be thrown away and re-loaded at anytime. For texture data loaded from main memory, this assumption holds true. However, there are several classes of data that need to be locked for longer than the duration of a single rendering command. Anytime a texture is a rendering target, be it as the destination of glCopyTexImage or as the rendering target of a pbuffer, that texture data cannot be thrown away until after it is copied to some sort of backing store. This restriction prevents the current memory manager from hardware accelerating glCopyTexImage. This restriction has also prevented implementation of pbuffers in open-source drivers and has forced binary-only driver, such as ATI's FireGL drivers, to use a static, fixed size memory area for pbuffers. Significant changes need to be made to the memory management system to support render-target buffers. In addition to supporting render-to-texture, pbuffers, and glCopyTexImage, the DRI should be able to support dynamic allocation of back-buffers and depth-buffers. While this is a secondary consideration, this will allow different depth-buffer depths on the same display (i.e., one window can use a 16-bit depth-buffer while another uses a 24-bit depth-buffer) and full-screen anti-aliasing (FSAA). As has recently been discussed on the dri-devel mailing list, texture uploads in the DRI are not fully pipelined [1]. For applications that need to use dynamic textures, such as streaming video, this presents a performance bottleneck. In order to support fully pipelined texture uploads, a mechanism is needed to determine when rendering using a particular texture buffer is complete. The current memory manager lacks such a mechanism. Each texture image in the DRI can be replicated in memory in as many as three places. The texture can exist in on-card memory, in backing-store in main memory, and in storage in the application. As the texture requirements of games and other applications continues to increase, duplicating texture data becomes more and more painful. In non-extended OpenGL it should be possible to reduce the number of copies of a texture to no more than two. With the use of extensions such as APPLE_client_storage[2] or ARB_pixel_buffer_object[3], it should be possible to achieve single-copy textures. The remainder of this document is divided into two major sections. The first section details the design of the new memory manager. This section is further divided into a section detailing the the data structures used in the implementation and a section detailing division of work between the kernel driver and the user space driver. As this is intended to be a living document, the remaining section describes the open issues addressed during the design process. Issues are marked as either open or closed. Each issue will have some description of the problem and possible solutions. Each closed issues will have a description of the resolution. It is expected the the resolution of closed issues will be incorporated in the other sections of the document. }}} {{{ 2. Design of the memory manager {{{ 2.1 Data structures Each of the main data structures used by the memory manager lives in a shared memory space and is accessed by all processes on the system. As such, these shared memory areas are protected by a mutex. In the initial implementation all memory areas will share the mutex currently used to serialize access to the existing SAREA. In the future additional mutexes may be created. {{{ 2.1.1 Memory block The regions of graphics memory (i.e., AGP aperture, on-card, etc.) are divided into a fixed number of logical blocks. For each region of graphics memory, there is an array of 'struct memory_block' in main memory that represents these logical blocks. The ith element in the array represents the ith logical block in the memory region. The granularity of the blocks can vary depending on a number of factors, and may be driver dependent. This is similar to the way the current memory manager works. However, the current memory manager divides memory into 64KB blocks, with a maximum of 64 blocks. The current intention is to divide each region in much smaller blocks and allow up to 65536 blocks. For texture data, 64KB blocks are fine, but this is not fine enough granularity for vertex data. The most important field in the memory_block structure is fence. Each driver maintains a 32-bit counter of rendering operations. This counter is maintained at whatever granularity the driver sees fit. This may be per rendering operation, per DMA buffer, or some other granularity. Each object (texture, vertex buffer, etc.) stored in a memory_block has an individual fence value. When the driver fence value is greater than the fence value for the object, then all in-flight rendering objects that use that object have completed. The fence value stored in the memory_block is the maximum fence value over the set of objects in the memory_block. Since individual objects can span multiple memory blocks, each memory_block tracks a sequence number and an ID. A group of blocks with the same ID number is considered to be a linear memory range ordered by sequence numbers. This information is used to allow the various parts of the memory manager to page block in and out and maintain proper ordering of data in memory. There is also a set of flags associated with each block. If BLOCK_PIN is set for a block, then the block is pinned in memory and cannot be moved or removed regardless of the fence value. This would be used for back-buffers that are queued to be swapped to the front-buffer by an asynchronous buffer-swap call. However, it could be used in any case where the block depends on some activity other than rendering. For example, an implementation of ATI_map_object_buffer or ARB_vertex_buffer_object would need to use this bit while the buffer is mapped. If a all the objects in a block are backed in system memory, BLOCK_CLOBBER can be set. Rather than paging the block out when it needs to be reclaimed, the contents will simply be thrown away. This is similar to the way that the system memory manger handles read-only memory pages that are backed by a file (i.e., the TEXT section of an executing program). On some architectures memory in the AGP aperture can be cached by the CPU. There may be some other cases where memory that is readable by the graphics chip (PCI-GART memory?) can be cached. In these cases BLOCK_CACHABLE is set. This can allow direct reading of the memory block in cases where a software fallback is by the GL. If the block is not cachable, it may be better to page the memory back to system memory or modify the AGP mapping for optimal performance. enum { BLOCK_PIN = (1U << 0), /** Block is pinned & cannot move. */ BLOCK_CLOBBER = (1U << 1), /** Clobber block on reclaim. */ BLOCK_CACHABLE = (1U << 2), /** Is this memory area cachable by the CPU? */ }; struct memory_block { u32 fence; /** Fence age of the last-to-expire object in the * block. */ u16 status; /** Status flag bits. */ u16 id; /** ID number of the block. */ u16 sequence; /** Sequence number of block in group of blocks. */ }; }}} {{{ 2.1.2 Available block ID list Each sequence of memory blocks used by a process must have a system unique block ID. Block IDs are on the range [0, 0x0ffff]. Available block IDs are stored in a bit vector. If a bit in the vector is cleared, that block ID is available. If a bit in the vector is set, that block ID is in use by some process. To allocate an ID, a process must obtain a lock and search the table for a cleared bit. After finding a clear bit, the process sets the bit and releases the lock. This should not happen frequently and should be very fast. It should also be possible to implement the search using a cmpxchg instruction on CPUs that support it. Either implementation suffers from having to perform a linear search each time an ID is to be allocated. One way to reduce this cost is to allocate multiple IDs at a time (perhaps all of the available IDs represented in a single byte or 32-bit word) and keep the IDs when they are not in use. The driver would have to be careful not to keep too many unused IDs or the system might run out of available IDs. The bit vector is stored in an 8KB shared memory segment. It is never accessed by the kernel. The bit-vector was chosen because it represents the 65536 possible IDs in a very compact way. }}} {{{ 2.1.3 Modified block list The list of modified blocks is tracked using a simple bit vector. When a process reclaims a block that is owned by a different process, the bit for that block ID is marked in the bit vector. When a process wants to use a particular object, it must check the bit in the vector for the memory_block associated with the object. If the bit is set, then some other process reclaimed a portion of the memory associated with that memory_block ID. There is a subtle detail that needs to be kept clear here. The bit field tracks modified memory_blocks by block ID. Multiple memory_blocks can have the same block ID. If the bit for a block ID is set in the vector, that only means that at least one memory_block has been modified. Depending on where individual objects are located in the sequence of blocks and which blocks in the sequence were reclaimed, the reclaimed blocks may not need to be brought back in. To determine which logical block was modified, the process must look at the memory_block structures for the object. It is up to the driver to maintain a mapping of which elements in the array of memory_block structures match a particular ID. If the block ID on one of the array elements has changed, then the block was modified. It is up to the driver to determine if the block needs to be paged back in or restored from some other backing store. In practice, storing the mapping and determining how to restore a block is handled in device independent user-space code. }}} {{{ 2.1.4 Free Memory / Reclaim List The method used to store the free-list can make or break the performance of any memory management algorithm. Much research has been done on paging strategies for virtual memory operating systems. The architecture of a processor's memory management unit is similar to what is done here, but different in a subtle way. A CPU's MMU can map out individual pages of memory, and we can map out individual blocks. However, the CPU's MMU can replace individual pages at any arbitrary physical address. We cannot. Blocks with the same ID must remain in ordered by their sequence numbers. This allows us to take memory out like paged virtual memory, but we have to bring the memory back in like segmented virtual memory. Some tricks can be done with the AGP aperture to work around this, but those tricks do nothing for on-card memory. On-card memory is several orders of magnitude faster than AGP memory, so optimizing the use of on-card memory is a top priority. The total void of available papers on this subject and the need for performance has led to the design of several candidate data structures. The intention is to implement and profile each. There two things that need to be tracked. The first is the set of free blocks. The other is an ordered set of allocated blocks. The set of allocated blocks is ordered by reclaim priority. The allocated block with this highest reclaim priority is the block that will be first reclaimed when there are no free blocks available. {{{ 2.1.4.1 Linked List }}} {{{ 2.1.4.2 Linked with Bit-Vector }}} }}} }}} {{{ 2.2 Division of labor One of the design goals of the new memory manager, and the DRI in general, is to facilitate user upgrades and bi-directional compatibility (i.e., forward and backward compatibility). This design tenet has led to placing very little decision making logic in the kernel. Basically, the desire is to put code in the kernel that can only go in the kernel. This trend continues with the new memory manager. The vast majority of the new functionality is implemented in the user-space portion of the driver. {{{ 2.2.1 What happens in the kernel There are two difficulties with adding new functionality to the kernel. Adding extra code to the kernel presents portability problems. It is much more difficult to port kernel code from Linux to BSD or Solaris or anything else than it is to port user-space code. Additionally, changing functionality in kernel code can present compatibility problems. These are both issues that can only present headaches to developers and users alike. As such, the functionality that is added to the kernel in the new memory manager is limited to glorified data movement. {{{ 2.2.1.1 Page out The primary API exposed to user-space from the kernel is for paging memory blocks in and out. When the user-space driver detects that a block of memory needs to be reclaimed, there are two types of blocks that can be reclaimed. Either a throw-away block can be reclaimed or a non throw-away block. If a throw-away block is selected, no kernel intervention is required. If a non throw-away block is selected, the user-space driver must ask the kernel to page the block out. int drmPageOutMemoryBlock( int fd, unsigned heap_id, unsigned index, unsigned block_id, unsigned sequence ); The 'index' is the index of the memory block to page out in the array of memory blocks. If the fence for the specified memory has not completed, the call will wait for the fence to complete. When a block is paged out, the bit for its block ID in the modified-block bit vector is set. The 'block_id' and the 'sequence' are the block ID number and the sequence number for the data that will be placed in the block. In addition, the user-space driver can ask the kernel page a memory block back in. int drmPageInMemoryBlock( int fd, unsigned heap_id, unsigned index, unsigned block_id, unsigned sequence ); The 'index' is the index of the destination memory block, and 'block_id' and 'sequence' specify the memory block to page back in. The third entry point is a short-cut that combines the first two. This allows a memory block to be paged out and another block paged back in at the same location. int drmPageOutPageInMemoryBlock( int fd, unsigned heap_id, unsigned index, unsigned block_id, unsigned sequence ); }}} {{{ 2.2.1.2 AGP Re-mapping The kernel is also capable of doing some AGP mapping tricks that allow the implementation of true single-copy textures. With the AGPGART 2.0 interface, which should be available in the 2.5.60 or 2.5.61 Linux kernel, the pages backing portions of the AGP aperture can be changed dynamically. Using this functionality, the DRM could use the copy of the texture in the driver as the backing for a portion of the AGP aperture. If APPLE_client_storage[2] is used, pages in the application could be used to back the AGP aperture. This functionality will likely not be available in the first release of the new memory manager. }}} }}} {{{ 2.2.2 What happens in user-space Just like in the current memory manager, the decision making logic happens in the user-space portion of the driver. In user-space, the driver handles requests to allocate and free memory for textures and other data. The user-space portion also make policy decisions about which data blocks are to be paged out of memory to make room for other blocks. The code that implements the user-space portion of the memory manager is device independent and lives in lib/GL/mesa/src/drv/common/. {{{ 2.2.2.1 Internal driver interface Since the memory manager is implemented in a device independent way, there is a single API presented by the memory manager to the driver. This interface allows driver to allocate and free buffer and control if and when the buffers can be reclaimed by other processes. The API intentionally does not expose any of the data structures from section 2.1 to the driver. Each memory pool (i.e., on-card memory, AGP memory, etc.) has an associated set of attributes. enum driMemoryBufferFlags { BUFFLAG_COLOR_BUFFER = (1U << 0), /** Buffer can be used as a color * buffer destination for rendering. */ BUFFLAG_DEPTH_BUFFER = (1U << 1), /** Buffer can be used as a depth * buffer destination for rendering. */ BUFFLAG_STENCIL_BUFFER = (1U << 2), /** Buffer can be used as a stencil * buffer destination for rendering. */ BUFFLAG_TEXTURE = (1U << 3), /** Buffer can be used to source * texture data. */ BUFFLAG_VERTEX = (1U << 4), /** Buffer can be used to source * vertex data. */ BUFFLAG_COMMAND = (1U << 5), /** Buffer can be used to source * rendering commands. */ BUFFLAG_CACHABLE = (1U << 6), /** Buffer is cachable by the CPU. */ BUFFLAG_RESERVED0 = (1U << 7), /** Reserved for future use. */ BUFFLAG_DRIVER0 = (1U << 8), /** Private flag usable by the * driver. */ BUFFLAG_DRIVER1 = (1U << 9), /** Private flag usable by the * driver. */ BUFFLAG_DRIVER2 = (1U << 10), /** Private flag usable by the * driver. */ BUFFLAG_DRIVER3 = (1U << 11), /** Private flag usable by the * driver. */ }; The driBufferAllocate function is used to allocate a buffer in memory. When the buffer is no longer needed, driBufferRelease is used to free the memory. If a buffer is released that is pinned, it is unpinned. If a buffer is released that has a fence set that has not completed, the buffer will be queued for deletion. As soon as the fence is complete, the buffer will be released. The idea is that both of these functions are cheap to call. In most cases it will be better for a driver to throw away a buffer and allocate a new one than to wait for the fence on the existing buffer. int driBufferAllocate( struct driMemoryBuffer ** buf, unsigned size, unsigned required_flags, unsigned optional_flags, unsigned * actaul_flags ); int driBufferRelease( struct driMemoryBuffer * buf ); Simply allocating a buffer may or may not reserve space for the buffer. When driBufferGetCPUAddress or driBufferGetCardAddress is called, the buffer will be committed to memory and will be pinned. It is very important that the memory be pinned at this point. If it were not pinned, another process could reclaim the memory before the allocating process would have a chance to use the memory. int driBufferGetCPUAddress( struct driMemoryBuffer * buf, void ** addr ); int driBufferGetCardAddress( struct driMemoryBuffer * buf, void ** addr ); Once the address of the buffer is fixed, the driver can use an appropriate mechanism to load data into the buffer or read data from the buffer. Most buffers do not need to be saved when they are reclaimed by another process. Some buffers, such as textures that have been modified by glCopyTexImage, must be saved. The driBufferCanClobber function is used to set this state. By default all buffers are marked as clobberable. If a buffer cannot be clobber, it should be marked as such by calling driBufferCanClobber with can_clobber set to GL_FALSE. int driBufferCanClobber( struct driMemoryBuffer * buf, GLboolean can_clobber ); When a buffer is submitted to the hardware for rendering, a fence needs to be set on it. By examining the fence on a buffer, the driver can know if the buffer is being used by the hardware. Along with with driBufferSetFence, two functions exist to test a fence for completion or wait for a fence to complete. int driBufferSetFence( struct driMemoryBuffer * buf, unsigned fence ); int driBufferWaitFence( struct driMemoryBuffer * buf ); GLboolean driBufferTestFence( struct driMemoryBuffer * buf ); Properly setting fences is very important. Usage statistics for a buffer are updated each time a fence is set. These statistics may be used to determine which buffers are reclaimed when additional memory is needed. For example, if buffers are reclaimed on an LRU or MRU basis, setting a fence is the only way to update the buffers usage timestamp. After a fence has been set for a buffer, it can typically be unpinned. The fence is generally enough of a synchronization method. int driBufferPin( struct driMemoryBuffer * buf ); int driBufferUnpin( struct driMemoryBuffer * buf ); The function driBufferDataLost is used to determine if the data associated with a buffer has been lost. At any point in time a buffer can be in one of four states. It can never have been committed (i.e., by calling either driBufferGetCPUAddress or driBufferGetCardAddress), clobbered when another process reclaimed the memory, paged out (but restorable by the kernel), or still in memory (and unmodified). In the first two cases the driver needs to commit the buffer and load data, and driBufferWasLost will return GL_TRUE. In the other two cases the driver need either only commit the buffer, and driBufferWasLost will return GL_FALSE. GLboolean driBufferWasLost( struct driMemoryBuffer * buf ); Each driver will need to have a wrapper around _tnl_run_pipeline. At a minimum, the wrapper needs to call driBufferWasLost on each buffer that will be used. All buffers that will be used need to be pinned. After calling _tnl_run_pipeline, a fence needs to be set on each buffer, and buffers that do not need to be pinned should be unpinned. At the end of each frame (i.e., when glXSwapBuffers is called), driBufferPoolEndFrame should be called. Depending on the policies used to reclaim allocated memory, driBufferPoolEndFrame can help track per-frame buffer usage statistics. int driBufferPoolEndFrame( unsigned pool_id ); }}} {{{ 2.2.2.2 Creating memory pools The actual work to create the memory pools happens in the kernel. There is a certain amount of work done in user-space to determine what pools are available. Most AGP graphics cards will have three memory pools. 1. On-card memory. This pool can typically hold any type of buffer. 2. AGP aperture. This pool can usually only hold texture data, vertex data, and command buffers. 3. PCI DMA memory. This pool can typically only hold command buffers, but may be able to hold other types of buffers in emergency, low-memory situations. [Most of the rest of this section depends on the resolution to issue #3.13.] }}} {{{ 2.2.2.3 Reclaiming used blocks Further discussion is required before this section can be written. }}} }}} }}} }}} {{{ 3. Open Issues 3.1 How do we determine which blocks to reclaim? (open) 3.2 Are the sizes of the status, id, and sequence fields for 'struct memory_block' adequate? (open) 3.3 How does a process map from a memory address to a block ID? (open) 3.4 How are block IDs reclaimed if a process crashes? (open) 3.5 What happens when all the block IDs are exhausted? (open) 3.6 Should the kernel calls to page blocks in / out work on a single block at a time, or should they take a list of blocks? If they take a list of blocks, should it be required to be a linear list (i.e., page out N blocks, starting with block M)? 3.7 There seem to be race conditions in assigning fences, paging blocks in / out, and copying in data. (open) The race I envision is that paging out blocks, copying in data, and assigning a fence number is a multi-step process. The first part of the process happens without any locks held. Moreover, the first step can block. What happens if another process becomes active between the return from drmPageOutMemoryBlock and the call to copy data into the memory? 3.8 What interfaces are needed to create memory heaps? (open) 3.9 What interfaces are needed to copy data around (i.e., from main memory to on-card, AGP to on-card, on-card to AGP, etc.)? (open) 3.10 Fencing (open) There is quite a bit of discussion of fences to protect memory blocks. What additional support would be required for NV_fence or APPLE_fence? Is it worth the effort to design in that extra support now? 3.11 Where should the code live? (open) If the memory manager is truly device independent, should it live somewhere other than in the lib/GL/mesa/ subtree? If the code might be used by other driver vendors (i.e., binary-only drivers based on DRI), should it live in lib/GL/dri/? 3.12 Can buffer attributes change? (open) Should it be possible to change the attributes of a buffer? It might be useful to allocate a pbuffer with (BUFFLAG_COLOR_BUFFER | BUFFLAG_TEXTURE) initially, but change it to (BUFFLAG_TEXTURE) when it is bound to a texture. If this is allowed, what sort of API should be used to control it? 3.13 Maximum texture size (open) The current memory manager calculates the maximum texture size (as returned from GL_MAX_TEXTURE_SIZE) based on the amount of available texture memory and the number of texture units. The size is the maximum size that could be bound to all texture units and still fit into available texture memory. With the new memory manager, this system will not work. 1. Non-texture data can be put in "texture" memory and some buffers can be pinned, making it is impossible to calculate, in advance, the amount of available texture memory. 2. Calculating the maximum texture size in advance fails to consider the space savings from using compressed textures. 3. It is unlikely that cards that can use many textures per pass (the R200 can use 6) will have the maximum number of maximum size textures bound at once. This can artificially limit the maximum size of textures that can be used on an texture unit. On a 64MB PCI R200, I believe the maximum sized cube map, with full mipmaps, would be 256x256. There are three possible solutions that I can see. 1. Continue using the existing technique for calculating the maximum texture size. There are still cases where we could fail to bind textures to all available units. 2. Advertise the maximum hardware supported limit, but fallback to software rendering when all textures (and vertex buffers) cannot fit in memory. 3. Advertise the maximum hardware supported limit, but silently reduce the LOD on larger textures that will not fit. This will only help with mipmapped textures. We could also fallback to software TCL to remove vertex data from memory. }}} {{{ 4. Revision History 0.1 (2003/02/06) Initial version for internal pre-review. 0.2 (2003/02/07) Added BUFFLAG_RESERVED0 to remove the gap between BUFFLAG_CACHABLE and BUFFLAG_DRIVER0. Fixed numerous simple errors. Expanded the introduction to '2.1.1 Memory block'. Expanded '2.1.2 Available block ID list'. Expanded '2.1.3 Modified block list'. 0.3 (2003/02/07) Updated the description of the return codes for driBufferWasLost. Initial posting to dri-devel. }}} {{{ 5. References [1] "Streaming video through glTexSubImage2D," http://www.mail-archive.com/dri-devel%40lists.sourceforge.net/msg08875.html [2] "APPLE_client_storage," http://oss.sgi.com/projects/ogl-sample/registry/APPLE/client_storage.txt [3] "ARB_pixel_buffer_object," This specification is not yet available. }}}