Jerome Glisse wrote: > On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote: > >> Jerome Glisse wrote: >> >>> On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: >>> >>>>> We had to do a similar thing in the >>>>> Poulsbo driver and it turned out that we could save a significant amount >>>>> of >>>>> CPU by using a delayed workqueue, collecting objects and destroying them >>>>> periodically. >>>>> >>>> Yes, indeed, we don't really care about a fence expiring unless we >>>> want to use that buffer or we are out of memory. >>>> Processing each fence, especially if there is an interrupt per fence, >>>> seems unnecessary, since you can just do that work in the cases >>>> mentioned above. >>>> >>>> So, here is a new proposal that should have the best features of all >>>> previously considered methods. >>>> Assume first that there is a single FIFO channel. >>>> >>>> We introduce a list of fence nodes, each containing: >>>> - A list of buffers NOT to be destroyed having the fence as their sync >>>> object (or possibly one for each memory type) >>>> - A delayed-destroy list of buffers having the fence as their sync object >>>> >>>> Fence nodes are added at the end upon emission. >>>> They are removed and freed when the buffer count drops to 0 (meaning >>>> all buffers have been moved to later fences). >>>> Thus, the number of fence nodes does not grow without bounds, but is >>>> bounded by the number of buffers. >>>> . >>>> The LRU lists are changed to only contain unfenced buffers and buffers >>>> are initially put on it. >>>> When a buffer is fenced, it is removed from the LRU list or its >>>> current fence list, and added to the new fence (possibly destroying >>>> the old fence node in the process). >>>> The reference count of buffer objects is only increased when they are >>>> put in a delayed destruction list. >>>> >>>> Upon deletion, they are destroyed immediately if they are on the LRU >>>> list and moved to the corresponding delayed-delete list if they are in >>>> the fenced list. >>>> >>>> ttm_bo_delayed_delete is no longer called by any workqueue, but only >>>> on when we run out of memory, before we try eviction. >>>> It processes the list until the first unsignalled fence, destroying >>>> all buffers it finds and freeing all the fence nodes. >>>> All the buffers in the alive lists are put in the LRU list, in the >>>> correct order. >>>> We may either keep an alive list per memory type, or sort the buffers >>>> by memory type (so they are put in the memory type LRU list) at this >>>> point >>>> >>>> Thus, after ttm_bo_delayed_delete, there is the following scenario: >>>> 1. Inactive buffers with no references are destroyed >>>> 2. Inactive buffers with references are in the LRU list >>>> 3. Fenced buffers with no references are in the delayed-destroy list >>>> attached to their fence >>>> 4. Fenced buffers with references are in the alive list attached to their >>>> fence >>>> >>>> This should have about the same CPU and memory usage of the current >>>> code, but the following advantages: >>>> 1. Delayed deletion does not run periodically, but exactly when needed >>>> (at allocation time, before eviction) >>>> 2. Delayed deletion always deletes all buffers that can be deleted, >>>> since it processes them in fence order >>>> 3. The LRU list at eviction time contains exactly all evictable buffers >>>> 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so >>>> only one node is necessary >>>> 5. Once buffer deletion determines a fence is signalled, it can >>>> destroyed all its to-be-destroyed buffers without any more checking >>>> 6. Some fence logic (such as signalling of all fences on channel >>>> forced closing) can be moved from drivers to TTM >>>> 7. Some channel logic can be moved from drivers to TTM >>>> 8. The code should be simpler and more elegant >>>> >>>> Note that using a buffer with the CPU doesn't change its position in >>>> the LRU list. >>>> This is good, because evicting a buffer used by the CPU is actually a >>>> good thing, since it will move to a region of memory slower for the >>>> GPU but possibly faster for the CPU. >>>> However, for buffers in system memory, if the LRU list is used for >>>> swapping, CPU access should move the buffer to the front of list >>>> (using the LRU list for this is probably a bad idea though, since >>>> system memory swapping should be at page granularity). >>>> >>>> For multiple channels, things are slightly more complex. >>>> First, we need to built the fence data structure for each channel. >>>> Also, we need the fence/buffer nodes to be separate >>>> Delayed destruction continues to work, but the reference count needs >>>> to be set to the number of channels fencing the buffer on destruction. >>>> Putting buffers in the LRU list can be done by doing n-way merging >>>> between the channel fence lists, assigning each fence a global >>>> sequence number, and using that to merge and put back buffers in the >>>> LRU. >>>> n-way merging can be done with a small heap on the stack on current >>>> hardware where channels are limited. >>>> Furthermore, we need to keep a reference count, so that buffers are >>>> put in the LRU (or destroyed) only after they are off all the >>>> channels. >>>> >>>> The LRU list semantics are relaxed. If a buffer has both its fence >>>> emitted before another buffer, and also signaled before it, then it >>>> will be considered as "less recently" used, and the opposite thing >>>> happens if both are after. Otherwise, the order is undetermined. >>>> >>>> This should still guarantee good eviction behavior, and allows to have >>>> the LRU list only contain evictable buffers, while not requiring to >>>> use a dynamic priority queue. >>>> Otherwise, a red-black tree (or an indirect heap with back references) >>>> indexed by the global fence sequence number can guarantee strict >>>> emission-time LRU semantics. This seems unnecessary though. >>>> >>>> Thoughts? Opinions? >>>> Any possible improvement? >>>> >>>> >>> I love the first solution but solution to multi fifo side >>> looks overkilling. So i am wondering if we aren't tackling >>> the issue from the wrong side. Below is slightly different >>> approach to the issue. >>> >>> Idea is that memory management is about having bo in place >>> for the GPU. So instead of looking at the problem from the >>> bo side, look at it from the GPU address space side. >>> >>> For simplicity here i am using only one space VRAM (but GTT >>> or others can be deal the sameway). So now we split the >>> problem to look at it as a transaction any time the driver >>> want to do somethings it ask to ttm to create new transaction >>> each transaction is associated to a fence and list of buffer >>> with their possible placement (i will explain how it can >>> fit multi fifo gpu well too and how it also allow easy support >>> to GPU with virtual address space which is somethings slowly >>> becoming a reality i think). >>> >>> TTM receive a transaction it takes a transaction lock and give >>> transaction number to this new transaction. Then it goes over >>> the BO list and try to come up with the best placement for >>> each of them. >>> >> Jerome, This seems to have a fundamental flaw of not allowing >> simultaneous command submission. >> >> With the current code, A process blocked trying to validate its >> buffers will not block other processes from validating, and we >> should keep it that way. >> >> /Thomas >> >> > > Idea is who comes first is first to be serve :) and it can deal > with multiple command submission if the hw can deal with it > (sync stuff i am briefly describing in multi fifo part). But > yes if process A is first in the kernel and process B could > have been executed right away, A will win and have right to > evict others, i don't think it will hurt perf, i think most > of the usecase we will have enough memory and better userspace > driver that the eviction from vram should become rare enough > to be noticeable. > > Cheers, > Jerome > No, IMHO this is a big step back to "let's serialise everything with a single so we don't have to bother". Imagine a process blocking on VRAM space, and another process waiting that requires only AGP + pinned VRAM memory. That process shouldn't be blocked. A design rule of TTM is to *never* keep a global lock when waiting for the GPU, and I want to keep it that way.
/Thomas _______________________________________________ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau