On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote: > Jerome Glisse wrote: > >On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote: > >>>We had to do a similar thing in the > >>>Poulsbo driver and it turned out that we could save a significant amount of > >>>CPU by using a delayed workqueue, collecting objects and destroying them > >>>periodically. > >>Yes, indeed, we don't really care about a fence expiring unless we > >>want to use that buffer or we are out of memory. > >>Processing each fence, especially if there is an interrupt per fence, > >>seems unnecessary, since you can just do that work in the cases > >>mentioned above. > >> > >>So, here is a new proposal that should have the best features of all > >>previously considered methods. > >>Assume first that there is a single FIFO channel. > >> > >>We introduce a list of fence nodes, each containing: > >>- A list of buffers NOT to be destroyed having the fence as their sync > >>object (or possibly one for each memory type) > >>- A delayed-destroy list of buffers having the fence as their sync object > >> > >>Fence nodes are added at the end upon emission. > >>They are removed and freed when the buffer count drops to 0 (meaning > >>all buffers have been moved to later fences). > >>Thus, the number of fence nodes does not grow without bounds, but is > >>bounded by the number of buffers. > >>. > >>The LRU lists are changed to only contain unfenced buffers and buffers > >>are initially put on it. > >>When a buffer is fenced, it is removed from the LRU list or its > >>current fence list, and added to the new fence (possibly destroying > >>the old fence node in the process). > >>The reference count of buffer objects is only increased when they are > >>put in a delayed destruction list. > >> > >>Upon deletion, they are destroyed immediately if they are on the LRU > >>list and moved to the corresponding delayed-delete list if they are in > >>the fenced list. > >> > >>ttm_bo_delayed_delete is no longer called by any workqueue, but only > >>on when we run out of memory, before we try eviction. > >>It processes the list until the first unsignalled fence, destroying > >>all buffers it finds and freeing all the fence nodes. > >>All the buffers in the alive lists are put in the LRU list, in the > >>correct order. > >>We may either keep an alive list per memory type, or sort the buffers > >>by memory type (so they are put in the memory type LRU list) at this > >>point > >> > >>Thus, after ttm_bo_delayed_delete, there is the following scenario: > >>1. Inactive buffers with no references are destroyed > >>2. Inactive buffers with references are in the LRU list > >>3. Fenced buffers with no references are in the delayed-destroy list > >>attached to their fence > >>4. Fenced buffers with references are in the alive list attached to their > >>fence > >> > >>This should have about the same CPU and memory usage of the current > >>code, but the following advantages: > >>1. Delayed deletion does not run periodically, but exactly when needed > >>(at allocation time, before eviction) > >>2. Delayed deletion always deletes all buffers that can be deleted, > >>since it processes them in fence order > >>3. The LRU list at eviction time contains exactly all evictable buffers > >>4. Each buffer is always on an LRU _or_ on a delayed destroy list, so > >>only one node is necessary > >>5. Once buffer deletion determines a fence is signalled, it can > >>destroyed all its to-be-destroyed buffers without any more checking > >>6. Some fence logic (such as signalling of all fences on channel > >>forced closing) can be moved from drivers to TTM > >>7. Some channel logic can be moved from drivers to TTM > >>8. The code should be simpler and more elegant > >> > >>Note that using a buffer with the CPU doesn't change its position in > >>the LRU list. > >>This is good, because evicting a buffer used by the CPU is actually a > >>good thing, since it will move to a region of memory slower for the > >>GPU but possibly faster for the CPU. > >>However, for buffers in system memory, if the LRU list is used for > >>swapping, CPU access should move the buffer to the front of list > >>(using the LRU list for this is probably a bad idea though, since > >>system memory swapping should be at page granularity). > >> > >>For multiple channels, things are slightly more complex. > >>First, we need to built the fence data structure for each channel. > >>Also, we need the fence/buffer nodes to be separate > >>Delayed destruction continues to work, but the reference count needs > >>to be set to the number of channels fencing the buffer on destruction. > >>Putting buffers in the LRU list can be done by doing n-way merging > >>between the channel fence lists, assigning each fence a global > >>sequence number, and using that to merge and put back buffers in the > >>LRU. > >>n-way merging can be done with a small heap on the stack on current > >>hardware where channels are limited. > >>Furthermore, we need to keep a reference count, so that buffers are > >>put in the LRU (or destroyed) only after they are off all the > >>channels. > >> > >>The LRU list semantics are relaxed. If a buffer has both its fence > >>emitted before another buffer, and also signaled before it, then it > >>will be considered as "less recently" used, and the opposite thing > >>happens if both are after. Otherwise, the order is undetermined. > >> > >>This should still guarantee good eviction behavior, and allows to have > >>the LRU list only contain evictable buffers, while not requiring to > >>use a dynamic priority queue. > >>Otherwise, a red-black tree (or an indirect heap with back references) > >>indexed by the global fence sequence number can guarantee strict > >>emission-time LRU semantics. This seems unnecessary though. > >> > >>Thoughts? Opinions? > >>Any possible improvement? > >> > > > >I love the first solution but solution to multi fifo side > >looks overkilling. So i am wondering if we aren't tackling > >the issue from the wrong side. Below is slightly different > >approach to the issue. > > > >Idea is that memory management is about having bo in place > >for the GPU. So instead of looking at the problem from the > >bo side, look at it from the GPU address space side. > > > >For simplicity here i am using only one space VRAM (but GTT > >or others can be deal the sameway). So now we split the > >problem to look at it as a transaction any time the driver > >want to do somethings it ask to ttm to create new transaction > >each transaction is associated to a fence and list of buffer > >with their possible placement (i will explain how it can > >fit multi fifo gpu well too and how it also allow easy support > >to GPU with virtual address space which is somethings slowly > >becoming a reality i think). > > > >TTM receive a transaction it takes a transaction lock and give > >transaction number to this new transaction. Then it goes over > >the BO list and try to come up with the best placement for > >each of them. > Jerome, This seems to have a fundamental flaw of not allowing > simultaneous command submission. > > With the current code, A process blocked trying to validate its > buffers will not block other processes from validating, and we > should keep it that way. > > /Thomas >
Idea is who comes first is first to be serve :) and it can deal with multiple command submission if the hw can deal with it (sync stuff i am briefly describing in multi fifo part). But yes if process A is first in the kernel and process B could have been executed right away, A will win and have right to evict others, i don't think it will hurt perf, i think most of the usecase we will have enough memory and better userspace driver that the eviction from vram should become rare enough to be noticeable. Cheers, Jerome _______________________________________________ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau