Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-25 Thread Jerome Glisse
On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote:
 Jerome Glisse wrote:
 On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
 We had to do a similar thing in the
 Poulsbo driver and it turned out that we could save a significant amount of
 CPU by using a delayed workqueue, collecting objects and destroying them
 periodically.
 Yes, indeed, we don't really care about a fence expiring unless we
 want to use that buffer or we are out of memory.
 Processing each fence, especially if there is an interrupt per fence,
 seems unnecessary, since you can just do that work in the cases
 mentioned above.
 
 So, here is a new proposal that should have the best features of all
 previously considered methods.
 Assume first that there is a single FIFO channel.
 
 We introduce a list of fence nodes, each containing:
 - A list of buffers NOT to be destroyed having the fence as their sync
 object (or possibly one for each memory type)
 - A delayed-destroy list of buffers having the fence as their sync object
 
 Fence nodes are added at the end upon emission.
 They are removed and freed when the buffer count drops to 0 (meaning
 all buffers have been moved to later fences).
 Thus, the number of fence nodes does not grow without bounds, but is
 bounded by the number of buffers.
 .
 The LRU lists are changed to only contain unfenced buffers and buffers
 are initially put on it.
 When a buffer is fenced, it is removed from the LRU list or its
 current fence list, and added to the new fence (possibly destroying
 the old fence node in the process).
 The reference count of buffer objects is only increased when they are
 put in a delayed destruction list.
 
 Upon deletion, they are destroyed immediately if they are on the LRU
 list and moved to the corresponding delayed-delete list if they are in
 the fenced list.
 
 ttm_bo_delayed_delete is no longer called by any workqueue, but only
 on when we run out of memory, before we try eviction.
 It processes the list until the first unsignalled fence, destroying
 all buffers it finds and freeing all the fence nodes.
 All the buffers in the alive lists are put in the LRU list, in the
 correct order.
 We may either keep an alive list per memory type, or sort the buffers
 by memory type (so they are put in the memory type LRU list) at this
 point
 
 Thus, after ttm_bo_delayed_delete, there is the following scenario:
 1. Inactive buffers with no references are destroyed
 2. Inactive buffers with references are in the LRU list
 3. Fenced buffers with no references are in the delayed-destroy list
 attached to their fence
 4. Fenced buffers with references are in the alive list attached to their 
 fence
 
 This should have about the same CPU and memory usage of the current
 code, but the following advantages:
 1. Delayed deletion does not run periodically, but exactly when needed
 (at allocation time, before eviction)
 2. Delayed deletion always deletes all buffers that can be deleted,
 since it processes them in fence order
 3. The LRU list at eviction time contains exactly all evictable buffers
 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
 only one node is necessary
 5. Once buffer deletion determines a fence is signalled, it can
 destroyed all its to-be-destroyed buffers without any more checking
 6. Some fence logic (such as signalling of all fences on channel
 forced closing) can be moved  from drivers to TTM
 7. Some channel logic can be moved from drivers to TTM
 8. The code should be simpler and more elegant
 
 Note that using a buffer with the CPU doesn't change its position in
 the LRU list.
 This is good, because evicting a buffer used by the CPU is actually a
 good thing, since it will move to a region of memory slower for the
 GPU but possibly faster for the CPU.
 However, for buffers in system memory, if the LRU list is used for
 swapping, CPU access should move the buffer to the front of list
 (using the LRU list for this is probably a bad idea though, since
 system memory swapping should be at page granularity).
 
 For multiple channels, things are slightly more complex.
 First, we need to built the fence data structure for each channel.
 Also, we need the fence/buffer nodes to be separate
 Delayed destruction continues to work, but the reference count needs
 to be set to the number of channels fencing the buffer on destruction.
 Putting buffers in the LRU list can be done by doing n-way merging
 between the channel fence lists, assigning each fence a global
 sequence number, and using that to merge and put back buffers in the
 LRU.
 n-way merging can be done with a small heap on the stack on current
 hardware where channels are limited.
 Furthermore, we need to keep a reference count, so that buffers are
 put in the LRU (or destroyed) only after they are off all the
 channels.
 
 The LRU list semantics are relaxed. If a buffer has both its fence
 emitted before another buffer, and also signaled before it, then it
 will be 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-25 Thread Jerome Glisse
On Thu, Jan 21, 2010 at 04:14:39PM +0100, Luca Barbieri wrote:
 I'm not sure I understand your proposal correctly.
 It seems your proposoal is similar to mine, replacing the term fence
 nodes with ttm transactions, but I'm not sure if I understand it
 correctly
 
 Here is some pseudocode for a improved, simplified version of my proposal.
 It is modified so that there are no longer distinct alive/destroy
 lists, but buffers are destroyed if their ref count is zero.
 
 list_head ram_lru_list; /* list of bos */
 list_head vram_unfenced_lru_list; /* list of bos */
 list_head gart_unfenced_lru_list; /* list of bos */
 
 atomic uint64_t next_seq_num;
 
 // if read_list and write_list are empty, the buffer is unfenced and
 MUST be in an unfenced lru list
 // otherwise, it is fenced and MUST be, if not zombie, on some
 read_list/write_list, or if zombie, on some unfenced_list
 struct ttm_buffer_object
 {
kref_t ref;
list_head read_list; /* list of bo_fence_nodes */
list_head write_list; /* list of bo_fence_nodes */
list_head unfenced_list; /* list entry in
 [ram|[gart|vram]_unfenced]_lru_list */
[...]
 };
 
 // TODO: we could embed just the first two members in the
 ttm_buffer_object, and steal the lowest bit on the fence pointer to
 signify that
 // this would optimize for the very common single-fence case
 struct ttm_bo_fence_node
 {
 list_head bo_list; /* list entry in bo.[read|write]_list */
 struct ttm_fence_node* fence;
 
 struct ttm_buffer_object* bo;
 list_head fence_list; /* list entry in fence.[vram|gart|destroy]_list */
 };
 
 struct ttm_fence
 {
 void* sync_object; /* may also turned into an object containing a
 ttm_fence at the start */
 uint64_t seq_num; /* unique value in order of kmalloc of this ttm_fence */
 list_head bo_list; /* list of bo_fence_nodes */
 };
 
 struct ttm_channel
 {
 list_head fence_list; /* list of ttm_fence_node */
 };
 
 ttm_flush_fences()
 {
 list_head vram_list[MAX_CHANNELS];
 list_head gart_list[MAX_CHANNELS];
 foreach channel
 {
 foreach fence in channel
 {
  if(not driver-fence_expired(fence-sync_object))
  break;
  foreach bo_fence_node in fence.bo_list
  {
  remove bo_fence_node
  if bo.read_list and bo.write_list are both empty
  {
  if bo.refcount is zero
  destroy
  else
  {
  append to [vram|gart]_list[channel]
  }
  }
  }
 }
 }
 
 // this is the n-way merge of vram_list[]s into the lru list
 while(vram_list[]s are not all empty)
 {
 // this can use either a simple scan, or an heap
 find channel such that
 first_entry(vram_list[channel]).fence.seq_num is smallest
 
 remove first_entry(vram_list[channel]) and put the bo at the
 recent end of vram_unfenced_lru_list
 }
 
 same thing for gart;
 }
 
 // assume buffer is reserved, use current mechanism or mutex for that
 // channel may be null for CPU waits
 ttm_bo_wait(bo, channel, wait_for_write)
 {
  foreach fence_node in bo.write_list
  {
  if(fence_node.channel != channel)
  driver-wait_fence(fence_node.fence.sync_object);
  }
 
  if(!wait_for_write)
   return;
 
  foreach fence_node in bo.read_list
  {
  if(fence_node.channel != channel)
   driver-wait_fence(fence_node.fence.sync_object);
  }
 }
 
 ttm_out_of_memory() takes memory_alloc_mutex
 {
 retry:
 ttm_flush_fences();
 if(we have enough space now)
 return;
 foreach in [vram|gart]_unfenced_lru_list
 {
 evict that buffer if it's big enough, no need to check fences
 this uses the current ghost mechanism for accelerated moves
 emit evicted_buffer_fence for after emission
 }
 if we didn't find a big enough buffer, evict the biggest buffer
 (also consider empty space around it in size)
 if(we have enough space now)
 return;
 if(burn_cpu_time_but_have_lowest_latencies)
 {
 while(!driver-fence_expired(evicted_bo-sync_object) and we
 don't have enough space)
 {
 driver-wait_for_any_fence_to_expire();
 ttm_flush_fences();
 }
 }
 else
 ttm_bo_wait(evicted_bo 0)
 goto retry;
 }
 
 // assume the buffer has already been put in the desired memory space
 // also assume we already waited for the buffer fences
 ttm_add_buffer_to_fence(fence, bo)
 {
 remove bo from unfenced lru list if it's on it
 for the none or single bo_fence_node in bo.read_list or
 bo.write_list with bo_fence_node.fence.channel == fence.channel
 {
 remove bo_fence_node from bo_fence_node.fence.[gart|vram]_list
 if(bo_fence_node.fence has all lists empty)
 remove 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-25 Thread Thomas Hellstrom
Jerome Glisse wrote:
 On Thu, Jan 21, 2010 at 01:59:26PM +0100, Thomas Hellstrom wrote:
   
 Jerome Glisse wrote:
 
 On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
   
 We had to do a similar thing in the
 Poulsbo driver and it turned out that we could save a significant amount 
 of
 CPU by using a delayed workqueue, collecting objects and destroying them
 periodically.
   
 Yes, indeed, we don't really care about a fence expiring unless we
 want to use that buffer or we are out of memory.
 Processing each fence, especially if there is an interrupt per fence,
 seems unnecessary, since you can just do that work in the cases
 mentioned above.

 So, here is a new proposal that should have the best features of all
 previously considered methods.
 Assume first that there is a single FIFO channel.

 We introduce a list of fence nodes, each containing:
 - A list of buffers NOT to be destroyed having the fence as their sync
 object (or possibly one for each memory type)
 - A delayed-destroy list of buffers having the fence as their sync object

 Fence nodes are added at the end upon emission.
 They are removed and freed when the buffer count drops to 0 (meaning
 all buffers have been moved to later fences).
 Thus, the number of fence nodes does not grow without bounds, but is
 bounded by the number of buffers.
 .
 The LRU lists are changed to only contain unfenced buffers and buffers
 are initially put on it.
 When a buffer is fenced, it is removed from the LRU list or its
 current fence list, and added to the new fence (possibly destroying
 the old fence node in the process).
 The reference count of buffer objects is only increased when they are
 put in a delayed destruction list.

 Upon deletion, they are destroyed immediately if they are on the LRU
 list and moved to the corresponding delayed-delete list if they are in
 the fenced list.

 ttm_bo_delayed_delete is no longer called by any workqueue, but only
 on when we run out of memory, before we try eviction.
 It processes the list until the first unsignalled fence, destroying
 all buffers it finds and freeing all the fence nodes.
 All the buffers in the alive lists are put in the LRU list, in the
 correct order.
 We may either keep an alive list per memory type, or sort the buffers
 by memory type (so they are put in the memory type LRU list) at this
 point

 Thus, after ttm_bo_delayed_delete, there is the following scenario:
 1. Inactive buffers with no references are destroyed
 2. Inactive buffers with references are in the LRU list
 3. Fenced buffers with no references are in the delayed-destroy list
 attached to their fence
 4. Fenced buffers with references are in the alive list attached to their 
 fence

 This should have about the same CPU and memory usage of the current
 code, but the following advantages:
 1. Delayed deletion does not run periodically, but exactly when needed
 (at allocation time, before eviction)
 2. Delayed deletion always deletes all buffers that can be deleted,
 since it processes them in fence order
 3. The LRU list at eviction time contains exactly all evictable buffers
 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
 only one node is necessary
 5. Once buffer deletion determines a fence is signalled, it can
 destroyed all its to-be-destroyed buffers without any more checking
 6. Some fence logic (such as signalling of all fences on channel
 forced closing) can be moved  from drivers to TTM
 7. Some channel logic can be moved from drivers to TTM
 8. The code should be simpler and more elegant

 Note that using a buffer with the CPU doesn't change its position in
 the LRU list.
 This is good, because evicting a buffer used by the CPU is actually a
 good thing, since it will move to a region of memory slower for the
 GPU but possibly faster for the CPU.
 However, for buffers in system memory, if the LRU list is used for
 swapping, CPU access should move the buffer to the front of list
 (using the LRU list for this is probably a bad idea though, since
 system memory swapping should be at page granularity).

 For multiple channels, things are slightly more complex.
 First, we need to built the fence data structure for each channel.
 Also, we need the fence/buffer nodes to be separate
 Delayed destruction continues to work, but the reference count needs
 to be set to the number of channels fencing the buffer on destruction.
 Putting buffers in the LRU list can be done by doing n-way merging
 between the channel fence lists, assigning each fence a global
 sequence number, and using that to merge and put back buffers in the
 LRU.
 n-way merging can be done with a small heap on the stack on current
 hardware where channels are limited.
 Furthermore, we need to keep a reference count, so that buffers are
 put in the LRU (or destroyed) only after they are off all the
 channels.

 The LRU list semantics are relaxed. If a buffer has both its fence
 emitted before another buffer, and 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Jerome Glisse
On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
  We had to do a similar thing in the
  Poulsbo driver and it turned out that we could save a significant amount of
  CPU by using a delayed workqueue, collecting objects and destroying them
  periodically.
 Yes, indeed, we don't really care about a fence expiring unless we
 want to use that buffer or we are out of memory.
 Processing each fence, especially if there is an interrupt per fence,
 seems unnecessary, since you can just do that work in the cases
 mentioned above.
 
 So, here is a new proposal that should have the best features of all
 previously considered methods.
 Assume first that there is a single FIFO channel.
 
 We introduce a list of fence nodes, each containing:
 - A list of buffers NOT to be destroyed having the fence as their sync
 object (or possibly one for each memory type)
 - A delayed-destroy list of buffers having the fence as their sync object
 
 Fence nodes are added at the end upon emission.
 They are removed and freed when the buffer count drops to 0 (meaning
 all buffers have been moved to later fences).
 Thus, the number of fence nodes does not grow without bounds, but is
 bounded by the number of buffers.
 .
 The LRU lists are changed to only contain unfenced buffers and buffers
 are initially put on it.
 When a buffer is fenced, it is removed from the LRU list or its
 current fence list, and added to the new fence (possibly destroying
 the old fence node in the process).
 The reference count of buffer objects is only increased when they are
 put in a delayed destruction list.
 
 Upon deletion, they are destroyed immediately if they are on the LRU
 list and moved to the corresponding delayed-delete list if they are in
 the fenced list.
 
 ttm_bo_delayed_delete is no longer called by any workqueue, but only
 on when we run out of memory, before we try eviction.
 It processes the list until the first unsignalled fence, destroying
 all buffers it finds and freeing all the fence nodes.
 All the buffers in the alive lists are put in the LRU list, in the
 correct order.
 We may either keep an alive list per memory type, or sort the buffers
 by memory type (so they are put in the memory type LRU list) at this
 point
 
 Thus, after ttm_bo_delayed_delete, there is the following scenario:
 1. Inactive buffers with no references are destroyed
 2. Inactive buffers with references are in the LRU list
 3. Fenced buffers with no references are in the delayed-destroy list
 attached to their fence
 4. Fenced buffers with references are in the alive list attached to their 
 fence
 
 This should have about the same CPU and memory usage of the current
 code, but the following advantages:
 1. Delayed deletion does not run periodically, but exactly when needed
 (at allocation time, before eviction)
 2. Delayed deletion always deletes all buffers that can be deleted,
 since it processes them in fence order
 3. The LRU list at eviction time contains exactly all evictable buffers
 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
 only one node is necessary
 5. Once buffer deletion determines a fence is signalled, it can
 destroyed all its to-be-destroyed buffers without any more checking
 6. Some fence logic (such as signalling of all fences on channel
 forced closing) can be moved  from drivers to TTM
 7. Some channel logic can be moved from drivers to TTM
 8. The code should be simpler and more elegant
 
 Note that using a buffer with the CPU doesn't change its position in
 the LRU list.
 This is good, because evicting a buffer used by the CPU is actually a
 good thing, since it will move to a region of memory slower for the
 GPU but possibly faster for the CPU.
 However, for buffers in system memory, if the LRU list is used for
 swapping, CPU access should move the buffer to the front of list
 (using the LRU list for this is probably a bad idea though, since
 system memory swapping should be at page granularity).
 
 For multiple channels, things are slightly more complex.
 First, we need to built the fence data structure for each channel.
 Also, we need the fence/buffer nodes to be separate
 Delayed destruction continues to work, but the reference count needs
 to be set to the number of channels fencing the buffer on destruction.
 Putting buffers in the LRU list can be done by doing n-way merging
 between the channel fence lists, assigning each fence a global
 sequence number, and using that to merge and put back buffers in the
 LRU.
 n-way merging can be done with a small heap on the stack on current
 hardware where channels are limited.
 Furthermore, we need to keep a reference count, so that buffers are
 put in the LRU (or destroyed) only after they are off all the
 channels.
 
 The LRU list semantics are relaxed. If a buffer has both its fence
 emitted before another buffer, and also signaled before it, then it
 will be considered as less recently used, and the opposite thing
 happens if both are after. 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Thomas Hellstrom
Luca Barbieri wrote:
 We had to do a similar thing in the
 Poulsbo driver and it turned out that we could save a significant amount of
 CPU by using a delayed workqueue, collecting objects and destroying them
 periodically.
 
 Yes, indeed, we don't really care about a fence expiring unless we
 want to use that buffer or we are out of memory.
 Processing each fence, especially if there is an interrupt per fence,
 seems unnecessary, since you can just do that work in the cases
 mentioned above.

 So, here is a new proposal that should have the best features of all
 previously considered methods.
 Assume first that there is a single FIFO channel.

 We introduce a list of fence nodes, each containing:
 - A list of buffers NOT to be destroyed having the fence as their sync
 object (or possibly one for each memory type)
 - A delayed-destroy list of buffers having the fence as their sync object

 Fence nodes are added at the end upon emission.
 They are removed and freed when the buffer count drops to 0 (meaning
 all buffers have been moved to later fences).
 Thus, the number of fence nodes does not grow without bounds, but is
 bounded by the number of buffers.
 .
 The LRU lists are changed to only contain unfenced buffers and buffers
 are initially put on it.
 When a buffer is fenced, it is removed from the LRU list or its
 current fence list, and added to the new fence (possibly destroying
 the old fence node in the process).
 The reference count of buffer objects is only increased when they are
 put in a delayed destruction list.

 Upon deletion, they are destroyed immediately if they are on the LRU
 list and moved to the corresponding delayed-delete list if they are in
 the fenced list.

 ttm_bo_delayed_delete is no longer called by any workqueue, but only
 on when we run out of memory, before we try eviction.
 It processes the list until the first unsignalled fence, destroying
 all buffers it finds and freeing all the fence nodes.
 All the buffers in the alive lists are put in the LRU list, in the
 correct order.
 We may either keep an alive list per memory type, or sort the buffers
 by memory type (so they are put in the memory type LRU list) at this
 point

 Thus, after ttm_bo_delayed_delete, there is the following scenario:
 1. Inactive buffers with no references are destroyed
 2. Inactive buffers with references are in the LRU list
 3. Fenced buffers with no references are in the delayed-destroy list
 attached to their fence
 4. Fenced buffers with references are in the alive list attached to their 
 fence

 This should have about the same CPU and memory usage of the current
 code, but the following advantages:
 1. Delayed deletion does not run periodically, but exactly when needed
 (at allocation time, before eviction)
 2. Delayed deletion always deletes all buffers that can be deleted,
 since it processes them in fence order
 3. The LRU list at eviction time contains exactly all evictable buffers
 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
 only one node is necessary
 5. Once buffer deletion determines a fence is signalled, it can
 destroyed all its to-be-destroyed buffers without any more checking
 6. Some fence logic (such as signalling of all fences on channel
 forced closing) can be moved  from drivers to TTM
 7. Some channel logic can be moved from drivers to TTM
 8. The code should be simpler and more elegant

 Note that using a buffer with the CPU doesn't change its position in
 the LRU list.
 This is good, because evicting a buffer used by the CPU is actually a
 good thing, since it will move to a region of memory slower for the
 GPU but possibly faster for the CPU.
 However, for buffers in system memory, if the LRU list is used for
 swapping, CPU access should move the buffer to the front of list
 (using the LRU list for this is probably a bad idea though, since
 system memory swapping should be at page granularity).

 For multiple channels, things are slightly more complex.
 First, we need to built the fence data structure for each channel.
 Also, we need the fence/buffer nodes to be separate
 Delayed destruction continues to work, but the reference count needs
 to be set to the number of channels fencing the buffer on destruction.
 Putting buffers in the LRU list can be done by doing n-way merging
 between the channel fence lists, assigning each fence a global
 sequence number, and using that to merge and put back buffers in the
 LRU.
 n-way merging can be done with a small heap on the stack on current
 hardware where channels are limited.
 Furthermore, we need to keep a reference count, so that buffers are
 put in the LRU (or destroyed) only after they are off all the
 channels.

 The LRU list semantics are relaxed. If a buffer has both its fence
 emitted before another buffer, and also signaled before it, then it
 will be considered as less recently used, and the opposite thing
 happens if both are after. Otherwise, the order is undetermined.

 This should 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Thomas Hellstrom
Jerome Glisse wrote:
 On Thu, Jan 21, 2010 at 04:49:39AM +0100, Luca Barbieri wrote:
   
 We had to do a similar thing in the
 Poulsbo driver and it turned out that we could save a significant amount of
 CPU by using a delayed workqueue, collecting objects and destroying them
 periodically.
   
 Yes, indeed, we don't really care about a fence expiring unless we
 want to use that buffer or we are out of memory.
 Processing each fence, especially if there is an interrupt per fence,
 seems unnecessary, since you can just do that work in the cases
 mentioned above.

 So, here is a new proposal that should have the best features of all
 previously considered methods.
 Assume first that there is a single FIFO channel.

 We introduce a list of fence nodes, each containing:
 - A list of buffers NOT to be destroyed having the fence as their sync
 object (or possibly one for each memory type)
 - A delayed-destroy list of buffers having the fence as their sync object

 Fence nodes are added at the end upon emission.
 They are removed and freed when the buffer count drops to 0 (meaning
 all buffers have been moved to later fences).
 Thus, the number of fence nodes does not grow without bounds, but is
 bounded by the number of buffers.
 .
 The LRU lists are changed to only contain unfenced buffers and buffers
 are initially put on it.
 When a buffer is fenced, it is removed from the LRU list or its
 current fence list, and added to the new fence (possibly destroying
 the old fence node in the process).
 The reference count of buffer objects is only increased when they are
 put in a delayed destruction list.

 Upon deletion, they are destroyed immediately if they are on the LRU
 list and moved to the corresponding delayed-delete list if they are in
 the fenced list.

 ttm_bo_delayed_delete is no longer called by any workqueue, but only
 on when we run out of memory, before we try eviction.
 It processes the list until the first unsignalled fence, destroying
 all buffers it finds and freeing all the fence nodes.
 All the buffers in the alive lists are put in the LRU list, in the
 correct order.
 We may either keep an alive list per memory type, or sort the buffers
 by memory type (so they are put in the memory type LRU list) at this
 point

 Thus, after ttm_bo_delayed_delete, there is the following scenario:
 1. Inactive buffers with no references are destroyed
 2. Inactive buffers with references are in the LRU list
 3. Fenced buffers with no references are in the delayed-destroy list
 attached to their fence
 4. Fenced buffers with references are in the alive list attached to their 
 fence

 This should have about the same CPU and memory usage of the current
 code, but the following advantages:
 1. Delayed deletion does not run periodically, but exactly when needed
 (at allocation time, before eviction)
 2. Delayed deletion always deletes all buffers that can be deleted,
 since it processes them in fence order
 3. The LRU list at eviction time contains exactly all evictable buffers
 4. Each buffer is always on an LRU _or_ on a delayed destroy list, so
 only one node is necessary
 5. Once buffer deletion determines a fence is signalled, it can
 destroyed all its to-be-destroyed buffers without any more checking
 6. Some fence logic (such as signalling of all fences on channel
 forced closing) can be moved  from drivers to TTM
 7. Some channel logic can be moved from drivers to TTM
 8. The code should be simpler and more elegant

 Note that using a buffer with the CPU doesn't change its position in
 the LRU list.
 This is good, because evicting a buffer used by the CPU is actually a
 good thing, since it will move to a region of memory slower for the
 GPU but possibly faster for the CPU.
 However, for buffers in system memory, if the LRU list is used for
 swapping, CPU access should move the buffer to the front of list
 (using the LRU list for this is probably a bad idea though, since
 system memory swapping should be at page granularity).

 For multiple channels, things are slightly more complex.
 First, we need to built the fence data structure for each channel.
 Also, we need the fence/buffer nodes to be separate
 Delayed destruction continues to work, but the reference count needs
 to be set to the number of channels fencing the buffer on destruction.
 Putting buffers in the LRU list can be done by doing n-way merging
 between the channel fence lists, assigning each fence a global
 sequence number, and using that to merge and put back buffers in the
 LRU.
 n-way merging can be done with a small heap on the stack on current
 hardware where channels are limited.
 Furthermore, we need to keep a reference count, so that buffers are
 put in the LRU (or destroyed) only after they are off all the
 channels.

 The LRU list semantics are relaxed. If a buffer has both its fence
 emitted before another buffer, and also signaled before it, then it
 will be considered as less recently used, and the opposite thing
 happens if 

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri
 At a first glance:

 1) We probably *will* need a delayed destroyed workqueue to avoid wasting
 memory that otherwise should be freed to the system. At the very least, the
 delayed delete process should optionally be run by a system shrinker.
You are right. For VRAM we don't care since we are the only user,
while for system backed memory some delayed destruction will be
needed.
The logical extension of the scheme would be for the Linux page
allocator/swapper to check for TTM buffers to destroy when it would
otherwise shrink caches, try to swap and/or wait on swap to happen.
Not sure whether there are existing hooks for this or where exactly to
hook this code.

 2) Fences in TTM are currently not necessarily strictly ordered, and
 sequence numbers are hidden from the bo code. This means, for a given FIFO,
 fence sequence 3 may expire before fence sequence 2, depending on the usage
 of the buffer.

My definition of channel (I sometimes used FIFO incorrectly as a
synonym of that) is exactly a set of fences that are strictly ordered.
If the card has multiple HW engines, each is considered a different
channel (so that a channel becomes a (fifo, engine) pair).

We may need however to add the concept of a sync domain that would
be a set of channels that support on-GPU synchronization against each
other.
This would model hardware where channels with the same FIFO can be
synchronized together but those with different FIFOs don't, and also
multi-core GPUs where synchronization might be available only inside
each core and not across cores.

To sum it up, a GPU consists of a set of sync domains, each consisting
of a set of channels, each consisting of a sequence of fences, with
the following rules:
1. Fences within the same channel expire in order
2. If channels A and B belong to the same sync domain, it's possible
to emit a fence on A that is guaranteed to expire after an arbitrary
fence of B

Whether channels have the same FIFO or not is essentially a driver
implementation detail, and what TTM cares about is if they are in the
same sync domain.

[I just made up sync domain here: is there a standard term?]

This assumes that the synchronizability graph is a disjoint union of
complete graphs. Is there any example where it is not so?
Also, does this actually model correctly Poulsbo, or am I wrong?

Note that we could use CPU mediation more than we currently do.
For instance now Nouveau, to do inter-channel synchronization, simply
waits on the fence with the CPU immediately synchronously, while it
could instead queue the commands in software, and with an
interrupt/delayed mechanism submit them to hardware once the fence to
be waited for is expired.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Francisco Jerez
Luca Barbieri l...@luca-barbieri.com writes:

 At a first glance:

 1) We probably *will* need a delayed destroyed workqueue to avoid wasting
 memory that otherwise should be freed to the system. At the very least, the
 delayed delete process should optionally be run by a system shrinker.
 You are right. For VRAM we don't care since we are the only user,
 while for system backed memory some delayed destruction will be
 needed.
 The logical extension of the scheme would be for the Linux page
 allocator/swapper to check for TTM buffers to destroy when it would
 otherwise shrink caches, try to swap and/or wait on swap to happen.
 Not sure whether there are existing hooks for this or where exactly to
 hook this code.

 2) Fences in TTM are currently not necessarily strictly ordered, and
 sequence numbers are hidden from the bo code. This means, for a given FIFO,
 fence sequence 3 may expire before fence sequence 2, depending on the usage
 of the buffer.

 My definition of channel (I sometimes used FIFO incorrectly as a
 synonym of that) is exactly a set of fences that are strictly ordered.
 If the card has multiple HW engines, each is considered a different
 channel (so that a channel becomes a (fifo, engine) pair).

 We may need however to add the concept of a sync domain that would
 be a set of channels that support on-GPU synchronization against each
 other.
 This would model hardware where channels with the same FIFO can be
 synchronized together but those with different FIFOs don't, and also
 multi-core GPUs where synchronization might be available only inside
 each core and not across cores.

 To sum it up, a GPU consists of a set of sync domains, each consisting
 of a set of channels, each consisting of a sequence of fences, with
 the following rules:
 1. Fences within the same channel expire in order
 2. If channels A and B belong to the same sync domain, it's possible
 to emit a fence on A that is guaranteed to expire after an arbitrary
 fence of B

 Whether channels have the same FIFO or not is essentially a driver
 implementation detail, and what TTM cares about is if they are in the
 same sync domain.

 [I just made up sync domain here: is there a standard term?]

 This assumes that the synchronizability graph is a disjoint union of
 complete graphs. Is there any example where it is not so?
 Also, does this actually model correctly Poulsbo, or am I wrong?

 Note that we could use CPU mediation more than we currently do.
 For instance now Nouveau, to do inter-channel synchronization, simply
 waits on the fence with the CPU immediately synchronously, while it
 could instead queue the commands in software, and with an
 interrupt/delayed mechanism submit them to hardware once the fence to
 be waited for is expired.

Nvidia cards have a synchronization primitive that could be used to
synchronize several FIFOs in hardware (AKA semaphores, see [1] for an
example).

 ___
 Nouveau mailing list
 Nouveau@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/nouveau

[1] http://lists.freedesktop.org/archives/nouveau/2009-December/004514.html


pgpPDUkLn0FBH.pgp
Description: PGP signature
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri
 Nvidia cards have a synchronization primitive that could be used to
 synchronize several FIFOs in hardware (AKA semaphores, see [1] for an
 example).

Does this operate wholly on the GPU on all nVidia cards?

It seems that at least on some GPUs this will trigger software
methods that are basically a way for the GPU to trigger an interrupt
and stop the FIFO until the CPU handles the interrupt and restarts it.

Also, is there a way on nVidia cards to get interrupts on fences, but
only where the fence sequence number is higher than a dynamically set
value? (so that one could sleep for fence X without getting an
interrupt for every single fence before that)

If not, it could possibly be hacked around by reading from a DMA
object at the address of the fence sequence number and then resizing
the DMA object so that addresses from a certain point on would trigger
a protection fault interrupt.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Thomas Hellstrom
Luca Barbieri wrote:
 At a first glance:

 1) We probably *will* need a delayed destroyed workqueue to avoid wasting
 memory that otherwise should be freed to the system. At the very least, the
 delayed delete process should optionally be run by a system shrinker.
 
 You are right. For VRAM we don't care since we are the only user,
 while for system backed memory some delayed destruction will be
 needed.
 The logical extension of the scheme would be for the Linux page
 allocator/swapper to check for TTM buffers to destroy when it would
 otherwise shrink caches, try to swap and/or wait on swap to happen.
 Not sure whether there are existing hooks for this or where exactly to
 hook this code.

   

I think there are existing hooks for this, but I haven't yet figured out 
how they work.

 2) Fences in TTM are currently not necessarily strictly ordered, and
 sequence numbers are hidden from the bo code. This means, for a given FIFO,
 fence sequence 3 may expire before fence sequence 2, depending on the usage
 of the buffer.
 

 My definition of channel (I sometimes used FIFO incorrectly as a
 synonym of that) is exactly a set of fences that are strictly ordered.
 If the card has multiple HW engines, each is considered a different
 channel (so that a channel becomes a (fifo, engine) pair).

 We may need however to add the concept of a sync domain that would
 be a set of channels that support on-GPU synchronization against each
 other.
 This would model hardware where channels with the same FIFO can be
 synchronized together but those with different FIFOs don't, and also
 multi-core GPUs where synchronization might be available only inside
 each core and not across cores.

 To sum it up, a GPU consists of a set of sync domains, each consisting
 of a set of channels, each consisting of a sequence of fences, with
 the following rules:
 1. Fences within the same channel expire in order
 2. If channels A and B belong to the same sync domain, it's possible
 to emit a fence on A that is guaranteed to expire after an arbitrary
 fence of B

 Whether channels have the same FIFO or not is essentially a driver
 implementation detail, and what TTM cares about is if they are in the
 same sync domain.

 [I just made up sync domain here: is there a standard term?]

 This assumes that the synchronizability graph is a disjoint union of
 complete graphs. Is there any example where it is not so?
 Also, does this actually model correctly Poulsbo, or am I wrong?
   
Let me give some usage examples for intel, and Poulsbo. The 
synchronization for these are / were modeled with fences with stages, 
(called fence_types). The signaling of one stage set a bit in a bit 
mask. A bo was considered idle when (bo-required_stages  
fence-signaled_stages == bo-required_stages).

Intel would have had the stages command_submitted, read_flushed, 
write_flushed. A command buffer would have been idle if the fence had 
signaled command_submitted, whereas a render buffer would've been idle 
on  command_submitted | write_flushed. A write flush would've had to be 
issued separately. Typically when the buffer was put on the delayed 
delete queue or when the bo_idle function was called, which really 
minimized the amount of needed flushing. An executed write flush would 
signal write_flushed on all fences whose sequence had passed. An 
explicit ordering of fences here would've meant queueing a 
read-write-flush  in between them.

For Poulsbo, the GPU was modeled with a programming stage, a binner 
stage, a rasterizer stage and a feedback stage. Each stage of completion 
set a signaled_stage bit in the fence.

Now, a vertex buffer would be idle when the fence signaled binner_done, 
(which could happen well before rasterization complete of the previous 
command sequence). It's true that one could use feedback done for all 
buffers but a low-memory-footprint system quickly ran out of vertex 
buffer space, so quick reuse of those were essential.

So to summarize, the usage of the buffer together with the signaled 
state of the fence object really determines whether the buffer is idle. 
I think in your model,  the Poulsbo GPU would've been a sync domain and 
the programmer, binner, rasterizer and  feedback engine would've 
been separate channels. The Intel case would perhaps have been a bit 
more tricky.

/Thomas
















 Note that we could use CPU mediation more than we currently do.
 For instance now Nouveau, to do inter-channel synchronization, simply
 waits on the fence with the CPU immediately synchronously, while it
 could instead queue the commands in software, and with an
 interrupt/delayed mechanism submit them to hardware once the fence to
 be waited for is expired.
   

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Francisco Jerez
Luca Barbieri l...@luca-barbieri.com writes:

 Nvidia cards have a synchronization primitive that could be used to
 synchronize several FIFOs in hardware (AKA semaphores, see [1] for an
 example).

 Does this operate wholly on the GPU on all nVidia cards?

 It seems that at least on some GPUs this will trigger software
 methods that are basically a way for the GPU to trigger an interrupt
 and stop the FIFO until the CPU handles the interrupt and restarts it.

Yes, it could be handled without software methods, Maarten (CC'ed) was
looking into this, IIRC.

 Also, is there a way on nVidia cards to get interrupts on fences, but
 only where the fence sequence number is higher than a dynamically set
 value? (so that one could sleep for fence X without getting an
 interrupt for every single fence before that)

 If not, it could possibly be hacked around by reading from a DMA
 object at the address of the fence sequence number and then resizing
 the DMA object so that addresses from a certain point on would trigger
 a protection fault interrupt.

I don't think you can safely modify a DMA object without stalling the
card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts
and disable them by flipping a bit in PGRAPH when you stop caring about
them.


pgpJlRQwUx57K.pgp
Description: PGP signature
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri
I'm not sure I understand your proposal correctly.
It seems your proposoal is similar to mine, replacing the term fence
nodes with ttm transactions, but I'm not sure if I understand it
correctly

Here is some pseudocode for a improved, simplified version of my proposal.
It is modified so that there are no longer distinct alive/destroy
lists, but buffers are destroyed if their ref count is zero.

list_head ram_lru_list; /* list of bos */
list_head vram_unfenced_lru_list; /* list of bos */
list_head gart_unfenced_lru_list; /* list of bos */

atomic uint64_t next_seq_num;

// if read_list and write_list are empty, the buffer is unfenced and
MUST be in an unfenced lru list
// otherwise, it is fenced and MUST be, if not zombie, on some
read_list/write_list, or if zombie, on some unfenced_list
struct ttm_buffer_object
{
   kref_t ref;
   list_head read_list; /* list of bo_fence_nodes */
   list_head write_list; /* list of bo_fence_nodes */
   list_head unfenced_list; /* list entry in
[ram|[gart|vram]_unfenced]_lru_list */
   [...]
};

// TODO: we could embed just the first two members in the
ttm_buffer_object, and steal the lowest bit on the fence pointer to
signify that
// this would optimize for the very common single-fence case
struct ttm_bo_fence_node
{
list_head bo_list; /* list entry in bo.[read|write]_list */
struct ttm_fence_node* fence;

struct ttm_buffer_object* bo;
list_head fence_list; /* list entry in fence.[vram|gart|destroy]_list */
};

struct ttm_fence
{
void* sync_object; /* may also turned into an object containing a
ttm_fence at the start */
uint64_t seq_num; /* unique value in order of kmalloc of this ttm_fence */
list_head bo_list; /* list of bo_fence_nodes */
};

struct ttm_channel
{
list_head fence_list; /* list of ttm_fence_node */
};

ttm_flush_fences()
{
list_head vram_list[MAX_CHANNELS];
list_head gart_list[MAX_CHANNELS];
foreach channel
{
foreach fence in channel
{
 if(not driver-fence_expired(fence-sync_object))
 break;
 foreach bo_fence_node in fence.bo_list
 {
 remove bo_fence_node
 if bo.read_list and bo.write_list are both empty
 {
 if bo.refcount is zero
 destroy
 else
 {
 append to [vram|gart]_list[channel]
 }
 }
 }
}
}

// this is the n-way merge of vram_list[]s into the lru list
while(vram_list[]s are not all empty)
{
// this can use either a simple scan, or an heap
find channel such that
first_entry(vram_list[channel]).fence.seq_num is smallest

remove first_entry(vram_list[channel]) and put the bo at the
recent end of vram_unfenced_lru_list
}

same thing for gart;
}

// assume buffer is reserved, use current mechanism or mutex for that
// channel may be null for CPU waits
ttm_bo_wait(bo, channel, wait_for_write)
{
 foreach fence_node in bo.write_list
 {
 if(fence_node.channel != channel)
 driver-wait_fence(fence_node.fence.sync_object);
 }

 if(!wait_for_write)
  return;

 foreach fence_node in bo.read_list
 {
 if(fence_node.channel != channel)
  driver-wait_fence(fence_node.fence.sync_object);
 }
}

ttm_out_of_memory() takes memory_alloc_mutex
{
retry:
ttm_flush_fences();
if(we have enough space now)
return;
foreach in [vram|gart]_unfenced_lru_list
{
evict that buffer if it's big enough, no need to check fences
this uses the current ghost mechanism for accelerated moves
emit evicted_buffer_fence for after emission
}
if we didn't find a big enough buffer, evict the biggest buffer
(also consider empty space around it in size)
if(we have enough space now)
return;
if(burn_cpu_time_but_have_lowest_latencies)
{
while(!driver-fence_expired(evicted_bo-sync_object) and we
don't have enough space)
{
driver-wait_for_any_fence_to_expire();
ttm_flush_fences();
}
}
else
ttm_bo_wait(evicted_bo 0)
goto retry;
}

// assume the buffer has already been put in the desired memory space
// also assume we already waited for the buffer fences
ttm_add_buffer_to_fence(fence, bo)
{
remove bo from unfenced lru list if it's on it
for the none or single bo_fence_node in bo.read_list or
bo.write_list with bo_fence_node.fence.channel == fence.channel
{
remove bo_fence_node from bo_fence_node.fence.[gart|vram]_list
if(bo_fence_node.fence has all lists empty)
remove from channel and free the fence bo_fence_node.fence
remove bo_fence_node from bo.[read|list]_list
}

create a new bo_fence node and use it to add the bo to the fence
}

ttm_bo_refcount_drops_to_zero(bo)

Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Luca Barbieri
 If not, it could possibly be hacked around by reading from a DMA
 object at the address of the fence sequence number and then resizing
 the DMA object so that addresses from a certain point on would trigger
 a protection fault interrupt.

 I don't think you can safely modify a DMA object without stalling the
 card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts
 and disable them by flipping a bit in PGRAPH when you stop caring about
 them.

The problem is that one needs to disable them *before* the one he cares about.

Suppose the card is at fence 0 and we are interested in fence 1000 expiring.

If we just enable interrupts, then we are going to be interrupted
uselessly 1000 times.
Instead, we would like to tell the card send me interrupts for
fences, but only for fence number 1000 (or higher).

This way one could for instance render a whole scene, and then
desiring to read it into the CPU, just ask for an interrupt once
rendering is done (i.e. wait for the framebuffer fence) and get a
single interrupt, while we cleanly sleep undisturbed in the meantime.

Basically, it would just need some way of *conditionally* causing interrupts.
If there is none, then maybe we could insert a call to a fence
mini-pushbuffer filled with NOPs that could be overwritten with an
interrupt request on demand?
Or alternatively, construct such a pushbuffer with the 2D or 3D
engines, starting from our 1000 input and the fence value generated
by the 3D engine? (this is likely to be slow though).
Or some hack like the DMA object resizing? (would it crash the GPU? or
just not work due to it caching the previous size?)
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Maarten Maathuis
On Thu, Jan 21, 2010 at 3:44 PM, Francisco Jerez curroje...@riseup.net wrote:
 Luca Barbieri l...@luca-barbieri.com writes:

 Nvidia cards have a synchronization primitive that could be used to
 synchronize several FIFOs in hardware (AKA semaphores, see [1] for an
 example).

 Does this operate wholly on the GPU on all nVidia cards?

 It seems that at least on some GPUs this will trigger software
 methods that are basically a way for the GPU to trigger an interrupt
 and stop the FIFO until the CPU handles the interrupt and restarts it.


Yes this triggers a few software methods, nvidia hardware is a bit
lacking in this area.

 Yes, it could be handled without software methods, Maarten (CC'ed) was
 looking into this, IIRC.

Doing it without software methods means you need to have a semaphore
that exists in the cpu domain and therefore cannot be used again
until the cpu is sure it's done. So that will probably require a
rotating queue of several semaphores and a fallback to cpu waiting. It
should not be that hard to prototype in kernel, although we currently
assume notifiers (little chunks of memory that amongst other things
are used by semaphore) to be in channel private memory, so while in
kernel it's probably doable, there will need to be api changes for the
userspace side of it. Since the respective chunk of memory needs to
have a dma object on more than one channel. This should probably be
dealt with in combination with calims desire to be able to have dma
objects for random buffer objects for use with gpu queries. It doesn't
seem impossible, but it won't exactly be trivial either. The sanest
model would probably involve an ioctl that takes two channels and does
whatever is needed to make it synchronised. But given the recent
discussion/noise about doubts regarding our mixed userspace/kernel
model, we should consider this carefully. That said, i doubt it would
be sanely possible with a full userspace model like the one nvidia
uses.


 Also, is there a way on nVidia cards to get interrupts on fences, but
 only where the fence sequence number is higher than a dynamically set
 value? (so that one could sleep for fence X without getting an
 interrupt for every single fence before that)

 If not, it could possibly be hacked around by reading from a DMA
 object at the address of the fence sequence number and then resizing
 the DMA object so that addresses from a certain point on would trigger
 a protection fault interrupt.

 I don't think you can safely modify a DMA object without stalling the
 card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts
 and disable them by flipping a bit in PGRAPH when you stop caring about
 them.

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-21 Thread Francisco Jerez
Luca Barbieri l...@luca-barbieri.com writes:

 If not, it could possibly be hacked around by reading from a DMA
 object at the address of the fence sequence number and then resizing
 the DMA object so that addresses from a certain point on would trigger
 a protection fault interrupt.

 I don't think you can safely modify a DMA object without stalling the
 card completely, but i think you could use e.g. PGRAPH NOTIFY interrupts
 and disable them by flipping a bit in PGRAPH when you stop caring about
 them.

 The problem is that one needs to disable them *before* the one he cares about.

 Suppose the card is at fence 0 and we are interested in fence 1000 expiring.

That isn't the most common case. You typically don't expect the card to
get further than 3-4 pushbufs behind the CPU.


 If we just enable interrupts, then we are going to be interrupted
 uselessly 1000 times.
 Instead, we would like to tell the card send me interrupts for
 fences, but only for fence number 1000 (or higher).

 This way one could for instance render a whole scene, and then
 desiring to read it into the CPU, just ask for an interrupt once
 rendering is done (i.e. wait for the framebuffer fence) and get a
 single interrupt, while we cleanly sleep undisturbed in the meantime.

 Basically, it would just need some way of *conditionally* causing interrupts.
 If there is none, then maybe we could insert a call to a fence
 mini-pushbuffer filled with NOPs that could be overwritten with an
 interrupt request on demand?

The card also keeps some internal FIFO caches, so it seems to me that
getting that safe of races and efficient at the same time could be a bit
non-trivial.

 Or alternatively, construct such a pushbuffer with the 2D or 3D
 engines, starting from our 1000 input and the fence value generated
 by the 3D engine? (this is likely to be slow though).
 Or some hack like the DMA object resizing? (would it crash the GPU? or
 just not work due to it caching the previous size?)


pgpxgRnpPl0JA.pgp
Description: PGP signature
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Thomas Hellstrom

Thomas Hellstrom wrote:


Yes, it looks correct. Although it seems a little unintuitive to enter 
the loop with the spinlock held, and exit it with the spinlock not held. 
I've attached yet another patch to have that fixed. Could you take a 
look at whether it seems OK with you and, in that case, repost it for Dave?


  

... And the patch.

/Thomas

From 68972c220abe3ce19eb046d72fa218646168adc7 Mon Sep 17 00:00:00 2001
From: Luca Barbieri l...@luca-barbieri.com
Date: Mon, 18 Jan 2010 19:34:53 +0100
Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2)

ttm_bo_delayed_delete has a race condition, because after we do:
kref_put(nentry-list_kref, ttm_bo_release_list);

we are not holding the list lock and not holding any reference to
objects, and thus every bo in the list can be removed and freed at
this point.

However, we then use the next pointer we stored, which is not guaranteed
to be valid.

This was apparently the cause of some Nouveau oopses I experienced.

This patch rewrites the function so that it keeps the reference to nentry
until nentry itself is freed and we already got a reference to nentry-next.

It should now be correct and free of races, but please double check this.

Updated according to Thomas Hellstrom's feedback.

Signed-off-by: Luca Barbieri l...@luca-barbieri.com
---
 drivers/gpu/drm/ttm/ttm_bo.c |   58 ++---
 1 files changed, 26 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 2920f9a..5dfa41f 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -523,52 +523,46 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all)
 static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all)
 {
 	struct ttm_bo_global *glob = bdev-glob;
-	struct ttm_buffer_object *entry, *nentry;
-	struct list_head *list, *next;
-	int ret;
+	struct ttm_buffer_object *entry;
+	int ret = 0;
 
 	spin_lock(glob-lru_lock);
-	list_for_each_safe(list, next, bdev-ddestroy) {
-		entry = list_entry(list, struct ttm_buffer_object, ddestroy);
-		nentry = NULL;
+	if (list_empty(bdev-ddestroy)) {
+		spin_unlock(glob-lru_lock);
+		return 0;
+	}
 
-		/*
-		 * Protect the next list entry from destruction while we
-		 * unlock the lru_lock.
-		 */
+	entry = list_first_entry(bdev-ddestroy,
+		struct ttm_buffer_object, ddestroy);
+	kref_get(entry-list_kref);
+
+	for (;;) {
+		struct ttm_buffer_object *nentry = NULL;
 
-		if (next != bdev-ddestroy) {
-			nentry = list_entry(next, struct ttm_buffer_object,
-	ddestroy);
+		if (entry-ddestroy.next != bdev-ddestroy) {
+			nentry = list_first_entry(entry-ddestroy,
+struct ttm_buffer_object, ddestroy);
 			kref_get(nentry-list_kref);
 		}
-		kref_get(entry-list_kref);
 
 		spin_unlock(glob-lru_lock);
 		ret = ttm_bo_cleanup_refs(entry, remove_all);
 		kref_put(entry-list_kref, ttm_bo_release_list);
+		entry = nentry;
+
+		if (ret || !entry)
+			break;
 
 		spin_lock(glob-lru_lock);
-		if (nentry) {
-			bool next_onlist = !list_empty(next);
+		
+		if (list_empty(entry-ddestroy)) {
 			spin_unlock(glob-lru_lock);
-			kref_put(nentry-list_kref, ttm_bo_release_list);
-			spin_lock(glob-lru_lock);
-			/*
-			 * Someone might have raced us and removed the
-			 * next entry from the list. We don't bother restarting
-			 * list traversal.
-			 */
-
-			if (!next_onlist)
-break;
-		}
-		if (ret)
 			break;
+		}
 	}
-	ret = !list_empty(bdev-ddestroy);
-	spin_unlock(glob-lru_lock);
 
+	if (entry)
+		kref_put(entry-list_kref, ttm_bo_release_list);
 	return ret;
 }
 
-- 
1.6.3.3

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Thomas Hellstrom

Thomas Hellstrom wrote:

Thomas Hellstrom wrote:
  
Yes, it looks correct. Although it seems a little unintuitive to enter 
the loop with the spinlock held, and exit it with the spinlock not held. 
I've attached yet another patch to have that fixed. Could you take a 
look at whether it seems OK with you and, in that case, repost it for Dave?


  


... And the patch.

/Thomas

  
Hmm, It seems I should've stayed in bed today. Hopefully this is the 
right one...


/Thomas


From b3dc72cf74d1b8e0eb5f5423fbb0ac975f147f4c Mon Sep 17 00:00:00 2001
From: Luca Barbieri l...@luca-barbieri.com
Date: Mon, 18 Jan 2010 19:34:53 +0100
Subject: [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2)

ttm_bo_delayed_delete has a race condition, because after we do:
kref_put(nentry-list_kref, ttm_bo_release_list);

we are not holding the list lock and not holding any reference to
objects, and thus every bo in the list can be removed and freed at
this point.

However, we then use the next pointer we stored, which is not guaranteed
to be valid.

This was apparently the cause of some Nouveau oopses I experienced.

This patch rewrites the function so that it keeps the reference to nentry
until nentry itself is freed and we already got a reference to nentry-next.

It should now be correct and free of races, but please double check this.

Updated according to Thomas Hellstrom's feedback.

Signed-off-by: Luca Barbieri l...@luca-barbieri.com
Signed-off-by: Thomas Hellstrom thellst...@vmware.com
---
 drivers/gpu/drm/ttm/ttm_bo.c |   58 ++
 1 files changed, 25 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index c7733c3..1a3e909 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -524,52 +524,44 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all)
 static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all)
 {
 	struct ttm_bo_global *glob = bdev-glob;
-	struct ttm_buffer_object *entry, *nentry;
-	struct list_head *list, *next;
-	int ret;
+	struct ttm_buffer_object *entry = NULL;
+	int ret = 0;
 
 	spin_lock(glob-lru_lock);
-	list_for_each_safe(list, next, bdev-ddestroy) {
-		entry = list_entry(list, struct ttm_buffer_object, ddestroy);
-		nentry = NULL;
+	if (list_empty(bdev-ddestroy))
+		goto out_unlock;
 
-		/*
-		 * Protect the next list entry from destruction while we
-		 * unlock the lru_lock.
-		 */
+	entry = list_first_entry(bdev-ddestroy,
+		struct ttm_buffer_object, ddestroy);
+	kref_get(entry-list_kref);
+
+	for (;;) {
+		struct ttm_buffer_object *nentry = NULL;
 
-		if (next != bdev-ddestroy) {
-			nentry = list_entry(next, struct ttm_buffer_object,
-	ddestroy);
+		if (entry-ddestroy.next != bdev-ddestroy) {
+			nentry = list_first_entry(entry-ddestroy,
+struct ttm_buffer_object, ddestroy);
 			kref_get(nentry-list_kref);
 		}
-		kref_get(entry-list_kref);
 
 		spin_unlock(glob-lru_lock);
 		ret = ttm_bo_cleanup_refs(entry, remove_all);
 		kref_put(entry-list_kref, ttm_bo_release_list);
+		entry = nentry;
+
+		if (ret || !entry)
+			goto out;
 
 		spin_lock(glob-lru_lock);
-		if (nentry) {
-			bool next_onlist = !list_empty(next);
-			spin_unlock(glob-lru_lock);
-			kref_put(nentry-list_kref, ttm_bo_release_list);
-			spin_lock(glob-lru_lock);
-			/*
-			 * Someone might have raced us and removed the
-			 * next entry from the list. We don't bother restarting
-			 * list traversal.
-			 */
-
-			if (!next_onlist)
-break;
-		}
-		if (ret)
+		if (list_empty(entry-ddestroy))
 			break;
 	}
-	ret = !list_empty(bdev-ddestroy);
-	spin_unlock(glob-lru_lock);
 
+out_unlock:
+	spin_unlock(glob-lru_lock);
+out:
+	if (entry)
+		kref_put(entry-list_kref, ttm_bo_release_list);
 	return ret;
 }
 
-- 
1.6.2.5

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri
Yes it's fine. I sent your patch to Dave with an expanded commit
comment for merging.

Here is a possible redesign of the mechanism inspired by this issue.
It seems that what we are racing against is buffer eviction, due to
delayed deletion buffers being still kept on the LRU list.

I'm wondering if the delayed deletion design could be changed as follows:
1. Remove to-be-deleted buffers from the LRU list before putting them
on delayed delete
2. Change buffer eviction to first do a delayed deletion pass. This
should be cheap (and cheaper than the current code) because delayed
deletion stops at the first unsignaled fence.
3. Add a new delayed deletion lock/semaphore. Then, have
ttm_bo_delayed_delete take it for reading and keep it across the
function.
4. Inside the delayed deletion lock, grab the LRU lock, copy the
delayed delete list head to a local variable, set it to empty and
release the lock.
5. Iterate on the privately held list with list_for_each_safe
6. At the end of ttm_bo_delayed_delete, retake the LRU lock and readd
the remaining part of our private list at the head of the global list

This would prevent uselessly trying to evict delayed-delete buffers
(which should be processed in fence order and not LRU order), and also
prevent concurrent delayed deletions, which should be more harmful
than useful.

Furthermore, it should be possible to get rid of list locking in the
following way:
1. BOs to be delayed-deleted are added to the head of the initial
delayed deletion single linked list, using atomic cmpxchg
2. ttm_bo_delayed_delete takes the delayed deletion lock and grabs the
list with an atomic xchg of the head with zero
3. It reverses the list in place, processes the entries and puts them
at the end of a second single linked list, the recurring delayed
deletion list
4. It processes the recurring delayed deletion list, cleaning up the BOs
5. Finally, the delayed deletion lock is released

This makes adding to the delayed deletion list lockless.

The LRU list instead inherently needs to be doubly linked, so only RCU
could make it lockless, and it seems that may require using an
external list node structure (so readers don't suddenly jump to the
most recent node), and that would not be a win (except with *lots* of
CPUs).
Besides, most DRM drivers (except vmware) are taking the BKL around
all ioctls and (except nouveau) use a single pushbuffer, so this is a
bit moot anyway.

What do you think?

Anyway, this, if done, would be for the next merge window, or later,
while the current fix ought to be merged now.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri
 Also note that the delayed delete list is not in fence order but in
 deletion-time order, which perhaps gives room for more optimizations.
You are right.
I think then that ttm_bo_delayed_delete may still need to be changed,
because it stops when ttm_bo_cleanup_refs returns -EBUSY, which
happens when a fence has not been reached.
This means that a buffer will need to wait for all previously deleted
buffers to become unused, even if it is unused itself.
Is this acceptable?

What if we get rid of the delayed destroy list, and instead append
buffers to be deleted to their fence object, and delete them when the
fence is signaled?

This also allows to do it more naturally, since the fence object can
just keep a normal reference to the buffers it fences, and unreference
them on expiration.

Then there needs to be no special delayed destruction logic, and it
would work as if the GPU were keeping a reference to the buffer
itself, using fences as a proxy to have the CPU do that work for the
GPU.

Then the delayed work is no longer periodically destroy buffers but
rather periodically check if fences are expired, naturally stopping
at the first unexpired one.
Drivers that support IRQs on fences could also do the work in the
interrupt handler/tasklet instead, avoid the delay jiffies magic
number. This may need a NAPI-like interrupt mitigation middle layer
for optimal results though.

 But isn't an atomic cmpxchg about as costly as a spinlock?
I think it's cheaper on all architectures, as otherwise it would be
mostly pointless to have it, since you can emulate it with a spinlock.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Luca Barbieri
When designing this, we should also keep in mind that some drivers
(e.g. nouveau) have multiple FIFO channels, and thus we would like a
buffer to be referenced for reading by multiple channels at once (and
be destroyed only when all fences are expired, obviously).
Also, hardware may support on-GPU inter-channel synchronization, and
then multiple references may be for writing too.

If we use an external dynamically allocated channel/buffer list node,
we can support this (if the kernel allocators aren't fast enough,
which they should be, we can just keep released ones linked to the bo
to speed allocations).

Note that in nouveau there is a small hardware limit to channels (up
to 128 on nv50), but future hardware may possibly support unlimited
channels.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Thomas Hellstrom
Luca Barbieri wrote:
 Also note that the delayed delete list is not in fence order but in
 deletion-time order, which perhaps gives room for more optimizations.
 
 You are right.
 I think then that ttm_bo_delayed_delete may still need to be changed,
 because it stops when ttm_bo_cleanup_refs returns -EBUSY, which
 happens when a fence has not been reached.
 This means that a buffer will need to wait for all previously deleted
 buffers to become unused, even if it is unused itself.
 Is this acceptable?
   

Yes, I think it's acceptable if you view it in the context that the most 
important buffer resources (GPU memory space and physical system memory) 
are immediately reclaimable through the eviction- and swapping mechanisms.

 What if we get rid of the delayed destroy list, and instead append
 buffers to be deleted to their fence object, and delete them when the
 fence is signaled?

 This also allows to do it more naturally, since the fence object can
 just keep a normal reference to the buffers it fences, and unreference
 them on expiration.

 Then there needs to be no special delayed destruction logic, and it
 would work as if the GPU were keeping a reference to the buffer
 itself, using fences as a proxy to have the CPU do that work for the
 GPU.

 Then the delayed work is no longer periodically destroy buffers but
 rather periodically check if fences are expired, naturally stopping
 at the first unexpired one.
 Drivers that support IRQs on fences could also do the work in the
 interrupt handler/tasklet instead, avoid the delay jiffies magic
 number. This may need a NAPI-like interrupt mitigation middle layer
 for optimal results though.

   

Yes, I think that this way, it should definitely be possible to find a 
more optimal solution. One should keep in mind, however, that we'll 
probably not able to destroy buffers from within an atomic context, 
which means we have to schedule a workqueue to do that task. We had to 
do a similar thing in the Poulsbo driver and it turned out that we could 
save a significant amount of CPU by using a delayed workqueue, 
collecting objects and destroying them periodically.

/Thomas

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-20 Thread Thomas Hellstrom
Luca Barbieri wrote:
 When designing this, we should also keep in mind that some drivers
 (e.g. nouveau) have multiple FIFO channels, and thus we would like a
 buffer to be referenced for reading by multiple channels at once (and
 be destroyed only when all fences are expired, obviously).
 Also, hardware may support on-GPU inter-channel synchronization, and
 then multiple references may be for writing too.
   

In the context of the current code, I've been thinking of having a list 
of fences on each bo to support multiple readers, and also to try to 
deal with the problem of simultaneous GPU- and CPU readers.

But if the hardware supports on-GPU inter-channel synchronization, I 
figure the code should be smart enough to only wait on the last write 
fence?

/Thomas

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete

2010-01-19 Thread Thomas Hellstrom
Luca,

Good catch.
Some comments inline:



Luca Barbieri wrote:
 + entry = list_first_entry(bdev-ddestroy,
 + struct ttm_buffer_object, ddestroy);
 + kref_get(entry-list_kref);
  
 - if (next != bdev-ddestroy) {
 - nentry = list_entry(next, struct ttm_buffer_object,
 - ddestroy);
 + for (;;) {
 + struct ttm_buffer_object *nentry = NULL;
 +
 + if (!list_empty(entry-ddestroy)
 +  entry-ddestroy.next != bdev-ddestroy) {
 + nentry = list_entry(entry-ddestroy.next,
 + struct ttm_buffer_object, ddestroy);
   

I'm not sure it's considered ok to access the list_head members directly 
in new code.
Would  nentry=list_first_entry(entry-ddestroy, ) work?

   kref_get(nentry-list_kref);
   }
   

else unlock and break? That would save an unnecessary call to 
ttm_bo_cleanup_refs.

 - kref_get(entry-list_kref);
  
   spin_unlock(glob-lru_lock);
   ret = ttm_bo_cleanup_refs(entry, remove_all);
   kref_put(entry-list_kref, ttm_bo_release_list);
 + entry = nentry;
   

Here nentry may have been removed from the list by another process, 
which would trigger the unnecessary call, mentioned above.

/Thomas

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/nouveau