Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 10:36:01PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 10:07 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom 
> >>> wrote:
> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
> >
> >wrote:
> >>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>What I tried to point out was that the synchronization shouldn't be
> >>>needed, because the CPU shouldn't do anything with the contents of
> >>>evicted buffers. The GPU moves the buffers, not the CPU. What does 
> >>>the
> >>>CPU do besides updating some kernel structures?
> >>>
> >>>Also, buffer deletion is something where you don't need to wait for
> >>>the buffer to become idle if you know the memory area won't be
> >>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>would be the GPU to move new data in and once that happens, the old
> >>>buffer will be trivially idle, because single-ring GPUs execute
> >>>commands in order.
> >>>
> >>>Marek
> >>Actually asynchronous eviction / deletion is something I have been
> >>prototyping for a while but never gotten around to implement in TTM:
> >>
> >>There are a few minor caveats:
> >>
> >>With buffer deletion, what you say is true for fixed memory, but 
> >>not for
> >>TT
> >>memory where pages are reclaimed by the system after buffer 
> >>destruction.
> >>That means that we don't have to wait for idle to free GPU space, 
> >>but we
> >>need to wait before pages are handed back to the system.
> >>
> >>Swapout needs to access the contents of evicted buffers, but
> >>synchronizing
> >>doesn't need to happen until just before swapout.
> >>
> >>Multi-ring - CPU support: If another ring / engine or the CPU is 
> >>about to
> >>move in buffer contents to VRAM or a GPU aperture that was 
> >>previously
> >>evicted by another ring, it needs to sync with that eviction, but 
> >>doesn't
> >>know what buffer or even which buffers occupied the space 
> >>previously.
> >>Trivially one can attach a sync object to the memory type manager 
> >>that
> >>represents the last eviction from that memory type, and *any* 
> >>engine (CPU
> >>or
> >>GPU) that moves buffer contents in needs to order that movement with
> >>respect
> >>to that fence. As you say, with a single ring and no CPU fallbacks, 
> >>that
> >>ordering is a no-op, but any common (non-driver based) 
> >>implementation
> >>needs
> >>to support this.
> >>
> >>A single fence attached to the memory type manager is the simplest
> >>solution,
> >>but a solution with a fence for each free region in the free list 
> >>is also
> >>possible. Then TTM needs a driver callback to be able order fences 
> >>w r t
> >>echother.
> >>
> >>/Thomas
> >>
> >Radeon already handle multi-ring and ttm interaction with what we 
> >call
> >semaphore. Semaphore are created to synchronize with fence accross
> >different ring. I think the easiest solution is to just remove the bo
> >wait in ttm and let driver handle this.
> The wait can be removed, but only conditioned on a driver flag that 
> says it
> supports unsynchronous buffer moves.
> 
> The multi-ring case I'm talking about is:
> 
> Ring 1 evicts buffer A, emits fence 0
> Ring 2 evicts buffer B, emits fence 1
> ..Other eviction takes place by various rings, perhaps including ring 
> 1 and
> ring 2.
> Ring 3 moves buffer C into the space which happens bo be the union of 
> the
> space prevously occupied buffer A and buffer B.
> 
> Question is: which fence do you want to order this move with?
> The answer is whichever of fence 0 and 1 signals last.
> 
> I think it's a reasonable thing for TTM to keep track of this, but in 
> order
> to do so it needs a driver callback that
> can order two fences, and can order a job in the current ring w r t a 
> fence.
> In radeon's case that driver cal

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Thomas Hellstrom

On 11/30/2012 10:07 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 08:25 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 07:07 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 06:18 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  wrote:

On 11/30/2012 05:30 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
wrote:

On 11/29/2012 10:58 PM, Marek Olšák wrote:

What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek

Actually asynchronous eviction / deletion is something I have been
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for
TT
memory where pages are reclaimed by the system after buffer destruction.
That means that we don't have to wait for idle to free GPU space, but we
need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but
synchronizing
doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about to
move in buffer contents to VRAM or a GPU aperture that was previously
evicted by another ring, it needs to sync with that eviction, but doesn't
know what buffer or even which buffers occupied the space previously.
Trivially one can attach a sync object to the memory type manager that
represents the last eviction from that memory type, and *any* engine (CPU
or
GPU) that moves buffer contents in needs to order that movement with
respect
to that fence. As you say, with a single ring and no CPU fallbacks, that
ordering is a no-op, but any common (non-driver based) implementation
needs
to support this.

A single fence attached to the memory type manager is the simplest
solution,
but a solution with a fence for each free region in the free list is also
possible. Then TTM needs a driver callback to be able order fences w r t
echother.

/Thomas


Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.

The wait can be removed, but only conditioned on a driver flag that says it
supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 and
ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of the
space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in order
to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a fence.
In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler hardware
it would wait on one of the fences.

/Thomas


I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome

Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for
the whole memory type, it's a simple fact that fences from the same
ring are trivially ordered, which means such a list should contain at
most as many fences as there are rings.

Yes, one function callback is needed to know which fence is necessary,
also ttm needs to know the number of rings (note that i think newer
hw will have somethings like 1024 rings or even more, even today hw
might have as many as i think nvidia channel is pretty much what i
define to be a ring).

But i think most case will be few fence accross few rings. Like 1
ring

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 09:35:49PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 08:25 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  
> >wrote:
> >>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
> >>>wrote:
> On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >What I tried to point out was that the synchronization shouldn't be
> >needed, because the CPU shouldn't do anything with the contents of
> >evicted buffers. The GPU moves the buffers, not the CPU. What does 
> >the
> >CPU do besides updating some kernel structures?
> >
> >Also, buffer deletion is something where you don't need to wait for
> >the buffer to become idle if you know the memory area won't be
> >mapped by the CPU, ever. The memory can be reclaimed right away. It
> >would be the GPU to move new data in and once that happens, the old
> >buffer will be trivially idle, because single-ring GPUs execute
> >commands in order.
> >
> >Marek
> Actually asynchronous eviction / deletion is something I have been
> prototyping for a while but never gotten around to implement in TTM:
> 
> There are a few minor caveats:
> 
> With buffer deletion, what you say is true for fixed memory, but not 
> for
> TT
> memory where pages are reclaimed by the system after buffer 
> destruction.
> That means that we don't have to wait for idle to free GPU space, but 
> we
> need to wait before pages are handed back to the system.
> 
> Swapout needs to access the contents of evicted buffers, but
> synchronizing
> doesn't need to happen until just before swapout.
> 
> Multi-ring - CPU support: If another ring / engine or the CPU is 
> about to
> move in buffer contents to VRAM or a GPU aperture that was previously
> evicted by another ring, it needs to sync with that eviction, but 
> doesn't
> know what buffer or even which buffers occupied the space previously.
> Trivially one can attach a sync object to the memory type manager that
> represents the last eviction from that memory type, and *any* engine 
> (CPU
> or
> GPU) that moves buffer contents in needs to order that movement with
> respect
> to that fence. As you say, with a single ring and no CPU fallbacks, 
> that
> ordering is a no-op, but any common (non-driver based) implementation
> needs
> to support this.
> 
> A single fence attached to the memory type manager is the simplest
> solution,
> but a solution with a fence for each free region in the free list is 
> also
> possible. Then TTM needs a driver callback to be able order fences w 
> r t
> echother.
> 
> /Thomas
> 
> >>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>semaphore. Semaphore are created to synchronize with fence accross
> >>>different ring. I think the easiest solution is to just remove the bo
> >>>wait in ttm and let driver handle this.
> >>The wait can be removed, but only conditioned on a driver flag that 
> >>says it
> >>supports unsynchronous buffer moves.
> >>
> >>The multi-ring case I'm talking about is:
> >>
> >>Ring 1 evicts buffer A, emits fence 0
> >>Ring 2 evicts buffer B, emits fence 1
> >>..Other eviction takes place by various rings, perhaps including ring 1 
> >>and
> >>ring 2.
> >>Ring 3 moves buffer C into the space which happens bo be the union of 
> >>the
> >>space prevously occupied buffer A and buffer B.
> >>
> >>Question is: which fence do you want to order this move with?
> >>The answer is whichever of fence 0 and 1 signals last.
> >>
> >>I think it's a reasonable thing for TTM to keep track of this, but in 
> >>order
> >>to do so it needs a driver callback that
> >>can order two fences, and can order a job in the current ring w r t a 
> >>fence.
> >>In radeon's case that driver callback
> >>would probably insert a barrier / semaphore. In the case of simpler 
> >>hardware
> >>it would wait on one of the fences.
> >>
> >>/Thomas
> >>
> >I don't think we can order fence easily with a clean api, i would
> >rather see ttm provide a list of fence to driver and tell to the
> >driver before moving this object all the fence on this list 

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Thomas Hellstrom

On 11/30/2012 08:25 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 07:07 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 06:18 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  wrote:

On 11/30/2012 05:30 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
wrote:

On 11/29/2012 10:58 PM, Marek Olšák wrote:

What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek

Actually asynchronous eviction / deletion is something I have been
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for
TT
memory where pages are reclaimed by the system after buffer destruction.
That means that we don't have to wait for idle to free GPU space, but we
need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but
synchronizing
doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about to
move in buffer contents to VRAM or a GPU aperture that was previously
evicted by another ring, it needs to sync with that eviction, but doesn't
know what buffer or even which buffers occupied the space previously.
Trivially one can attach a sync object to the memory type manager that
represents the last eviction from that memory type, and *any* engine (CPU
or
GPU) that moves buffer contents in needs to order that movement with
respect
to that fence. As you say, with a single ring and no CPU fallbacks, that
ordering is a no-op, but any common (non-driver based) implementation
needs
to support this.

A single fence attached to the memory type manager is the simplest
solution,
but a solution with a fence for each free region in the free list is also
possible. Then TTM needs a driver callback to be able order fences w r t
echother.

/Thomas


Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.

The wait can be removed, but only conditioned on a driver flag that says it
supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 and
ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of the
space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in order
to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a fence.
In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler hardware
it would wait on one of the fences.

/Thomas


I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome

Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for
the whole memory type, it's a simple fact that fences from the same
ring are trivially ordered, which means such a list should contain at
most as many fences as there are rings.

Yes, one function callback is needed to know which fence is necessary,
also ttm needs to know the number of rings (note that i think newer
hw will have somethings like 1024 rings or even more, even today hw
might have as many as i think nvidia channel is pretty much what i
define to be a ring).

But i think most case will be few fence accross few rings. Like 1
ring is the dma ring and then you have a ring for one of the GL
context that using the memory and another ring for th

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 07:31:04PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 07:07 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> >>On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  
> >>>wrote:
> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
> >wrote:
> >>On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >>>What I tried to point out was that the synchronization shouldn't be
> >>>needed, because the CPU shouldn't do anything with the contents of
> >>>evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >>>CPU do besides updating some kernel structures?
> >>>
> >>>Also, buffer deletion is something where you don't need to wait for
> >>>the buffer to become idle if you know the memory area won't be
> >>>mapped by the CPU, ever. The memory can be reclaimed right away. It
> >>>would be the GPU to move new data in and once that happens, the old
> >>>buffer will be trivially idle, because single-ring GPUs execute
> >>>commands in order.
> >>>
> >>>Marek
> >>Actually asynchronous eviction / deletion is something I have been
> >>prototyping for a while but never gotten around to implement in TTM:
> >>
> >>There are a few minor caveats:
> >>
> >>With buffer deletion, what you say is true for fixed memory, but not for
> >>TT
> >>memory where pages are reclaimed by the system after buffer destruction.
> >>That means that we don't have to wait for idle to free GPU space, but we
> >>need to wait before pages are handed back to the system.
> >>
> >>Swapout needs to access the contents of evicted buffers, but
> >>synchronizing
> >>doesn't need to happen until just before swapout.
> >>
> >>Multi-ring - CPU support: If another ring / engine or the CPU is about 
> >>to
> >>move in buffer contents to VRAM or a GPU aperture that was previously
> >>evicted by another ring, it needs to sync with that eviction, but 
> >>doesn't
> >>know what buffer or even which buffers occupied the space previously.
> >>Trivially one can attach a sync object to the memory type manager that
> >>represents the last eviction from that memory type, and *any* engine 
> >>(CPU
> >>or
> >>GPU) that moves buffer contents in needs to order that movement with
> >>respect
> >>to that fence. As you say, with a single ring and no CPU fallbacks, that
> >>ordering is a no-op, but any common (non-driver based) implementation
> >>needs
> >>to support this.
> >>
> >>A single fence attached to the memory type manager is the simplest
> >>solution,
> >>but a solution with a fence for each free region in the free list is 
> >>also
> >>possible. Then TTM needs a driver callback to be able order fences w r t
> >>echother.
> >>
> >>/Thomas
> >>
> >Radeon already handle multi-ring and ttm interaction with what we call
> >semaphore. Semaphore are created to synchronize with fence accross
> >different ring. I think the easiest solution is to just remove the bo
> >wait in ttm and let driver handle this.
> The wait can be removed, but only conditioned on a driver flag that says 
> it
> supports unsynchronous buffer moves.
> 
> The multi-ring case I'm talking about is:
> 
> Ring 1 evicts buffer A, emits fence 0
> Ring 2 evicts buffer B, emits fence 1
> ..Other eviction takes place by various rings, perhaps including ring 1 
> and
> ring 2.
> Ring 3 moves buffer C into the space which happens bo be the union of the
> space prevously occupied buffer A and buffer B.
> 
> Question is: which fence do you want to order this move with?
> The answer is whichever of fence 0 and 1 signals last.
> 
> I think it's a reasonable thing for TTM to keep track of this, but in 
> order
> to do so it needs a driver callback that
> can order two fences, and can order a job in the current ring w r t a 
> fence.
> In radeon's case that driver callback
> would probably insert a barrier / semaphore. In the case of simpler 
> hardware
> it would wait on one of the fences.
> 
> /Thomas
> 
> >>>I don't think we can order fence easily with a clean api, i would
> >>>rather see ttm provide a list of fence to driver and tell to the
> >>>driver before moving this object all the fence on this list need to be
> >>>completed. I think it's as easy as associating fence with drm_mm (well
> >>>nouveau as its own mm stuff) but idea would basicly be that fence are
> >>>both associated with bo and with mm object so you know when a segment
> >>>of memory is idle/available for use.
> >>>
> >>>Cheers,
> >>>Jerome
> >>
> >>Hmm. Agreed that would save a lot of barriers.
> >>
> >>Even if TTM tra

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Thomas Hellstrom

On 11/30/2012 07:07 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:

On 11/30/2012 06:18 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  wrote:

On 11/30/2012 05:30 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
wrote:

On 11/29/2012 10:58 PM, Marek Olšák wrote:

What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek

Actually asynchronous eviction / deletion is something I have been
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for
TT
memory where pages are reclaimed by the system after buffer destruction.
That means that we don't have to wait for idle to free GPU space, but we
need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but
synchronizing
doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about to
move in buffer contents to VRAM or a GPU aperture that was previously
evicted by another ring, it needs to sync with that eviction, but doesn't
know what buffer or even which buffers occupied the space previously.
Trivially one can attach a sync object to the memory type manager that
represents the last eviction from that memory type, and *any* engine (CPU
or
GPU) that moves buffer contents in needs to order that movement with
respect
to that fence. As you say, with a single ring and no CPU fallbacks, that
ordering is a no-op, but any common (non-driver based) implementation
needs
to support this.

A single fence attached to the memory type manager is the simplest
solution,
but a solution with a fence for each free region in the free list is also
possible. Then TTM needs a driver callback to be able order fences w r t
echother.

/Thomas


Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.

The wait can be removed, but only conditioned on a driver flag that says it
supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 and
ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of the
space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in order
to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a fence.
In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler hardware
it would wait on one of the fences.

/Thomas


I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome


Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for
the whole memory type, it's a simple fact that fences from the same
ring are trivially ordered, which means such a list should contain at
most as many fences as there are rings.

Yes, one function callback is needed to know which fence is necessary,
also ttm needs to know the number of rings (note that i think newer
hw will have somethings like 1024 rings or even more, even today hw
might have as many as i think nvidia channel is pretty much what i
define to be a ring).

But i think most case will be few fence accross few rings. Like 1
ring is the dma ring and then you have a ring for one of the GL
context that using the memory and another ring for the new context
that want to use the memory.


So, whatever approach is chosen, TTM needs to be able to determine

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 06:43:36PM +0100, Thomas Hellstrom wrote:
> On 11/30/2012 06:18 PM, Jerome Glisse wrote:
> >On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  
> >wrote:
> >>On 11/30/2012 05:30 PM, Jerome Glisse wrote:
> >>>On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
> >>>wrote:
> On 11/29/2012 10:58 PM, Marek Olšák wrote:
> >
> >What I tried to point out was that the synchronization shouldn't be
> >needed, because the CPU shouldn't do anything with the contents of
> >evicted buffers. The GPU moves the buffers, not the CPU. What does the
> >CPU do besides updating some kernel structures?
> >
> >Also, buffer deletion is something where you don't need to wait for
> >the buffer to become idle if you know the memory area won't be
> >mapped by the CPU, ever. The memory can be reclaimed right away. It
> >would be the GPU to move new data in and once that happens, the old
> >buffer will be trivially idle, because single-ring GPUs execute
> >commands in order.
> >
> >Marek
> 
> Actually asynchronous eviction / deletion is something I have been
> prototyping for a while but never gotten around to implement in TTM:
> 
> There are a few minor caveats:
> 
> With buffer deletion, what you say is true for fixed memory, but not for
> TT
> memory where pages are reclaimed by the system after buffer destruction.
> That means that we don't have to wait for idle to free GPU space, but we
> need to wait before pages are handed back to the system.
> 
> Swapout needs to access the contents of evicted buffers, but
> synchronizing
> doesn't need to happen until just before swapout.
> 
> Multi-ring - CPU support: If another ring / engine or the CPU is about to
> move in buffer contents to VRAM or a GPU aperture that was previously
> evicted by another ring, it needs to sync with that eviction, but doesn't
> know what buffer or even which buffers occupied the space previously.
> Trivially one can attach a sync object to the memory type manager that
> represents the last eviction from that memory type, and *any* engine (CPU
> or
> GPU) that moves buffer contents in needs to order that movement with
> respect
> to that fence. As you say, with a single ring and no CPU fallbacks, that
> ordering is a no-op, but any common (non-driver based) implementation
> needs
> to support this.
> 
> A single fence attached to the memory type manager is the simplest
> solution,
> but a solution with a fence for each free region in the free list is also
> possible. Then TTM needs a driver callback to be able order fences w r t
> echother.
> 
> /Thomas
> 
> >>>Radeon already handle multi-ring and ttm interaction with what we call
> >>>semaphore. Semaphore are created to synchronize with fence accross
> >>>different ring. I think the easiest solution is to just remove the bo
> >>>wait in ttm and let driver handle this.
> >>
> >>The wait can be removed, but only conditioned on a driver flag that says it
> >>supports unsynchronous buffer moves.
> >>
> >>The multi-ring case I'm talking about is:
> >>
> >>Ring 1 evicts buffer A, emits fence 0
> >>Ring 2 evicts buffer B, emits fence 1
> >>..Other eviction takes place by various rings, perhaps including ring 1 and
> >>ring 2.
> >>Ring 3 moves buffer C into the space which happens bo be the union of the
> >>space prevously occupied buffer A and buffer B.
> >>
> >>Question is: which fence do you want to order this move with?
> >>The answer is whichever of fence 0 and 1 signals last.
> >>
> >>I think it's a reasonable thing for TTM to keep track of this, but in order
> >>to do so it needs a driver callback that
> >>can order two fences, and can order a job in the current ring w r t a fence.
> >>In radeon's case that driver callback
> >>would probably insert a barrier / semaphore. In the case of simpler hardware
> >>it would wait on one of the fences.
> >>
> >>/Thomas
> >>
> >I don't think we can order fence easily with a clean api, i would
> >rather see ttm provide a list of fence to driver and tell to the
> >driver before moving this object all the fence on this list need to be
> >completed. I think it's as easy as associating fence with drm_mm (well
> >nouveau as its own mm stuff) but idea would basicly be that fence are
> >both associated with bo and with mm object so you know when a segment
> >of memory is idle/available for use.
> >
> >Cheers,
> >Jerome
> 
> 
> Hmm. Agreed that would save a lot of barriers.
> 
> Even if TTM tracks fences by free mm regions or a single fence for
> the whole memory type, it's a simple fact that fences from the same
> ring are trivially ordered, which means such a list should contain at
> most as many fences as there are rings.

Yes, one function callback is needed to know which fence is necessary,
also ttm needs to know the number of rings (note that 

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Thomas Hellstrom

On 11/30/2012 06:18 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  wrote:

On 11/30/2012 05:30 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
wrote:

On 11/29/2012 10:58 PM, Marek Olšák wrote:


What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek


Actually asynchronous eviction / deletion is something I have been
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for
TT
memory where pages are reclaimed by the system after buffer destruction.
That means that we don't have to wait for idle to free GPU space, but we
need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but
synchronizing
doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about to
move in buffer contents to VRAM or a GPU aperture that was previously
evicted by another ring, it needs to sync with that eviction, but doesn't
know what buffer or even which buffers occupied the space previously.
Trivially one can attach a sync object to the memory type manager that
represents the last eviction from that memory type, and *any* engine (CPU
or
GPU) that moves buffer contents in needs to order that movement with
respect
to that fence. As you say, with a single ring and no CPU fallbacks, that
ordering is a no-op, but any common (non-driver based) implementation
needs
to support this.

A single fence attached to the memory type manager is the simplest
solution,
but a solution with a fence for each free region in the free list is also
possible. Then TTM needs a driver callback to be able order fences w r t
echother.

/Thomas


Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.


The wait can be removed, but only conditioned on a driver flag that says it
supports unsynchronous buffer moves.

The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 and
ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of the
space prevously occupied buffer A and buffer B.

Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in order
to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a fence.
In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler hardware
it would wait on one of the fences.

/Thomas


I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome



Hmm. Agreed that would save a lot of barriers.

Even if TTM tracks fences by free mm regions or a single fence for the 
whole memory type,
it's a simple fact that fences from the same ring are trivially ordered, 
which means such a list

should contain at most as many fences as there are rings.

So, whatever approach is chosen, TTM needs to be able to determine that 
trivial ordering,
and I think the upcoming cross-device fencing work will face the exact 
same problem.


My proposed ordering API would look something like

struct fence *order_fences(struct fence *fence_a, struct fence *fence_b, 
bool trivial_order, bool interruptible, bool no_wait_gpu)


Returns which of the fences @fence_a and @fence_b that when signaled, 
guarantees that also the other
fence has signaled. If @quick_order is true, and the driver cannot 
trivially order the fences, it may return ERR_PTR(-EAGAIN),
if @interruptible is true, any wait should be performed interruptibly 
and if no_wait_gpu is true, the 

Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 12:08 PM, Thomas Hellstrom  wrote:
> On 11/30/2012 05:30 PM, Jerome Glisse wrote:
>>
>> On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom 
>> wrote:
>>>
>>> On 11/29/2012 10:58 PM, Marek Olšák wrote:


 What I tried to point out was that the synchronization shouldn't be
 needed, because the CPU shouldn't do anything with the contents of
 evicted buffers. The GPU moves the buffers, not the CPU. What does the
 CPU do besides updating some kernel structures?

 Also, buffer deletion is something where you don't need to wait for
 the buffer to become idle if you know the memory area won't be
 mapped by the CPU, ever. The memory can be reclaimed right away. It
 would be the GPU to move new data in and once that happens, the old
 buffer will be trivially idle, because single-ring GPUs execute
 commands in order.

 Marek
>>>
>>>
>>> Actually asynchronous eviction / deletion is something I have been
>>> prototyping for a while but never gotten around to implement in TTM:
>>>
>>> There are a few minor caveats:
>>>
>>> With buffer deletion, what you say is true for fixed memory, but not for
>>> TT
>>> memory where pages are reclaimed by the system after buffer destruction.
>>> That means that we don't have to wait for idle to free GPU space, but we
>>> need to wait before pages are handed back to the system.
>>>
>>> Swapout needs to access the contents of evicted buffers, but
>>> synchronizing
>>> doesn't need to happen until just before swapout.
>>>
>>> Multi-ring - CPU support: If another ring / engine or the CPU is about to
>>> move in buffer contents to VRAM or a GPU aperture that was previously
>>> evicted by another ring, it needs to sync with that eviction, but doesn't
>>> know what buffer or even which buffers occupied the space previously.
>>> Trivially one can attach a sync object to the memory type manager that
>>> represents the last eviction from that memory type, and *any* engine (CPU
>>> or
>>> GPU) that moves buffer contents in needs to order that movement with
>>> respect
>>> to that fence. As you say, with a single ring and no CPU fallbacks, that
>>> ordering is a no-op, but any common (non-driver based) implementation
>>> needs
>>> to support this.
>>>
>>> A single fence attached to the memory type manager is the simplest
>>> solution,
>>> but a solution with a fence for each free region in the free list is also
>>> possible. Then TTM needs a driver callback to be able order fences w r t
>>> echother.
>>>
>>> /Thomas
>>>
>> Radeon already handle multi-ring and ttm interaction with what we call
>> semaphore. Semaphore are created to synchronize with fence accross
>> different ring. I think the easiest solution is to just remove the bo
>> wait in ttm and let driver handle this.
>
>
> The wait can be removed, but only conditioned on a driver flag that says it
> supports unsynchronous buffer moves.
>
> The multi-ring case I'm talking about is:
>
> Ring 1 evicts buffer A, emits fence 0
> Ring 2 evicts buffer B, emits fence 1
> ..Other eviction takes place by various rings, perhaps including ring 1 and
> ring 2.
> Ring 3 moves buffer C into the space which happens bo be the union of the
> space prevously occupied buffer A and buffer B.
>
> Question is: which fence do you want to order this move with?
> The answer is whichever of fence 0 and 1 signals last.
>
> I think it's a reasonable thing for TTM to keep track of this, but in order
> to do so it needs a driver callback that
> can order two fences, and can order a job in the current ring w r t a fence.
> In radeon's case that driver callback
> would probably insert a barrier / semaphore. In the case of simpler hardware
> it would wait on one of the fences.
>
> /Thomas
>

I don't think we can order fence easily with a clean api, i would
rather see ttm provide a list of fence to driver and tell to the
driver before moving this object all the fence on this list need to be
completed. I think it's as easy as associating fence with drm_mm (well
nouveau as its own mm stuff) but idea would basicly be that fence are
both associated with bo and with mm object so you know when a segment
of memory is idle/available for use.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Thomas Hellstrom

On 11/30/2012 05:30 PM, Jerome Glisse wrote:

On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom  wrote:

On 11/29/2012 10:58 PM, Marek Olšák wrote:


What I tried to point out was that the synchronization shouldn't be
needed, because the CPU shouldn't do anything with the contents of
evicted buffers. The GPU moves the buffers, not the CPU. What does the
CPU do besides updating some kernel structures?

Also, buffer deletion is something where you don't need to wait for
the buffer to become idle if you know the memory area won't be
mapped by the CPU, ever. The memory can be reclaimed right away. It
would be the GPU to move new data in and once that happens, the old
buffer will be trivially idle, because single-ring GPUs execute
commands in order.

Marek


Actually asynchronous eviction / deletion is something I have been
prototyping for a while but never gotten around to implement in TTM:

There are a few minor caveats:

With buffer deletion, what you say is true for fixed memory, but not for TT
memory where pages are reclaimed by the system after buffer destruction.
That means that we don't have to wait for idle to free GPU space, but we
need to wait before pages are handed back to the system.

Swapout needs to access the contents of evicted buffers, but synchronizing
doesn't need to happen until just before swapout.

Multi-ring - CPU support: If another ring / engine or the CPU is about to
move in buffer contents to VRAM or a GPU aperture that was previously
evicted by another ring, it needs to sync with that eviction, but doesn't
know what buffer or even which buffers occupied the space previously.
Trivially one can attach a sync object to the memory type manager that
represents the last eviction from that memory type, and *any* engine (CPU or
GPU) that moves buffer contents in needs to order that movement with respect
to that fence. As you say, with a single ring and no CPU fallbacks, that
ordering is a no-op, but any common (non-driver based) implementation needs
to support this.

A single fence attached to the memory type manager is the simplest solution,
but a solution with a fence for each free region in the free list is also
possible. Then TTM needs a driver callback to be able order fences w r t
echother.

/Thomas


Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.


The wait can be removed, but only conditioned on a driver flag that says 
it supports unsynchronous buffer moves.


The multi-ring case I'm talking about is:

Ring 1 evicts buffer A, emits fence 0
Ring 2 evicts buffer B, emits fence 1
..Other eviction takes place by various rings, perhaps including ring 1 
and ring 2.
Ring 3 moves buffer C into the space which happens bo be the union of 
the space prevously occupied buffer A and buffer B.


Question is: which fence do you want to order this move with?
The answer is whichever of fence 0 and 1 signals last.

I think it's a reasonable thing for TTM to keep track of this, but in 
order to do so it needs a driver callback that
can order two fences, and can order a job in the current ring w r t a 
fence. In radeon's case that driver callback
would probably insert a barrier / semaphore. In the case of simpler 
hardware it would wait on one of the fences.


/Thomas



Cheers,
Jerome




___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Asynchronous eviction [WAS Re: [PATCH] drm/ttm: add minimum residency constraint for bo eviction]

2012-11-30 Thread Jerome Glisse
On Fri, Nov 30, 2012 at 4:39 AM, Thomas Hellstrom  wrote:
> On 11/29/2012 10:58 PM, Marek Olšák wrote:
>>
>>
>> What I tried to point out was that the synchronization shouldn't be
>> needed, because the CPU shouldn't do anything with the contents of
>> evicted buffers. The GPU moves the buffers, not the CPU. What does the
>> CPU do besides updating some kernel structures?
>>
>> Also, buffer deletion is something where you don't need to wait for
>> the buffer to become idle if you know the memory area won't be
>> mapped by the CPU, ever. The memory can be reclaimed right away. It
>> would be the GPU to move new data in and once that happens, the old
>> buffer will be trivially idle, because single-ring GPUs execute
>> commands in order.
>>
>> Marek
>
>
> Actually asynchronous eviction / deletion is something I have been
> prototyping for a while but never gotten around to implement in TTM:
>
> There are a few minor caveats:
>
> With buffer deletion, what you say is true for fixed memory, but not for TT
> memory where pages are reclaimed by the system after buffer destruction.
> That means that we don't have to wait for idle to free GPU space, but we
> need to wait before pages are handed back to the system.
>
> Swapout needs to access the contents of evicted buffers, but synchronizing
> doesn't need to happen until just before swapout.
>
> Multi-ring - CPU support: If another ring / engine or the CPU is about to
> move in buffer contents to VRAM or a GPU aperture that was previously
> evicted by another ring, it needs to sync with that eviction, but doesn't
> know what buffer or even which buffers occupied the space previously.
> Trivially one can attach a sync object to the memory type manager that
> represents the last eviction from that memory type, and *any* engine (CPU or
> GPU) that moves buffer contents in needs to order that movement with respect
> to that fence. As you say, with a single ring and no CPU fallbacks, that
> ordering is a no-op, but any common (non-driver based) implementation needs
> to support this.
>
> A single fence attached to the memory type manager is the simplest solution,
> but a solution with a fence for each free region in the free list is also
> possible. Then TTM needs a driver callback to be able order fences w r t
> echother.
>
> /Thomas
>

Radeon already handle multi-ring and ttm interaction with what we call
semaphore. Semaphore are created to synchronize with fence accross
different ring. I think the easiest solution is to just remove the bo
wait in ttm and let driver handle this.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel