Re: RFC: Radeon multi ring support branch

2011-11-20 Thread Jerome Glisse
On Fri, Nov 18, 2011 at 11:34 AM, Jerome Glisse  wrote:
> On Fri, Nov 18, 2011 at 07:44:19AM -0500, Alex Deucher wrote:
>> 2011/11/17 Alex Deucher :
>> > 2011/11/17 Christian König :
>> >> On 16.11.2011 01:24, Jerome Glisse wrote:
>> >>>
>> >>> Well as we don't specify on which value semaphore should wait on, i am
>> >>> prety sure the first ring to increment the semaphore will unblock all
>> >>> waiter. So if you have ring1 that want to wait on ring2 and ring3 as 
>> >>> soon as
>> >>> ring2 or ring3 is done ring1 will go one while either ring2 or ring3 
>> >>> might
>> >>> not be done. I will test that tomorrow but from doc i have it seems so. 
>> >>> Thus
>> >>> it will be broken with more than one ring, that would mean you have to
>> >>> allocate one semaphore for each ring couple you want to synchronize. Note
>> >>> that the usual case will likely be sync btw 2 ring.
>> >>
>> >> Good point, but I played with it a bit more today and it is just behaving 
>> >> as
>> >> I thought it would be. A single signal command will just unblock a single
>> >> waiter, even if there are multiple waiters currently for this semaphore, 
>> >> the
>> >> only thing you can't tell is which waiter will come first.
>> >>
>> >> I should also note that the current algorithm will just emit multiple wait
>> >> operations to a single ring, and spread the signal operations to all other
>> >> rings we are interested in. That isn't very efficient, but should indeed
>> >> work quite fine.
>> >>
>> >>> After retesting the first patch  drm/radeon: fix debugfs handling is NAK
>> >>> a complete no go.
>> >>>
>> >>> Issue is that radeon_debugfs_cleanup is call after rdev is free. This
>> >>> is why i used a static array. I forgot about that, i should have put a
>> >>> comment. I guess you built your kernel without debugfs or that you
>> >>> didn't tested to reload the module.
>> >>
>> >> Mhm, I have tested it, seen the crash, and didn't thought that this is a
>> >> problem. Don't ask me why I can't understand it myself right now.
>> >>
>> >> Anyway, I moved the unregistering of the files into a separate function,
>> >> which is now called from radeon_device_fini instead of
>> >> radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
>> >> missed something else.
>> >>
>> >> I also merged your indention fixes and the fix for the never allocated
>> >> semaphores and pushed the result into my public repository
>> >> (http://cgit.freedesktop.org/~deathsimple/linux), so please take another
>> >> look at it.
>> >
>> > I've got a few other patches to enable further functionality in the
>> > mring patches.
>> > - per ring fence interrupts
>> > - add some additional ring fields to better handle different ring types
>> >
>> > http://people.freedesktop.org/~agd5f/mrings/
>> >
>>
>> FYI, I updated these later last night.
>>
>> Alex
>>
>
> Ok so reviewed the patch serie, please Christian keep v2, v3, ...
> informations, i find this usefull. I put updated patch at
> http://people.freedesktop.org/~glisse/mrings/
>
> Couple of fixes there, indentation, and also i changed the testing
> parameter to be a bit flag which make our life easier when we want
> to only test the semaphore stuff and not the bo move.
>
> Cheers,
> Jerome
>

Found a major bug introduced by patch 15, rewrote it. Reuploaded
the whole series at
http://people.freedesktop.org/~glisse/mrings/

Issue is that all the fence list should be initialized only once from
asic init callback. Commit message has bigger explanation.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-18 Thread Jerome Glisse
On Fri, Nov 18, 2011 at 07:44:19AM -0500, Alex Deucher wrote:
> 2011/11/17 Alex Deucher :
> > 2011/11/17 Christian König :
> >> On 16.11.2011 01:24, Jerome Glisse wrote:
> >>>
> >>> Well as we don't specify on which value semaphore should wait on, i am
> >>> prety sure the first ring to increment the semaphore will unblock all
> >>> waiter. So if you have ring1 that want to wait on ring2 and ring3 as soon 
> >>> as
> >>> ring2 or ring3 is done ring1 will go one while either ring2 or ring3 might
> >>> not be done. I will test that tomorrow but from doc i have it seems so. 
> >>> Thus
> >>> it will be broken with more than one ring, that would mean you have to
> >>> allocate one semaphore for each ring couple you want to synchronize. Note
> >>> that the usual case will likely be sync btw 2 ring.
> >>
> >> Good point, but I played with it a bit more today and it is just behaving 
> >> as
> >> I thought it would be. A single signal command will just unblock a single
> >> waiter, even if there are multiple waiters currently for this semaphore, 
> >> the
> >> only thing you can't tell is which waiter will come first.
> >>
> >> I should also note that the current algorithm will just emit multiple wait
> >> operations to a single ring, and spread the signal operations to all other
> >> rings we are interested in. That isn't very efficient, but should indeed
> >> work quite fine.
> >>
> >>> After retesting the first patch  drm/radeon: fix debugfs handling is NAK
> >>> a complete no go.
> >>>
> >>> Issue is that radeon_debugfs_cleanup is call after rdev is free. This
> >>> is why i used a static array. I forgot about that, i should have put a
> >>> comment. I guess you built your kernel without debugfs or that you
> >>> didn't tested to reload the module.
> >>
> >> Mhm, I have tested it, seen the crash, and didn't thought that this is a
> >> problem. Don't ask me why I can't understand it myself right now.
> >>
> >> Anyway, I moved the unregistering of the files into a separate function,
> >> which is now called from radeon_device_fini instead of
> >> radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
> >> missed something else.
> >>
> >> I also merged your indention fixes and the fix for the never allocated
> >> semaphores and pushed the result into my public repository
> >> (http://cgit.freedesktop.org/~deathsimple/linux), so please take another
> >> look at it.
> >
> > I've got a few other patches to enable further functionality in the
> > mring patches.
> > - per ring fence interrupts
> > - add some additional ring fields to better handle different ring types
> >
> > http://people.freedesktop.org/~agd5f/mrings/
> >
> 
> FYI, I updated these later last night.
> 
> Alex
> 

Ok so reviewed the patch serie, please Christian keep v2, v3, ...
informations, i find this usefull. I put updated patch at
http://people.freedesktop.org/~glisse/mrings/

Couple of fixes there, indentation, and also i changed the testing
parameter to be a bit flag which make our life easier when we want
to only test the semaphore stuff and not the bo move.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-18 Thread Jerome Glisse
On Fri, Nov 18, 2011 at 09:21:50AM -0500, Jerome Glisse wrote:
> 2011/11/18 Christian König :
> > On 17.11.2011 17:58, Jerome Glisse wrote:
> >>
> >> 2011/11/17 Christian König:
> >>>
> >>> On 16.11.2011 01:24, Jerome Glisse wrote:
> 
>  Well as we don't specify on which value semaphore should wait on, i am
>  prety sure the first ring to increment the semaphore will unblock all
>  waiter. So if you have ring1 that want to wait on ring2 and ring3 as
>  soon as
>  ring2 or ring3 is done ring1 will go one while either ring2 or ring3
>  might
>  not be done. I will test that tomorrow but from doc i have it seems so.
>  Thus
>  it will be broken with more than one ring, that would mean you have to
>  allocate one semaphore for each ring couple you want to synchronize.
>  Note
>  that the usual case will likely be sync btw 2 ring.
> >>>
> >>> Good point, but I played with it a bit more today and it is just behaving
> >>> as
> >>> I thought it would be. A single signal command will just unblock a single
> >>> waiter, even if there are multiple waiters currently for this semaphore,
> >>> the
> >>> only thing you can't tell is which waiter will come first.
> >>
> >> I am not talking about multiple waiter but about single waiter on
> >> multilple
> >> signaler. Current implementation allocate one and only one semaphore
> >> not matter how much ring you want to synchronize with. Which means
> >> the single waiter will be unblock once only a single ring signal, which is
> >> broken if you want to wait for several rings.
> >
> > Wait a moment, having a single semaphore doesn't means that there is only a
> > single waiter, just take a look at the code again:
> >
> >        for (i = 0; i < RADEON_NUM_RINGS; ++i) {
> >                /* no need to sync to our own or unused rings */
> >                if (i == p->ring || !p->sync_to_ring[i])
> >                        continue;
> >
> >                if (!p->ib->fence->semaphore) {
> >                        r = radeon_semaphore_create(p->rdev,
> > &p->ib->fence->semaphore);
> >                        if (r)
> >                                return r;
> >                }
> >
> >                radeon_semaphore_emit_signal(p->rdev, i,
> > p->ib->fence->semaphore);
> >                radeon_semaphore_emit_wait(p->rdev, p->ring,
> > p->ib->fence->semaphore); <
> >        }
> >
> > It is allocating a single semaphore, thats correct, but at the same time it
> > emits multiple wait commands. So if we want to synchronize with just one
> > ring, we only wait once on the semaphore, but if we want to synchronize with
> > N rings we also wait N times on the same semaphore. Since we don't really
> > care about the order in which the signal operations arrive that is pretty
> > much ok, because when execution continues after the last wait command it has
> > been made sure that every signal operation has been fired. The extended
> > semaphore test in my repository is also checking for this condition, and it
> > really seems to work fine.
> 
> So how does the wait logic works, that's what i am asking.
> If semaphore block wait for the addr to be non zero than it doesn't work
> If semaphore block wait block until it get a signal than it should work 
> thought
> there is a possible race issue in hw if the signal ring wakeup the semaphore
> block at the same time do the semaphore block consider this as a single
> signal or as 2 signal.
> 
> radeon_semaphore_create(&p->ib->fence->semaphore)
> radeon_semaphore_emit_signal(ringA, &p->ib->fence->semaphore)
> radeon_semaphore_emit_signal(ringB, &p->ib->fence->semaphore)
> radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)
> radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)
> 
> ringC will stall until either ringA or ringB signal if it only wait on non 
> zero
> 
> Code you put in you repo yesterday tested following case:
> radeon_semaphore_emit_wait(ringA, &p->ib->fence->semaphore)
> radeon_semaphore_emit_wait(ringB, &p->ib->fence->semaphore)
> radeon_semaphore_emit_signal(ringC, &p->ib->fence->semaphore)
> 
> Which is fine, i don't argue on such use case. I argue on :
> radeon_semaphore_emit_signal(ringA, &p->ib->fence->semaphore)
> radeon_semaphore_emit_signal(ringB, &p->ib->fence->semaphore)
> radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)
> 
> What i think would be safer is :
> radeon_semaphore_emit_signal(ringA, semaphore_ringa_ringc)
> radeon_semaphore_emit_signal(ringB, semaphore_ringb_ringc)
> radeon_semaphore_emit_wait(ringC,semaphore_ringa_ringc)
> radeon_semaphore_emit_wait(ringC,semaphore_ringb_ringc)
> 
> Which wouldn't be affected by any possible race.
> 
> >>> I should also note that the current algorithm will just emit multiple
> >>> wait
> >>> operations to a single ring, and spread the signal operations to all
> >>> other
> >>> rings we are interested in. That isn't very efficient, but should indeed
> >>> work quite fine.

Ok so i have

Re: RFC: Radeon multi ring support branch

2011-11-18 Thread Jerome Glisse
2011/11/18 Christian König :
> On 17.11.2011 17:58, Jerome Glisse wrote:
>>
>> 2011/11/17 Christian König:
>>>
>>> On 16.11.2011 01:24, Jerome Glisse wrote:

 Well as we don't specify on which value semaphore should wait on, i am
 prety sure the first ring to increment the semaphore will unblock all
 waiter. So if you have ring1 that want to wait on ring2 and ring3 as
 soon as
 ring2 or ring3 is done ring1 will go one while either ring2 or ring3
 might
 not be done. I will test that tomorrow but from doc i have it seems so.
 Thus
 it will be broken with more than one ring, that would mean you have to
 allocate one semaphore for each ring couple you want to synchronize.
 Note
 that the usual case will likely be sync btw 2 ring.
>>>
>>> Good point, but I played with it a bit more today and it is just behaving
>>> as
>>> I thought it would be. A single signal command will just unblock a single
>>> waiter, even if there are multiple waiters currently for this semaphore,
>>> the
>>> only thing you can't tell is which waiter will come first.
>>
>> I am not talking about multiple waiter but about single waiter on
>> multilple
>> signaler. Current implementation allocate one and only one semaphore
>> not matter how much ring you want to synchronize with. Which means
>> the single waiter will be unblock once only a single ring signal, which is
>> broken if you want to wait for several rings.
>
> Wait a moment, having a single semaphore doesn't means that there is only a
> single waiter, just take a look at the code again:
>
>        for (i = 0; i < RADEON_NUM_RINGS; ++i) {
>                /* no need to sync to our own or unused rings */
>                if (i == p->ring || !p->sync_to_ring[i])
>                        continue;
>
>                if (!p->ib->fence->semaphore) {
>                        r = radeon_semaphore_create(p->rdev,
> &p->ib->fence->semaphore);
>                        if (r)
>                                return r;
>                }
>
>                radeon_semaphore_emit_signal(p->rdev, i,
> p->ib->fence->semaphore);
>                radeon_semaphore_emit_wait(p->rdev, p->ring,
> p->ib->fence->semaphore); <
>        }
>
> It is allocating a single semaphore, thats correct, but at the same time it
> emits multiple wait commands. So if we want to synchronize with just one
> ring, we only wait once on the semaphore, but if we want to synchronize with
> N rings we also wait N times on the same semaphore. Since we don't really
> care about the order in which the signal operations arrive that is pretty
> much ok, because when execution continues after the last wait command it has
> been made sure that every signal operation has been fired. The extended
> semaphore test in my repository is also checking for this condition, and it
> really seems to work fine.

So how does the wait logic works, that's what i am asking.
If semaphore block wait for the addr to be non zero than it doesn't work
If semaphore block wait block until it get a signal than it should work thought
there is a possible race issue in hw if the signal ring wakeup the semaphore
block at the same time do the semaphore block consider this as a single
signal or as 2 signal.

radeon_semaphore_create(&p->ib->fence->semaphore)
radeon_semaphore_emit_signal(ringA, &p->ib->fence->semaphore)
radeon_semaphore_emit_signal(ringB, &p->ib->fence->semaphore)
radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)
radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)

ringC will stall until either ringA or ringB signal if it only wait on non zero

Code you put in you repo yesterday tested following case:
radeon_semaphore_emit_wait(ringA, &p->ib->fence->semaphore)
radeon_semaphore_emit_wait(ringB, &p->ib->fence->semaphore)
radeon_semaphore_emit_signal(ringC, &p->ib->fence->semaphore)

Which is fine, i don't argue on such use case. I argue on :
radeon_semaphore_emit_signal(ringA, &p->ib->fence->semaphore)
radeon_semaphore_emit_signal(ringB, &p->ib->fence->semaphore)
radeon_semaphore_emit_wait(ringC, &p->ib->fence->semaphore)

What i think would be safer is :
radeon_semaphore_emit_signal(ringA, semaphore_ringa_ringc)
radeon_semaphore_emit_signal(ringB, semaphore_ringb_ringc)
radeon_semaphore_emit_wait(ringC,semaphore_ringa_ringc)
radeon_semaphore_emit_wait(ringC,semaphore_ringb_ringc)

Which wouldn't be affected by any possible race.

>>> I should also note that the current algorithm will just emit multiple
>>> wait
>>> operations to a single ring, and spread the signal operations to all
>>> other
>>> rings we are interested in. That isn't very efficient, but should indeed
>>> work quite fine.
>>
>> Well the cs patch you posted don't do that. It create one semaphore
>> and emit a wait with that semaphore and emit signal to all ring. That
>> was my point. But i agree that creating a semaphore for each ring
>> couple you want to wait for will work.
>>
 After retesting the firs

Re: RFC: Radeon multi ring support branch

2011-11-18 Thread Alex Deucher
2011/11/17 Alex Deucher :
> 2011/11/17 Christian König :
>> On 16.11.2011 01:24, Jerome Glisse wrote:
>>>
>>> Well as we don't specify on which value semaphore should wait on, i am
>>> prety sure the first ring to increment the semaphore will unblock all
>>> waiter. So if you have ring1 that want to wait on ring2 and ring3 as soon as
>>> ring2 or ring3 is done ring1 will go one while either ring2 or ring3 might
>>> not be done. I will test that tomorrow but from doc i have it seems so. Thus
>>> it will be broken with more than one ring, that would mean you have to
>>> allocate one semaphore for each ring couple you want to synchronize. Note
>>> that the usual case will likely be sync btw 2 ring.
>>
>> Good point, but I played with it a bit more today and it is just behaving as
>> I thought it would be. A single signal command will just unblock a single
>> waiter, even if there are multiple waiters currently for this semaphore, the
>> only thing you can't tell is which waiter will come first.
>>
>> I should also note that the current algorithm will just emit multiple wait
>> operations to a single ring, and spread the signal operations to all other
>> rings we are interested in. That isn't very efficient, but should indeed
>> work quite fine.
>>
>>> After retesting the first patch  drm/radeon: fix debugfs handling is NAK
>>> a complete no go.
>>>
>>> Issue is that radeon_debugfs_cleanup is call after rdev is free. This
>>> is why i used a static array. I forgot about that, i should have put a
>>> comment. I guess you built your kernel without debugfs or that you
>>> didn't tested to reload the module.
>>
>> Mhm, I have tested it, seen the crash, and didn't thought that this is a
>> problem. Don't ask me why I can't understand it myself right now.
>>
>> Anyway, I moved the unregistering of the files into a separate function,
>> which is now called from radeon_device_fini instead of
>> radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
>> missed something else.
>>
>> I also merged your indention fixes and the fix for the never allocated
>> semaphores and pushed the result into my public repository
>> (http://cgit.freedesktop.org/~deathsimple/linux), so please take another
>> look at it.
>
> I've got a few other patches to enable further functionality in the
> mring patches.
> - per ring fence interrupts
> - add some additional ring fields to better handle different ring types
>
> http://people.freedesktop.org/~agd5f/mrings/
>

FYI, I updated these later last night.

Alex

> Alex
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-18 Thread Christian König

On 17.11.2011 17:58, Jerome Glisse wrote:

2011/11/17 Christian König:

On 16.11.2011 01:24, Jerome Glisse wrote:

Well as we don't specify on which value semaphore should wait on, i am
prety sure the first ring to increment the semaphore will unblock all
waiter. So if you have ring1 that want to wait on ring2 and ring3 as soon as
ring2 or ring3 is done ring1 will go one while either ring2 or ring3 might
not be done. I will test that tomorrow but from doc i have it seems so. Thus
it will be broken with more than one ring, that would mean you have to
allocate one semaphore for each ring couple you want to synchronize. Note
that the usual case will likely be sync btw 2 ring.

Good point, but I played with it a bit more today and it is just behaving as
I thought it would be. A single signal command will just unblock a single
waiter, even if there are multiple waiters currently for this semaphore, the
only thing you can't tell is which waiter will come first.

I am not talking about multiple waiter but about single waiter on multilple
signaler. Current implementation allocate one and only one semaphore
not matter how much ring you want to synchronize with. Which means
the single waiter will be unblock once only a single ring signal, which is
broken if you want to wait for several rings.
Wait a moment, having a single semaphore doesn't means that there is 
only a single waiter, just take a look at the code again:


for (i = 0; i < RADEON_NUM_RINGS; ++i) {
/* no need to sync to our own or unused rings */
if (i == p->ring || !p->sync_to_ring[i])
continue;

if (!p->ib->fence->semaphore) {
r = radeon_semaphore_create(p->rdev, 
&p->ib->fence->semaphore);

if (r)
return r;
}

radeon_semaphore_emit_signal(p->rdev, i, 
p->ib->fence->semaphore);
radeon_semaphore_emit_wait(p->rdev, p->ring, 
p->ib->fence->semaphore); <

}

It is allocating a single semaphore, thats correct, but at the same time 
it emits multiple wait commands. So if we want to synchronize with just 
one ring, we only wait once on the semaphore, but if we want to 
synchronize with N rings we also wait N times on the same semaphore. 
Since we don't really care about the order in which the signal 
operations arrive that is pretty much ok, because when execution 
continues after the last wait command it has been made sure that every 
signal operation has been fired. The extended semaphore test in my 
repository is also checking for this condition, and it really seems to 
work fine.



I should also note that the current algorithm will just emit multiple wait
operations to a single ring, and spread the signal operations to all other
rings we are interested in. That isn't very efficient, but should indeed
work quite fine.

Well the cs patch you posted don't do that. It create one semaphore
and emit a wait with that semaphore and emit signal to all ring. That
was my point. But i agree that creating a semaphore for each ring
couple you want to wait for will work.


After retesting the first patch  drm/radeon: fix debugfs handling is NAK
a complete no go.

Issue is that radeon_debugfs_cleanup is call after rdev is free. This
is why i used a static array. I forgot about that, i should have put a
comment. I guess you built your kernel without debugfs or that you
didn't tested to reload the module.

Mhm, I have tested it, seen the crash, and didn't thought that this is a
problem. Don't ask me why I can't understand it myself right now.

Anyway, I moved the unregistering of the files into a separate function,
which is now called from radeon_device_fini instead of
radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
missed something else.

Please don't touch the debugfs stuff, it's fine the way it is. No need to
change anything there.
At least for me there is need to change something, because using a 
static variable causes the debugfs files to only appear on the first 
card in the system. And in my configuration the first card is always the 
onboard APU, so I wondered for half a day why a locked up IB doesn't 
contains any data, until i finally recognized that I'm looking at the 
wrong GPU in the system.



I also merged your indention fixes and the fix for the never allocated
semaphores and pushed the result into my public repository
(http://cgit.freedesktop.org/~deathsimple/linux), so please take another
look at it.


I will take a look

Thanks, but please be aware that I'm going to also integrate Alex 
patches ontop of it today and also give it another test round, just to 
make sure that I'm not doing anything stupid with the debugfs code. So 
if you haven't done already please wait a bit more.


Thanks,
Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://l

Re: RFC: Radeon multi ring support branch

2011-11-17 Thread Alex Deucher
2011/11/17 Christian König :
> On 16.11.2011 01:24, Jerome Glisse wrote:
>>
>> Well as we don't specify on which value semaphore should wait on, i am
>> prety sure the first ring to increment the semaphore will unblock all
>> waiter. So if you have ring1 that want to wait on ring2 and ring3 as soon as
>> ring2 or ring3 is done ring1 will go one while either ring2 or ring3 might
>> not be done. I will test that tomorrow but from doc i have it seems so. Thus
>> it will be broken with more than one ring, that would mean you have to
>> allocate one semaphore for each ring couple you want to synchronize. Note
>> that the usual case will likely be sync btw 2 ring.
>
> Good point, but I played with it a bit more today and it is just behaving as
> I thought it would be. A single signal command will just unblock a single
> waiter, even if there are multiple waiters currently for this semaphore, the
> only thing you can't tell is which waiter will come first.
>
> I should also note that the current algorithm will just emit multiple wait
> operations to a single ring, and spread the signal operations to all other
> rings we are interested in. That isn't very efficient, but should indeed
> work quite fine.
>
>> After retesting the first patch  drm/radeon: fix debugfs handling is NAK
>> a complete no go.
>>
>> Issue is that radeon_debugfs_cleanup is call after rdev is free. This
>> is why i used a static array. I forgot about that, i should have put a
>> comment. I guess you built your kernel without debugfs or that you
>> didn't tested to reload the module.
>
> Mhm, I have tested it, seen the crash, and didn't thought that this is a
> problem. Don't ask me why I can't understand it myself right now.
>
> Anyway, I moved the unregistering of the files into a separate function,
> which is now called from radeon_device_fini instead of
> radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
> missed something else.
>
> I also merged your indention fixes and the fix for the never allocated
> semaphores and pushed the result into my public repository
> (http://cgit.freedesktop.org/~deathsimple/linux), so please take another
> look at it.

I've got a few other patches to enable further functionality in the
mring patches.
- per ring fence interrupts
- add some additional ring fields to better handle different ring types

http://people.freedesktop.org/~agd5f/mrings/

Alex
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-17 Thread Jerome Glisse
2011/11/17 Christian König :
> On 16.11.2011 01:24, Jerome Glisse wrote:
>>
>> Well as we don't specify on which value semaphore should wait on, i am
>> prety sure the first ring to increment the semaphore will unblock all
>> waiter. So if you have ring1 that want to wait on ring2 and ring3 as soon as
>> ring2 or ring3 is done ring1 will go one while either ring2 or ring3 might
>> not be done. I will test that tomorrow but from doc i have it seems so. Thus
>> it will be broken with more than one ring, that would mean you have to
>> allocate one semaphore for each ring couple you want to synchronize. Note
>> that the usual case will likely be sync btw 2 ring.
>
> Good point, but I played with it a bit more today and it is just behaving as
> I thought it would be. A single signal command will just unblock a single
> waiter, even if there are multiple waiters currently for this semaphore, the
> only thing you can't tell is which waiter will come first.

I am not talking about multiple waiter but about single waiter on multilple
signaler. Current implementation allocate one and only one semaphore
not matter how much ring you want to synchronize with. Which means
the single waiter will be unblock once only a single ring signal, which is
broken if you want to wait for several rings.

> I should also note that the current algorithm will just emit multiple wait
> operations to a single ring, and spread the signal operations to all other
> rings we are interested in. That isn't very efficient, but should indeed
> work quite fine.

Well the cs patch you posted don't do that. It create one semaphore
and emit a wait with that semaphore and emit signal to all ring. That
was my point. But i agree that creating a semaphore for each ring
couple you want to wait for will work.

>> After retesting the first patch  drm/radeon: fix debugfs handling is NAK
>> a complete no go.
>>
>> Issue is that radeon_debugfs_cleanup is call after rdev is free. This
>> is why i used a static array. I forgot about that, i should have put a
>> comment. I guess you built your kernel without debugfs or that you
>> didn't tested to reload the module.
>
> Mhm, I have tested it, seen the crash, and didn't thought that this is a
> problem. Don't ask me why I can't understand it myself right now.
>
> Anyway, I moved the unregistering of the files into a separate function,
> which is now called from radeon_device_fini instead of
> radeon_debugfs_cleanup. That seems to work fine, at least if I haven't
> missed something else.

Please don't touch the debugfs stuff, it's fine the way it is. No need to
change anything there.

> I also merged your indention fixes and the fix for the never allocated
> semaphores and pushed the result into my public repository
> (http://cgit.freedesktop.org/~deathsimple/linux), so please take another
> look at it.
>

I will take a look

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-17 Thread Christian König

On 16.11.2011 01:24, Jerome Glisse wrote:
Well as we don't specify on which value semaphore should wait on, i am 
prety sure the first ring to increment the semaphore will unblock all 
waiter. So if you have ring1 that want to wait on ring2 and ring3 as 
soon as ring2 or ring3 is done ring1 will go one while either ring2 or 
ring3 might not be done. I will test that tomorrow but from doc i have 
it seems so. Thus it will be broken with more than one ring, that 
would mean you have to allocate one semaphore for each ring couple you 
want to synchronize. Note that the usual case will likely be sync btw 
2 ring.


Good point, but I played with it a bit more today and it is just 
behaving as I thought it would be. A single signal command will just 
unblock a single waiter, even if there are multiple waiters currently 
for this semaphore, the only thing you can't tell is which waiter will 
come first.


I should also note that the current algorithm will just emit multiple 
wait operations to a single ring, and spread the signal operations to 
all other rings we are interested in. That isn't very efficient, but 
should indeed work quite fine.



After retesting the first patch  drm/radeon: fix debugfs handling is NAK
a complete no go.

Issue is that radeon_debugfs_cleanup is call after rdev is free. This
is why i used a static array. I forgot about that, i should have put a
comment. I guess you built your kernel without debugfs or that you
didn't tested to reload the module.


Mhm, I have tested it, seen the crash, and didn't thought that this is a 
problem. Don't ask me why I can't understand it myself right now.


Anyway, I moved the unregistering of the files into a separate function, 
which is now called from radeon_device_fini instead of 
radeon_debugfs_cleanup. That seems to work fine, at least if I haven't 
missed something else.


I also merged your indention fixes and the fix for the never allocated 
semaphores and pushed the result into my public repository 
(http://cgit.freedesktop.org/~deathsimple/linux), so please take another 
look at it.


Thanks,
Christian.

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-15 Thread Jerome Glisse
2011/11/15 Christian König :
> On 15.11.2011 20:32, Jerome Glisse wrote:
>>
>> On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:
>>>
>>> Hello everybody,
>>>
>>> to support multiple compute rings, async DMA engines and UVD we need
>>> to teach the radeon kernel module how to sync buffers between
>>> different rings and make some changes to the command submission
>>> ioctls.
>>>
>>> Since we can't release any documentation about async DMA or UVD
>>> (yet), my current branch concentrates on getting the additional
>>> compute rings on cayman running. Unfortunately those rings have
>>> hardware bugs that can't be worked around, so they are actually not
>>> very useful in a production environment, but they should do quite
>>> well for this testing purpose.
>>>
>>> The branch can be found here:
>>> http://cgit.freedesktop.org/~deathsimple/linux/log/
>>>
>>> Since some of the patches are quite intrusive, constantly rebaseing
>>> them could get a bit painful. So I would like to see most of the
>>> stuff included into drm-next, even if we don't make use of the new
>>> functionality right now.
>>>
>>> Comments welcome,
>>> Christian.
>>
>> So i have been looking back at all this and now there is somethings
>> puzzling me. So if semaphore wait for a non null value at gpu address
>> provided in the packet than the current implementation for the cs
>> ioctl doesn't work when there is more than 2 rings to sync.
>
> Semaphores are counters, so each signal operation is atomically incrementing
> the counter, while each wait operation is (atomically) checking if the
> counter is above zero and if it is decrement it. So you can signal/wait on
> the same address multiple times.

Yup i did understand that right.

>>
>> As it will use only one semaphore so first ring to finish will
>> mark the semaphore as done even if there is still other ring not
>> done.
>
> Nope, each wait operation will wait for a separate signal operation (at
> least I think so).

Well as we don't specify on which value semaphore should wait on, i am
prety sure the first ring to increment the semaphore will unblock all
waiter. So if you have ring1 that want to wait on ring2 and ring3 as
soon as ring2 or ring3 is done ring1 will go one while either ring2 or
ring3 might not be done. I will test that tomorrow but from doc i have
it seems so. Thus it will be broken with more than one ring, that
would mean you have to allocate one semaphore for each ring couple you
want to synchronize. Note that the usual case will likely be sync btw
2 ring.

>>
>> This all make me wonder if some change to cs ioctl would make
>> all this better. So idea of semaphore is to wait for some other
>> ring to finish something. So let say we have following scenario:
>> Application submit following to ring1: csA, csB
>> Application now submit to ring2: cs1, cs2
>> And application want csA to be done for cs1 and csB to be done
>> for cs2.
>>
>> To achieve such usage pattern we would need to return fence seq
>> or similar from the cs ioctl. So user application would know
>> ringid+fence_seq for csA&  csB and provide this when scheduling
>> cs1&  cs2. Here i am assuming MEM_WRITE/WAIT_REG_MEM packet
>> are as good as MEM_SEMAPHORE packet. Ie the semaphore packet
>> doesn't give us much more than MEM_WRITE/WAIT_REG_MEM would.
>>
>> To achieve that each ring got it's fence scratch addr where to
>> write seq number. And then we use WAIT_REG_MEM on this addr
>> and with the specific seq for the other ring that needs
>> synchronization. This would simplify the semaphore code as
>> we wouldn't need somethings new beside helper function and
>> maybe extending the fence structure.
>
> I played around with the same Idea before implementing the whole semaphore
> stuff, but the killer argument against it is that not all rings support the
> WAIT_REG_MEM command.
>
> Also the semaphores are much more efficient than the WAIT_REG_MEM command,
> because all semaphore commands from the different rings are send to a
> central semaphore block, so that constant polling, and with it constant
> memory access can be avoided. Additional to that the WAIT_REG_MEM command
> has a minimum latency of Wait_Interval*16 clock cycles, while semaphore need
> 4 clock cycles for the command and 4 clock cycles for the result, so it
> definitely has a much lower latency.

Yup that was my guess too that semaphore block optimize things

> We should also keep in mind that the semaphore block is not only capable to
> sync between different rings inside a single GPU, but can also communicate
> with another semaphore block in a second GPU. So it is a key part in a multi
> GPU environment.
>
>> Anyway i put updating ring patch at :
>> http://people.freedesktop.org/~glisse/mrings/
>> It rebased on top of linus tree and it has several space
>> indentation fixes and also a fix for no semaphore allocated
>> issue (patch 5)
>>
> Thanks, I will try to integrate the changes tomorrow.
>

After retesting the first patch  drm/r

Re: RFC: Radeon multi ring support branch

2011-11-15 Thread Christian König

On 15.11.2011 20:32, Jerome Glisse wrote:

On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:

Hello everybody,

to support multiple compute rings, async DMA engines and UVD we need
to teach the radeon kernel module how to sync buffers between
different rings and make some changes to the command submission
ioctls.

Since we can't release any documentation about async DMA or UVD
(yet), my current branch concentrates on getting the additional
compute rings on cayman running. Unfortunately those rings have
hardware bugs that can't be worked around, so they are actually not
very useful in a production environment, but they should do quite
well for this testing purpose.

The branch can be found here:
http://cgit.freedesktop.org/~deathsimple/linux/log/

Since some of the patches are quite intrusive, constantly rebaseing
them could get a bit painful. So I would like to see most of the
stuff included into drm-next, even if we don't make use of the new
functionality right now.

Comments welcome,
Christian.

So i have been looking back at all this and now there is somethings
puzzling me. So if semaphore wait for a non null value at gpu address
provided in the packet than the current implementation for the cs
ioctl doesn't work when there is more than 2 rings to sync.
Semaphores are counters, so each signal operation is atomically 
incrementing the counter, while each wait operation is (atomically) 
checking if the counter is above zero and if it is decrement it. So you 
can signal/wait on the same address multiple times.




As it will use only one semaphore so first ring to finish will
mark the semaphore as done even if there is still other ring not
done.
Nope, each wait operation will wait for a separate signal operation (at 
least I think so).


This all make me wonder if some change to cs ioctl would make
all this better. So idea of semaphore is to wait for some other
ring to finish something. So let say we have following scenario:
Application submit following to ring1: csA, csB
Application now submit to ring2: cs1, cs2
And application want csA to be done for cs1 and csB to be done
for cs2.

To achieve such usage pattern we would need to return fence seq
or similar from the cs ioctl. So user application would know
ringid+fence_seq for csA&  csB and provide this when scheduling
cs1&  cs2. Here i am assuming MEM_WRITE/WAIT_REG_MEM packet
are as good as MEM_SEMAPHORE packet. Ie the semaphore packet
doesn't give us much more than MEM_WRITE/WAIT_REG_MEM would.

To achieve that each ring got it's fence scratch addr where to
write seq number. And then we use WAIT_REG_MEM on this addr
and with the specific seq for the other ring that needs
synchronization. This would simplify the semaphore code as
we wouldn't need somethings new beside helper function and
maybe extending the fence structure.
I played around with the same Idea before implementing the whole 
semaphore stuff, but the killer argument against it is that not all 
rings support the WAIT_REG_MEM command.


Also the semaphores are much more efficient than the WAIT_REG_MEM 
command, because all semaphore commands from the different rings are 
send to a central semaphore block, so that constant polling, and with it 
constant memory access can be avoided. Additional to that the 
WAIT_REG_MEM command has a minimum latency of Wait_Interval*16 clock 
cycles, while semaphore need 4 clock cycles for the command and 4 clock 
cycles for the result, so it definitely has a much lower latency.


We should also keep in mind that the semaphore block is not only capable 
to sync between different rings inside a single GPU, but can also 
communicate with another semaphore block in a second GPU. So it is a key 
part in a multi GPU environment.



Anyway i put updating ring patch at :
http://people.freedesktop.org/~glisse/mrings/
It rebased on top of linus tree and it has several space
indentation fixes and also a fix for no semaphore allocated
issue (patch 5)


Thanks, I will try to integrate the changes tomorrow.

Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-15 Thread Jerome Glisse
On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:
> Hello everybody,
> 
> to support multiple compute rings, async DMA engines and UVD we need
> to teach the radeon kernel module how to sync buffers between
> different rings and make some changes to the command submission
> ioctls.
> 
> Since we can't release any documentation about async DMA or UVD
> (yet), my current branch concentrates on getting the additional
> compute rings on cayman running. Unfortunately those rings have
> hardware bugs that can't be worked around, so they are actually not
> very useful in a production environment, but they should do quite
> well for this testing purpose.
> 
> The branch can be found here:
> http://cgit.freedesktop.org/~deathsimple/linux/log/
> 
> Since some of the patches are quite intrusive, constantly rebaseing
> them could get a bit painful. So I would like to see most of the
> stuff included into drm-next, even if we don't make use of the new
> functionality right now.
> 
> Comments welcome,
> Christian.

So i have been looking back at all this and now there is somethings
puzzling me. So if semaphore wait for a non null value at gpu address
provided in the packet than the current implementation for the cs
ioctl doesn't work when there is more than 2 rings to sync.

http://cgit.freedesktop.org/~deathsimple/linux/commit/?h=multi-ring-testing&id=bae372811c697a889ff0cf9128f52fe914d0fe1b

As it will use only one semaphore so first ring to finish will
mark the semaphore as done even if there is still other ring not
done.

This all make me wonder if some change to cs ioctl would make
all this better. So idea of semaphore is to wait for some other
ring to finish something. So let say we have following scenario:
Application submit following to ring1: csA, csB
Application now submit to ring2: cs1, cs2 
And application want csA to be done for cs1 and csB to be done
for cs2.

To achieve such usage pattern we would need to return fence seq
or similar from the cs ioctl. So user application would know
ringid+fence_seq for csA & csB and provide this when scheduling
cs1 & cs2. Here i am assuming MEM_WRITE/WAIT_REG_MEM packet
are as good as MEM_SEMAPHORE packet. Ie the semaphore packet
doesn't give us much more than MEM_WRITE/WAIT_REG_MEM would.

To achieve that each ring got it's fence scratch addr where to
write seq number. And then we use WAIT_REG_MEM on this addr
and with the specific seq for the other ring that needs
synchronization. This would simplify the semaphore code as
we wouldn't need somethings new beside helper function and
maybe extending the fence structure.

Anyway i put updating ring patch at :
http://people.freedesktop.org/~glisse/mrings/
It rebased on top of linus tree and it has several space
indentation fixes and also a fix for no semaphore allocated
issue (patch 5)

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-02 Thread Jerome Glisse
On Wed, Nov 02, 2011 at 11:12:42AM +0100, Christian König wrote:
> On 31.10.2011 16:05, Jerome Glisse wrote:
> >On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:
> >>Hello everybody,
> >>
> >>to support multiple compute rings, async DMA engines and UVD we need
> >>to teach the radeon kernel module how to sync buffers between
> >>different rings and make some changes to the command submission
> >>ioctls.
> >>
> >>Since we can't release any documentation about async DMA or UVD
> >>(yet), my current branch concentrates on getting the additional
> >>compute rings on cayman running. Unfortunately those rings have
> >>hardware bugs that can't be worked around, so they are actually not
> >>very useful in a production environment, but they should do quite
> >>well for this testing purpose.
> >>
> >>The branch can be found here:
> >>http://cgit.freedesktop.org/~deathsimple/linux/log/
> >>
> >>Since some of the patches are quite intrusive, constantly rebaseing
> >>them could get a bit painful. So I would like to see most of the
> >>stuff included into drm-next, even if we don't make use of the new
> >>functionality right now.
> >>
> >>Comments welcome,
> >>Christian.
> >So for all patches except the interface change see below
> >Reviewed-by: Jerome Glisse
> >
> >For the interface change, as discussed previously, i believe prio
> >should be a userspace argument, kernel could override it.
> >
> >
> Yeah, I'm still not sure what we should do about the priority.
> 
> Say for example we have 2 processes. Process A is sending compute
> jobs both with high and low priority, while process B is sending
> jobs with only high priority. Unfortunately the jobs send by B
> doesn't utilizes the hardware to its limits, so that even jobs on a
> lower priority rings get their share of the compute resources.
> 
> The effect is that it reverses the priority A wants for its jobs.
> The high priority jobs of A get executed much slower than the low
> priority jobs of A, because B is spamming the hight priority ring
> with its under utilizing jobs.
> 
> In such a situation it would be better to adjust the job scheduling
> a bit so that jobs of process A gets on ring 1 on jobs from B get on
> ring 2, but I have now idea how to detect such a situation. Anyway,
> the primary goal of the different compute rings is to separate
> compute from GFX so that even with big compute jobs running the
> system still stays responsible to user input, so I think adding a
> better scheduling for compute jobs can be done much later.
> 
> Christian.

I think this was already pointed out, my idea was to have the prio
argument in ioctl and have kernel override it. For instance i was
thinking the drm master could set for each process what is the
top priority this process can get, so which ever process is drm
master would choose. Or we could do some kind of userspace daemon
that would have special right from kernel/drm pov and would be
able to set the max priority each process get on the gpu.

But as a first step just getting the prio as an argument sounds
what we should do as right now we won't be doing anykind of
real GPU scheduling.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-11-02 Thread Christian König

On 31.10.2011 16:05, Jerome Glisse wrote:

On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:

Hello everybody,

to support multiple compute rings, async DMA engines and UVD we need
to teach the radeon kernel module how to sync buffers between
different rings and make some changes to the command submission
ioctls.

Since we can't release any documentation about async DMA or UVD
(yet), my current branch concentrates on getting the additional
compute rings on cayman running. Unfortunately those rings have
hardware bugs that can't be worked around, so they are actually not
very useful in a production environment, but they should do quite
well for this testing purpose.

The branch can be found here:
http://cgit.freedesktop.org/~deathsimple/linux/log/

Since some of the patches are quite intrusive, constantly rebaseing
them could get a bit painful. So I would like to see most of the
stuff included into drm-next, even if we don't make use of the new
functionality right now.

Comments welcome,
Christian.

So for all patches except the interface change see below
Reviewed-by: Jerome Glisse

For the interface change, as discussed previously, i believe prio
should be a userspace argument, kernel could override it.



Yeah, I'm still not sure what we should do about the priority.

Say for example we have 2 processes. Process A is sending compute jobs 
both with high and low priority, while process B is sending jobs with 
only high priority. Unfortunately the jobs send by B doesn't utilizes 
the hardware to its limits, so that even jobs on a lower priority rings 
get their share of the compute resources.


The effect is that it reverses the priority A wants for its jobs. The 
high priority jobs of A get executed much slower than the low priority 
jobs of A, because B is spamming the hight priority ring with its under 
utilizing jobs.


In such a situation it would be better to adjust the job scheduling a 
bit so that jobs of process A gets on ring 1 on jobs from B get on ring 
2, but I have now idea how to detect such a situation. Anyway, the 
primary goal of the different compute rings is to separate compute from 
GFX so that even with big compute jobs running the system still stays 
responsible to user input, so I think adding a better scheduling for 
compute jobs can be done much later.


Christian.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: RFC: Radeon multi ring support branch

2011-10-31 Thread Jerome Glisse
On Sat, Oct 29, 2011 at 03:00:28PM +0200, Christian König wrote:
> Hello everybody,
> 
> to support multiple compute rings, async DMA engines and UVD we need
> to teach the radeon kernel module how to sync buffers between
> different rings and make some changes to the command submission
> ioctls.
> 
> Since we can't release any documentation about async DMA or UVD
> (yet), my current branch concentrates on getting the additional
> compute rings on cayman running. Unfortunately those rings have
> hardware bugs that can't be worked around, so they are actually not
> very useful in a production environment, but they should do quite
> well for this testing purpose.
> 
> The branch can be found here:
> http://cgit.freedesktop.org/~deathsimple/linux/log/
> 
> Since some of the patches are quite intrusive, constantly rebaseing
> them could get a bit painful. So I would like to see most of the
> stuff included into drm-next, even if we don't make use of the new
> functionality right now.
> 
> Comments welcome,
> Christian.

So for all patches except the interface change see below
Reviewed-by: Jerome Glisse 

For the interface change, as discussed previously, i believe prio
should be a userspace argument, kernel could override it.

Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel