Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-08 Thread Tvrtko Ursulin



On 03/05/2024 15:28, Alex Deucher wrote:

On Fri, May 3, 2024 at 7:50 AM Tvrtko Ursulin  wrote:

On 02/05/2024 16:00, Alex Deucher wrote:

On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
 wrote:



On 02/05/2024 14:07, Christian König wrote:

Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:


Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:

On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin 
wrote:


From: Tvrtko Ursulin 

I have noticed AMD GPUs can have more than one "engine" (ring?) of
the same type
but amdgpu is not reporting that in fdinfo using the capacity engine
tag.

This series is therefore an attempt to improve that, but only an RFC
since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong
it may not
be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of
instances of an
"engine" type and spotted that could perhaps be used in more than
one place in
the driver. I was more than a little bit confused by the ip_instance
and uapi
rings, then how rings are selected to context entities internally.
Anyway..
hopefully it is a simple enough series to easily spot any such large
misses.

End result should be that, assuming two "engine" instances, one
fully loaded and
one idle will only report client using 50% of that engine type.


That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings. E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.


This is the same as with Intel/i915. Fdinfo is not intended to provide
utilisation of EUs and such, just how busy are the "entities" kernel
submits to. So doing something like in this series would make the
reporting more similar between the two drivers.

I think both the 0-800% or 0-100% range (taking 8 ring compute as an
example) can be misleading for different workloads. Neither <800% in
the former means one can send more work and same for <100% in the latter.


Yeah, I think that's what Alex tries to describe. By using 8 compute
rings your 800% load is actually incorrect and quite misleading.

Background is that those 8 compute rings won't be active all at the same
time, but rather waiting on each other for resources.

But this "waiting" is unfortunately considered execution time since the
used approach is actually not really capable of separating waiting and
execution time.


Right, so 800% is what gputop could be suggesting today, by the virtue 8
context/clients can each use 100% if they only use a subset of compute
units. I was proposing to expose the capacity in fdinfo so it can be
scaled down and then dicussing how both situation have pros and cons.


There is also a parallel with the CPU world here and hyper threading,
if not wider, where "What does 100% actually mean?" is also wishy-washy.

Also note that the reporting of actual time based values in fdinfo
would not changing with this series.

Of if you can guide me towards how to distinguish real vs fake
parallelism in HW IP blocks I could modify the series to only add
capacity tags where there are truly independent blocks. That would be
different from i915 though were I did not bother with that
distinction. (For reasons that assignment of for instance EUs to
compute "rings" (command streamers in i915) was supposed to be
possible to re-configure on the fly. So it did not make sense to try
and be super smart in fdinfo.)


Well exactly that's the point we don't really have truly independent
blocks on AMD hardware.

There are things like independent SDMA instances, but those a meant to
be used like the first instance for uploads and the second for downloads
etc.. When you use both instances for the same job they will pretty much
limit each other because of a single resource.


So _never_ multiple instances of the same IP block? No video decode,
encode, anything?


Some chips have multiple encode/decode IP blocks that are actually
separate instances, however, we load balance between them so userspace
sees just one engine.  Also in some cases they are asymmetric (e.g.,
different sets of supported CODECs on each instance).  The driver
handles this by inspecting the command buffer and scheduling on the
appropriate instance based on the requested CODEC.  SDMA also supports
multiple IP blocks that are independent.


Similar to i915 just that we don't inspect buffers but expose the
instance capabilities and userspace is responsible to set up the load
balancing engine with the correct physical mask.


How do you handle load balancing across applications?


From the uapi side 

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-03 Thread Alex Deucher
On Fri, May 3, 2024 at 7:50 AM Tvrtko Ursulin  wrote:
>
>
> On 02/05/2024 16:00, Alex Deucher wrote:
> > On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
> >  wrote:
> >>
> >>
> >> On 02/05/2024 14:07, Christian König wrote:
> >>> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
> 
>  Hi Alex,
> 
>  On 30/04/2024 19:32, Alex Deucher wrote:
> > On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin 
> > wrote:
> >>
> >> From: Tvrtko Ursulin 
> >>
> >> I have noticed AMD GPUs can have more than one "engine" (ring?) of
> >> the same type
> >> but amdgpu is not reporting that in fdinfo using the capacity engine
> >> tag.
> >>
> >> This series is therefore an attempt to improve that, but only an RFC
> >> since it is
> >> quite likely I got stuff wrong on the first attempt. Or if not wrong
> >> it may not
> >> be very beneficial in AMDs case.
> >>
> >> So I tried to figure out how to count and store the number of
> >> instances of an
> >> "engine" type and spotted that could perhaps be used in more than
> >> one place in
> >> the driver. I was more than a little bit confused by the ip_instance
> >> and uapi
> >> rings, then how rings are selected to context entities internally.
> >> Anyway..
> >> hopefully it is a simple enough series to easily spot any such large
> >> misses.
> >>
> >> End result should be that, assuming two "engine" instances, one
> >> fully loaded and
> >> one idle will only report client using 50% of that engine type.
> >
> > That would only be true if there are multiple instantiations of the IP
> > on the chip which in most cases is not true.  In most cases there is
> > one instance of the IP that can be fed from multiple rings. E.g. for
> > graphics and compute, all of the rings ultimately feed into the same
> > compute units on the chip.  So if you have a gfx ring and a compute
> > rings, you can schedule work to them asynchronously, but ultimately
> > whether they execute serially or in parallel depends on the actual
> > shader code in the command buffers and the extent to which it can
> > utilize the available compute units in the shader cores.
> 
>  This is the same as with Intel/i915. Fdinfo is not intended to provide
>  utilisation of EUs and such, just how busy are the "entities" kernel
>  submits to. So doing something like in this series would make the
>  reporting more similar between the two drivers.
> 
>  I think both the 0-800% or 0-100% range (taking 8 ring compute as an
>  example) can be misleading for different workloads. Neither <800% in
>  the former means one can send more work and same for <100% in the latter.
> >>>
> >>> Yeah, I think that's what Alex tries to describe. By using 8 compute
> >>> rings your 800% load is actually incorrect and quite misleading.
> >>>
> >>> Background is that those 8 compute rings won't be active all at the same
> >>> time, but rather waiting on each other for resources.
> >>>
> >>> But this "waiting" is unfortunately considered execution time since the
> >>> used approach is actually not really capable of separating waiting and
> >>> execution time.
> >>
> >> Right, so 800% is what gputop could be suggesting today, by the virtue 8
> >> context/clients can each use 100% if they only use a subset of compute
> >> units. I was proposing to expose the capacity in fdinfo so it can be
> >> scaled down and then dicussing how both situation have pros and cons.
> >>
>  There is also a parallel with the CPU world here and hyper threading,
>  if not wider, where "What does 100% actually mean?" is also wishy-washy.
> 
>  Also note that the reporting of actual time based values in fdinfo
>  would not changing with this series.
> 
>  Of if you can guide me towards how to distinguish real vs fake
>  parallelism in HW IP blocks I could modify the series to only add
>  capacity tags where there are truly independent blocks. That would be
>  different from i915 though were I did not bother with that
>  distinction. (For reasons that assignment of for instance EUs to
>  compute "rings" (command streamers in i915) was supposed to be
>  possible to re-configure on the fly. So it did not make sense to try
>  and be super smart in fdinfo.)
> >>>
> >>> Well exactly that's the point we don't really have truly independent
> >>> blocks on AMD hardware.
> >>>
> >>> There are things like independent SDMA instances, but those a meant to
> >>> be used like the first instance for uploads and the second for downloads
> >>> etc.. When you use both instances for the same job they will pretty much
> >>> limit each other because of a single resource.
> >>
> >> So _never_ multiple instances of the same IP block? No video decode,
> >> encode, anything?
> >
> > Some chips have multiple encode/decode IP blocks that are actually

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-03 Thread Tvrtko Ursulin



On 02/05/2024 16:00, Alex Deucher wrote:

On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
 wrote:



On 02/05/2024 14:07, Christian König wrote:

Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:


Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:

On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin 
wrote:


From: Tvrtko Ursulin 

I have noticed AMD GPUs can have more than one "engine" (ring?) of
the same type
but amdgpu is not reporting that in fdinfo using the capacity engine
tag.

This series is therefore an attempt to improve that, but only an RFC
since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong
it may not
be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of
instances of an
"engine" type and spotted that could perhaps be used in more than
one place in
the driver. I was more than a little bit confused by the ip_instance
and uapi
rings, then how rings are selected to context entities internally.
Anyway..
hopefully it is a simple enough series to easily spot any such large
misses.

End result should be that, assuming two "engine" instances, one
fully loaded and
one idle will only report client using 50% of that engine type.


That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings. E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.


This is the same as with Intel/i915. Fdinfo is not intended to provide
utilisation of EUs and such, just how busy are the "entities" kernel
submits to. So doing something like in this series would make the
reporting more similar between the two drivers.

I think both the 0-800% or 0-100% range (taking 8 ring compute as an
example) can be misleading for different workloads. Neither <800% in
the former means one can send more work and same for <100% in the latter.


Yeah, I think that's what Alex tries to describe. By using 8 compute
rings your 800% load is actually incorrect and quite misleading.

Background is that those 8 compute rings won't be active all at the same
time, but rather waiting on each other for resources.

But this "waiting" is unfortunately considered execution time since the
used approach is actually not really capable of separating waiting and
execution time.


Right, so 800% is what gputop could be suggesting today, by the virtue 8
context/clients can each use 100% if they only use a subset of compute
units. I was proposing to expose the capacity in fdinfo so it can be
scaled down and then dicussing how both situation have pros and cons.


There is also a parallel with the CPU world here and hyper threading,
if not wider, where "What does 100% actually mean?" is also wishy-washy.

Also note that the reporting of actual time based values in fdinfo
would not changing with this series.

Of if you can guide me towards how to distinguish real vs fake
parallelism in HW IP blocks I could modify the series to only add
capacity tags where there are truly independent blocks. That would be
different from i915 though were I did not bother with that
distinction. (For reasons that assignment of for instance EUs to
compute "rings" (command streamers in i915) was supposed to be
possible to re-configure on the fly. So it did not make sense to try
and be super smart in fdinfo.)


Well exactly that's the point we don't really have truly independent
blocks on AMD hardware.

There are things like independent SDMA instances, but those a meant to
be used like the first instance for uploads and the second for downloads
etc.. When you use both instances for the same job they will pretty much
limit each other because of a single resource.


So _never_ multiple instances of the same IP block? No video decode,
encode, anything?


Some chips have multiple encode/decode IP blocks that are actually
separate instances, however, we load balance between them so userspace
sees just one engine.  Also in some cases they are asymmetric (e.g.,
different sets of supported CODECs on each instance).  The driver
handles this by inspecting the command buffer and scheduling on the
appropriate instance based on the requested CODEC.  SDMA also supports
multiple IP blocks that are independent.


Similar to i915 just that we don't inspect buffers but expose the 
instance capabilities and userspace is responsible to set up the load 
balancing engine with the correct physical mask.


Anyway, back to the main point - are you interested at all for me to add 
the capacity flags to at least the IP blocks which are probed to exist 
more than a singleton? 

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-03 Thread Tvrtko Ursulin



On 02/05/2024 14:07, Christian König wrote:

Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:


Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:
On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin  
wrote:


From: Tvrtko Ursulin 

I have noticed AMD GPUs can have more than one "engine" (ring?) of 
the same type
but amdgpu is not reporting that in fdinfo using the capacity engine 
tag.


This series is therefore an attempt to improve that, but only an RFC 
since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong 
it may not

be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of 
instances of an
"engine" type and spotted that could perhaps be used in more than 
one place in
the driver. I was more than a little bit confused by the ip_instance 
and uapi
rings, then how rings are selected to context entities internally. 
Anyway..
hopefully it is a simple enough series to easily spot any such large 
misses.


End result should be that, assuming two "engine" instances, one 
fully loaded and

one idle will only report client using 50% of that engine type.


That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings. E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.


This is the same as with Intel/i915. Fdinfo is not intended to provide 
utilisation of EUs and such, just how busy are the "entities" kernel 
submits to. So doing something like in this series would make the 
reporting more similar between the two drivers.


I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
example) can be misleading for different workloads. Neither <800% in 
the former means one can send more work and same for <100% in the latter.


Yeah, I think that's what Alex tries to describe. By using 8 compute 
rings your 800% load is actually incorrect and quite misleading.


Background is that those 8 compute rings won't be active all at the same 
time, but rather waiting on each other for resources.


But this "waiting" is unfortunately considered execution time since the 
used approach is actually not really capable of separating waiting and 
execution time.


Right, so 800% is what gputop could be suggesting today, by the virtue 8 
context/clients can each use 100% if they only use a subset of compute 
units. I was proposing to expose the capacity in fdinfo so it can be 
scaled down and then dicussing how both situation have pros and cons.


There is also a parallel with the CPU world here and hyper threading, 
if not wider, where "What does 100% actually mean?" is also wishy-washy.


Also note that the reporting of actual time based values in fdinfo 
would not changing with this series.


Of if you can guide me towards how to distinguish real vs fake 
parallelism in HW IP blocks I could modify the series to only add 
capacity tags where there are truly independent blocks. That would be 
different from i915 though were I did not bother with that 
distinction. (For reasons that assignment of for instance EUs to 
compute "rings" (command streamers in i915) was supposed to be 
possible to re-configure on the fly. So it did not make sense to try 
and be super smart in fdinfo.)


Well exactly that's the point we don't really have truly independent 
blocks on AMD hardware.


There are things like independent SDMA instances, but those a meant to 
be used like the first instance for uploads and the second for downloads 
etc.. When you use both instances for the same job they will pretty much 
limit each other because of a single resource.


So _never_ multiple instances of the same IP block? No video decode, 
encode, anything?



As for the UAPI portion of this, we generally expose a limited number
of rings to user space and then we use the GPU scheduler to load
balance between all of the available rings of a type to try and
extract as much parallelism as we can.


The part I do not understand is the purpose of the ring argument in 
for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create 
up to N scheduling entities using different ring id's, but internally 
they can map to 1:N same scheduler instances (depending on IP type, 
can be that each userspace ring maps to same N hw rings, or for rings 
with no drm sched load balancing userspace ring also does not appear 
to have a relation to the picked drm sched instance.).


So I neither understand how this ring is useful, or how it does not 
create a problem for IP types which use drm_sched_pick_best. It 
appears even if 

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-02 Thread Alex Deucher
On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
 wrote:
>
>
> On 02/05/2024 14:07, Christian König wrote:
> > Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
> >>
> >> Hi Alex,
> >>
> >> On 30/04/2024 19:32, Alex Deucher wrote:
> >>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin 
> >>> wrote:
> 
>  From: Tvrtko Ursulin 
> 
>  I have noticed AMD GPUs can have more than one "engine" (ring?) of
>  the same type
>  but amdgpu is not reporting that in fdinfo using the capacity engine
>  tag.
> 
>  This series is therefore an attempt to improve that, but only an RFC
>  since it is
>  quite likely I got stuff wrong on the first attempt. Or if not wrong
>  it may not
>  be very beneficial in AMDs case.
> 
>  So I tried to figure out how to count and store the number of
>  instances of an
>  "engine" type and spotted that could perhaps be used in more than
>  one place in
>  the driver. I was more than a little bit confused by the ip_instance
>  and uapi
>  rings, then how rings are selected to context entities internally.
>  Anyway..
>  hopefully it is a simple enough series to easily spot any such large
>  misses.
> 
>  End result should be that, assuming two "engine" instances, one
>  fully loaded and
>  one idle will only report client using 50% of that engine type.
> >>>
> >>> That would only be true if there are multiple instantiations of the IP
> >>> on the chip which in most cases is not true.  In most cases there is
> >>> one instance of the IP that can be fed from multiple rings. E.g. for
> >>> graphics and compute, all of the rings ultimately feed into the same
> >>> compute units on the chip.  So if you have a gfx ring and a compute
> >>> rings, you can schedule work to them asynchronously, but ultimately
> >>> whether they execute serially or in parallel depends on the actual
> >>> shader code in the command buffers and the extent to which it can
> >>> utilize the available compute units in the shader cores.
> >>
> >> This is the same as with Intel/i915. Fdinfo is not intended to provide
> >> utilisation of EUs and such, just how busy are the "entities" kernel
> >> submits to. So doing something like in this series would make the
> >> reporting more similar between the two drivers.
> >>
> >> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
> >> example) can be misleading for different workloads. Neither <800% in
> >> the former means one can send more work and same for <100% in the latter.
> >
> > Yeah, I think that's what Alex tries to describe. By using 8 compute
> > rings your 800% load is actually incorrect and quite misleading.
> >
> > Background is that those 8 compute rings won't be active all at the same
> > time, but rather waiting on each other for resources.
> >
> > But this "waiting" is unfortunately considered execution time since the
> > used approach is actually not really capable of separating waiting and
> > execution time.
>
> Right, so 800% is what gputop could be suggesting today, by the virtue 8
> context/clients can each use 100% if they only use a subset of compute
> units. I was proposing to expose the capacity in fdinfo so it can be
> scaled down and then dicussing how both situation have pros and cons.
>
> >> There is also a parallel with the CPU world here and hyper threading,
> >> if not wider, where "What does 100% actually mean?" is also wishy-washy.
> >>
> >> Also note that the reporting of actual time based values in fdinfo
> >> would not changing with this series.
> >>
> >> Of if you can guide me towards how to distinguish real vs fake
> >> parallelism in HW IP blocks I could modify the series to only add
> >> capacity tags where there are truly independent blocks. That would be
> >> different from i915 though were I did not bother with that
> >> distinction. (For reasons that assignment of for instance EUs to
> >> compute "rings" (command streamers in i915) was supposed to be
> >> possible to re-configure on the fly. So it did not make sense to try
> >> and be super smart in fdinfo.)
> >
> > Well exactly that's the point we don't really have truly independent
> > blocks on AMD hardware.
> >
> > There are things like independent SDMA instances, but those a meant to
> > be used like the first instance for uploads and the second for downloads
> > etc.. When you use both instances for the same job they will pretty much
> > limit each other because of a single resource.
>
> So _never_ multiple instances of the same IP block? No video decode,
> encode, anything?

Some chips have multiple encode/decode IP blocks that are actually
separate instances, however, we load balance between them so userspace
sees just one engine.  Also in some cases they are asymmetric (e.g.,
different sets of supported CODECs on each instance).  The driver
handles this by inspecting the command buffer and scheduling on the
appropriate instance based on the requested CODEC. 

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-02 Thread Christian König

Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:


Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:
On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin  
wrote:


From: Tvrtko Ursulin 

I have noticed AMD GPUs can have more than one "engine" (ring?) of 
the same type
but amdgpu is not reporting that in fdinfo using the capacity engine 
tag.


This series is therefore an attempt to improve that, but only an RFC 
since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong 
it may not

be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of 
instances of an
"engine" type and spotted that could perhaps be used in more than 
one place in
the driver. I was more than a little bit confused by the ip_instance 
and uapi
rings, then how rings are selected to context entities internally. 
Anyway..
hopefully it is a simple enough series to easily spot any such large 
misses.


End result should be that, assuming two "engine" instances, one 
fully loaded and

one idle will only report client using 50% of that engine type.


That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings. E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.


This is the same as with Intel/i915. Fdinfo is not intended to provide 
utilisation of EUs and such, just how busy are the "entities" kernel 
submits to. So doing something like in this series would make the 
reporting more similar between the two drivers.


I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
example) can be misleading for different workloads. Neither <800% in 
the former means one can send more work and same for <100% in the latter.


Yeah, I think that's what Alex tries to describe. By using 8 compute 
rings your 800% load is actually incorrect and quite misleading.


Background is that those 8 compute rings won't be active all at the same 
time, but rather waiting on each other for resources.


But this "waiting" is unfortunately considered execution time since the 
used approach is actually not really capable of separating waiting and 
execution time.




There is also a parallel with the CPU world here and hyper threading, 
if not wider, where "What does 100% actually mean?" is also wishy-washy.


Also note that the reporting of actual time based values in fdinfo 
would not changing with this series.


Of if you can guide me towards how to distinguish real vs fake 
parallelism in HW IP blocks I could modify the series to only add 
capacity tags where there are truly independent blocks. That would be 
different from i915 though were I did not bother with that 
distinction. (For reasons that assignment of for instance EUs to 
compute "rings" (command streamers in i915) was supposed to be 
possible to re-configure on the fly. So it did not make sense to try 
and be super smart in fdinfo.)


Well exactly that's the point we don't really have truly independent 
blocks on AMD hardware.


There are things like independent SDMA instances, but those a meant to 
be used like the first instance for uploads and the second for downloads 
etc.. When you use both instances for the same job they will pretty much 
limit each other because of a single resource.



As for the UAPI portion of this, we generally expose a limited number
of rings to user space and then we use the GPU scheduler to load
balance between all of the available rings of a type to try and
extract as much parallelism as we can.


The part I do not understand is the purpose of the ring argument in 
for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create 
up to N scheduling entities using different ring id's, but internally 
they can map to 1:N same scheduler instances (depending on IP type, 
can be that each userspace ring maps to same N hw rings, or for rings 
with no drm sched load balancing userspace ring also does not appear 
to have a relation to the picked drm sched instance.).


So I neither understand how this ring is useful, or how it does not 
create a problem for IP types which use drm_sched_pick_best. It 
appears even if userspace created two scheduling entities with 
different ring ids they could randomly map to same drm sched aka same 
hw ring, no?


Yeah, that is correct. The multimedia instances have to use a "fixed" 
load balancing because of lack of firmware support. That should have 
been fixed by now but we never found time to actually validate it.


Regarding the "ring" parameter in CS, that is basically just for 
backward 

Re: [RFC 0/5] Add capacity key to fdinfo

2024-05-02 Thread Tvrtko Ursulin



Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:

On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin  wrote:


From: Tvrtko Ursulin 

I have noticed AMD GPUs can have more than one "engine" (ring?) of the same type
but amdgpu is not reporting that in fdinfo using the capacity engine tag.

This series is therefore an attempt to improve that, but only an RFC since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong it may not
be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of instances of an
"engine" type and spotted that could perhaps be used in more than one place in
the driver. I was more than a little bit confused by the ip_instance and uapi
rings, then how rings are selected to context entities internally. Anyway..
hopefully it is a simple enough series to easily spot any such large misses.

End result should be that, assuming two "engine" instances, one fully loaded and
one idle will only report client using 50% of that engine type.


That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings.  E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.


This is the same as with Intel/i915. Fdinfo is not intended to provide 
utilisation of EUs and such, just how busy are the "entities" kernel 
submits to. So doing something like in this series would make the 
reporting more similar between the two drivers.


I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
example) can be misleading for different workloads. Neither <800% in the 
former means one can send more work and same for <100% in the latter.


There is also a parallel with the CPU world here and hyper threading, if 
not wider, where "What does 100% actually mean?" is also wishy-washy.


Also note that the reporting of actual time based values in fdinfo would 
not changing with this series.


Of if you can guide me towards how to distinguish real vs fake 
parallelism in HW IP blocks I could modify the series to only add 
capacity tags where there are truly independent blocks. That would be 
different from i915 though were I did not bother with that distinction. 
(For reasons that assignment of for instance EUs to compute "rings" 
(command streamers in i915) was supposed to be possible to re-configure 
on the fly. So it did not make sense to try and be super smart in fdinfo.)



As for the UAPI portion of this, we generally expose a limited number
of rings to user space and then we use the GPU scheduler to load
balance between all of the available rings of a type to try and
extract as much parallelism as we can.


The part I do not understand is the purpose of the ring argument in for 
instance drm_amdgpu_cs_chunk_ib. It appears userspace can create up to N 
scheduling entities using different ring id's, but internally they can 
map to 1:N same scheduler instances (depending on IP type, can be that 
each userspace ring maps to same N hw rings, or for rings with no drm 
sched load balancing userspace ring also does not appear to have a 
relation to the picked drm sched instance.).


So I neither understand how this ring is useful, or how it does not 
create a problem for IP types which use drm_sched_pick_best. It appears 
even if userspace created two scheduling entities with different ring 
ids they could randomly map to same drm sched aka same hw ring, no?


Regards,

Tvrtko


Alex




Tvrtko Ursulin (5):
   drm/amdgpu: Cache number of rings per hw ip type
   drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
 ioctl
   drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
   drm/amdgpu: Show engine capacity in fdinfo
   drm/amdgpu: Only show VRAM in fdinfo if it exists

  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c|  3 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 62 +++---
  5 files changed, 49 insertions(+), 70 deletions(-)

--
2.44.0


Re: [RFC 0/5] Add capacity key to fdinfo

2024-04-30 Thread Alex Deucher
On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin  wrote:
>
> From: Tvrtko Ursulin 
>
> I have noticed AMD GPUs can have more than one "engine" (ring?) of the same 
> type
> but amdgpu is not reporting that in fdinfo using the capacity engine tag.
>
> This series is therefore an attempt to improve that, but only an RFC since it 
> is
> quite likely I got stuff wrong on the first attempt. Or if not wrong it may 
> not
> be very beneficial in AMDs case.
>
> So I tried to figure out how to count and store the number of instances of an
> "engine" type and spotted that could perhaps be used in more than one place in
> the driver. I was more than a little bit confused by the ip_instance and uapi
> rings, then how rings are selected to context entities internally. Anyway..
> hopefully it is a simple enough series to easily spot any such large misses.
>
> End result should be that, assuming two "engine" instances, one fully loaded 
> and
> one idle will only report client using 50% of that engine type.

That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings.  E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.

As for the UAPI portion of this, we generally expose a limited number
of rings to user space and then we use the GPU scheduler to load
balance between all of the available rings of a type to try and
extract as much parallelism as we can.

Alex


>
> Tvrtko Ursulin (5):
>   drm/amdgpu: Cache number of rings per hw ip type
>   drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
> ioctl
>   drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>   drm/amdgpu: Show engine capacity in fdinfo
>   drm/amdgpu: Only show VRAM in fdinfo if it exists
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c|  3 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c| 62 +++---
>  5 files changed, 49 insertions(+), 70 deletions(-)
>
> --
> 2.44.0