amdgpu: Integrate with DRM cgroup

Kenny Ho Tue, 03 Dec 2019 08:03:23 -0800

Hey Joe,

I don't have all the answers right now but one thing I want to mention
is that, with cgroup, there's always a possibility for a user
configuration that lead to under resource for the application.  Your
comments certainly highlight the needs to make under-resource
situation obvious to debug.  (I want to write this down so I don't
forget also... :) I should probably have some dmesg for situation like
this.)  Thanks!


Regards,
Kenny

On Mon, Dec 2, 2019 at 5:05 PM Greathouse, Joseph
<joseph.greatho...@amd.com> wrote:
>
> > -----Original Message-----
> > From: Kenny Ho <y2ke...@gmail.com>
> > Sent: Friday, November 29, 2019 12:00 AM
> >
> > Reducing audience since this is AMD specific.
> >
> > On Tue, Oct 8, 2019 at 3:11 PM Kuehling, Felix <felix.kuehl...@amd.com> 
> > wrote:
> > >
> > > On 2019-08-29 2:05 a.m., Kenny Ho wrote:
> > > > The number of logical gpu (lgpu) is defined to be the number of
> > > > compute unit (CU) for a device.  The lgpu allocation limit only
> > > > applies to compute workload for the moment (enforced via kfd queue
> > > > creation.)  Any cu_mask update is validated against the availability
> > > > of the compute unit as defined by the drmcg the kfd process belongs to.
> > >
> > > There is something missing here. There is an API for the application
> > > to specify a CU mask. Right now it looks like the
> > > application-specified and CGroup-specified CU masks would clobber each
> > > other. Instead the two should be merged.
> > >
> > > The CGroup-specified mask should specify a subset of CUs available for
> > > application-specified CU masks. When the cgroup CU mask changes, you'd
> > > need to take any application-specified CU masks into account before
> > > updating the hardware.
> > The idea behind the current implementation is to give sysadmin priority 
> > over user application (as that is the definition of control
> > group.)  Mask specified by applicatoin/user is validated by 
> > pqm_drmcg_lgpu_validate and rejected with EACCES if they are not
> > compatible.  The alternative is to ignore the difference and have the 
> > kernel guess/redistribute the assignment but I am not sure if this
> > is a good approach since there is not enough information to allow the 
> > kernel to guess the user's intention correctly consistently.  (This
> > is base on multiple conversations with you and Joe that, led me to believe, 
> > there are situation where spreading CU assignment across
> > multiple SE is a good thing but not always.)
> >
> > If the cgroup-specified mask is changed after the application has set the 
> > mask, the intersection of the two masks will be set instead.  It
> > is possible to have no intersection and in this case no CU is made 
> > available to the application (just like the possibility for memcgroup to
> > starve the amount of memory needed by an application.)
>
> I don't disagree with forcing a user to work within an lgpu's allocation. But 
> there's two minor problems here:
>
> 1) we will need a way for the process to query what the lgpu's bitmap looks 
> like. You and Felix are somewhat discussing this below, but I don't think the 
> KFD's "number of CUs" topology information is sufficient. I can know I have 
> 32 CUs, but I don't know which 32 bits in the bitmask are turned on. But your 
> code in pqm_drmcg_lgpu_validate() requires a subset when setting  CU mask on 
> an lgpu. A user needs to know what bits are on in the LGPU for this to work.
> 2) Even if we have a query API, do we have an easy way to prevent a data 
> race? Do we care? For instance, if I query the existing lgpu bitmap, then try 
> to set a CU mask on a subset of that, it's possible that the lgpu will change 
> between the query and set. That would make the setting fail, maybe that's 
> good enough (you can just try in a loop until it succeeds?)
>
> Do empty CU masks actually work? This seems like something we would want to 
> avoid. This could happen not infrequently if someone does something like:
> * lgpu with half the CUs enabled
> * User sets a mask to use half of those CUs
> * lgpu is changed to enable the other half of the CUS --> now the user's mask 
> is fully destroyed and everything dies. :\
>
> > > The KFD topology APIs report the number of available CUs to the
> > > application. CGroups would change that number at runtime and
> > > applications would not expect that. I think the best way to deal with
> > > that would be to have multiple bits in the application-specified CU
> > > mask map to the same CU. How to do that in a fair way is not obvious.
> > > I guess a more coarse-grain division of the GPU into LGPUs would make
> > > this somewhat easier.
> > Another possibility is to add namespace to the topology sysfs such that the 
> > correct number of CUs changes accordingly.  Although that
> > wouldn't give the user the available mask that is made available by this 
> > implementation via the cgroup sysfs.  Another possibility is to
> > modify the thunk similar to what was done for device cgroup (device
> > re-mapping.)
>
> I'd vote for a set of mask query APIs in the Thunk. One for the process's 
> current CU mask, and one for a queue's current CU mask. We have a setter API 
> already. Since the KFD topology information is also mirrored in sysfs, I 
> would worry that a process would see different KFD topology information if 
> it's querying the Thunk (which would show the lgpu's number of CUS0 vs. if 
> it's reading sysfs (which would show the GPU's number of CUs).
>
> As mentioned above, the KFD "num CUs" is insufficient for knowing how to set 
> the CU bitmask, so I don't think we should rely on it in this case. IMO, KFD 
> topology should describe the real hardware regardless of how cgroups is 
> limiting things. I'm willing to be told this is a bad idea, though.
>
> > > How is this problem handled for CPU cores and the interaction with CPU
> > > pthread_setaffinity_np?
> > Per the documentation of pthread_setaffinity_np, "If the call is 
> > successful, and the thread is not currently running on one of the CPUs
> > in cpuset, then it is migrated to one of those CPUs."
> > http://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html
> >
> > Regards,
> > Kenny
> >
> >
> >
> > > Regards,
> > >    Felix
> > >
> > >
> > > >
> > > > Change-Id: I69a57452c549173a1cd623c30dc57195b3b6563e
> > > > Signed-off-by: Kenny Ho <kenny...@amd.com>
> > > > ---
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h    |   4 +
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c       |  21 +++
> > > >   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c      |   6 +
> > > >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h         |   3 +
> > > >   .../amd/amdkfd/kfd_process_queue_manager.c    | 140 ++++++++++++++++++
> > > >   5 files changed, 174 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > > index 55cb1b2094fd..369915337213 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> > > > @@ -198,6 +198,10 @@ uint8_t amdgpu_amdkfd_get_xgmi_hops_count(struct 
> > > > kgd_dev *dst, struct kgd_dev *s
> > > >               valid;                                                  \
> > > >       })
> > > >
> > > > +int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
> > > > +             struct amdgpu_device *adev, unsigned long *lgpu_bitmap,
> > > > +             unsigned int nbits);
> > > > +
> > > >   /* GPUVM API */
> > > >   int amdgpu_amdkfd_gpuvm_create_process_vm(struct kgd_dev *kgd, 
> > > > unsigned int pasid,
> > > >                                       void **vm, void
> > > > **process_info, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > index 163a4fbf0611..8abeffdd2e5b 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > @@ -1398,9 +1398,29 @@ amdgpu_get_crtc_scanout_position(struct 
> > > > drm_device *dev, unsigned int pipe,
> > > >   static void amdgpu_drmcg_custom_init(struct drm_device *dev,
> > > >       struct drmcg_props *props)
> > > >   {
> > > > +     struct amdgpu_device *adev = dev->dev_private;
> > > > +
> > > > +     props->lgpu_capacity = adev->gfx.cu_info.number;
> > > > +
> > > >       props->limit_enforced = true;
> > > >   }
> > > >
> > > > +static void amdgpu_drmcg_limit_updated(struct drm_device *dev,
> > > > +             struct task_struct *task, struct drmcg_device_resource 
> > > > *ddr,
> > > > +             enum drmcg_res_type res_type) {
> > > > +     struct amdgpu_device *adev = dev->dev_private;
> > > > +
> > > > +     switch (res_type) {
> > > > +     case DRMCG_TYPE_LGPU:
> > > > +             amdgpu_amdkfd_update_cu_mask_for_process(task, adev,
> > > > +                        ddr->lgpu_allocated, 
> > > > dev->drmcg_props.lgpu_capacity);
> > > > +             break;
> > > > +     default:
> > > > +             break;
> > > > +     }
> > > > +}
> > > > +
> > > >   static struct drm_driver kms_driver = {
> > > >       .driver_features =
> > > >           DRIVER_USE_AGP | DRIVER_ATOMIC | @@ -1438,6 +1458,7 @@
> > > > static struct drm_driver kms_driver = {
> > > >       .gem_prime_mmap = amdgpu_gem_prime_mmap,
> > > >
> > > >       .drmcg_custom_init = amdgpu_drmcg_custom_init,
> > > > +     .drmcg_limit_updated = amdgpu_drmcg_limit_updated,
> > > >
> > > >       .name = DRIVER_NAME,
> > > >       .desc = DRIVER_DESC,
> > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > > b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > > index 138c70454e2b..fa765b803f97 100644
> > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> > > > @@ -450,6 +450,12 @@ static int kfd_ioctl_set_cu_mask(struct file 
> > > > *filp, struct kfd_process *p,
> > > >               return -EFAULT;
> > > >       }
> > > >
> > > > +     if (!pqm_drmcg_lgpu_validate(p, args->queue_id, 
> > > > properties.cu_mask, cu_mask_size)) {
> > > > +             pr_debug("CU mask not permitted by DRM Cgroup");
> > > > +             kfree(properties.cu_mask);
> > > > +             return -EACCES;
> > > > +     }
> > > > +
> > > >       mutex_lock(&p->mutex);
> > > >
> > > >       retval = pqm_set_cu_mask(&p->pqm, args->queue_id,
> > > > &properties); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > index 8b0eee5b3521..88881bec7550 100644
> > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > @@ -1038,6 +1038,9 @@ int pqm_get_wave_state(struct 
> > > > process_queue_manager *pqm,
> > > >                      u32 *ctl_stack_used_size,
> > > >                      u32 *save_area_used_size);
> > > >
> > > > +bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 
> > > > *cu_mask,
> > > > +             unsigned int cu_mask_size);
> > > > +
> > > >   int amdkfd_fence_wait_timeout(unsigned int *fence_addr,
> > > >                               unsigned int fence_value,
> > > >                               unsigned int timeout_ms); diff --git
> > > > a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > index 7e6c3ee82f5b..a896de290307 100644
> > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c
> > > > @@ -23,9 +23,11 @@
> > > >
> > > >   #include <linux/slab.h>
> > > >   #include <linux/list.h>
> > > > +#include <linux/cgroup_drm.h>
> > > >   #include "kfd_device_queue_manager.h"
> > > >   #include "kfd_priv.h"
> > > >   #include "kfd_kernel_queue.h"
> > > > +#include "amdgpu.h"
> > > >   #include "amdgpu_amdkfd.h"
> > > >
> > > >   static inline struct process_queue_node *get_queue_by_qid( @@
> > > > -167,6 +169,7 @@ static int create_cp_queue(struct 
> > > > process_queue_manager *pqm,
> > > >                               struct queue_properties *q_properties,
> > > >                               struct file *f, unsigned int qid)
> > > >   {
> > > > +     struct drmcg *drmcg;
> > > >       int retval;
> > > >
> > > >       /* Doorbell initialized in user space*/ @@ -180,6 +183,36 @@
> > > > static int create_cp_queue(struct process_queue_manager *pqm,
> > > >       if (retval != 0)
> > > >               return retval;
> > > >
> > > > +
> > > > +     drmcg = drmcg_get(pqm->process->lead_thread);
> > > > +     if (drmcg) {
> > > > +             struct amdgpu_device *adev;
> > > > +             struct drmcg_device_resource *ddr;
> > > > +             int mask_size;
> > > > +             u32 *mask;
> > > > +
> > > > +             adev = (struct amdgpu_device *) dev->kgd;
> > > > +
> > > > +             mask_size = adev->ddev->drmcg_props.lgpu_capacity;
> > > > +             mask = kzalloc(sizeof(u32) * round_up(mask_size, 32),
> > > > +                             GFP_KERNEL);
> > > > +
> > > > +             if (!mask) {
> > > > +                     drmcg_put(drmcg);
> > > > +                     uninit_queue(*q);
> > > > +                     return -ENOMEM;
> > > > +             }
> > > > +
> > > > +             ddr =
> > > > + drmcg->dev_resources[adev->ddev->primary->index];
> > > > +
> > > > +             bitmap_to_arr32(mask, ddr->lgpu_allocated, mask_size);
> > > > +
> > > > +             (*q)->properties.cu_mask_count = mask_size;
> > > > +             (*q)->properties.cu_mask = mask;
> > > > +
> > > > +             drmcg_put(drmcg);
> > > > +     }
> > > > +
> > > >       (*q)->device = dev;
> > > >       (*q)->process = pqm->process;
> > > >
> > > > @@ -495,6 +528,113 @@ int pqm_get_wave_state(struct 
> > > > process_queue_manager *pqm,
> > > >                                                      
> > > > save_area_used_size);
> > > >   }
> > > >
> > > > +bool pqm_drmcg_lgpu_validate(struct kfd_process *p, int qid, u32 
> > > > *cu_mask,
> > > > +             unsigned int cu_mask_size) {
> > > > +     DECLARE_BITMAP(curr_mask, MAX_DRMCG_LGPU_CAPACITY);
> > > > +     struct drmcg_device_resource *ddr;
> > > > +     struct process_queue_node *pqn;
> > > > +     struct amdgpu_device *adev;
> > > > +     struct drmcg *drmcg;
> > > > +     bool result;
> > > > +
> > > > +     if (cu_mask_size > MAX_DRMCG_LGPU_CAPACITY)
> > > > +             return false;
> > > > +
> > > > +     bitmap_from_arr32(curr_mask, cu_mask, cu_mask_size);
> > > > +
> > > > +     pqn = get_queue_by_qid(&p->pqm, qid);
> > > > +     if (!pqn)
> > > > +             return false;
> > > > +
> > > > +     adev = (struct amdgpu_device *)pqn->q->device->kgd;
> > > > +
> > > > +     drmcg = drmcg_get(p->lead_thread);
> > > > +     ddr = drmcg->dev_resources[adev->ddev->primary->index];
> > > > +
> > > > +     if (bitmap_subset(curr_mask, ddr->lgpu_allocated,
> > > > +                             MAX_DRMCG_LGPU_CAPACITY))
> > > > +             result = true;
> > > > +     else
> > > > +             result = false;
> > > > +
> > > > +     drmcg_put(drmcg);
> > > > +
> > > > +     return result;
> > > > +}
> > > > +
> > > > +int amdgpu_amdkfd_update_cu_mask_for_process(struct task_struct *task,
> > > > +             struct amdgpu_device *adev, unsigned long *lgpu_bm,
> > > > +             unsigned int lgpu_bm_size) {
> > > > +     struct kfd_dev *kdev = adev->kfd.dev;
> > > > +     struct process_queue_node *pqn;
> > > > +     struct kfd_process *kfdproc;
> > > > +     size_t size_in_bytes;
> > > > +     u32 *cu_mask;
> > > > +     int rc = 0;
> > > > +
> > > > +     if ((lgpu_bm_size % 32) != 0) {
> > > > +             pr_warn("lgpu_bm_size %d must be a multiple of 32",
> > > > +                             lgpu_bm_size);
> > > > +             return -EINVAL;
> > > > +     }
> > > > +
> > > > +     kfdproc = kfd_get_process(task);
> > > > +
> > > > +     if (IS_ERR(kfdproc))
> > > > +             return -ESRCH;
> > > > +
> > > > +     size_in_bytes = sizeof(u32) * round_up(lgpu_bm_size, 32);
> > > > +
> > > > +     mutex_lock(&kfdproc->mutex);
> > > > +     list_for_each_entry(pqn, &kfdproc->pqm.queues, 
> > > > process_queue_list) {
> > > > +             if (pqn->q && pqn->q->device == kdev) {
> > > > +                     /* update cu_mask accordingly */
> > > > +                     cu_mask = kzalloc(size_in_bytes, GFP_KERNEL);
> > > > +                     if (!cu_mask) {
> > > > +                             rc = -ENOMEM;
> > > > +                             break;
> > > > +                     }
> > > > +
> > > > +                     if (pqn->q->properties.cu_mask) {
> > > > +                             DECLARE_BITMAP(curr_mask,
> > > > +
> > > > + MAX_DRMCG_LGPU_CAPACITY);
> > > > +
> > > > +                             if (pqn->q->properties.cu_mask_count >
> > > > +                                             lgpu_bm_size) {
> > > > +                                     rc = -EINVAL;
> > > > +                                     kfree(cu_mask);
> > > > +                                     break;
> > > > +                             }
> > > > +
> > > > +                             bitmap_from_arr32(curr_mask,
> > > > +                                             
> > > > pqn->q->properties.cu_mask,
> > > > +
> > > > + pqn->q->properties.cu_mask_count);
> > > > +
> > > > +                             bitmap_and(curr_mask, curr_mask, lgpu_bm,
> > > > +                                             lgpu_bm_size);
> > > > +
> > > > +                             bitmap_to_arr32(cu_mask, curr_mask,
> > > > +                                             lgpu_bm_size);
> > > > +
> > > > +                             kfree(curr_mask);
> > > > +                     } else
> > > > +                             bitmap_to_arr32(cu_mask, lgpu_bm,
> > > > +                                             lgpu_bm_size);
> > > > +
> > > > +                     pqn->q->properties.cu_mask = cu_mask;
> > > > +                     pqn->q->properties.cu_mask_count =
> > > > + lgpu_bm_size;
> > > > +
> > > > +                     rc = pqn->q->device->dqm->ops.update_queue(
> > > > +                                     pqn->q->device->dqm, pqn->q);
> > > > +             }
> > > > +     }
> > > > +     mutex_unlock(&kfdproc->mutex);
> > > > +
> > > > +     return rc;
> > > > +}
> > > > +
> > > >   #if defined(CONFIG_DEBUG_FS)
> > > >
> > > >   int pqm_debugfs_mqds(struct seq_file *m, void *data)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH RFC v4 16/16] drm/amdgpu: Integrate with DRM cgroup

Reply via email to