Re: [PATCH] drm/amdgpu: Fix incorrect return value
Am 03.04.24 um 09:06 schrieb YiPeng Chai: [Why] After calling amdgpu_vram_mgr_reserve_range multiple times with the same address, calling amdgpu_vram_mgr_query_page_status will always return -EBUSY. From the second call to amdgpu_vram_mgr_reserve_range, the same address will be added to the reservations_pending list again and is never moved to the reserved_pages list because the address had been reserved. Well that sounds like a really bad idea to me. Why is the function called multiple times with the same address in the first place ? Apart from that a note on the coding style below. [How] First add the address status check before calling amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If the address is already in the reservations_pending list, directly reserve memory; only add new nodes for the addresses that are not in the reserved_pages list and reservations_pending list. Signed-off-by: YiPeng Chai --- drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 +--- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c index 1e36c428d254..0bf3f4092900 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct ttm_resource_manager *man) dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", rsv->start, rsv->size); - vis_usage = amdgpu_vram_mgr_vis_size(adev, block); atomic64_add(vis_usage, >vis_usage); spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@ int amdgpu_vram_mgr_reserve_range(struct amdgpu_vram_mgr *mgr, uint64_t start, uint64_t size) { struct amdgpu_vram_reservation *rsv; + int ret = 0; Don't initialize local variables when it isn't necessary. - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); - if (!rsv) - return -ENOMEM; + ret = amdgpu_vram_mgr_query_page_status(mgr, start); + if (!ret) + return 0; + + if (ret == -ENOENT) { + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); + if (!rsv) + return -ENOMEM; - INIT_LIST_HEAD(>allocated); - INIT_LIST_HEAD(>blocks); + INIT_LIST_HEAD(>allocated); + INIT_LIST_HEAD(>blocks); - rsv->start = start; - rsv->size = size; + rsv->start = start; + rsv->size = size; + + mutex_lock(>lock); + list_add_tail(>blocks, >reservations_pending); + mutex_unlock(>lock); + + } You should probably not lock/unlock here. Regards, Christian. mutex_lock(>lock); - list_add_tail(>blocks, >reservations_pending); amdgpu_vram_mgr_do_reserve(>manager); mutex_unlock(>lock);
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] OK - Best Regards, Thomas -Original Message- From: Zhou1, Tao Sent: Tuesday, April 9, 2024 10:52 AM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Li, Candice ; Wang, Yang(Kevin) ; Yang, Stanley Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value [AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Wednesday, April 3, 2024 3:07 PM > To: amd-gfx@lists.freedesktop.org > Cc: Chai, Thomas ; Zhang, Hawking > ; Zhou1, Tao ; Li, Candice > ; Wang, Yang(Kevin) ; > Yang, Stanley ; Chai, Thomas > > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > [Why] > After calling amdgpu_vram_mgr_reserve_range multiple times with the > same address, calling amdgpu_vram_mgr_query_page_status will always > return - EBUSY. > From the second call to amdgpu_vram_mgr_reserve_range, the same > address will be added to the reservations_pending list again and is > never moved to the reserved_pages list because the address had been reserved. > > [How] > First add the address status check before calling > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do > nothing; If the address is already in the reservations_pending list, > directly reserve memory; only add new nodes for the addresses that are > not in the reserved_pages list and reservations_pending list. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 > +--- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > index 1e36c428d254..0bf3f4092900 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > ttm_resource_manager *man) > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > rsv->start, rsv->size); > - > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > atomic64_add(vis_usage, >vis_usage); > spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@ > int amdgpu_vram_mgr_reserve_range(struct > amdgpu_vram_mgr *mgr, > uint64_t start, uint64_t size) { > struct amdgpu_vram_reservation *rsv; > + int ret = 0; > > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > - if (!rsv) > - return -ENOMEM; > + ret = amdgpu_vram_mgr_query_page_status(mgr, start); > + if (!ret) > + return 0; > + > + if (ret == -ENOENT) { > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > + if (!rsv) > + return -ENOMEM; > > - INIT_LIST_HEAD(>allocated); > - INIT_LIST_HEAD(>blocks); > + INIT_LIST_HEAD(>allocated); > + INIT_LIST_HEAD(>blocks); > > - rsv->start = start; > - rsv->size = size; > + rsv->start = start; > + rsv->size = size; > + > + mutex_lock(>lock); > + list_add_tail(>blocks, >reservations_pending); > + mutex_unlock(>lock); [Tao] we can drop the mutex_unlock and add if (ret != -ENOENT) for the second mutex_lock to avoid unlocking/locking repeatedly. > + > + } > > mutex_lock(>lock); > - list_add_tail(>blocks, >reservations_pending); > amdgpu_vram_mgr_do_reserve(>manager); > mutex_unlock(>lock); > > -- > 2.34.1
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Wednesday, April 3, 2024 3:07 PM > To: amd-gfx@lists.freedesktop.org > Cc: Chai, Thomas ; Zhang, Hawking > ; Zhou1, Tao ; Li, Candice > ; Wang, Yang(Kevin) ; Yang, > Stanley ; Chai, Thomas > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > [Why] > After calling amdgpu_vram_mgr_reserve_range multiple times with the same > address, calling amdgpu_vram_mgr_query_page_status will always return - > EBUSY. > From the second call to amdgpu_vram_mgr_reserve_range, the same address > will be added to the reservations_pending list again and is never moved to the > reserved_pages list because the address had been reserved. > > [How] > First add the address status check before calling > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If > the address is already in the reservations_pending list, directly reserve > memory; > only add new nodes for the addresses that are not in the reserved_pages list > and > reservations_pending list. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 +--- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > index 1e36c428d254..0bf3f4092900 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > ttm_resource_manager *man) > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > rsv->start, rsv->size); > - > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > atomic64_add(vis_usage, >vis_usage); > spin_lock(>bdev->lru_lock); > @@ -340,19 +339,30 @@ int amdgpu_vram_mgr_reserve_range(struct > amdgpu_vram_mgr *mgr, > uint64_t start, uint64_t size) > { > struct amdgpu_vram_reservation *rsv; > + int ret = 0; > > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > - if (!rsv) > - return -ENOMEM; > + ret = amdgpu_vram_mgr_query_page_status(mgr, start); > + if (!ret) > + return 0; > + > + if (ret == -ENOENT) { > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > + if (!rsv) > + return -ENOMEM; > > - INIT_LIST_HEAD(>allocated); > - INIT_LIST_HEAD(>blocks); > + INIT_LIST_HEAD(>allocated); > + INIT_LIST_HEAD(>blocks); > > - rsv->start = start; > - rsv->size = size; > + rsv->start = start; > + rsv->size = size; > + > + mutex_lock(>lock); > + list_add_tail(>blocks, >reservations_pending); > + mutex_unlock(>lock); [Tao] we can drop the mutex_unlock and add if (ret != -ENOENT) for the second mutex_lock to avoid unlocking/locking repeatedly. > + > + } > > mutex_lock(>lock); > - list_add_tail(>blocks, >reservations_pending); > amdgpu_vram_mgr_do_reserve(>manager); > mutex_unlock(>lock); > > -- > 2.34.1
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] - Best Regards, Thomas -Original Message- From: Zhou1, Tao Sent: Monday, April 8, 2024 4:41 PM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Li, Candice ; Wang, Yang(Kevin) ; Yang, Stanley Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value [AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Sunday, April 7, 2024 10:21 AM > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org > Cc: Zhang, Hawking ; Li, Candice > ; Wang, Yang(Kevin) ; > Yang, Stanley > Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value > > [AMD Official Use Only - General] > > - > Best Regards, > Thomas > > -Original Message- > From: Zhou1, Tao > Sent: Wednesday, April 3, 2024 6:36 PM > To: Chai, Thomas ; amd-gfx@lists.freedesktop.org > Cc: Zhang, Hawking ; Li, Candice > ; Wang, Yang(Kevin) ; > Yang, Stanley > Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value > > [AMD Official Use Only - General] > > > -Original Message- > > From: Chai, Thomas > > Sent: Wednesday, April 3, 2024 3:07 PM > > To: amd-gfx@lists.freedesktop.org > > Cc: Chai, Thomas ; Zhang, Hawking > > ; Zhou1, Tao ; Li, Candice > > ; Wang, Yang(Kevin) ; > > Yang, Stanley ; Chai, Thomas > > > > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > > > [Why] > > After calling amdgpu_vram_mgr_reserve_range multiple times with > > the same address, calling amdgpu_vram_mgr_query_page_status will > > always return - EBUSY. > > >[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range > >multiple > times with the same address? IIRC, we skip duplicate address before > reserve memory. > > [Thomas] >When poison creation interrupt is received, since some poisoning > addresses may have been allocated by some processes, reserving these memories > will fail. > These memory will be tried to reserve again after killing the poisoned > process in the subsequent poisoning consumption interrupt handler. > so amdgpu_vram_mgr_reserve_range needs to be called multiple times > with the same address. > > > From the second call to amdgpu_vram_mgr_reserve_range, the same > > address will be added to the reservations_pending list again and is > > never moved to the reserved_pages list because the address had been > reserved. >[Tao] but if a page is added to reservations_pending list, it should also be >put in data->bps array, and when we call amdgpu_ras_add_bad_pages again, >amdgpu_ras_check_bad_page_unlock could ignore this page. So except for amdgpu_ras_add_bad_pages, would you like to call amdgpu_vram_mgr_reserve_range in other place? [Thomas] Yes, Since after amdgpu_ras_add_bad_pages is called, the bad pages will be saved to eeprom. When a large number of bad pages need to be reserved, this will delay subsequent memory reservation. I want to call amdgpu_vram_mgr_reserve_range to reserve memory immediately when driver receives poison creation interrupt, this can reduce the probability of bad memory pages being allocated and storing the bad pages to eeprom can be done slowly. > > > > [How] > > First add the address status check before calling > > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do > > nothing; If the address is already in the reservations_pending list, > > directly reserve memory; only add new nodes for the addresses that > > are not in the reserved_pages list and reservations_pending list. > > > > Signed-off-by: YiPeng Chai > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 > > +--- > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > index 1e36c428d254..0bf3f4092900 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > > ttm_resource_manager *man) > > > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > > rsv->start, rsv->size); > > - > > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > > atomic64_add(vis_usage, >vis_usage); > > spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@ > > int amdgpu_vram_mgr_reserve_range(struct > > amdgpu_vram_mgr *mgr, > > uint64_t start, uint64_
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Sunday, April 7, 2024 10:21 AM > To: Zhou1, Tao ; amd-gfx@lists.freedesktop.org > Cc: Zhang, Hawking ; Li, Candice > ; Wang, Yang(Kevin) ; Yang, > Stanley > Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value > > [AMD Official Use Only - General] > > - > Best Regards, > Thomas > > -Original Message- > From: Zhou1, Tao > Sent: Wednesday, April 3, 2024 6:36 PM > To: Chai, Thomas ; amd-gfx@lists.freedesktop.org > Cc: Zhang, Hawking ; Li, Candice > ; Wang, Yang(Kevin) ; Yang, > Stanley > Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value > > [AMD Official Use Only - General] > > > -Original Message- > > From: Chai, Thomas > > Sent: Wednesday, April 3, 2024 3:07 PM > > To: amd-gfx@lists.freedesktop.org > > Cc: Chai, Thomas ; Zhang, Hawking > > ; Zhou1, Tao ; Li, Candice > > ; Wang, Yang(Kevin) ; > > Yang, Stanley ; Chai, Thomas > > > > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > > > [Why] > > After calling amdgpu_vram_mgr_reserve_range multiple times with the > > same address, calling amdgpu_vram_mgr_query_page_status will always > > return - EBUSY. > > >[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple > times with the same address? IIRC, we skip duplicate address before reserve > memory. > > [Thomas] >When poison creation interrupt is received, since some poisoning addresses > may > have been allocated by some processes, reserving these memories will fail. > These memory will be tried to reserve again after killing the poisoned > process in > the subsequent poisoning consumption interrupt handler. > so amdgpu_vram_mgr_reserve_range needs to be called multiple times with the > same address. > > > From the second call to amdgpu_vram_mgr_reserve_range, the same > > address will be added to the reservations_pending list again and is > > never moved to the reserved_pages list because the address had been > reserved. [Tao] but if a page is added to reservations_pending list, it should also be put in data->bps array, and when we call amdgpu_ras_add_bad_pages again, amdgpu_ras_check_bad_page_unlock could ignore this page. So except for amdgpu_ras_add_bad_pages, would you like to call amdgpu_vram_mgr_reserve_range in other place? > > > > [How] > > First add the address status check before calling > > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do > > nothing; If the address is already in the reservations_pending list, > > directly reserve memory; only add new nodes for the addresses that are > > not in the reserved_pages list and reservations_pending list. > > > > Signed-off-by: YiPeng Chai > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 > > +--- > > 1 file changed, 19 insertions(+), 9 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > index 1e36c428d254..0bf3f4092900 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > > ttm_resource_manager *man) > > > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > > rsv->start, rsv->size); > > - > > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > > atomic64_add(vis_usage, >vis_usage); > > spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@ > > int amdgpu_vram_mgr_reserve_range(struct > > amdgpu_vram_mgr *mgr, > > uint64_t start, uint64_t size) { > > struct amdgpu_vram_reservation *rsv; > > + int ret = 0; > > > > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > > - if (!rsv) > > - return -ENOMEM; > > + ret = amdgpu_vram_mgr_query_page_status(mgr, start); > > + if (!ret) > > + return 0; > > + > > + if (ret == -ENOENT) { > > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > > + if (!rsv) > > + return -ENOMEM; > > > > - INIT_LIST_HEAD(>allocated); > > - INIT_LIST_HEAD(>blocks); > > + INIT_LIST_HEAD(>allocated); > > + INIT_LIST_HEAD(>blocks); > > > > - rsv->start = start; > > - rsv->size = size; > > + rsv->start = start; > > + rsv->size = size; > > + > > + mutex_lock(>lock); > > + list_add_tail(>blocks, >reservations_pending); > > + mutex_unlock(>lock); > > + > > + } > > > > mutex_lock(>lock); > > - list_add_tail(>blocks, >reservations_pending); > > amdgpu_vram_mgr_do_reserve(>manager); > > mutex_unlock(>lock); > > > > -- > > 2.34.1 > >
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] - Best Regards, Thomas -Original Message- From: Zhou1, Tao Sent: Wednesday, April 3, 2024 6:36 PM To: Chai, Thomas ; amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Li, Candice ; Wang, Yang(Kevin) ; Yang, Stanley Subject: RE: [PATCH] drm/amdgpu: Fix incorrect return value [AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Wednesday, April 3, 2024 3:07 PM > To: amd-gfx@lists.freedesktop.org > Cc: Chai, Thomas ; Zhang, Hawking > ; Zhou1, Tao ; Li, Candice > ; Wang, Yang(Kevin) ; > Yang, Stanley ; Chai, Thomas > > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > [Why] > After calling amdgpu_vram_mgr_reserve_range multiple times with the > same address, calling amdgpu_vram_mgr_query_page_status will always > return - EBUSY. >[Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple >times with the same address? IIRC, we skip duplicate address before reserve >memory. [Thomas] When poison creation interrupt is received, since some poisoning addresses may have been allocated by some processes, reserving these memories will fail. These memory will be tried to reserve again after killing the poisoned process in the subsequent poisoning consumption interrupt handler. so amdgpu_vram_mgr_reserve_range needs to be called multiple times with the same address. > From the second call to amdgpu_vram_mgr_reserve_range, the same > address will be added to the reservations_pending list again and is > never moved to the reserved_pages list because the address had been reserved. > > [How] > First add the address status check before calling > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do > nothing; If the address is already in the reservations_pending list, > directly reserve memory; only add new nodes for the addresses that are > not in the reserved_pages list and reservations_pending list. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 > +--- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > index 1e36c428d254..0bf3f4092900 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > ttm_resource_manager *man) > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > rsv->start, rsv->size); > - > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > atomic64_add(vis_usage, >vis_usage); > spin_lock(>bdev->lru_lock); @@ -340,19 +339,30 @@ > int amdgpu_vram_mgr_reserve_range(struct > amdgpu_vram_mgr *mgr, > uint64_t start, uint64_t size) { > struct amdgpu_vram_reservation *rsv; > + int ret = 0; > > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > - if (!rsv) > - return -ENOMEM; > + ret = amdgpu_vram_mgr_query_page_status(mgr, start); > + if (!ret) > + return 0; > + > + if (ret == -ENOENT) { > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > + if (!rsv) > + return -ENOMEM; > > - INIT_LIST_HEAD(>allocated); > - INIT_LIST_HEAD(>blocks); > + INIT_LIST_HEAD(>allocated); > + INIT_LIST_HEAD(>blocks); > > - rsv->start = start; > - rsv->size = size; > + rsv->start = start; > + rsv->size = size; > + > + mutex_lock(>lock); > + list_add_tail(>blocks, >reservations_pending); > + mutex_unlock(>lock); > + > + } > > mutex_lock(>lock); > - list_add_tail(>blocks, >reservations_pending); > amdgpu_vram_mgr_do_reserve(>manager); > mutex_unlock(>lock); > > -- > 2.34.1
RE: [PATCH] drm/amdgpu: Fix incorrect return value
[AMD Official Use Only - General] > -Original Message- > From: Chai, Thomas > Sent: Wednesday, April 3, 2024 3:07 PM > To: amd-gfx@lists.freedesktop.org > Cc: Chai, Thomas ; Zhang, Hawking > ; Zhou1, Tao ; Li, Candice > ; Wang, Yang(Kevin) ; Yang, > Stanley ; Chai, Thomas > Subject: [PATCH] drm/amdgpu: Fix incorrect return value > > [Why] > After calling amdgpu_vram_mgr_reserve_range multiple times with the same > address, calling amdgpu_vram_mgr_query_page_status will always return - > EBUSY. [Tao] could you explain why we call amdgpu_vram_mgr_reserve_range multiple times with the same address? IIRC, we skip duplicate address before reserve memory. > From the second call to amdgpu_vram_mgr_reserve_range, the same address > will be added to the reservations_pending list again and is never moved to the > reserved_pages list because the address had been reserved. > > [How] > First add the address status check before calling > amdgpu_vram_mgr_do_reserve, if the address is already reserved, do nothing; If > the address is already in the reservations_pending list, directly reserve > memory; > only add new nodes for the addresses that are not in the reserved_pages list > and > reservations_pending list. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c | 28 +--- > 1 file changed, 19 insertions(+), 9 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > index 1e36c428d254..0bf3f4092900 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vram_mgr.c > @@ -317,7 +317,6 @@ static void amdgpu_vram_mgr_do_reserve(struct > ttm_resource_manager *man) > > dev_dbg(adev->dev, "Reservation 0x%llx - %lld, Succeeded\n", > rsv->start, rsv->size); > - > vis_usage = amdgpu_vram_mgr_vis_size(adev, block); > atomic64_add(vis_usage, >vis_usage); > spin_lock(>bdev->lru_lock); > @@ -340,19 +339,30 @@ int amdgpu_vram_mgr_reserve_range(struct > amdgpu_vram_mgr *mgr, > uint64_t start, uint64_t size) > { > struct amdgpu_vram_reservation *rsv; > + int ret = 0; > > - rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > - if (!rsv) > - return -ENOMEM; > + ret = amdgpu_vram_mgr_query_page_status(mgr, start); > + if (!ret) > + return 0; > + > + if (ret == -ENOENT) { > + rsv = kzalloc(sizeof(*rsv), GFP_KERNEL); > + if (!rsv) > + return -ENOMEM; > > - INIT_LIST_HEAD(>allocated); > - INIT_LIST_HEAD(>blocks); > + INIT_LIST_HEAD(>allocated); > + INIT_LIST_HEAD(>blocks); > > - rsv->start = start; > - rsv->size = size; > + rsv->start = start; > + rsv->size = size; > + > + mutex_lock(>lock); > + list_add_tail(>blocks, >reservations_pending); > + mutex_unlock(>lock); > + > + } > > mutex_lock(>lock); > - list_add_tail(>blocks, >reservations_pending); > amdgpu_vram_mgr_do_reserve(>manager); > mutex_unlock(>lock); > > -- > 2.34.1
Re: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage
[AMD Public Use] No need to apologize, it was a good catch. Something to double check in the future if something similar ends up getting applied again. Thanks! Alex From: amd-gfx on behalf of Matt Coffin Sent: Friday, August 14, 2020 3:13 PM To: amd-gfx@lists.freedesktop.org Cc: Quan, Evan ; Koenig, Christian ; Li, Dennis Subject: Re: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage Hi all, As of 026acaeac2d205f22c0f682cc1c7b1a85b9ccd00 ("drm/amdgpu: revert "fix system hang issue during GPU reset""), this patch is no longer needed, and won't apply, because the badly-behaving code was removed; I'll withdraw this patch for now. Sorry to those who wasted their time reviewing. Cheers, Matt On 8/13/20 7:15 PM, Matt Coffin wrote: > The changes in edad8312cbbf9a33c86873fc4093664f150dd5c1 introduced an > issue with the sysfs interface for pp_od_clk_voltage. It overwrites the > return value to 0 when it calls another function, then returns 0. The > intended behavior is that a positive return value indicates the number > of bytes from the buffer that you processed in that call. > > With the 0 return value, clients would submit the same value to be > written over and over again, resulting in an infinite loop. > > This is resolved by returning the count of bytes read (in this case the > whole message), when the desired return is 0 (success). > > Fixes: edad8312cbbf ("drm/amdgpu: fix system hang issue during GPU") > Bug: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1245data=02%7C01%7Calexander.deucher%40amd.com%7C3fb1fd9f95ad441a88a208d840862e80%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637330292979204288sdata=iABLxfzQVQa5zBK6a0JozvRYl%2Fg5eTFnfOS86r0g9rU%3Dreserved=0 > Signed-off-by: Matt Coffin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > index 1705e328c6fc..f00c7ed361d4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > @@ -937,7 +937,11 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct > device *dev, > > pro_end: >up_read(>reset_sem); > - return ret; > + if (ret) { > + return ret; > + } else { > + return count; > + } > } > > static ssize_t amdgpu_get_pp_od_clk_voltage(struct device *dev, > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Calexander.deucher%40amd.com%7C3fb1fd9f95ad441a88a208d840862e80%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637330292979204288sdata=GLWlbhgo0grFjcOh5ZAJWhXGxSm6Ufw8eDFDHZf95Uk%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage
Hi all, As of 026acaeac2d205f22c0f682cc1c7b1a85b9ccd00 ("drm/amdgpu: revert "fix system hang issue during GPU reset""), this patch is no longer needed, and won't apply, because the badly-behaving code was removed; I'll withdraw this patch for now. Sorry to those who wasted their time reviewing. Cheers, Matt On 8/13/20 7:15 PM, Matt Coffin wrote: > The changes in edad8312cbbf9a33c86873fc4093664f150dd5c1 introduced an > issue with the sysfs interface for pp_od_clk_voltage. It overwrites the > return value to 0 when it calls another function, then returns 0. The > intended behavior is that a positive return value indicates the number > of bytes from the buffer that you processed in that call. > > With the 0 return value, clients would submit the same value to be > written over and over again, resulting in an infinite loop. > > This is resolved by returning the count of bytes read (in this case the > whole message), when the desired return is 0 (success). > > Fixes: edad8312cbbf ("drm/amdgpu: fix system hang issue during GPU") > Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1245 > Signed-off-by: Matt Coffin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > index 1705e328c6fc..f00c7ed361d4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > @@ -937,7 +937,11 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct > device *dev, > > pro_end: > up_read(>reset_sem); > - return ret; > + if (ret) { > + return ret; > + } else { > + return count; > + } > } > > static ssize_t amdgpu_get_pp_od_clk_voltage(struct device *dev, > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage
[AMD Official Use Only - Internal Distribution Only] Thanks for catching this. The patch is reviewed-by: Evan Quan I have the same copy as Matt. @Li, Dennis maybe you were working on an old copy? BR Evan -Original Message- From: amd-gfx On Behalf Of Matt Coffin Sent: Friday, August 14, 2020 11:14 AM To: Li, Dennis ; amd-gfx@lists.freedesktop.org Cc: Koenig, Christian Subject: Re: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage Hey Dennis, Thanks for the testing. I'm having some issues reproducing, as that command is working fine for me in sh, bash, and zsh. So just to confirm a few things while I look at it... 1. What kind of SMU is on whatever card you're testing on? Looks like smu_v11+ to me? 2. (shouldn't matter if you're right about which -EINVAL return is being hit), but is OverDrive enabled? 3. Is this based off of latest amd-staging-drm-next? This is the code block I'm seeing on the HEAD of alex's branch... which is a bit different from what you pasted. This error also happens **before** the infinite loop I was fixing with this patch, but might as well get both birds with one stone if there's still an issue. while (tmp_str[0]) { sub_str = strsep(_str, delimiter); ret = kstrtol(sub_str, 0, [parameter_size]); if (ret) return -EINVAL; parameter_size++; while (isspace(*tmp_str)) tmp_str++; } On 8/13/20 8:14 PM, Li, Dennis wrote: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Matt, > With your change, I still could reproduce the following issue: > > # echo "s 1 1900" > /sys/class/drm/card0/device/pp_od_clk_voltage > bash: echo: write error: Invalid argument > > I found that it is related the following lines code, could you help > double check it? > > while ((sub_str = strsep(_str, delimiter)) != NULL) { // sub_str will be > empty string > ret = kstrtol(sub_str, 0, [parameter_size]); > if (ret) > return -EINVAL; // return here > parameter_size++; > > while (isspace(*tmp_str)) > tmp_str++; > } > > Best Regards > Dennis Li > -Original Message- > From: Matt Coffin > Sent: Friday, August 14, 2020 9:15 AM > To: amd-gfx@lists.freedesktop.org > Cc: Koenig, Christian ; Li, Dennis > ; Matt Coffin > Subject: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for > pp_od_clk_voltage > > The changes in edad8312cbbf9a33c86873fc4093664f150dd5c1 introduced an issue > with the sysfs interface for pp_od_clk_voltage. It overwrites the return > value to 0 when it calls another function, then returns 0. The intended > behavior is that a positive return value indicates the number of bytes from > the buffer that you processed in that call. > > With the 0 return value, clients would submit the same value to be written > over and over again, resulting in an infinite loop. > > This is resolved by returning the count of bytes read (in this case the whole > message), when the desired return is 0 (success). > > Fixes: edad8312cbbf ("drm/amdgpu: fix system hang issue during GPU") > Bug: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitl > ab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1245data=02%7C01%7C > evan.quan%40amd.com%7C08991e3858f144a4dd0908d840001324%7C3dd8961fe4884 > e608e11a82d994e183d%7C0%7C0%7C637329716385273288sdata=w%2FBlX8CpG > 0TfTYd1InX8Z%2FMBrRbVu%2FT4zWehWKHClCE%3Dreserved=0 > Signed-off-by: Matt Coffin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > index 1705e328c6fc..f00c7ed361d4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > @@ -937,7 +937,11 @@ static ssize_t > amdgpu_set_pp_od_clk_voltage(struct device *dev, > > pro_end: > up_read(>reset_sem); > -return ret; > +if (ret) { > +return ret; > +} else { > +return count; > +} > } > > static ssize_t amdgpu_get_pp_od_clk_voltage(struct device *dev, > -- > 2.28.0 > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfxdata=02%7C01%7Cevan.quan%40amd.com%7C08991e3858f144a4dd0908d840001324%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637329716385273288sdata=IbsPW7%2Fr2HXLxKK4%2FOb6fqFmZXtyivbTsP9ftd7P2Dw%3Dreserved=0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage
Hey Dennis, Thanks for the testing. I'm having some issues reproducing, as that command is working fine for me in sh, bash, and zsh. So just to confirm a few things while I look at it... 1. What kind of SMU is on whatever card you're testing on? Looks like smu_v11+ to me? 2. (shouldn't matter if you're right about which -EINVAL return is being hit), but is OverDrive enabled? 3. Is this based off of latest amd-staging-drm-next? This is the code block I'm seeing on the HEAD of alex's branch... which is a bit different from what you pasted. This error also happens **before** the infinite loop I was fixing with this patch, but might as well get both birds with one stone if there's still an issue. while (tmp_str[0]) { sub_str = strsep(_str, delimiter); ret = kstrtol(sub_str, 0, [parameter_size]); if (ret) return -EINVAL; parameter_size++; while (isspace(*tmp_str)) tmp_str++; } On 8/13/20 8:14 PM, Li, Dennis wrote: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Matt, > With your change, I still could reproduce the following issue: > > # echo "s 1 1900" > /sys/class/drm/card0/device/pp_od_clk_voltage > bash: echo: write error: Invalid argument > > I found that it is related the following lines code, could you help > double check it? > > while ((sub_str = strsep(_str, delimiter)) != NULL) { // sub_str > will be empty string > ret = kstrtol(sub_str, 0, [parameter_size]); > if (ret) > return -EINVAL; // return here > parameter_size++; > > while (isspace(*tmp_str)) > tmp_str++; > } > > Best Regards > Dennis Li > -Original Message- > From: Matt Coffin > Sent: Friday, August 14, 2020 9:15 AM > To: amd-gfx@lists.freedesktop.org > Cc: Koenig, Christian ; Li, Dennis > ; Matt Coffin > Subject: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for > pp_od_clk_voltage > > The changes in edad8312cbbf9a33c86873fc4093664f150dd5c1 introduced an issue > with the sysfs interface for pp_od_clk_voltage. It overwrites the return > value to 0 when it calls another function, then returns 0. The intended > behavior is that a positive return value indicates the number of bytes from > the buffer that you processed in that call. > > With the 0 return value, clients would submit the same value to be written > over and over again, resulting in an infinite loop. > > This is resolved by returning the count of bytes read (in this case the whole > message), when the desired return is 0 (success). > > Fixes: edad8312cbbf ("drm/amdgpu: fix system hang issue during GPU") > Bug: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1245data=02%7C01%7CDennis.Li%40amd.com%7C4de8308bf7974ea9e62308d83fef922b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637329646078379799sdata=N9c6e7cUMCDpvBIYUEzxkadJbJdBryXyfhfhb%2BUEwjg%3Dreserved=0 > Signed-off-by: Matt Coffin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > index 1705e328c6fc..f00c7ed361d4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > @@ -937,7 +937,11 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct > device *dev, > > pro_end: > up_read(>reset_sem); > - return ret; > + if (ret) { > + return ret; > + } else { > + return count; > + } > } > > static ssize_t amdgpu_get_pp_od_clk_voltage(struct device *dev, > -- > 2.28.0 > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
RE: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage
[AMD Official Use Only - Internal Distribution Only] Hi, Matt, With your change, I still could reproduce the following issue: # echo "s 1 1900" > /sys/class/drm/card0/device/pp_od_clk_voltage bash: echo: write error: Invalid argument I found that it is related the following lines code, could you help double check it? while ((sub_str = strsep(_str, delimiter)) != NULL) { // sub_str will be empty string ret = kstrtol(sub_str, 0, [parameter_size]); if (ret) return -EINVAL; // return here parameter_size++; while (isspace(*tmp_str)) tmp_str++; } Best Regards Dennis Li -Original Message- From: Matt Coffin Sent: Friday, August 14, 2020 9:15 AM To: amd-gfx@lists.freedesktop.org Cc: Koenig, Christian ; Li, Dennis ; Matt Coffin Subject: [PATCH] drm/amdgpu: Fix incorrect return value in sysfs for pp_od_clk_voltage The changes in edad8312cbbf9a33c86873fc4093664f150dd5c1 introduced an issue with the sysfs interface for pp_od_clk_voltage. It overwrites the return value to 0 when it calls another function, then returns 0. The intended behavior is that a positive return value indicates the number of bytes from the buffer that you processed in that call. With the 0 return value, clients would submit the same value to be written over and over again, resulting in an infinite loop. This is resolved by returning the count of bytes read (in this case the whole message), when the desired return is 0 (success). Fixes: edad8312cbbf ("drm/amdgpu: fix system hang issue during GPU") Bug: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1245data=02%7C01%7CDennis.Li%40amd.com%7C4de8308bf7974ea9e62308d83fef922b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637329646078379799sdata=N9c6e7cUMCDpvBIYUEzxkadJbJdBryXyfhfhb%2BUEwjg%3Dreserved=0 Signed-off-by: Matt Coffin --- drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c index 1705e328c6fc..f00c7ed361d4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c @@ -937,7 +937,11 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct device *dev, pro_end: up_read(>reset_sem); - return ret; + if (ret) { + return ret; + } else { + return count; + } } static ssize_t amdgpu_get_pp_od_clk_voltage(struct device *dev, -- 2.28.0 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx