Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 13.07.19 um 22:24 schrieb Felix Kuehling: Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian: The only real solution I can see is to be able to reliable kill shaders in an OOM situation. Well, we can in fact preempt our compute shaders with low latency. Killing a KFD process will do exactly that. I've taken a look at that thing as well and to be honest it is not even remotely sufficient. We need something which stops the hardware *immediately* from accessing system memory, and not wait for the SQ to kill all waves, flush caches etc... One possibility I'm playing around with for a while is to replace the root PD for the VMIDs in question on the fly. E.g. we just let it point to some dummy which redirects everything into nirvana. But implementing this is easier said than done... Warming up this thread, since I just fixed another bug that was enabled by artificial memory pressure due to the GTT limit. I think disabling the PD for the VMIDs is a good idea. A problem is that HWS firmware updates PD pointers in the background for its VMIDs. So this would require a reliable and fast way to kill the HWS first. Well we don't necessary need to completely kill the HWS. What we need is to suspend it, kill a specific process and resume it later on. As far as I can see the concept with the HWS interaction was to use a ring buffer with async feedback when something is done. That is really convenient for performative and reliable operation, but unfortunately not if you need to kill of some processing immediately. So something like setting a bit in a register to suspend the HWS, kill the VMIDs, set a flag in the HWS runlist to stop it from scheduling a specific process once more and then resume the HWS is what is needed here. An alternative I thought about is, disabling bus access at the BIF level if that's possible somehow. Basically we would instantaneously kill all GPU system memory access, signal all fences or just remove all fences from all BO reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to be freed, let the OOM killer do its thing, and when the dust settles, reset the GPU. Yeah, thought about that as well. The problem with this approach is that it is rather invasive. E.g. stopping the BIF means stopping it for everybody and not just the process which is currently killed and when we reset the GPU it is actually quite likely that we lose the content of VRAM. Regards, Christian. Regards, Felix Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian: >>> The only real solution I can see is to be able to reliable kill shaders >>> in an OOM situation. >> Well, we can in fact preempt our compute shaders with low latency. >> Killing a KFD process will do exactly that. > I've taken a look at that thing as well and to be honest it is not even > remotely sufficient. > > We need something which stops the hardware *immediately* from accessing > system memory, and not wait for the SQ to kill all waves, flush caches > etc... > > One possibility I'm playing around with for a while is to replace the > root PD for the VMIDs in question on the fly. E.g. we just let it point > to some dummy which redirects everything into nirvana. > > But implementing this is easier said than done... Warming up this thread, since I just fixed another bug that was enabled by artificial memory pressure due to the GTT limit. I think disabling the PD for the VMIDs is a good idea. A problem is that HWS firmware updates PD pointers in the background for its VMIDs. So this would require a reliable and fast way to kill the HWS first. An alternative I thought about is, disabling bus access at the BIF level if that's possible somehow. Basically we would instantaneously kill all GPU system memory access, signal all fences or just remove all fences from all BO reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to be freed, let the OOM killer do its thing, and when the dust settles, reset the GPU. Regards, Felix > > Regards, > Christian. signature.asc Description: OpenPGP digital signature ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 30.04.19 um 19:25 schrieb Kuehling, Felix: > [SNIP] To sum it up the requirement of using almost all system memory by a GPU is simply not possible upstream and even in any production system rather questionable. >>> It should be doable with userptr, which now uses unpinned pages through >>> HMM. Currently the GTT limit affects the largest possible userptr >>> allocation, though not the total sum of all userptr allocations. Maybe >>> making userptr completely independent of GTT size would be an easier >>> problem to tackle. Then I can leave the GTT limit alone. >> Well this way we would only avoid the symptoms, but not the real problem. > It allocates pages in user mode rather than kernel mode. That means, OOM > situations take a completely different code path. Before running out of > memory completely, triggering the OOM killer, the kernel would start > swapping pages, which would trigger the MMU notifier to stop the user > mode queues or invalidate GPU page table entries, and allow the pages to > be swapped out. Well it at least removes the extra layer in TTM we have here. But what I meant is that it still doesn't fix our underlying problem of stopping the hardware immediately. The only real solution I can see is to be able to reliable kill shaders in an OOM situation. >>> Well, we can in fact preempt our compute shaders with low latency. >>> Killing a KFD process will do exactly that. >> I've taken a look at that thing as well and to be honest it is not even >> remotely sufficient. >> >> We need something which stops the hardware *immediately* from accessing >> system memory, and not wait for the SQ to kill all waves, flush caches >> etc... > It's apparently sufficient to use in our MMU notifier. There is also a > way to disable the grace period that allows short waves to complete > before being preempted, though we're not using that at the moment. > > >> One possibility I'm playing around with for a while is to replace the >> root PD for the VMIDs in question on the fly. E.g. we just let it point >> to some dummy which redirects everything into nirvana. > Even that's not sufficient. You'll also need to free the pages > immediately. For KFD processes, cleaning up of memory is done in a > worker thread that gets kicked off by a release MMU notifier when the > process' mm_struct is taken down. Yeah, that worker is what I meant with that this whole thing is not sufficient. BTW: How does that still work with HMM? I mean HMM doesn't take a reference to the pages any more. But let me say it differently: When we want the OOM killer to work correctly we need to have a short cut path which doesn't takes any locks or allocates memory. What we can do is: Write some registers and then maybe busy wait for the TLB flush to complete. What we can't do is: Waiting for the SQ to kill waves, waiting for fences etc... > Then there is still TTM's delayed freeing of BOs that waits for fences. > So you'd need to signal all the BO fences to allow them to be released. Actually that doesn't apply in the critical code path. In this situation TTM tries to free things up it doesn't need to wait for immediately. What we are missing here is something like a kill interface for fences maybe... > TBH, I don't understand why waiting is not an option, if the alternative > is a kernel panic. If your OOM killer kicks in, your system is basically > dead. Waiting for a fraction of a second to let a GPU finish its memory > access should be a small price to pay in that situation. Oh, yeah that is really good point as well. I think that this restriction was created to make sure that the OOM killer always makes progress and doesn't waits for things like network congestion. Now the Linux MM is not really made for long term I/O mappings anyway. And that was also recently a topic on the lists in the context of HMM (there is a LWN summary about it). Probably worth bringing that discussion up once more. Christian. > > Regards, > Felix > > >> But implementing this is easier said than done... >> >> Regards, >> Christian. >> ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
On 2019-04-30 1:03 p.m., Koenig, Christian wrote: > Am 30.04.19 um 17:36 schrieb Kuehling, Felix: >> On 2019-04-30 5:32 a.m., Christian König wrote: >>> [CAUTION: External Email] >>> >>> Am 30.04.19 um 01:16 schrieb Kuehling, Felix: On 2019-04-29 8:34 a.m., Christian König wrote: > Am 28.04.19 um 09:44 schrieb Kuehling, Felix: >> From: Kent Russell >> >> GTT size is currently limited to the minimum of VRAM size or 3/4 of >> system memory. This severely limits the quanitity of system memory >> that can be used by ROCm application. >> >> Increase GTT size to the maximum of VRAM size or system memory size. > Well, NAK. > > This limit was done on purpose because we otherwise the > max-texture-size would be crashing the system because the OOM killer > would be causing a system panic. > > Using more than 75% of system memory by the GPU at the same time makes > the system unstable and so we can't allow that by default. Like we discussed, the current implementation is too limiting. On a Fiji system with 4GB VRAM and 32GB system memory, it limits system memory allocations to 4GB. I think this workaround was fixed once before and reverted because it broke a CZ system with 1GB system memory. So I suspect that this is an issue affecting small memory systems where maybe the 1/2 system memory limit in TTM isn't sufficient to protect from OOM panics. >>> Well it not only broke on a 1GB CZ system, this was just where Andrey >>> reproduced it. We got reports from all kind of systems. >> I'd like to see those reports. This patch has been included in Linux Pro >> releases since 18.20. I'm not aware that anyone complained about it. > Well to be honest our Pro driver is actually not used that widely and > only used on rather homogeneous systems. > > Which is not really surprising since we only advise to use it on > professional use cases. > The OOM killer problem is a more general problem that potentially affects other drivers too. Keeping this GTT limit broken in AMDGPU is an inadequate workaround at best. I'd like to look for a better solution, probably some adjustment of the TTM system memory limits on systems with small memory, to avoid OOM panics on such systems. >>> The core problem here is that the OOM killer explicitly doesn't want to >>> block for shaders to finish whatever it is doing. >>> >>> So currently as soon as the hardware is using some memory it can't be >>> reclaimed immediately. >>> >>> The original limit in TTM was 2/3 of system memory and that worked >>> really reliable and we ran into problems only after raising it to 3/4. >> The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for >> emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit >> is only available to root. > Ah! I think I know why those limits doesn't kick in here! > > When GTT space is used by evictions from VRAM then we will use the > emergency limit as well. > >> This GTT limit kicks in before I get anywhere close to the TTM limit. >> That's why I think it is both broken and redundant. > That was also the argument when we removed it the last time, but it got > immediately reverted. > >>> To sum it up the requirement of using almost all system memory by a GPU >>> is simply not possible upstream and even in any production system rather >>> questionable. >> It should be doable with userptr, which now uses unpinned pages through >> HMM. Currently the GTT limit affects the largest possible userptr >> allocation, though not the total sum of all userptr allocations. Maybe >> making userptr completely independent of GTT size would be an easier >> problem to tackle. Then I can leave the GTT limit alone. > Well this way we would only avoid the symptoms, but not the real problem. It allocates pages in user mode rather than kernel mode. That means, OOM situations take a completely different code path. Before running out of memory completely, triggering the OOM killer, the kernel would start swapping pages, which would trigger the MMU notifier to stop the user mode queues or invalidate GPU page table entries, and allow the pages to be swapped out. > >>> The only real solution I can see is to be able to reliable kill shaders >>> in an OOM situation. >> Well, we can in fact preempt our compute shaders with low latency. >> Killing a KFD process will do exactly that. > I've taken a look at that thing as well and to be honest it is not even > remotely sufficient. > > We need something which stops the hardware *immediately* from accessing > system memory, and not wait for the SQ to kill all waves, flush caches > etc... It's apparently sufficient to use in our MMU notifier. There is also a way to disable the grace period that allows short waves to complete before being preempted, though we're not using that at the moment. > > One possibility I'm playing around with for a while is to replace t
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 30.04.19 um 17:36 schrieb Kuehling, Felix: > On 2019-04-30 5:32 a.m., Christian König wrote: >> [CAUTION: External Email] >> >> Am 30.04.19 um 01:16 schrieb Kuehling, Felix: >>> On 2019-04-29 8:34 a.m., Christian König wrote: Am 28.04.19 um 09:44 schrieb Kuehling, Felix: > From: Kent Russell > > GTT size is currently limited to the minimum of VRAM size or 3/4 of > system memory. This severely limits the quanitity of system memory > that can be used by ROCm application. > > Increase GTT size to the maximum of VRAM size or system memory size. Well, NAK. This limit was done on purpose because we otherwise the max-texture-size would be crashing the system because the OOM killer would be causing a system panic. Using more than 75% of system memory by the GPU at the same time makes the system unstable and so we can't allow that by default. >>> Like we discussed, the current implementation is too limiting. On a Fiji >>> system with 4GB VRAM and 32GB system memory, it limits system memory >>> allocations to 4GB. I think this workaround was fixed once before and >>> reverted because it broke a CZ system with 1GB system memory. So I >>> suspect that this is an issue affecting small memory systems where maybe >>> the 1/2 system memory limit in TTM isn't sufficient to protect from OOM >>> panics. >> Well it not only broke on a 1GB CZ system, this was just where Andrey >> reproduced it. We got reports from all kind of systems. > I'd like to see those reports. This patch has been included in Linux Pro > releases since 18.20. I'm not aware that anyone complained about it. Well to be honest our Pro driver is actually not used that widely and only used on rather homogeneous systems. Which is not really surprising since we only advise to use it on professional use cases. >>> The OOM killer problem is a more general problem that potentially >>> affects other drivers too. Keeping this GTT limit broken in AMDGPU is an >>> inadequate workaround at best. I'd like to look for a better solution, >>> probably some adjustment of the TTM system memory limits on systems with >>> small memory, to avoid OOM panics on such systems. >> The core problem here is that the OOM killer explicitly doesn't want to >> block for shaders to finish whatever it is doing. >> >> So currently as soon as the hardware is using some memory it can't be >> reclaimed immediately. >> >> The original limit in TTM was 2/3 of system memory and that worked >> really reliable and we ran into problems only after raising it to 3/4. > The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for > emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit > is only available to root. Ah! I think I know why those limits doesn't kick in here! When GTT space is used by evictions from VRAM then we will use the emergency limit as well. > This GTT limit kicks in before I get anywhere close to the TTM limit. > That's why I think it is both broken and redundant. That was also the argument when we removed it the last time, but it got immediately reverted. >> To sum it up the requirement of using almost all system memory by a GPU >> is simply not possible upstream and even in any production system rather >> questionable. > It should be doable with userptr, which now uses unpinned pages through > HMM. Currently the GTT limit affects the largest possible userptr > allocation, though not the total sum of all userptr allocations. Maybe > making userptr completely independent of GTT size would be an easier > problem to tackle. Then I can leave the GTT limit alone. Well this way we would only avoid the symptoms, but not the real problem. >> The only real solution I can see is to be able to reliable kill shaders >> in an OOM situation. > Well, we can in fact preempt our compute shaders with low latency. > Killing a KFD process will do exactly that. I've taken a look at that thing as well and to be honest it is not even remotely sufficient. We need something which stops the hardware *immediately* from accessing system memory, and not wait for the SQ to kill all waves, flush caches etc... One possibility I'm playing around with for a while is to replace the root PD for the VMIDs in question on the fly. E.g. we just let it point to some dummy which redirects everything into nirvana. But implementing this is easier said than done... Regards, Christian. ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
On 2019-04-30 5:32 a.m., Christian König wrote: > [CAUTION: External Email] > > Am 30.04.19 um 01:16 schrieb Kuehling, Felix: >> On 2019-04-29 8:34 a.m., Christian König wrote: >>> Am 28.04.19 um 09:44 schrieb Kuehling, Felix: From: Kent Russell GTT size is currently limited to the minimum of VRAM size or 3/4 of system memory. This severely limits the quanitity of system memory that can be used by ROCm application. Increase GTT size to the maximum of VRAM size or system memory size. >>> Well, NAK. >>> >>> This limit was done on purpose because we otherwise the >>> max-texture-size would be crashing the system because the OOM killer >>> would be causing a system panic. >>> >>> Using more than 75% of system memory by the GPU at the same time makes >>> the system unstable and so we can't allow that by default. >> Like we discussed, the current implementation is too limiting. On a Fiji >> system with 4GB VRAM and 32GB system memory, it limits system memory >> allocations to 4GB. I think this workaround was fixed once before and >> reverted because it broke a CZ system with 1GB system memory. So I >> suspect that this is an issue affecting small memory systems where maybe >> the 1/2 system memory limit in TTM isn't sufficient to protect from OOM >> panics. > > Well it not only broke on a 1GB CZ system, this was just where Andrey > reproduced it. We got reports from all kind of systems. I'd like to see those reports. This patch has been included in Linux Pro releases since 18.20. I'm not aware that anyone complained about it. > >> The OOM killer problem is a more general problem that potentially >> affects other drivers too. Keeping this GTT limit broken in AMDGPU is an >> inadequate workaround at best. I'd like to look for a better solution, >> probably some adjustment of the TTM system memory limits on systems with >> small memory, to avoid OOM panics on such systems. > > The core problem here is that the OOM killer explicitly doesn't want to > block for shaders to finish whatever it is doing. > > So currently as soon as the hardware is using some memory it can't be > reclaimed immediately. > > The original limit in TTM was 2/3 of system memory and that worked > really reliable and we ran into problems only after raising it to 3/4. The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit is only available to root. This GTT limit kicks in before I get anywhere close to the TTM limit. That's why I think it is both broken and redundant. > > To sum it up the requirement of using almost all system memory by a GPU > is simply not possible upstream and even in any production system rather > questionable. It should be doable with userptr, which now uses unpinned pages through HMM. Currently the GTT limit affects the largest possible userptr allocation, though not the total sum of all userptr allocations. Maybe making userptr completely independent of GTT size would be an easier problem to tackle. Then I can leave the GTT limit alone. > > The only real solution I can see is to be able to reliable kill shaders > in an OOM situation. Well, we can in fact preempt our compute shaders with low latency. Killing a KFD process will do exactly that. Regards, Felix > > Regards, > Christian. > >> >> Regards, >> Felix >> >> >>> What could maybe work is to reduce amount of system memory by a fixed >>> factor, but I of hand don't see a way of fixing this in general. >>> >>> Regards, >>> Christian. >>> Signed-off-by: Kent Russell Reviewed-by: Felix Kuehling Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c14198737dcd..e9ecc3953673 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) struct sysinfo si; si_meminfo(&si); - gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), - adev->gmc.mc_vram_size), - ((uint64_t)si.totalram * si.mem_unit * 3/4)); - } - else + gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), + adev->gmc.mc_vram_size, + ((uint64_t)si.totalram * si.mem_unit)); + } else gtt_size = (uint64_t)amdgpu_gtt_size << 20; /* Initialize GTT memory pool */ >> ___ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > ___ amd-gfx mailing list amd-gfx@lists.freed
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 30.04.19 um 01:16 schrieb Kuehling, Felix: On 2019-04-29 8:34 a.m., Christian König wrote: Am 28.04.19 um 09:44 schrieb Kuehling, Felix: From: Kent Russell GTT size is currently limited to the minimum of VRAM size or 3/4 of system memory. This severely limits the quanitity of system memory that can be used by ROCm application. Increase GTT size to the maximum of VRAM size or system memory size. Well, NAK. This limit was done on purpose because we otherwise the max-texture-size would be crashing the system because the OOM killer would be causing a system panic. Using more than 75% of system memory by the GPU at the same time makes the system unstable and so we can't allow that by default. Like we discussed, the current implementation is too limiting. On a Fiji system with 4GB VRAM and 32GB system memory, it limits system memory allocations to 4GB. I think this workaround was fixed once before and reverted because it broke a CZ system with 1GB system memory. So I suspect that this is an issue affecting small memory systems where maybe the 1/2 system memory limit in TTM isn't sufficient to protect from OOM panics. Well it not only broke on a 1GB CZ system, this was just where Andrey reproduced it. We got reports from all kind of systems. The OOM killer problem is a more general problem that potentially affects other drivers too. Keeping this GTT limit broken in AMDGPU is an inadequate workaround at best. I'd like to look for a better solution, probably some adjustment of the TTM system memory limits on systems with small memory, to avoid OOM panics on such systems. The core problem here is that the OOM killer explicitly doesn't want to block for shaders to finish whatever it is doing. So currently as soon as the hardware is using some memory it can't be reclaimed immediately. The original limit in TTM was 2/3 of system memory and that worked really reliable and we ran into problems only after raising it to 3/4. To sum it up the requirement of using almost all system memory by a GPU is simply not possible upstream and even in any production system rather questionable. The only real solution I can see is to be able to reliable kill shaders in an OOM situation. Regards, Christian. Regards, Felix What could maybe work is to reduce amount of system memory by a fixed factor, but I of hand don't see a way of fixing this in general. Regards, Christian. Signed-off-by: Kent Russell Reviewed-by: Felix Kuehling Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c14198737dcd..e9ecc3953673 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) struct sysinfo si; si_meminfo(&si); - gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), - adev->gmc.mc_vram_size), - ((uint64_t)si.totalram * si.mem_unit * 3/4)); - } - else + gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), + adev->gmc.mc_vram_size, + ((uint64_t)si.totalram * si.mem_unit)); + } else gtt_size = (uint64_t)amdgpu_gtt_size << 20; /* Initialize GTT memory pool */ ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
On 2019-04-29 8:34 a.m., Christian König wrote: > Am 28.04.19 um 09:44 schrieb Kuehling, Felix: >> From: Kent Russell >> >> GTT size is currently limited to the minimum of VRAM size or 3/4 of >> system memory. This severely limits the quanitity of system memory >> that can be used by ROCm application. >> >> Increase GTT size to the maximum of VRAM size or system memory size. > > Well, NAK. > > This limit was done on purpose because we otherwise the > max-texture-size would be crashing the system because the OOM killer > would be causing a system panic. > > Using more than 75% of system memory by the GPU at the same time makes > the system unstable and so we can't allow that by default. Like we discussed, the current implementation is too limiting. On a Fiji system with 4GB VRAM and 32GB system memory, it limits system memory allocations to 4GB. I think this workaround was fixed once before and reverted because it broke a CZ system with 1GB system memory. So I suspect that this is an issue affecting small memory systems where maybe the 1/2 system memory limit in TTM isn't sufficient to protect from OOM panics. The OOM killer problem is a more general problem that potentially affects other drivers too. Keeping this GTT limit broken in AMDGPU is an inadequate workaround at best. I'd like to look for a better solution, probably some adjustment of the TTM system memory limits on systems with small memory, to avoid OOM panics on such systems. Regards, Felix > > What could maybe work is to reduce amount of system memory by a fixed > factor, but I of hand don't see a way of fixing this in general. > > Regards, > Christian. > >> >> Signed-off-by: Kent Russell >> Reviewed-by: Felix Kuehling >> Signed-off-by: Felix Kuehling >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 - >> 1 file changed, 4 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> index c14198737dcd..e9ecc3953673 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) >> struct sysinfo si; >> si_meminfo(&si); >> - gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), >> - adev->gmc.mc_vram_size), >> - ((uint64_t)si.totalram * si.mem_unit * 3/4)); >> - } >> - else >> + gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), >> + adev->gmc.mc_vram_size, >> + ((uint64_t)si.totalram * si.mem_unit)); >> + } else >> gtt_size = (uint64_t)amdgpu_gtt_size << 20; >> /* Initialize GTT memory pool */ > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation
Am 28.04.19 um 09:44 schrieb Kuehling, Felix: From: Kent Russell GTT size is currently limited to the minimum of VRAM size or 3/4 of system memory. This severely limits the quanitity of system memory that can be used by ROCm application. Increase GTT size to the maximum of VRAM size or system memory size. Well, NAK. This limit was done on purpose because we otherwise the max-texture-size would be crashing the system because the OOM killer would be causing a system panic. Using more than 75% of system memory by the GPU at the same time makes the system unstable and so we can't allow that by default. What could maybe work is to reduce amount of system memory by a fixed factor, but I of hand don't see a way of fixing this in general. Regards, Christian. Signed-off-by: Kent Russell Reviewed-by: Felix Kuehling Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c14198737dcd..e9ecc3953673 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) struct sysinfo si; si_meminfo(&si); - gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), - adev->gmc.mc_vram_size), - ((uint64_t)si.totalram * si.mem_unit * 3/4)); - } - else + gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), + adev->gmc.mc_vram_size, + ((uint64_t)si.totalram * si.mem_unit)); + } else gtt_size = (uint64_t)amdgpu_gtt_size << 20; /* Initialize GTT memory pool */ ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
[PATCH 27/27] drm/amdgpu: Fix GTT size calculation
From: Kent Russell GTT size is currently limited to the minimum of VRAM size or 3/4 of system memory. This severely limits the quanitity of system memory that can be used by ROCm application. Increase GTT size to the maximum of VRAM size or system memory size. Signed-off-by: Kent Russell Reviewed-by: Felix Kuehling Signed-off-by: Felix Kuehling --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index c14198737dcd..e9ecc3953673 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev) struct sysinfo si; si_meminfo(&si); - gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), - adev->gmc.mc_vram_size), - ((uint64_t)si.totalram * si.mem_unit * 3/4)); - } - else + gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20), + adev->gmc.mc_vram_size, + ((uint64_t)si.totalram * si.mem_unit)); + } else gtt_size = (uint64_t)amdgpu_gtt_size << 20; /* Initialize GTT memory pool */ -- 2.17.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx