Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-07-14 Thread Koenig, Christian
Am 13.07.19 um 22:24 schrieb Felix Kuehling:
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian:

The only real solution I can see is to be able to reliable kill shaders
in an OOM situation.


Well, we can in fact preempt our compute shaders with low latency.
Killing a KFD process will do exactly that.


I've taken a look at that thing as well and to be honest it is not even
remotely sufficient.

We need something which stops the hardware *immediately* from accessing
system memory, and not wait for the SQ to kill all waves, flush caches
etc...

One possibility I'm playing around with for a while is to replace the
root PD for the VMIDs in question on the fly. E.g. we just let it point
to some dummy which redirects everything into nirvana.

But implementing this is easier said than done...

Warming up this thread, since I just fixed another bug that was enabled by 
artificial memory pressure due to the GTT limit.

I think disabling the PD for the VMIDs is a good idea. A problem is that HWS 
firmware updates PD pointers in the background for its VMIDs. So this would 
require a reliable and fast way to kill the HWS first.

Well we don't necessary need to completely kill the HWS. What we need is to 
suspend it, kill a specific process and resume it later on.

As far as I can see the concept with the HWS interaction was to use a ring 
buffer with async feedback when something is done.

That is really convenient for performative and reliable operation, but 
unfortunately not if you need to kill of some processing immediately.

So something like setting a bit in a register to suspend the HWS, kill the 
VMIDs, set a flag in the HWS runlist to stop it from scheduling a specific 
process once more and then resume the HWS is what is needed here.


An alternative I thought about is, disabling bus access at the BIF level if 
that's possible somehow. Basically we would instantaneously kill all GPU system 
memory access, signal all fences or just remove all fences from all BO 
reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to 
be freed, let the OOM killer do its thing, and when the dust settles, reset the 
GPU.

Yeah, thought about that as well. The problem with this approach is that it is 
rather invasive.

E.g. stopping the BIF means stopping it for everybody and not just the process 
which is currently killed and when we reset the GPU it is actually quite likely 
that we lose the content of VRAM.

Regards,
Christian.


Regards,
  Felix

Regards,
Christian.


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-07-13 Thread Felix Kuehling
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian:
>>> The only real solution I can see is to be able to reliable kill shaders
>>> in an OOM situation.
>> Well, we can in fact preempt our compute shaders with low latency.
>> Killing a KFD process will do exactly that.
> I've taken a look at that thing as well and to be honest it is not even 
> remotely sufficient.
>
> We need something which stops the hardware *immediately* from accessing 
> system memory, and not wait for the SQ to kill all waves, flush caches 
> etc...
>
> One possibility I'm playing around with for a while is to replace the 
> root PD for the VMIDs in question on the fly. E.g. we just let it point 
> to some dummy which redirects everything into nirvana.
>
> But implementing this is easier said than done...

Warming up this thread, since I just fixed another bug that was enabled
by artificial memory pressure due to the GTT limit.

I think disabling the PD for the VMIDs is a good idea. A problem is that
HWS firmware updates PD pointers in the background for its VMIDs. So
this would require a reliable and fast way to kill the HWS first.

An alternative I thought about is, disabling bus access at the BIF level
if that's possible somehow. Basically we would instantaneously kill all
GPU system memory access, signal all fences or just remove all fences
from all BO reservations (reservation_object_add_excl_fence(resv, NULL))
to allow memory to be freed, let the OOM killer do its thing, and when
the dust settles, reset the GPU.

Regards,
  Felix

>
> Regards,
> Christian.


signature.asc
Description: OpenPGP digital signature
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-05-02 Thread Koenig, Christian
Am 30.04.19 um 19:25 schrieb Kuehling, Felix:
> [SNIP]
 To sum it up the requirement of using almost all system memory by a GPU
 is simply not possible upstream and even in any production system rather
 questionable.
>>> It should be doable with userptr, which now uses unpinned pages through
>>> HMM. Currently the GTT limit affects the largest possible userptr
>>> allocation, though not the total sum of all userptr allocations. Maybe
>>> making userptr completely independent of GTT size would be an easier
>>> problem to tackle. Then I can leave the GTT limit alone.
>> Well this way we would only avoid the symptoms, but not the real problem.
> It allocates pages in user mode rather than kernel mode. That means, OOM
> situations take a completely different code path. Before running out of
> memory completely, triggering the OOM killer, the kernel would start
> swapping pages, which would trigger the MMU notifier to stop the user
> mode queues or invalidate GPU page table entries, and allow the pages to
> be swapped out.

Well it at least removes the extra layer in TTM we have here.

But what I meant is that it still doesn't fix our underlying problem of 
stopping the hardware immediately.

 The only real solution I can see is to be able to reliable kill shaders
 in an OOM situation.
>>> Well, we can in fact preempt our compute shaders with low latency.
>>> Killing a KFD process will do exactly that.
>> I've taken a look at that thing as well and to be honest it is not even
>> remotely sufficient.
>>
>> We need something which stops the hardware *immediately* from accessing
>> system memory, and not wait for the SQ to kill all waves, flush caches
>> etc...
> It's apparently sufficient to use in our MMU notifier. There is also a
> way to disable the grace period that allows short waves to complete
> before being preempted, though we're not using that at the moment.
>
>
>> One possibility I'm playing around with for a while is to replace the
>> root PD for the VMIDs in question on the fly. E.g. we just let it point
>> to some dummy which redirects everything into nirvana.
> Even that's not sufficient. You'll also need to free the pages
> immediately. For KFD processes, cleaning up of memory is done in a
> worker thread that gets kicked off by a release MMU notifier when the
> process' mm_struct is taken down.

Yeah, that worker is what I meant with that this whole thing is not 
sufficient. BTW: How does that still work with HMM? I mean HMM doesn't 
take a reference to the pages any more.

But let me say it differently: When we want the OOM killer to work 
correctly we need to have a short cut path which doesn't takes any locks 
or allocates memory.

What we can do is: Write some registers and then maybe busy wait for the 
TLB flush to complete.

What we can't do is: Waiting for the SQ to kill waves, waiting for 
fences etc...

> Then there is still TTM's delayed freeing of BOs that waits for fences.
> So you'd need to signal all the BO fences to allow them to be released.

Actually that doesn't apply in the critical code path. In this situation 
TTM tries to free things up it doesn't need to wait for immediately.

What we are missing here is something like a kill interface for fences 
maybe...

> TBH, I don't understand why waiting is not an option, if the alternative
> is a kernel panic. If your OOM killer kicks in, your system is basically
> dead. Waiting for a fraction of a second to let a GPU finish its memory
> access should be a small price to pay in that situation.

Oh, yeah that is really good point as well.

I think that this restriction was created to make sure that the OOM 
killer always makes progress and doesn't waits for things like network 
congestion.

Now the Linux MM is not really made for long term I/O mappings anyway. 
And that was also recently a topic on the lists in the context of HMM 
(there is a LWN summary about it). Probably worth bringing that 
discussion up once more.

Christian.

>
> Regards,
>     Felix
>
>
>> But implementing this is easier said than done...
>>
>> Regards,
>> Christian.
>>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-30 Thread Kuehling, Felix
On 2019-04-30 1:03 p.m., Koenig, Christian wrote:
> Am 30.04.19 um 17:36 schrieb Kuehling, Felix:
>> On 2019-04-30 5:32 a.m., Christian König wrote:
>>> [CAUTION: External Email]
>>>
>>> Am 30.04.19 um 01:16 schrieb Kuehling, Felix:
 On 2019-04-29 8:34 a.m., Christian König wrote:
> Am 28.04.19 um 09:44 schrieb Kuehling, Felix:
>> From: Kent Russell 
>>
>> GTT size is currently limited to the minimum of VRAM size or 3/4 of
>> system memory. This severely limits the quanitity of system memory
>> that can be used by ROCm application.
>>
>> Increase GTT size to the maximum of VRAM size or system memory size.
> Well, NAK.
>
> This limit was done on purpose because we otherwise the
> max-texture-size would be crashing the system because the OOM killer
> would be causing a system panic.
>
> Using more than 75% of system memory by the GPU at the same time makes
> the system unstable and so we can't allow that by default.
 Like we discussed, the current implementation is too limiting. On a Fiji
 system with 4GB VRAM and 32GB system memory, it limits system memory
 allocations to 4GB. I think this workaround was fixed once before and
 reverted because it broke a CZ system with 1GB system memory. So I
 suspect that this is an issue affecting small memory systems where maybe
 the 1/2 system memory limit in TTM isn't sufficient to protect from OOM
 panics.
>>> Well it not only broke on a 1GB CZ system, this was just where Andrey
>>> reproduced it. We got reports from all kind of systems.
>> I'd like to see those reports. This patch has been included in Linux Pro
>> releases since 18.20. I'm not aware that anyone complained about it.
> Well to be honest our Pro driver is actually not used that widely and
> only used on rather homogeneous systems.
>
> Which is not really surprising since we only advise to use it on
> professional use cases.
>
 The OOM killer problem is a more general problem that potentially
 affects other drivers too. Keeping this GTT limit broken in AMDGPU is an
 inadequate workaround at best. I'd like to look for a better solution,
 probably some adjustment of the TTM system memory limits on systems with
 small memory, to avoid OOM panics on such systems.
>>> The core problem here is that the OOM killer explicitly doesn't want to
>>> block for shaders to finish whatever it is doing.
>>>
>>> So currently as soon as the hardware is using some memory it can't be
>>> reclaimed immediately.
>>>
>>> The original limit in TTM was 2/3 of system memory and that worked
>>> really reliable and we ran into problems only after raising it to 3/4.
>> The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for
>> emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit
>> is only available to root.
> Ah! I think I know why those limits doesn't kick in here!
>
> When GTT space is used by evictions from VRAM then we will use the
> emergency limit as well.
>
>> This GTT limit kicks in before I get anywhere close to the TTM limit.
>> That's why I think it is both broken and redundant.
> That was also the argument when we removed it the last time, but it got
> immediately reverted.
>
>>> To sum it up the requirement of using almost all system memory by a GPU
>>> is simply not possible upstream and even in any production system rather
>>> questionable.
>> It should be doable with userptr, which now uses unpinned pages through
>> HMM. Currently the GTT limit affects the largest possible userptr
>> allocation, though not the total sum of all userptr allocations. Maybe
>> making userptr completely independent of GTT size would be an easier
>> problem to tackle. Then I can leave the GTT limit alone.
> Well this way we would only avoid the symptoms, but not the real problem.

It allocates pages in user mode rather than kernel mode. That means, OOM 
situations take a completely different code path. Before running out of 
memory completely, triggering the OOM killer, the kernel would start 
swapping pages, which would trigger the MMU notifier to stop the user 
mode queues or invalidate GPU page table entries, and allow the pages to 
be swapped out.


>
>>> The only real solution I can see is to be able to reliable kill shaders
>>> in an OOM situation.
>> Well, we can in fact preempt our compute shaders with low latency.
>> Killing a KFD process will do exactly that.
> I've taken a look at that thing as well and to be honest it is not even
> remotely sufficient.
>
> We need something which stops the hardware *immediately* from accessing
> system memory, and not wait for the SQ to kill all waves, flush caches
> etc...

It's apparently sufficient to use in our MMU notifier. There is also a 
way to disable the grace period that allows short waves to complete 
before being preempted, though we're not using that at the moment.


>
> One possibility I'm playing around with for a while is to replace t

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-30 Thread Koenig, Christian
Am 30.04.19 um 17:36 schrieb Kuehling, Felix:
> On 2019-04-30 5:32 a.m., Christian König wrote:
>> [CAUTION: External Email]
>>
>> Am 30.04.19 um 01:16 schrieb Kuehling, Felix:
>>> On 2019-04-29 8:34 a.m., Christian König wrote:
 Am 28.04.19 um 09:44 schrieb Kuehling, Felix:
> From: Kent Russell 
>
> GTT size is currently limited to the minimum of VRAM size or 3/4 of
> system memory. This severely limits the quanitity of system memory
> that can be used by ROCm application.
>
> Increase GTT size to the maximum of VRAM size or system memory size.
 Well, NAK.

 This limit was done on purpose because we otherwise the
 max-texture-size would be crashing the system because the OOM killer
 would be causing a system panic.

 Using more than 75% of system memory by the GPU at the same time makes
 the system unstable and so we can't allow that by default.
>>> Like we discussed, the current implementation is too limiting. On a Fiji
>>> system with 4GB VRAM and 32GB system memory, it limits system memory
>>> allocations to 4GB. I think this workaround was fixed once before and
>>> reverted because it broke a CZ system with 1GB system memory. So I
>>> suspect that this is an issue affecting small memory systems where maybe
>>> the 1/2 system memory limit in TTM isn't sufficient to protect from OOM
>>> panics.
>> Well it not only broke on a 1GB CZ system, this was just where Andrey
>> reproduced it. We got reports from all kind of systems.
> I'd like to see those reports. This patch has been included in Linux Pro
> releases since 18.20. I'm not aware that anyone complained about it.

Well to be honest our Pro driver is actually not used that widely and 
only used on rather homogeneous systems.

Which is not really surprising since we only advise to use it on 
professional use cases.

>>> The OOM killer problem is a more general problem that potentially
>>> affects other drivers too. Keeping this GTT limit broken in AMDGPU is an
>>> inadequate workaround at best. I'd like to look for a better solution,
>>> probably some adjustment of the TTM system memory limits on systems with
>>> small memory, to avoid OOM panics on such systems.
>> The core problem here is that the OOM killer explicitly doesn't want to
>> block for shaders to finish whatever it is doing.
>>
>> So currently as soon as the hardware is using some memory it can't be
>> reclaimed immediately.
>>
>> The original limit in TTM was 2/3 of system memory and that worked
>> really reliable and we ran into problems only after raising it to 3/4.
> The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for
> emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit
> is only available to root.

Ah! I think I know why those limits doesn't kick in here!

When GTT space is used by evictions from VRAM then we will use the 
emergency limit as well.

> This GTT limit kicks in before I get anywhere close to the TTM limit.
> That's why I think it is both broken and redundant.

That was also the argument when we removed it the last time, but it got 
immediately reverted.

>> To sum it up the requirement of using almost all system memory by a GPU
>> is simply not possible upstream and even in any production system rather
>> questionable.
> It should be doable with userptr, which now uses unpinned pages through
> HMM. Currently the GTT limit affects the largest possible userptr
> allocation, though not the total sum of all userptr allocations. Maybe
> making userptr completely independent of GTT size would be an easier
> problem to tackle. Then I can leave the GTT limit alone.

Well this way we would only avoid the symptoms, but not the real problem.

>> The only real solution I can see is to be able to reliable kill shaders
>> in an OOM situation.
> Well, we can in fact preempt our compute shaders with low latency.
> Killing a KFD process will do exactly that.

I've taken a look at that thing as well and to be honest it is not even 
remotely sufficient.

We need something which stops the hardware *immediately* from accessing 
system memory, and not wait for the SQ to kill all waves, flush caches 
etc...

One possibility I'm playing around with for a while is to replace the 
root PD for the VMIDs in question on the fly. E.g. we just let it point 
to some dummy which redirects everything into nirvana.

But implementing this is easier said than done...

Regards,
Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-30 Thread Kuehling, Felix

On 2019-04-30 5:32 a.m., Christian König wrote:
> [CAUTION: External Email]
>
> Am 30.04.19 um 01:16 schrieb Kuehling, Felix:
>> On 2019-04-29 8:34 a.m., Christian König wrote:
>>> Am 28.04.19 um 09:44 schrieb Kuehling, Felix:
 From: Kent Russell 

 GTT size is currently limited to the minimum of VRAM size or 3/4 of
 system memory. This severely limits the quanitity of system memory
 that can be used by ROCm application.

 Increase GTT size to the maximum of VRAM size or system memory size.
>>> Well, NAK.
>>>
>>> This limit was done on purpose because we otherwise the
>>> max-texture-size would be crashing the system because the OOM killer
>>> would be causing a system panic.
>>>
>>> Using more than 75% of system memory by the GPU at the same time makes
>>> the system unstable and so we can't allow that by default.
>> Like we discussed, the current implementation is too limiting. On a Fiji
>> system with 4GB VRAM and 32GB system memory, it limits system memory
>> allocations to 4GB. I think this workaround was fixed once before and
>> reverted because it broke a CZ system with 1GB system memory. So I
>> suspect that this is an issue affecting small memory systems where maybe
>> the 1/2 system memory limit in TTM isn't sufficient to protect from OOM
>> panics.
>
> Well it not only broke on a 1GB CZ system, this was just where Andrey
> reproduced it. We got reports from all kind of systems.

I'd like to see those reports. This patch has been included in Linux Pro 
releases since 18.20. I'm not aware that anyone complained about it.


>
>> The OOM killer problem is a more general problem that potentially
>> affects other drivers too. Keeping this GTT limit broken in AMDGPU is an
>> inadequate workaround at best. I'd like to look for a better solution,
>> probably some adjustment of the TTM system memory limits on systems with
>> small memory, to avoid OOM panics on such systems.
>
> The core problem here is that the OOM killer explicitly doesn't want to
> block for shaders to finish whatever it is doing.
>
> So currently as soon as the hardware is using some memory it can't be
> reclaimed immediately.
>
> The original limit in TTM was 2/3 of system memory and that worked
> really reliable and we ran into problems only after raising it to 3/4.

The TTM system memory limit is still 3/8 soft and 1/2 hard, 3/4 for 
emergencies. See ttm_mem_init_kernel_zone. AFAICT, the emergency limit 
is only available to root.

This GTT limit kicks in before I get anywhere close to the TTM limit. 
That's why I think it is both broken and redundant.


>
> To sum it up the requirement of using almost all system memory by a GPU
> is simply not possible upstream and even in any production system rather
> questionable.

It should be doable with userptr, which now uses unpinned pages through 
HMM. Currently the GTT limit affects the largest possible userptr 
allocation, though not the total sum of all userptr allocations. Maybe 
making userptr completely independent of GTT size would be an easier 
problem to tackle. Then I can leave the GTT limit alone.


>
> The only real solution I can see is to be able to reliable kill shaders
> in an OOM situation.

Well, we can in fact preempt our compute shaders with low latency. 
Killing a KFD process will do exactly that.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>     Felix
>>
>>
>>> What could maybe work is to reduce amount of system memory by a fixed
>>> factor, but I of hand don't see a way of fixing this in general.
>>>
>>> Regards,
>>> Christian.
>>>
 Signed-off-by: Kent Russell 
 Reviewed-by: Felix Kuehling 
 Signed-off-by: Felix Kuehling 
 ---
    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 -
    1 file changed, 4 insertions(+), 5 deletions(-)

 diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
 b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
 index c14198737dcd..e9ecc3953673 100644
 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
 +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
 @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device 
 *adev)
    struct sysinfo si;
  si_meminfo(&si);
 -    gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
 -   adev->gmc.mc_vram_size),
 -   ((uint64_t)si.totalram * si.mem_unit * 3/4));
 -    }
 -    else
 +    gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
 +    adev->gmc.mc_vram_size,
 +    ((uint64_t)si.totalram * si.mem_unit));
 +    } else
    gtt_size = (uint64_t)amdgpu_gtt_size << 20;
  /* Initialize GTT memory pool */
>> ___
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
___
amd-gfx mailing list
amd-gfx@lists.freed

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-30 Thread Christian König

Am 30.04.19 um 01:16 schrieb Kuehling, Felix:

On 2019-04-29 8:34 a.m., Christian König wrote:

Am 28.04.19 um 09:44 schrieb Kuehling, Felix:

From: Kent Russell 

GTT size is currently limited to the minimum of VRAM size or 3/4 of
system memory. This severely limits the quanitity of system memory
that can be used by ROCm application.

Increase GTT size to the maximum of VRAM size or system memory size.

Well, NAK.

This limit was done on purpose because we otherwise the
max-texture-size would be crashing the system because the OOM killer
would be causing a system panic.

Using more than 75% of system memory by the GPU at the same time makes
the system unstable and so we can't allow that by default.

Like we discussed, the current implementation is too limiting. On a Fiji
system with 4GB VRAM and 32GB system memory, it limits system memory
allocations to 4GB. I think this workaround was fixed once before and
reverted because it broke a CZ system with 1GB system memory. So I
suspect that this is an issue affecting small memory systems where maybe
the 1/2 system memory limit in TTM isn't sufficient to protect from OOM
panics.


Well it not only broke on a 1GB CZ system, this was just where Andrey 
reproduced it. We got reports from all kind of systems.



The OOM killer problem is a more general problem that potentially
affects other drivers too. Keeping this GTT limit broken in AMDGPU is an
inadequate workaround at best. I'd like to look for a better solution,
probably some adjustment of the TTM system memory limits on systems with
small memory, to avoid OOM panics on such systems.


The core problem here is that the OOM killer explicitly doesn't want to 
block for shaders to finish whatever it is doing.


So currently as soon as the hardware is using some memory it can't be 
reclaimed immediately.


The original limit in TTM was 2/3 of system memory and that worked 
really reliable and we ran into problems only after raising it to 3/4.


To sum it up the requirement of using almost all system memory by a GPU 
is simply not possible upstream and even in any production system rather 
questionable.


The only real solution I can see is to be able to reliable kill shaders 
in an OOM situation.


Regards,
Christian.



Regards,
    Felix



What could maybe work is to reduce amount of system memory by a fixed
factor, but I of hand don't see a way of fixing this in general.

Regards,
Christian.


Signed-off-by: Kent Russell 
Reviewed-by: Felix Kuehling 
Signed-off-by: Felix Kuehling 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 -
   1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c14198737dcd..e9ecc3953673 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
   struct sysinfo si;
     si_meminfo(&si);
-    gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
-   adev->gmc.mc_vram_size),
-   ((uint64_t)si.totalram * si.mem_unit * 3/4));
-    }
-    else
+    gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
+    adev->gmc.mc_vram_size,
+    ((uint64_t)si.totalram * si.mem_unit));
+    } else
   gtt_size = (uint64_t)amdgpu_gtt_size << 20;
     /* Initialize GTT memory pool */

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-29 Thread Kuehling, Felix
On 2019-04-29 8:34 a.m., Christian König wrote:
> Am 28.04.19 um 09:44 schrieb Kuehling, Felix:
>> From: Kent Russell 
>>
>> GTT size is currently limited to the minimum of VRAM size or 3/4 of
>> system memory. This severely limits the quanitity of system memory
>> that can be used by ROCm application.
>>
>> Increase GTT size to the maximum of VRAM size or system memory size.
>
> Well, NAK.
>
> This limit was done on purpose because we otherwise the 
> max-texture-size would be crashing the system because the OOM killer 
> would be causing a system panic.
>
> Using more than 75% of system memory by the GPU at the same time makes 
> the system unstable and so we can't allow that by default.

Like we discussed, the current implementation is too limiting. On a Fiji 
system with 4GB VRAM and 32GB system memory, it limits system memory 
allocations to 4GB. I think this workaround was fixed once before and 
reverted because it broke a CZ system with 1GB system memory. So I 
suspect that this is an issue affecting small memory systems where maybe 
the 1/2 system memory limit in TTM isn't sufficient to protect from OOM 
panics.

The OOM killer problem is a more general problem that potentially 
affects other drivers too. Keeping this GTT limit broken in AMDGPU is an 
inadequate workaround at best. I'd like to look for a better solution, 
probably some adjustment of the TTM system memory limits on systems with 
small memory, to avoid OOM panics on such systems.

Regards,
   Felix


>
> What could maybe work is to reduce amount of system memory by a fixed 
> factor, but I of hand don't see a way of fixing this in general.
>
> Regards,
> Christian.
>
>>
>> Signed-off-by: Kent Russell 
>> Reviewed-by: Felix Kuehling 
>> Signed-off-by: Felix Kuehling 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 -
>>   1 file changed, 4 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index c14198737dcd..e9ecc3953673 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
>>   struct sysinfo si;
>>     si_meminfo(&si);
>> -    gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
>> -   adev->gmc.mc_vram_size),
>> -   ((uint64_t)si.totalram * si.mem_unit * 3/4));
>> -    }
>> -    else
>> +    gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
>> +    adev->gmc.mc_vram_size,
>> +    ((uint64_t)si.totalram * si.mem_unit));
>> +    } else
>>   gtt_size = (uint64_t)amdgpu_gtt_size << 20;
>>     /* Initialize GTT memory pool */
>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-29 Thread Christian König

Am 28.04.19 um 09:44 schrieb Kuehling, Felix:

From: Kent Russell 

GTT size is currently limited to the minimum of VRAM size or 3/4 of
system memory. This severely limits the quanitity of system memory
that can be used by ROCm application.

Increase GTT size to the maximum of VRAM size or system memory size.


Well, NAK.

This limit was done on purpose because we otherwise the max-texture-size 
would be crashing the system because the OOM killer would be causing a 
system panic.


Using more than 75% of system memory by the GPU at the same time makes 
the system unstable and so we can't allow that by default.


What could maybe work is to reduce amount of system memory by a fixed 
factor, but I of hand don't see a way of fixing this in general.


Regards,
Christian.



Signed-off-by: Kent Russell 
Reviewed-by: Felix Kuehling 
Signed-off-by: Felix Kuehling 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 -
  1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c14198737dcd..e9ecc3953673 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
struct sysinfo si;
  
  		si_meminfo(&si);

-   gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
-  adev->gmc.mc_vram_size),
-  ((uint64_t)si.totalram * si.mem_unit * 3/4));
-   }
-   else
+   gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
+   adev->gmc.mc_vram_size,
+   ((uint64_t)si.totalram * si.mem_unit));
+   } else
gtt_size = (uint64_t)amdgpu_gtt_size << 20;
  
  	/* Initialize GTT memory pool */


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[PATCH 27/27] drm/amdgpu: Fix GTT size calculation

2019-04-28 Thread Kuehling, Felix
From: Kent Russell 

GTT size is currently limited to the minimum of VRAM size or 3/4 of
system memory. This severely limits the quanitity of system memory
that can be used by ROCm application.

Increase GTT size to the maximum of VRAM size or system memory size.

Signed-off-by: Kent Russell 
Reviewed-by: Felix Kuehling 
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c14198737dcd..e9ecc3953673 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1740,11 +1740,10 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
struct sysinfo si;
 
si_meminfo(&si);
-   gtt_size = min(max((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
-  adev->gmc.mc_vram_size),
-  ((uint64_t)si.totalram * si.mem_unit * 3/4));
-   }
-   else
+   gtt_size = max3((AMDGPU_DEFAULT_GTT_SIZE_MB << 20),
+   adev->gmc.mc_vram_size,
+   ((uint64_t)si.totalram * si.mem_unit));
+   } else
gtt_size = (uint64_t)amdgpu_gtt_size << 20;
 
/* Initialize GTT memory pool */
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx