Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-26 Thread Christian König
NAK, why the heck should we do this? It would just block all new 
processes from using the device.


Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:

Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
Signed-off-by: Monk Liu 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 4a9f749..c155ce4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
if (r < 0)
return r;
  
+	if (adev->in_gpu_reset)

+   return -ENODEV;
+
fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
if (unlikely(!fpriv)) {
r = -ENOMEM;



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-26 Thread Liu, Monk
When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our 
device, otherwise the VM init would be ruined by gpu_recover().

e.g. VM init need to create page table, but keep In mind that gpu_recover() 
calls ASIC RESET, 

if we don't block device open while gpu doing recover, the vm init (SDMA 
working on page table creating) would be ruined by ASIC RESET

do you have any good solution ? the key point is avoid/delay/push_back hw 
activities from UMD side when we are running in gpu_recover() function 

BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com] 
Sent: 2017年10月26日 15:18
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

NAK, why the heck should we do this? It would just block all new processes from 
using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
> Signed-off-by: Monk Liu 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 4a9f749..c155ce4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct 
> drm_file *file_priv)
>   if (r < 0)
>   return r;
>   
> + if (adev->in_gpu_reset)
> + return -ENODEV;
> +
>   fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>   if (unlikely(!fpriv)) {
>   r = -ENOMEM;


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-26 Thread Christian König

if we don't block device open while gpu doing recover, the vm init (SDMA 
working on page table creating) would be ruined by ASIC RESET
That is not a problem at all. SDMA just does some clear operation on the 
page tables and those are either recovered from the shadow or run after 
the reset.


Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:

When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our 
device, otherwise the VM init would be ruined by gpu_recover().

e.g. VM init need to create page table, but keep In mind that gpu_recover() 
calls ASIC RESET,

if we don't block device open while gpu doing recover, the vm init (SDMA 
working on page table creating) would be ruined by ASIC RESET

do you have any good solution ? the key point is avoid/delay/push_back hw 
activities from UMD side when we are running in gpu_recover() function

BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月26日 15:18
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

NAK, why the heck should we do this? It would just block all new processes from 
using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:

Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
Signed-off-by: Monk Liu 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
   1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 4a9f749..c155ce4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
if (r < 0)
return r;
   
+	if (adev->in_gpu_reset)

+   return -ENODEV;
+
fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
if (unlikely(!fpriv)) {
r = -ENOMEM;




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-26 Thread Liu, Monk
"Clear operation on the page table " is some kind of SDMA activity right? What 
if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???

BR Monk

-Original Message-
From: Koenig, Christian 
Sent: 2017年10月26日 18:54
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

> if we don't block device open while gpu doing recover, the vm init 
> (SDMA working on page table creating) would be ruined by ASIC RESET
That is not a problem at all. SDMA just does some clear operation on the page 
tables and those are either recovered from the shadow or run after the reset.

Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:
> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open 
> our device, otherwise the VM init would be ruined by gpu_recover().
>
> e.g. VM init need to create page table, but keep In mind that 
> gpu_recover() calls ASIC RESET,
>
> if we don't block device open while gpu doing recover, the vm init 
> (SDMA working on page table creating) would be ruined by ASIC RESET
>
> do you have any good solution ? the key point is avoid/delay/push_back 
> hw activities from UMD side when we are running in gpu_recover() 
> function
>
> BR Monk
>
> -Original Message-
> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
> Sent: 2017年10月26日 15:18
> To: Liu, Monk ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> NAK, why the heck should we do this? It would just block all new processes 
> from using the device.
>
> Christian.
>
> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>> Signed-off-by: Monk Liu 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 4a9f749..c155ce4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, 
>> struct drm_file *file_priv)
>>  if (r < 0)
>>  return r;
>>
>> +if (adev->in_gpu_reset)
>> +return -ENODEV;
>> +
>>  fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>  if (unlikely(!fpriv)) {
>>  r = -ENOMEM;
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-26 Thread Christian König

Am 26.10.2017 um 13:08 schrieb Liu, Monk:

"Clear operation on the page table " is some kind of SDMA activity right? What 
if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
I can't see any difference between the handling of existing VMs and new 
created ones.


Either we have correct handling and can redo the activity or we have 
corrupted VM page tables and crash again immediately.


So we need to handle this gracefully anyway,
Christian.



BR Monk

-Original Message-
From: Koenig, Christian
Sent: 2017年10月26日 18:54
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset


if we don't block device open while gpu doing recover, the vm init
(SDMA working on page table creating) would be ruined by ASIC RESET

That is not a problem at all. SDMA just does some clear operation on the page 
tables and those are either recovered from the shadow or run after the reset.

Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:

When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our 
device, otherwise the VM init would be ruined by gpu_recover().

e.g. VM init need to create page table, but keep In mind that
gpu_recover() calls ASIC RESET,

if we don't block device open while gpu doing recover, the vm init
(SDMA working on page table creating) would be ruined by ASIC RESET

do you have any good solution ? the key point is avoid/delay/push_back
hw activities from UMD side when we are running in gpu_recover()
function

BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月26日 15:18
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

NAK, why the heck should we do this? It would just block all new processes from 
using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:

Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
Signed-off-by: Monk Liu 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 4a9f749..c155ce4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct 
drm_file *file_priv)
if (r < 0)
return r;

+	if (adev->in_gpu_reset)

+   return -ENODEV;
+
fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
if (unlikely(!fpriv)) {
r = -ENOMEM;

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-29 Thread Liu, Monk
> I can't see any difference between the handling of existing VMs and new 
> created ones.
I know, for existing VMs we still have similar problems, I'm not saying this 
patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem: use RW 
mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
Read lock, and gpu_recover() take the write lock. 
But you gave NAK on this approach, so I want to hear your idea.

>Either we have correct handling and can redo the activity or we have corrupted 
>VM page tables and crash again immediately.
The thing is some VM activity is not go through GPU scheduler (direct), if it 
is interrupted by gpu_recover() it's not going to be re-scheduled again ...


> So we need to handle this gracefully anyway, Christian.
Yeah I'd like to hear 


BR Monk



-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com] 
Sent: 2017年10月26日 23:15
To: Liu, Monk ; Koenig, Christian ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

Am 26.10.2017 um 13:08 schrieb Liu, Monk:
> "Clear operation on the page table " is some kind of SDMA activity right? 
> What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
I can't see any difference between the handling of existing VMs and new created 
ones.

Either we have correct handling and can redo the activity or we have corrupted 
VM page tables and crash again immediately.

So we need to handle this gracefully anyway, Christian.

>
> BR Monk
>
> -Original Message-
> From: Koenig, Christian
> Sent: 2017年10月26日 18:54
> To: Liu, Monk ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
>> if we don't block device open while gpu doing recover, the vm init 
>> (SDMA working on page table creating) would be ruined by ASIC RESET
> That is not a problem at all. SDMA just does some clear operation on the page 
> tables and those are either recovered from the shadow or run after the reset.
>
> Regards,
> Christian.
>
> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open 
>> our device, otherwise the VM init would be ruined by gpu_recover().
>>
>> e.g. VM init need to create page table, but keep In mind that
>> gpu_recover() calls ASIC RESET,
>>
>> if we don't block device open while gpu doing recover, the vm init 
>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>
>> do you have any good solution ? the key point is 
>> avoid/delay/push_back hw activities from UMD side when we are running 
>> in gpu_recover() function
>>
>> BR Monk
>>
>> -----Original Message-
>> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
>> Sent: 2017年10月26日 15:18
>> To: Liu, Monk ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>> NAK, why the heck should we do this? It would just block all new processes 
>> from using the device.
>>
>> Christian.
>>
>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>> Signed-off-by: Monk Liu 
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 4a9f749..c155ce4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, 
>>> struct drm_file *file_priv)
>>> if (r < 0)
>>> return r;
>>> 
>>> +   if (adev->in_gpu_reset)
>>> +   return -ENODEV;
>>> +
>>> fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>> if (unlikely(!fpriv)) {
>>> r = -ENOMEM;
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-30 Thread Christian König

I can't see any difference between the handling of existing VMs and new created 
ones.

I know, for existing VMs we still have similar problems, I'm not saying this 
patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem: use RW 
mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
Read lock, and gpu_recover() take the write lock.
But you gave NAK on this approach, so I want to hear your idea.
Well, what I wanted to say is we simply don't have a problem here. The 
handling for new VMs is exactly the same as for existing VMs.


Regarding the lock that is still a big NAK, we just fixed the 
(hopefully) last deadlock between IOCTLs and GPU reset. Introducing any 
locks would cause problems with that once more.


What we can do is use a struct completion to prevent new changes from 
piling up while we are in the GPU reset code.



The thing is some VM activity is not go through GPU scheduler (direct), if it 
is interrupted by gpu_recover() it's not going to be re-scheduled again ...
That is not correct. All VM activity goes through the GPU scheduler as 
well. So we are certainly going to re-schedule everything.



So we need to handle this gracefully anyway, Christian.

Yeah I'd like to hear
As I said we don't need to do anything, the handling is correct as it is 
right now.


E.g. we have a shadow copy of the page tables in system memory to 
recover after VRAM lost and reschedule any pending operation after GPU 
reset.


That should be sufficient to handle all existing VM cases and even when 
it's not we need to work on the general handling and not add any checks 
to the IOCTL code which prevents applications from loading the driver.


Regards,
Christian.

Am 30.10.2017 um 04:47 schrieb Liu, Monk:

I can't see any difference between the handling of existing VMs and new created 
ones.

I know, for existing VMs we still have similar problems, I'm not saying this 
patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem: use RW 
mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
Read lock, and gpu_recover() take the write lock.
But you gave NAK on this approach, so I want to hear your idea.


Either we have correct handling and can redo the activity or we have corrupted 
VM page tables and crash again immediately.

The thing is some VM activity is not go through GPU scheduler (direct), if it 
is interrupted by gpu_recover() it's not going to be re-scheduled again ...



So we need to handle this gracefully anyway, Christian.

Yeah I'd like to hear


BR Monk



-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月26日 23:15
To: Liu, Monk ; Koenig, Christian ; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

Am 26.10.2017 um 13:08 schrieb Liu, Monk:

"Clear operation on the page table " is some kind of SDMA activity right? What 
if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???

I can't see any difference between the handling of existing VMs and new created 
ones.

Either we have correct handling and can redo the activity or we have corrupted 
VM page tables and crash again immediately.

So we need to handle this gracefully anyway, Christian.


BR Monk

-Original Message-
From: Koenig, Christian
Sent: 2017年10月26日 18:54
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset


if we don't block device open while gpu doing recover, the vm init
(SDMA working on page table creating) would be ruined by ASIC RESET

That is not a problem at all. SDMA just does some clear operation on the page 
tables and those are either recovered from the shadow or run after the reset.

Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:

When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our 
device, otherwise the VM init would be ruined by gpu_recover().

e.g. VM init need to create page table, but keep In mind that
gpu_recover() calls ASIC RESET,

if we don't block device open while gpu doing recover, the vm init
(SDMA working on page table creating) would be ruined by ASIC RESET

do you have any good solution ? the key point is
avoid/delay/push_back hw activities from UMD side when we are running
in gpu_recover() function

BR Monk

-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月26日 15:18
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

NAK, why the heck should we do this? It would just block all new processes from 
using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:

Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
Signed-off-by: Monk Liu 
---
 drivers/gpu

RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-30 Thread Liu, Monk
> That is not correct. All VM activity goes through the GPU scheduler as well. 
> So we are certainly going to re-schedule everything.

I'll check on that, if all go through scheduler that will be fine, that way 
just don't increase the karma of the job from kernel RQ is enough
To prevent VM jobs being corrupted by gpu reset, we can re-schedule those jobs 
if they are not signaled (but that need timedout not zero)

BR  Monk

-Original Message-
From: Koenig, Christian 
Sent: 2017年10月30日 15:37
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

>> I can't see any difference between the handling of existing VMs and new 
>> created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this 
> patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: 
> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, 
> and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
Well, what I wanted to say is we simply don't have a problem here. The handling 
for new VMs is exactly the same as for existing VMs.

Regarding the lock that is still a big NAK, we just fixed the
(hopefully) last deadlock between IOCTLs and GPU reset. Introducing any locks 
would cause problems with that once more.

What we can do is use a struct completion to prevent new changes from piling up 
while we are in the GPU reset code.

> The thing is some VM activity is not go through GPU scheduler (direct), if it 
> is interrupted by gpu_recover() it's not going to be re-scheduled again ...
That is not correct. All VM activity goes through the GPU scheduler as well. So 
we are certainly going to re-schedule everything.

>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
As I said we don't need to do anything, the handling is correct as it is right 
now.

E.g. we have a shadow copy of the page tables in system memory to recover after 
VRAM lost and reschedule any pending operation after GPU reset.

That should be sufficient to handle all existing VM cases and even when it's 
not we need to work on the general handling and not add any checks to the IOCTL 
code which prevents applications from loading the driver.

Regards,
Christian.

Am 30.10.2017 um 04:47 schrieb Liu, Monk:
>> I can't see any difference between the handling of existing VMs and new 
>> created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this 
> patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: 
> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, 
> and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
>
>> Either we have correct handling and can redo the activity or we have 
>> corrupted VM page tables and crash again immediately.
> The thing is some VM activity is not go through GPU scheduler (direct), if it 
> is interrupted by gpu_recover() it's not going to be re-scheduled again ...
>
>
>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
>
>
> BR Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
> Sent: 2017年10月26日 23:15
> To: Liu, Monk ; Koenig, Christian 
> ; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> Am 26.10.2017 um 13:08 schrieb Liu, Monk:
>> "Clear operation on the page table " is some kind of SDMA activity right? 
>> What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly 
>> ???
> I can't see any difference between the handling of existing VMs and new 
> created ones.
>
> Either we have correct handling and can redo the activity or we have 
> corrupted VM page tables and crash again immediately.
>
> So we need to handle this gracefully anyway, Christian.
>
>> BR Monk
>>
>> -Original Message-
>> From: Koenig, Christian
>> Sent: 2017年10月26日 18:54
>> To: Liu, Monk ; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>>> if we don't block device open while gpu doing recover, the vm init 
>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>> That is not a problem at all. SDMA just does some clear operation on the 
>> page tables and those are either recovered from the shadow or run after the 
>> reset.
>>
>> Regards,
>&g

Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

2017-10-30 Thread Christian König
The only thing that is not pushed through the scheduler are the ring and 
IB tests and those only because we want to use them directly in the GPU 
reset code to check if the hardware now works properly again.


The VM updates have a separate queue for each VM, but the initial clear 
goes through the kernel queue which is also used for buffer moves during 
eviction. But both always go through the scheduler.


Originally used the VM queue for the initial clear as well, but I 
changed that because I thought we could save some complexity. Might be a 
good idea to revert that because it is not as obvious and caused trouble 
with the ATC support on VG10 as well.


Regards,
Christian.

Am 30.10.2017 um 08:54 schrieb Liu, Monk:

That is not correct. All VM activity goes through the GPU scheduler as well. So 
we are certainly going to re-schedule everything.

I'll check on that, if all go through scheduler that will be fine, that way 
just don't increase the karma of the job from kernel RQ is enough
To prevent VM jobs being corrupted by gpu reset, we can re-schedule those jobs 
if they are not signaled (but that need timedout not zero)

BR  Monk

-Original Message-
From: Koenig, Christian
Sent: 2017年10月30日 15:37
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset


I can't see any difference between the handling of existing VMs and new created 
ones.

I know, for existing VMs we still have similar problems, I'm not saying this 
patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem:
use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, 
and gpu_recover() take the write lock.
But you gave NAK on this approach, so I want to hear your idea.

Well, what I wanted to say is we simply don't have a problem here. The handling 
for new VMs is exactly the same as for existing VMs.

Regarding the lock that is still a big NAK, we just fixed the
(hopefully) last deadlock between IOCTLs and GPU reset. Introducing any locks 
would cause problems with that once more.

What we can do is use a struct completion to prevent new changes from piling up 
while we are in the GPU reset code.


The thing is some VM activity is not go through GPU scheduler (direct), if it 
is interrupted by gpu_recover() it's not going to be re-scheduled again ...

That is not correct. All VM activity goes through the GPU scheduler as well. So 
we are certainly going to re-schedule everything.


So we need to handle this gracefully anyway, Christian.

Yeah I'd like to hear

As I said we don't need to do anything, the handling is correct as it is right 
now.

E.g. we have a shadow copy of the page tables in system memory to recover after 
VRAM lost and reschedule any pending operation after GPU reset.

That should be sufficient to handle all existing VM cases and even when it's 
not we need to work on the general handling and not add any checks to the IOCTL 
code which prevents applications from loading the driver.

Regards,
Christian.

Am 30.10.2017 um 04:47 schrieb Liu, Monk:

I can't see any difference between the handling of existing VMs and new created 
ones.

I know, for existing VMs we still have similar problems, I'm not saying this 
patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem:
use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, 
and gpu_recover() take the write lock.
But you gave NAK on this approach, so I want to hear your idea.


Either we have correct handling and can redo the activity or we have corrupted 
VM page tables and crash again immediately.

The thing is some VM activity is not go through GPU scheduler (direct), if it 
is interrupted by gpu_recover() it's not going to be re-scheduled again ...



So we need to handle this gracefully anyway, Christian.

Yeah I'd like to hear


BR Monk



-Original Message-
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2017年10月26日 23:15
To: Liu, Monk ; Koenig, Christian
; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

Am 26.10.2017 um 13:08 schrieb Liu, Monk:

"Clear operation on the page table " is some kind of SDMA activity right? What 
if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???

I can't see any difference between the handling of existing VMs and new created 
ones.

Either we have correct handling and can redo the activity or we have corrupted 
VM page tables and crash again immediately.

So we need to handle this gracefully anyway, Christian.


BR Monk

-Original Message-
From: Koenig, Christian
Sent: 2017年10月26日 18:54
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset


if we don't block device ope