[Public]

> > Here is taking a different lock than the reset_domain->sem. It is a 
> > seperate reset_domain->gpu_sem that is only locked when we will actuall do 
> > reset, it is not taken in the skip_hw_reset path.
>
> Exactly that is what you should *not* do. Please don't add any new lock to 
> the code. This is already complicated enough.
>
> If you think that adding wrappers for reset lock makes sense then we can 
> probably do that, bot don't add any lock for hw access.

The two lock protects very different things though. The first case is we need 
to block two resets running in parallel, this does not only cover GPU reset 
itself but also any teardown that happens before GPU reset. The second case is 
we need to ensure exclusive access to the GPU between GPU reset and GPU init, 
concurrent access is fine before GPU is reset.

Theoretically, the second case happens within the first case, so locking the 
first case would protect against both. But with the current implementation this 
is infeasible, all the generic functions called between 
amdgpu_device_lock/unlock_reset_domain would need to be swapped out for special 
versions so the reset thread does not dead lock itself. It is much simpler to 
have a second, much narrower lock that only covers GPU reset<->GPU init because 
all the accesses there are very low level anyway.

Teddy

Reply via email to