Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König

Am 17.01.22 um 15:50 schrieb Marek Olšák:
I don't think fork() would work with userspace where all buffers are 
shared. It certainly doesn't work now. The driver needs to be notified 
that a buffer or texture is shared to ensure data coherency between 
processes, and the driver must execute decompression and other render 
passes when a buffer or texture is being shared for the first time. 
Those aren't called when fork() is called.


Yeah, that's why you can install handlers which run before/after fork() 
is executed. But to summarize it is illegal for OpenGL, so we don't 
really need to worry about it.


For compute there are a couple of use cases though, but even those are 
not real world ones as far as I know.


But see below.



Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling > wrote:


Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
 Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire
discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole
>> once and
>> for all. The thing that worries me from a upstream/platform
pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem
rendernode
>>  drivers at least (or maybe even all gem drivers, no
idea), with
>> the
>>  below discussion cleaned up as justification.
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa
mailing list
> because (IIRC) Marek introduced a background thread to push
command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL
and then
> do a fork(). This indeed worked previously (no GPUVM at that
time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken
and needs
> fixing, but it still essentially means that there could be
people out
> there with really old userspace where this setting would
just break
> the desktop.
>
> I'm not really against that change either, but at least in
theory we
> could make fork() work perfectly fine even with VMs and
background
> threads.
 You may regret this if you ever try to build a shared virtual
address
 space between GPU and CPU. Then you have two processes
(parent and
 child) sharing the same render context and GPU VM address space.
 But the
 CPU address spaces are different. You can't maintain
consistent shared
 virtual address spaces for both processes when the GPU address
 space is
 shared between them.
>>> That's actually not much of a problem.
>>>
>>> All you need to do is to use pthread_atfork() and do the
appropriate
>>> action in parent/child to clean up your context:
>>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html


>> Thunk already does that. However, it's not foolproof.
pthread_atfork
>> hanlders aren't called when the process is forked with a clone
call.
>
> Yeah, but that's perfectly intentional. clone() is usually used to
> create threads.

Clone can be used to create new processes. Maybe not the common
use today.


>
>>> The rest is just to make sure that all shared and all private
data are
>>> kept separate all the time. Sharing virtual memory is already
done for
>>> decades this way, it's just that nobody ever did it with a
statefull
>>> device like GPUs.
>> My concern is not with sharing or not sharing data. It's with
sharing
>> the address space itself. If you share the render node, you
share GPU
>> virtual address space. However CPU address space is not shared
between
>> parent and child. That's a fundamental mismatch between the CPU
world
>> and current GPU driver implementation.
>
> Correct, but even that is 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Marek Olšák
I don't think fork() would work with userspace where all buffers are
shared. It certainly doesn't work now. The driver needs to be notified that
a buffer or texture is shared to ensure data coherency between processes,
and the driver must execute decompression and other render passes when a
buffer or texture is being shared for the first time. Those aren't called
when fork() is called.

Marek

On Mon, Jan 17, 2022 at 9:34 AM Felix Kuehling 
wrote:

> Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> > Am 17.01.22 um 15:17 schrieb Felix Kuehling:
> >> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> >>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>  Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> > Am 14.01.22 um 17:44 schrieb Daniel Vetter:
> >> Top post because I tried to catch up on the entire discussion here.
> >>
> >> So fundamentally I'm not opposed to just close this fork() hole
> >> once and
> >> for all. The thing that worries me from a upstream/platform pov is
> >> really
> >> only if we don't do it consistently across all drivers.
> >>
> >> So maybe as an idea:
> >> - Do the original patch, but not just for ttm but all gem rendernode
> >>  drivers at least (or maybe even all gem drivers, no idea), with
> >> the
> >>  below discussion cleaned up as justification.
> > I know of at least one use case which this will break.
> >
> > A couple of years back we had a discussion on the Mesa mailing list
> > because (IIRC) Marek introduced a background thread to push command
> > submissions to the kernel.
> >
> > That broke because some compositor used to initialize OpenGL and then
> > do a fork(). This indeed worked previously (no GPUVM at that time),
> > but with the addition of the backround thread obviously broke.
> >
> > The conclusion back then was that the compositor is broken and needs
> > fixing, but it still essentially means that there could be people out
> > there with really old userspace where this setting would just break
> > the desktop.
> >
> > I'm not really against that change either, but at least in theory we
> > could make fork() work perfectly fine even with VMs and background
> > threads.
>  You may regret this if you ever try to build a shared virtual address
>  space between GPU and CPU. Then you have two processes (parent and
>  child) sharing the same render context and GPU VM address space.
>  But the
>  CPU address spaces are different. You can't maintain consistent shared
>  virtual address spaces for both processes when the GPU address
>  space is
>  shared between them.
> >>> That's actually not much of a problem.
> >>>
> >>> All you need to do is to use pthread_atfork() and do the appropriate
> >>> action in parent/child to clean up your context:
> >>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
> >> Thunk already does that. However, it's not foolproof. pthread_atfork
> >> hanlders aren't called when the process is forked with a clone call.
> >
> > Yeah, but that's perfectly intentional. clone() is usually used to
> > create threads.
>
> Clone can be used to create new processes. Maybe not the common use today.
>
>
> >
> >>> The rest is just to make sure that all shared and all private data are
> >>> kept separate all the time. Sharing virtual memory is already done for
> >>> decades this way, it's just that nobody ever did it with a statefull
> >>> device like GPUs.
> >> My concern is not with sharing or not sharing data. It's with sharing
> >> the address space itself. If you share the render node, you share GPU
> >> virtual address space. However CPU address space is not shared between
> >> parent and child. That's a fundamental mismatch between the CPU world
> >> and current GPU driver implementation.
> >
> > Correct, but even that is easily solvable. As I said before you can
> > hang this state on a VMA and let it be cloned together with the CPU
> > address space.
>
> I'm not following. The address space I'm talking about is struct
> amdgpu_vm. It's associated with the render node file descriptor.
> Inheriting and using that file descriptor in the child inherits the
> amdgpu_vm. I don't see how you can hang that state on any one VMA.
>
> To be consistent with the CPU, you'd need to clone the GPU address space
> (struct amdgpu_vm) in the child process. That means you need a new
> render node file descriptor that imports all the BOs from the parent
> address space. It's a bunch of extra work to fork a process, that you're
> proposing to immediately undo with an atfork handler. So I really don't
> see the point.
>
> Regards,
>   Felix
>
>
> >
> > Since VMAs are informed about their cloning (in opposite to file
> > descriptors) it's trivial to even just clone kernel data on first access.
> >
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >>Felix
> >>
> >>
> >>> 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Felix Kuehling
Am 2022-01-17 um 9:21 a.m. schrieb Christian König:
> Am 17.01.22 um 15:17 schrieb Felix Kuehling:
>> Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
>>> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
 Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole
>> once and
>> for all. The thing that worries me from a upstream/platform pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem rendernode
>>  drivers at least (or maybe even all gem drivers, no idea), with
>> the
>>  below discussion cleaned up as justification.
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa mailing list
> because (IIRC) Marek introduced a background thread to push command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL and then
> do a fork(). This indeed worked previously (no GPUVM at that time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken and needs
> fixing, but it still essentially means that there could be people out
> there with really old userspace where this setting would just break
> the desktop.
>
> I'm not really against that change either, but at least in theory we
> could make fork() work perfectly fine even with VMs and background
> threads.
 You may regret this if you ever try to build a shared virtual address
 space between GPU and CPU. Then you have two processes (parent and
 child) sharing the same render context and GPU VM address space.
 But the
 CPU address spaces are different. You can't maintain consistent shared
 virtual address spaces for both processes when the GPU address
 space is
 shared between them.
>>> That's actually not much of a problem.
>>>
>>> All you need to do is to use pthread_atfork() and do the appropriate
>>> action in parent/child to clean up your context:
>>> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html
>> Thunk already does that. However, it's not foolproof. pthread_atfork
>> hanlders aren't called when the process is forked with a clone call.
>
> Yeah, but that's perfectly intentional. clone() is usually used to
> create threads.

Clone can be used to create new processes. Maybe not the common use today.


>
>>> The rest is just to make sure that all shared and all private data are
>>> kept separate all the time. Sharing virtual memory is already done for
>>> decades this way, it's just that nobody ever did it with a statefull
>>> device like GPUs.
>> My concern is not with sharing or not sharing data. It's with sharing
>> the address space itself. If you share the render node, you share GPU
>> virtual address space. However CPU address space is not shared between
>> parent and child. That's a fundamental mismatch between the CPU world
>> and current GPU driver implementation.
>
> Correct, but even that is easily solvable. As I said before you can
> hang this state on a VMA and let it be cloned together with the CPU
> address space.

I'm not following. The address space I'm talking about is struct
amdgpu_vm. It's associated with the render node file descriptor.
Inheriting and using that file descriptor in the child inherits the
amdgpu_vm. I don't see how you can hang that state on any one VMA.

To be consistent with the CPU, you'd need to clone the GPU address space
(struct amdgpu_vm) in the child process. That means you need a new
render node file descriptor that imports all the BOs from the parent
address space. It's a bunch of extra work to fork a process, that you're
proposing to immediately undo with an atfork handler. So I really don't
see the point.

Regards,
  Felix


>
> Since VMAs are informed about their cloning (in opposite to file
> descriptors) it's trivial to even just clone kernel data on first access.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
 Regards,
     Felix

>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König

Am 17.01.22 um 15:17 schrieb Felix Kuehling:

Am 2022-01-17 um 6:44 a.m. schrieb Christian König:

Am 14.01.22 um 18:40 schrieb Felix Kuehling:

Am 2022-01-14 um 12:26 p.m. schrieb Christian König:

Am 14.01.22 um 17:44 schrieb Daniel Vetter:

Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole
once and
for all. The thing that worries me from a upstream/platform pov is
really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
     drivers at least (or maybe even all gem drivers, no idea), with
the
     below discussion cleaned up as justification.

I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list
because (IIRC) Marek introduced a background thread to push command
submissions to the kernel.

That broke because some compositor used to initialize OpenGL and then
do a fork(). This indeed worked previously (no GPUVM at that time),
but with the addition of the backround thread obviously broke.

The conclusion back then was that the compositor is broken and needs
fixing, but it still essentially means that there could be people out
there with really old userspace where this setting would just break
the desktop.

I'm not really against that change either, but at least in theory we
could make fork() work perfectly fine even with VMs and background
threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.

That's actually not much of a problem.

All you need to do is to use pthread_atfork() and do the appropriate
action in parent/child to clean up your context:
https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

Thunk already does that. However, it's not foolproof. pthread_atfork
hanlders aren't called when the process is forked with a clone call.


Yeah, but that's perfectly intentional. clone() is usually used to 
create threads.



The rest is just to make sure that all shared and all private data are
kept separate all the time. Sharing virtual memory is already done for
decades this way, it's just that nobody ever did it with a statefull
device like GPUs.

My concern is not with sharing or not sharing data. It's with sharing
the address space itself. If you share the render node, you share GPU
virtual address space. However CPU address space is not shared between
parent and child. That's a fundamental mismatch between the CPU world
and current GPU driver implementation.


Correct, but even that is easily solvable. As I said before you can hang 
this state on a VMA and let it be cloned together with the CPU address 
space.


Since VMAs are informed about their cloning (in opposite to file 
descriptors) it's trivial to even just clone kernel data on first access.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.


Regards,
    Felix





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Felix Kuehling


Am 2022-01-17 um 6:44 a.m. schrieb Christian König:
> Am 14.01.22 um 18:40 schrieb Felix Kuehling:
>> Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
>>> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
 Top post because I tried to catch up on the entire discussion here.

 So fundamentally I'm not opposed to just close this fork() hole
 once and
 for all. The thing that worries me from a upstream/platform pov is
 really
 only if we don't do it consistently across all drivers.

 So maybe as an idea:
 - Do the original patch, but not just for ttm but all gem rendernode
     drivers at least (or maybe even all gem drivers, no idea), with
 the
     below discussion cleaned up as justification.
>>> I know of at least one use case which this will break.
>>>
>>> A couple of years back we had a discussion on the Mesa mailing list
>>> because (IIRC) Marek introduced a background thread to push command
>>> submissions to the kernel.
>>>
>>> That broke because some compositor used to initialize OpenGL and then
>>> do a fork(). This indeed worked previously (no GPUVM at that time),
>>> but with the addition of the backround thread obviously broke.
>>>
>>> The conclusion back then was that the compositor is broken and needs
>>> fixing, but it still essentially means that there could be people out
>>> there with really old userspace where this setting would just break
>>> the desktop.
>>>
>>> I'm not really against that change either, but at least in theory we
>>> could make fork() work perfectly fine even with VMs and background
>>> threads.
>> You may regret this if you ever try to build a shared virtual address
>> space between GPU and CPU. Then you have two processes (parent and
>> child) sharing the same render context and GPU VM address space. But the
>> CPU address spaces are different. You can't maintain consistent shared
>> virtual address spaces for both processes when the GPU address space is
>> shared between them.
>
> That's actually not much of a problem.
>
> All you need to do is to use pthread_atfork() and do the appropriate
> action in parent/child to clean up your context:
> https://man7.org/linux/man-pages/man3/pthread_atfork.3.html

Thunk already does that. However, it's not foolproof. pthread_atfork
hanlders aren't called when the process is forked with a clone call.


>
> The rest is just to make sure that all shared and all private data are
> kept separate all the time. Sharing virtual memory is already done for
> decades this way, it's just that nobody ever did it with a statefull
> device like GPUs.
My concern is not with sharing or not sharing data. It's with sharing
the address space itself. If you share the render node, you share GPU
virtual address space. However CPU address space is not shared between
parent and child. That's a fundamental mismatch between the CPU world
and current GPU driver implementation.

Regards,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-17 Thread Christian König

Am 14.01.22 um 18:40 schrieb Felix Kuehling:

Am 2022-01-14 um 12:26 p.m. schrieb Christian König:

Am 14.01.22 um 17:44 schrieb Daniel Vetter:

Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole once and
for all. The thing that worries me from a upstream/platform pov is
really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
    drivers at least (or maybe even all gem drivers, no idea), with the
    below discussion cleaned up as justification.

I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list
because (IIRC) Marek introduced a background thread to push command
submissions to the kernel.

That broke because some compositor used to initialize OpenGL and then
do a fork(). This indeed worked previously (no GPUVM at that time),
but with the addition of the backround thread obviously broke.

The conclusion back then was that the compositor is broken and needs
fixing, but it still essentially means that there could be people out
there with really old userspace where this setting would just break
the desktop.

I'm not really against that change either, but at least in theory we
could make fork() work perfectly fine even with VMs and background
threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.


That's actually not much of a problem.

All you need to do is to use pthread_atfork() and do the appropriate 
action in parent/child to clean up your context: 
https://man7.org/linux/man-pages/man3/pthread_atfork.3.html


The rest is just to make sure that all shared and all private data are 
kept separate all the time. Sharing virtual memory is already done for 
decades this way, it's just that nobody ever did it with a statefull 
device like GPUs.


Regards,
Christian.



Regards,
   Felix





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-14 Thread Felix Kuehling
Am 2022-01-14 um 12:26 p.m. schrieb Christian König:
> Am 14.01.22 um 17:44 schrieb Daniel Vetter:
>> Top post because I tried to catch up on the entire discussion here.
>>
>> So fundamentally I'm not opposed to just close this fork() hole once and
>> for all. The thing that worries me from a upstream/platform pov is
>> really
>> only if we don't do it consistently across all drivers.
>>
>> So maybe as an idea:
>> - Do the original patch, but not just for ttm but all gem rendernode
>>    drivers at least (or maybe even all gem drivers, no idea), with the
>>    below discussion cleaned up as justification.
>
> I know of at least one use case which this will break.
>
> A couple of years back we had a discussion on the Mesa mailing list
> because (IIRC) Marek introduced a background thread to push command
> submissions to the kernel.
>
> That broke because some compositor used to initialize OpenGL and then
> do a fork(). This indeed worked previously (no GPUVM at that time),
> but with the addition of the backround thread obviously broke.
>
> The conclusion back then was that the compositor is broken and needs
> fixing, but it still essentially means that there could be people out
> there with really old userspace where this setting would just break
> the desktop.
>
> I'm not really against that change either, but at least in theory we
> could make fork() work perfectly fine even with VMs and background
> threads.

You may regret this if you ever try to build a shared virtual address
space between GPU and CPU. Then you have two processes (parent and
child) sharing the same render context and GPU VM address space. But the
CPU address spaces are different. You can't maintain consistent shared
virtual address spaces for both processes when the GPU address space is
shared between them.

Regards,
  Felix


>
> Regards,
> Christian.
>
>> - Get acks from userspace driver folks who know real-world gl/vk
>> usage and
>>    khr specs in-depth enough to be able to tell us how much we'll regret
>>    this.
>>
>> - Merge it. Or come up with something new. Or maybe stick to the
>> Nack, but
>>    then maybe it would be good to document that somewhere in-tree?
>>
>> This entire can of worms just feels way too tricky to have it be handled
>> inconsistently across drivers. And trying to fix these kind of low-level
>> things across drivers once divergences exists is just really painful
>> (e.g.
>> trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
>> Christian is doing to get our dma_resv rules consistent across drivers).
>>
>> Cheers, Daniel
>>
>> On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:
>>> Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
 Am 06.01.22 um 17:51 schrieb Felix Kuehling:
> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>> [SNIP]
>>> Also, why does your ACK or NAK depend on this at all. If it's the
>>> right
>>> thing to do, it's the right thing to do regardless of who benefits
>>> from
>>> it. In addition, how can a child process that doesn't even use
>>> the GPU
>>> be in violation of any GPU-driver related specifications.
>> The argument is that the application is broken and needs to be fixed
>> instead of worked around inside the kernel.
> I still don't get how they the application is broken. Like I said,
> the
> child process is not using the GPU. How can the application be
> fixed in
> this case?
 Sounds like I'm still missing some important puzzle pieces for the
 full picture to figure out why this doesn't work.

> Are you saying, any application that forks and doesn't immediately
> call
> exec is broken?
 More or less yes. We essentially have three possible cases here:

 1. Application is already using (for example) OpenGL or OpenCL to do
 some rendering on the GPU and then calls fork() and expects to use
 OpenGL both in the parent and the child at the same time.
  As far as I know that's illegal from the Khronos specification
 point of view and working around inside the kernel for something not
 allowed in the first place is seen as futile effort.

 2. Application opened the file descriptor, forks and then initializes
 OpenGL/Vulkan/OpenCL.
  That's what some compositors are doing to drop into the backround
 and is explicitly legal.

 3. Application calls fork() and then doesn't use the GPU in the child.
 Either by not using it or calling exec.
  That should be legal and not cause any problems in the first
 place.

 But from your description I still don't get why we are running into
 problems here.

 I was assuming that you have case #1 because we previously had some
 examples of this with this python library, but from your description
 it seems 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-14 Thread Christian König

Am 14.01.22 um 17:44 schrieb Daniel Vetter:

Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole once and
for all. The thing that worries me from a upstream/platform pov is really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
   drivers at least (or maybe even all gem drivers, no idea), with the
   below discussion cleaned up as justification.


I know of at least one use case which this will break.

A couple of years back we had a discussion on the Mesa mailing list 
because (IIRC) Marek introduced a background thread to push command 
submissions to the kernel.


That broke because some compositor used to initialize OpenGL and then do 
a fork(). This indeed worked previously (no GPUVM at that time), but 
with the addition of the backround thread obviously broke.


The conclusion back then was that the compositor is broken and needs 
fixing, but it still essentially means that there could be people out 
there with really old userspace where this setting would just break the 
desktop.


I'm not really against that change either, but at least in theory we 
could make fork() work perfectly fine even with VMs and background threads.


Regards,
Christian.


- Get acks from userspace driver folks who know real-world gl/vk usage and
   khr specs in-depth enough to be able to tell us how much we'll regret
   this.

- Merge it. Or come up with something new. Or maybe stick to the Nack, but
   then maybe it would be good to document that somewhere in-tree?

This entire can of worms just feels way too tricky to have it be handled
inconsistently across drivers. And trying to fix these kind of low-level
things across drivers once divergences exists is just really painful (e.g.
trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
Christian is doing to get our dma_resv rules consistent across drivers).

Cheers, Daniel

On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:

Am 2022-01-07 um 3:56 a.m. schrieb Christian König:

Am 06.01.22 um 17:51 schrieb Felix Kuehling:

Am 2022-01-06 um 11:48 a.m. schrieb Christian König:

Am 06.01.22 um 17:45 schrieb Felix Kuehling:

Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
[SNIP]
Also, why does your ACK or NAK depend on this at all. If it's the
right
thing to do, it's the right thing to do regardless of who benefits
from
it. In addition, how can a child process that doesn't even use the GPU
be in violation of any GPU-driver related specifications.

The argument is that the application is broken and needs to be fixed
instead of worked around inside the kernel.

I still don't get how they the application is broken. Like I said, the
child process is not using the GPU. How can the application be fixed in
this case?

Sounds like I'm still missing some important puzzle pieces for the
full picture to figure out why this doesn't work.


Are you saying, any application that forks and doesn't immediately call
exec is broken?

More or less yes. We essentially have three possible cases here:

1. Application is already using (for example) OpenGL or OpenCL to do
some rendering on the GPU and then calls fork() and expects to use
OpenGL both in the parent and the child at the same time.
     As far as I know that's illegal from the Khronos specification
point of view and working around inside the kernel for something not
allowed in the first place is seen as futile effort.

2. Application opened the file descriptor, forks and then initializes
OpenGL/Vulkan/OpenCL.
     That's what some compositors are doing to drop into the backround
and is explicitly legal.

3. Application calls fork() and then doesn't use the GPU in the child.
Either by not using it or calling exec.
     That should be legal and not cause any problems in the first place.

But from your description I still don't get why we are running into
problems here.

I was assuming that you have case #1 because we previously had some
examples of this with this python library, but from your description
it seems to be case #3.

Correct. #3 has at least one issue we previously worked around in the
Thunk: The inherited VMAs prevent BOs from being freed in the parent
process. This manifests as an apparent memory leak. Therefore the Thunk
calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
that are causing problems with CRIU are GTT BOs that weren't covered by
this previous workaround.

The new issue with CRIU is, that CRIU saves and restores all the VMAs.
When trying to restore the inherited VMAs in the child process, the mmap
call fails because the child process' render node FD is no longer
inherited from the parent, but is instead created by its own "open"
system call. The mmap call in the child fails for at least two reasons:

   * The child process' render node FD doesn't have permission to map the
 parent 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-14 Thread Daniel Vetter
Top post because I tried to catch up on the entire discussion here.

So fundamentally I'm not opposed to just close this fork() hole once and
for all. The thing that worries me from a upstream/platform pov is really
only if we don't do it consistently across all drivers.

So maybe as an idea:
- Do the original patch, but not just for ttm but all gem rendernode
  drivers at least (or maybe even all gem drivers, no idea), with the
  below discussion cleaned up as justification.

- Get acks from userspace driver folks who know real-world gl/vk usage and
  khr specs in-depth enough to be able to tell us how much we'll regret
  this.

- Merge it. Or come up with something new. Or maybe stick to the Nack, but
  then maybe it would be good to document that somewhere in-tree?

This entire can of worms just feels way too tricky to have it be handled
inconsistently across drivers. And trying to fix these kind of low-level
things across drivers once divergences exists is just really painful (e.g.
trying to make all dma-buf mmap VM_SPECIAL or the herculeian task
Christian is doing to get our dma_resv rules consistent across drivers).

Cheers, Daniel

On Fri, Jan 07, 2022 at 12:47:45PM -0500, Felix Kuehling wrote:
> Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
> > Am 06.01.22 um 17:51 schrieb Felix Kuehling:
> >> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
> >>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>  Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>  [SNIP]
>  Also, why does your ACK or NAK depend on this at all. If it's the
>  right
>  thing to do, it's the right thing to do regardless of who benefits
>  from
>  it. In addition, how can a child process that doesn't even use the GPU
>  be in violation of any GPU-driver related specifications.
> >>> The argument is that the application is broken and needs to be fixed
> >>> instead of worked around inside the kernel.
> >> I still don't get how they the application is broken. Like I said, the
> >> child process is not using the GPU. How can the application be fixed in
> >> this case?
> >
> > Sounds like I'm still missing some important puzzle pieces for the
> > full picture to figure out why this doesn't work.
> >
> >> Are you saying, any application that forks and doesn't immediately call
> >> exec is broken?
> >
> > More or less yes. We essentially have three possible cases here:
> >
> > 1. Application is already using (for example) OpenGL or OpenCL to do
> > some rendering on the GPU and then calls fork() and expects to use
> > OpenGL both in the parent and the child at the same time.
> >     As far as I know that's illegal from the Khronos specification
> > point of view and working around inside the kernel for something not
> > allowed in the first place is seen as futile effort.
> >
> > 2. Application opened the file descriptor, forks and then initializes
> > OpenGL/Vulkan/OpenCL.
> >     That's what some compositors are doing to drop into the backround
> > and is explicitly legal.
> >
> > 3. Application calls fork() and then doesn't use the GPU in the child.
> > Either by not using it or calling exec.
> >     That should be legal and not cause any problems in the first place. 
> >
> > But from your description I still don't get why we are running into
> > problems here.
> >
> > I was assuming that you have case #1 because we previously had some
> > examples of this with this python library, but from your description
> > it seems to be case #3.
> 
> Correct. #3 has at least one issue we previously worked around in the
> Thunk: The inherited VMAs prevent BOs from being freed in the parent
> process. This manifests as an apparent memory leak. Therefore the Thunk
> calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
> that are causing problems with CRIU are GTT BOs that weren't covered by
> this previous workaround.
> 
> The new issue with CRIU is, that CRIU saves and restores all the VMAs.
> When trying to restore the inherited VMAs in the child process, the mmap
> call fails because the child process' render node FD is no longer
> inherited from the parent, but is instead created by its own "open"
> system call. The mmap call in the child fails for at least two reasons:
> 
>   * The child process' render node FD doesn't have permission to map the
> parent process' BO any more
>   * The parent BO doesn't get restored in the child process, so its mmap
> offset doesn't get updated to the new mmap offset of the restored
> parent BO by the amdgpu CRIU plugin
> 
> We could maybe add a whole bunch of complexity in CRIU and our CRIU
> plugin to fix this. But it's pointless because like you said, actually
> doing anything with the BO in the child process is illegal anyway
> (scenario #1 above). The easiest solution seems to be, to just not
> inherit the VMA in the first place.
> 
> Regards,
>   Felix
> 
> 
> >
> >> Or does an application that forks need to be aware that some other part

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-10 Thread Bhardwaj, Rajneesh

Hi Christian


I have reverted the change from the amd-staging-drm-next as per the 
discussion.  Thank you.



Regards

Rajneesh

On 1/4/2022 1:08 PM, Felix Kuehling wrote:

[+Adrian]

Am 2021-12-23 um 2:05 a.m. schrieb Christian König:


Am 22.12.21 um 21:53 schrieb Daniel Vetter:

On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

[SNIP]
Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this
problem. I
really don't want to have random one-off hacks that don't work across
the
board, for a problem where we (drm subsystem) really shouldn't be the
only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please
grab an
ack from them.

Unfortunately it's a KFD design problem. AMD used a single device
node, then mmaped different objects from the same offset to different
processes and expected it to work the rest of the fs subsystem without
churn.

This may be true for mmaps in the KFD device, but not for mmaps in the
DRM render nodes.



So yes, this is indeed because the mmap space is per file descriptor
for the use case here.

No. This is a different problem.

The problem has to do with the way that DRM manages mmap permissions. In
order to be able to mmap an offset in the render node, there needs to be
a BO that was created in the same render node. If you fork a process, it
inherits the VMA. But KFD doesn't know anything about the inherited BOs
from the parent process. Therefore those BOs don't get checkpointed and
restored in the child process. When the CRIU checkpoint is restored, our
CRIU plugin never creates a BO corresponding to the VMA in the child
process' render node FD. We've also lost the relationship between the
parent and child-process' render node FDs. After "fork" the render node
FD points to the same struct file in parent and child. After restoring
the CRIU checkpoint, they are separate struct files, created by separate
"open" system calls. Therefore the mmap call that restores the VMA fails
in the child process.

At least for KFD, there is no point inheriting BOs from a child process,
because the GPU has no way of accessing the BOs in the child process.
The child process has no GPU address space, no user mode queues, no way
to do anything with the GPU before it completely reinitializes its KFD
context.

We can workaround this issue in user mode with madvise(...,
MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
memory leak in the parent process while a child process exists. But it's
slightly racy because there is a short time window where VMA exists
without the VM_DONTCOPY flag. A fork during that time window could still
create a child process with an inherited VMA.

Therefore a safer solution is to set the vm_flags in the VMA in the
driver when the VMA is first created.

Regards,
   Felix



And thanks for pointing this out, this indeed makes the whole change
extremely questionable.

Regards,
Christian.


Cheers, Daniel



Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-07 Thread Felix Kuehling
Am 2022-01-07 um 3:56 a.m. schrieb Christian König:
> Am 06.01.22 um 17:51 schrieb Felix Kuehling:
>> Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
>>> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
 Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
 [SNIP]
 Also, why does your ACK or NAK depend on this at all. If it's the
 right
 thing to do, it's the right thing to do regardless of who benefits
 from
 it. In addition, how can a child process that doesn't even use the GPU
 be in violation of any GPU-driver related specifications.
>>> The argument is that the application is broken and needs to be fixed
>>> instead of worked around inside the kernel.
>> I still don't get how they the application is broken. Like I said, the
>> child process is not using the GPU. How can the application be fixed in
>> this case?
>
> Sounds like I'm still missing some important puzzle pieces for the
> full picture to figure out why this doesn't work.
>
>> Are you saying, any application that forks and doesn't immediately call
>> exec is broken?
>
> More or less yes. We essentially have three possible cases here:
>
> 1. Application is already using (for example) OpenGL or OpenCL to do
> some rendering on the GPU and then calls fork() and expects to use
> OpenGL both in the parent and the child at the same time.
>     As far as I know that's illegal from the Khronos specification
> point of view and working around inside the kernel for something not
> allowed in the first place is seen as futile effort.
>
> 2. Application opened the file descriptor, forks and then initializes
> OpenGL/Vulkan/OpenCL.
>     That's what some compositors are doing to drop into the backround
> and is explicitly legal.
>
> 3. Application calls fork() and then doesn't use the GPU in the child.
> Either by not using it or calling exec.
>     That should be legal and not cause any problems in the first place. 
>
> But from your description I still don't get why we are running into
> problems here.
>
> I was assuming that you have case #1 because we previously had some
> examples of this with this python library, but from your description
> it seems to be case #3.

Correct. #3 has at least one issue we previously worked around in the
Thunk: The inherited VMAs prevent BOs from being freed in the parent
process. This manifests as an apparent memory leak. Therefore the Thunk
calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs
that are causing problems with CRIU are GTT BOs that weren't covered by
this previous workaround.

The new issue with CRIU is, that CRIU saves and restores all the VMAs.
When trying to restore the inherited VMAs in the child process, the mmap
call fails because the child process' render node FD is no longer
inherited from the parent, but is instead created by its own "open"
system call. The mmap call in the child fails for at least two reasons:

  * The child process' render node FD doesn't have permission to map the
parent process' BO any more
  * The parent BO doesn't get restored in the child process, so its mmap
offset doesn't get updated to the new mmap offset of the restored
parent BO by the amdgpu CRIU plugin

We could maybe add a whole bunch of complexity in CRIU and our CRIU
plugin to fix this. But it's pointless because like you said, actually
doing anything with the BO in the child process is illegal anyway
(scenario #1 above). The easiest solution seems to be, to just not
inherit the VMA in the first place.

Regards,
  Felix


>
>> Or does an application that forks need to be aware that some other part
>> of the application used the GPU and explicitly free any GPU resources?
>
> Others might fill that information in, but I think that was the plan
> for newer APIs like Vulkan.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>    Felix
>>
>>
>>> Regards,
>>> Christian.
>>>
 Regards,
     Felix


> Let's talk about this on Mondays call. Thanks for giving the whole
> context.
>
> Regards,
> Christian.
>
>> Regards,
>>  Felix
>>
>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-07 Thread Christian König

Am 06.01.22 um 17:51 schrieb Felix Kuehling:

Am 2022-01-06 um 11:48 a.m. schrieb Christian König:

Am 06.01.22 um 17:45 schrieb Felix Kuehling:

Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
[SNIP]
Also, why does your ACK or NAK depend on this at all. If it's the right
thing to do, it's the right thing to do regardless of who benefits from
it. In addition, how can a child process that doesn't even use the GPU
be in violation of any GPU-driver related specifications.

The argument is that the application is broken and needs to be fixed
instead of worked around inside the kernel.

I still don't get how they the application is broken. Like I said, the
child process is not using the GPU. How can the application be fixed in
this case?


Sounds like I'm still missing some important puzzle pieces for the full 
picture to figure out why this doesn't work.



Are you saying, any application that forks and doesn't immediately call
exec is broken?


More or less yes. We essentially have three possible cases here:

1. Application is already using (for example) OpenGL or OpenCL to do 
some rendering on the GPU and then calls fork() and expects to use 
OpenGL both in the parent and the child at the same time.
    As far as I know that's illegal from the Khronos specification 
point of view and working around inside the kernel for something not 
allowed in the first place is seen as futile effort.


2. Application opened the file descriptor, forks and then initializes 
OpenGL/Vulkan/OpenCL.
    That's what some compositors are doing to drop into the backround 
and is explicitly legal.


3. Application calls fork() and then doesn't use the GPU in the child. 
Either by not using it or calling exec.

    That should be legal and not cause any problems in the first place.

But from your description I still don't get why we are running into 
problems here.


I was assuming that you have case #1 because we previously had some 
examples of this with this python library, but from your description it 
seems to be case #3.



Or does an application that forks need to be aware that some other part
of the application used the GPU and explicitly free any GPU resources?


Others might fill that information in, but I think that was the plan for 
newer APIs like Vulkan.


Regards,
Christian.



Thanks,
   Felix



Regards,
Christian.


Regards,
    Felix



Let's talk about this on Mondays call. Thanks for giving the whole
context.

Regards,
Christian.


Regards,
     Felix





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-06 Thread Felix Kuehling
Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
> Am 05.01.22 um 17:16 schrieb Felix Kuehling:
>> [SNIP]
 But KFD doesn't know anything about the inherited BOs
 from the parent process.
>>> Ok, why that? When the KFD is reinitializing it's context why
>>> shouldn't it cleanup those VMAs?
>> That cleanup has to be initiated by user mode. Basically closing the old
>> KFD and DRM file descriptors, cleaning up all the user mode VM state,
>> unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
>> and starts from scratch.
>>
>> User mode will do this automatically when it tries to reinitialize ROCm.
>> However, in this case the child process doesn't do that (e.g. a python
>> application using the multi-processing package). The child process does
>> not use ROCm. But you're left with all the dangling VMAs in the child
>> process indefinitely.
>
> Oh, not that one again. I'm unfortunately pretty sure that this is an
> clear NAK then.
>
> This python multi-processing package is violating various
> specifications by doing this fork() and we already had multiple
> discussions about that.

Well, it's in wide-spread use. We can't just throw up our hands and say
they're buggy and not supported.

Also, why does your ACK or NAK depend on this at all. If it's the right
thing to do, it's the right thing to do regardless of who benefits from
it. In addition, how can a child process that doesn't even use the GPU
be in violation of any GPU-driver related specifications.

Regards,
  Felix


>
> Let's talk about this on Mondays call. Thanks for giving the whole
> context.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-06 Thread Felix Kuehling
Am 2022-01-06 um 11:48 a.m. schrieb Christian König:
> Am 06.01.22 um 17:45 schrieb Felix Kuehling:
>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König:
>>> Am 05.01.22 um 17:16 schrieb Felix Kuehling:
 [SNIP]
>> But KFD doesn't know anything about the inherited BOs
>> from the parent process.
> Ok, why that? When the KFD is reinitializing it's context why
> shouldn't it cleanup those VMAs?
 That cleanup has to be initiated by user mode. Basically closing
 the old
 KFD and DRM file descriptors, cleaning up all the user mode VM state,
 unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
 and starts from scratch.

 User mode will do this automatically when it tries to reinitialize
 ROCm.
 However, in this case the child process doesn't do that (e.g. a python
 application using the multi-processing package). The child process
 does
 not use ROCm. But you're left with all the dangling VMAs in the child
 process indefinitely.
>>> Oh, not that one again. I'm unfortunately pretty sure that this is an
>>> clear NAK then.
>>>
>>> This python multi-processing package is violating various
>>> specifications by doing this fork() and we already had multiple
>>> discussions about that.
>> Well, it's in wide-spread use. We can't just throw up our hands and say
>> they're buggy and not supported.
>
> Because that's not my NAK, but rather from upstream.
>
>> Also, why does your ACK or NAK depend on this at all. If it's the right
>> thing to do, it's the right thing to do regardless of who benefits from
>> it. In addition, how can a child process that doesn't even use the GPU
>> be in violation of any GPU-driver related specifications.
>
> The argument is that the application is broken and needs to be fixed
> instead of worked around inside the kernel.

I still don't get how they the application is broken. Like I said, the
child process is not using the GPU. How can the application be fixed in
this case?

Are you saying, any application that forks and doesn't immediately call
exec is broken?

Or does an application that forks need to be aware that some other part
of the application used the GPU and explicitly free any GPU resources?

Thanks,
  Felix


>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> Let's talk about this on Mondays call. Thanks for giving the whole
>>> context.
>>>
>>> Regards,
>>> Christian.
>>>
 Regards,
     Felix

>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-06 Thread Christian König

Am 06.01.22 um 17:45 schrieb Felix Kuehling:

Am 2022-01-06 um 4:05 a.m. schrieb Christian König:

Am 05.01.22 um 17:16 schrieb Felix Kuehling:

[SNIP]

But KFD doesn't know anything about the inherited BOs
from the parent process.

Ok, why that? When the KFD is reinitializing it's context why
shouldn't it cleanup those VMAs?

That cleanup has to be initiated by user mode. Basically closing the old
KFD and DRM file descriptors, cleaning up all the user mode VM state,
unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
and starts from scratch.

User mode will do this automatically when it tries to reinitialize ROCm.
However, in this case the child process doesn't do that (e.g. a python
application using the multi-processing package). The child process does
not use ROCm. But you're left with all the dangling VMAs in the child
process indefinitely.

Oh, not that one again. I'm unfortunately pretty sure that this is an
clear NAK then.

This python multi-processing package is violating various
specifications by doing this fork() and we already had multiple
discussions about that.

Well, it's in wide-spread use. We can't just throw up our hands and say
they're buggy and not supported.


Because that's not my NAK, but rather from upstream.


Also, why does your ACK or NAK depend on this at all. If it's the right
thing to do, it's the right thing to do regardless of who benefits from
it. In addition, how can a child process that doesn't even use the GPU
be in violation of any GPU-driver related specifications.


The argument is that the application is broken and needs to be fixed 
instead of worked around inside the kernel.


Regards,
Christian.



Regards,
   Felix



Let's talk about this on Mondays call. Thanks for giving the whole
context.

Regards,
Christian.


Regards,
    Felix





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-06 Thread Christian König

Am 05.01.22 um 17:16 schrieb Felix Kuehling:

[SNIP]

But KFD doesn't know anything about the inherited BOs
from the parent process.

Ok, why that? When the KFD is reinitializing it's context why
shouldn't it cleanup those VMAs?

That cleanup has to be initiated by user mode. Basically closing the old
KFD and DRM file descriptors, cleaning up all the user mode VM state,
unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
and starts from scratch.

User mode will do this automatically when it tries to reinitialize ROCm.
However, in this case the child process doesn't do that (e.g. a python
application using the multi-processing package). The child process does
not use ROCm. But you're left with all the dangling VMAs in the child
process indefinitely.


Oh, not that one again. I'm unfortunately pretty sure that this is an 
clear NAK then.


This python multi-processing package is violating various specifications 
by doing this fork() and we already had multiple discussions about that.


Let's talk about this on Mondays call. Thanks for giving the whole context.

Regards,
Christian.



Regards,
   Felix





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-05 Thread Felix Kuehling
Am 2022-01-05 um 11:16 a.m. schrieb Felix Kuehling:
>> I was already wondering which mmaps through the KFD node we have left
>> which cause problems here.
> We still use the KFD FD for mapping doorbells and HDP flushing. These
> are both SG BOs, so they cannot be CPU-mapped through render nodes. The
> KFD FD is also used for mapping signal pages and CWSR trap handlers on
> old APUs.
>
> Those VMAs aren't causing the problem. They still map successfully on
> restore.
>
>
Small correction: KFD already sets the VM_DONTCOPY flag for all KFD FD
mappings.

The patch under discussion here does the same for DRM FD mappings (at
least for KFD BOs).

Regards,
  Felix



Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-05 Thread Felix Kuehling
Am 2022-01-05 um 3:08 a.m. schrieb Christian König:
> Am 04.01.22 um 19:08 schrieb Felix Kuehling:
>> [+Adrian]
>>
>> Am 2021-12-23 um 2:05 a.m. schrieb Christian König:
>>
>>> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
 On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

 [SNIP]
 Still sounds funky. I think minimally we should have an ack from CRIU
 developers that this is officially the right way to solve this
 problem. I
 really don't want to have random one-off hacks that don't work across
 the
 board, for a problem where we (drm subsystem) really shouldn't be the
 only
 one with this problem. Where "this problem" means that the mmap
 space is
 per file description, and not per underlying inode or real device or
 whatever. That part sounds like a CRIU problem, and I expect CRIU
 folks
 want a consistent solution across the board for this. Hence please
 grab an
 ack from them.
>>> Unfortunately it's a KFD design problem. AMD used a single device
>>> node, then mmaped different objects from the same offset to different
>>> processes and expected it to work the rest of the fs subsystem without
>>> churn.
>> This may be true for mmaps in the KFD device, but not for mmaps in the
>> DRM render nodes.
>
> Correct, yes.
>
>>> So yes, this is indeed because the mmap space is per file descriptor
>>> for the use case here.
>> No. This is a different problem.
>
> I was already wondering which mmaps through the KFD node we have left
> which cause problems here.

We still use the KFD FD for mapping doorbells and HDP flushing. These
are both SG BOs, so they cannot be CPU-mapped through render nodes. The
KFD FD is also used for mapping signal pages and CWSR trap handlers on
old APUs.

Those VMAs aren't causing the problem. They still map successfully on
restore.


>
>> The problem has to do with the way that DRM manages mmap permissions. In
>> order to be able to mmap an offset in the render node, there needs to be
>> a BO that was created in the same render node. If you fork a process, it
>> inherits the VMA.
>
> Yeah, so far it works like designed.
>
>> But KFD doesn't know anything about the inherited BOs
>> from the parent process.
>
> Ok, why that? When the KFD is reinitializing it's context why
> shouldn't it cleanup those VMAs?

That cleanup has to be initiated by user mode. Basically closing the old
KFD and DRM file descriptors, cleaning up all the user mode VM state,
unmapping all the VMAs, etc. Then it reopens KFD and the render nodes
and starts from scratch.

User mode will do this automatically when it tries to reinitialize ROCm.
However, in this case the child process doesn't do that (e.g. a python
application using the multi-processing package). The child process does
not use ROCm. But you're left with all the dangling VMAs in the child
process indefinitely.

Regards,
  Felix


>
>> Therefore those BOs don't get checkpointed and
>> restored in the child process. When the CRIU checkpoint is restored, our
>> CRIU plugin never creates a BO corresponding to the VMA in the child
>> process' render node FD. We've also lost the relationship between the
>> parent and child-process' render node FDs. After "fork" the render node
>> FD points to the same struct file in parent and child. After restoring
>> the CRIU checkpoint, they are separate struct files, created by separate
>> "open" system calls. Therefore the mmap call that restores the VMA fails
>> in the child process.
>>
>> At least for KFD, there is no point inheriting BOs from a child process,
>> because the GPU has no way of accessing the BOs in the child process.
>> The child process has no GPU address space, no user mode queues, no way
>> to do anything with the GPU before it completely reinitializes its KFD
>> context.
>>
>> We can workaround this issue in user mode with madvise(...,
>> MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
>> memory leak in the parent process while a child process exists. But it's
>> slightly racy because there is a short time window where VMA exists
>> without the VM_DONTCOPY flag. A fork during that time window could still
>> create a child process with an inherited VMA.
>>
>> Therefore a safer solution is to set the vm_flags in the VMA in the
>> driver when the VMA is first created.
>
> Thanks for the full explanation, it makes much more sense now.
>
> Regards,
> Christian.
>
>>
>> Regards,
>>    Felix
>>
>>
>>> And thanks for pointing this out, this indeed makes the whole change
>>> extremely questionable.
>>>
>>> Regards,
>>> Christian.
>>>
 Cheers, Daniel

>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-05 Thread Christian König

Am 04.01.22 um 19:08 schrieb Felix Kuehling:

[+Adrian]

Am 2021-12-23 um 2:05 a.m. schrieb Christian König:


Am 22.12.21 um 21:53 schrieb Daniel Vetter:

On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

[SNIP]
Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this
problem. I
really don't want to have random one-off hacks that don't work across
the
board, for a problem where we (drm subsystem) really shouldn't be the
only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please
grab an
ack from them.

Unfortunately it's a KFD design problem. AMD used a single device
node, then mmaped different objects from the same offset to different
processes and expected it to work the rest of the fs subsystem without
churn.

This may be true for mmaps in the KFD device, but not for mmaps in the
DRM render nodes.


Correct, yes.


So yes, this is indeed because the mmap space is per file descriptor
for the use case here.

No. This is a different problem.


I was already wondering which mmaps through the KFD node we have left 
which cause problems here.



The problem has to do with the way that DRM manages mmap permissions. In
order to be able to mmap an offset in the render node, there needs to be
a BO that was created in the same render node. If you fork a process, it
inherits the VMA.


Yeah, so far it works like designed.


But KFD doesn't know anything about the inherited BOs
from the parent process.


Ok, why that? When the KFD is reinitializing it's context why shouldn't 
it cleanup those VMAs?



Therefore those BOs don't get checkpointed and
restored in the child process. When the CRIU checkpoint is restored, our
CRIU plugin never creates a BO corresponding to the VMA in the child
process' render node FD. We've also lost the relationship between the
parent and child-process' render node FDs. After "fork" the render node
FD points to the same struct file in parent and child. After restoring
the CRIU checkpoint, they are separate struct files, created by separate
"open" system calls. Therefore the mmap call that restores the VMA fails
in the child process.

At least for KFD, there is no point inheriting BOs from a child process,
because the GPU has no way of accessing the BOs in the child process.
The child process has no GPU address space, no user mode queues, no way
to do anything with the GPU before it completely reinitializes its KFD
context.

We can workaround this issue in user mode with madvise(...,
MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
memory leak in the parent process while a child process exists. But it's
slightly racy because there is a short time window where VMA exists
without the VM_DONTCOPY flag. A fork during that time window could still
create a child process with an inherited VMA.

Therefore a safer solution is to set the vm_flags in the VMA in the
driver when the VMA is first created.


Thanks for the full explanation, it makes much more sense now.

Regards,
Christian.



Regards,
   Felix



And thanks for pointing this out, this indeed makes the whole change
extremely questionable.

Regards,
Christian.


Cheers, Daniel





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2022-01-04 Thread Felix Kuehling
[+Adrian]

Am 2021-12-23 um 2:05 a.m. schrieb Christian König:

> Am 22.12.21 um 21:53 schrieb Daniel Vetter:
>> On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
>>
>> [SNIP]
>> Still sounds funky. I think minimally we should have an ack from CRIU
>> developers that this is officially the right way to solve this
>> problem. I
>> really don't want to have random one-off hacks that don't work across
>> the
>> board, for a problem where we (drm subsystem) really shouldn't be the
>> only
>> one with this problem. Where "this problem" means that the mmap space is
>> per file description, and not per underlying inode or real device or
>> whatever. That part sounds like a CRIU problem, and I expect CRIU folks
>> want a consistent solution across the board for this. Hence please
>> grab an
>> ack from them.
>
> Unfortunately it's a KFD design problem. AMD used a single device
> node, then mmaped different objects from the same offset to different
> processes and expected it to work the rest of the fs subsystem without
> churn.

This may be true for mmaps in the KFD device, but not for mmaps in the
DRM render nodes.


>
> So yes, this is indeed because the mmap space is per file descriptor
> for the use case here.

No. This is a different problem.

The problem has to do with the way that DRM manages mmap permissions. In
order to be able to mmap an offset in the render node, there needs to be
a BO that was created in the same render node. If you fork a process, it
inherits the VMA. But KFD doesn't know anything about the inherited BOs
from the parent process. Therefore those BOs don't get checkpointed and
restored in the child process. When the CRIU checkpoint is restored, our
CRIU plugin never creates a BO corresponding to the VMA in the child
process' render node FD. We've also lost the relationship between the
parent and child-process' render node FDs. After "fork" the render node
FD points to the same struct file in parent and child. After restoring
the CRIU checkpoint, they are separate struct files, created by separate
"open" system calls. Therefore the mmap call that restores the VMA fails
in the child process.

At least for KFD, there is no point inheriting BOs from a child process,
because the GPU has no way of accessing the BOs in the child process.
The child process has no GPU address space, no user mode queues, no way
to do anything with the GPU before it completely reinitializes its KFD
context.

We can workaround this issue in user mode with madvise(...,
MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
memory leak in the parent process while a child process exists. But it's
slightly racy because there is a short time window where VMA exists
without the VM_DONTCOPY flag. A fork during that time window could still
create a child process with an inherited VMA.

Therefore a safer solution is to set the vm_flags in the VMA in the
driver when the VMA is first created.

Regards,
  Felix


>
> And thanks for pointing this out, this indeed makes the whole change
> extremely questionable.
>
> Regards,
> Christian.
>
>>
>> Cheers, Daniel
>>
>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-22 Thread Christian König

Am 22.12.21 um 21:53 schrieb Daniel Vetter:

On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

[SNIP]
Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random one-off hacks that don't work across the
board, for a problem where we (drm subsystem) really shouldn't be the only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please grab an
ack from them.


Unfortunately it's a KFD design problem. AMD used a single device node, 
then mmaped different objects from the same offset to different 
processes and expected it to work the rest of the fs subsystem without 
churn.


So yes, this is indeed because the mmap space is per file descriptor for 
the use case here.


And thanks for pointing this out, this indeed makes the whole change 
extremely questionable.


Regards,
Christian.



Cheers, Daniel





Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-22 Thread Bhardwaj, Rajneesh

Sorry for the typo in my previous email. Please read Adrian Reber*

On 12/22/2021 8:49 PM, Bhardwaj, Rajneesh wrote:


Adding Adrian Rebel who is the CRIU maintainer and CRIU list

On 12/22/2021 3:53 PM, Daniel Vetter wrote:

On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

On 12/20/2021 4:29 AM, Daniel Vetter wrote:

On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:

Am 09.12.21 um 19:28 schrieb Felix Kuehling:

Am 2021-12-09 um 10:30 a.m. schrieb Christian König:

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

I think it's fine for the whole amdgpu stack, my concern is more about
radeon, nouveau and the ARM stacks which are using this as well.

That blew up so nicely the last time we tried to change it and I know of at
least one case where radeon was/is used with BOs in a child process.

I'm way late and burried again, but I think it'd be good to be consistent



I had committed this change into our amd-staging-drm-next branch last 
week after I got the ACK and RB from Felix and Christian.




here across drivers. Or at least across drm drivers. And we've had the vma
open/close refcounting to make fork work since forever.

I think if we do this we should really only do this for mmap() where this
applies, but reading through the thread here I'm honestly confused why
this is a problem. If CRIU can't handle forked mmaps it needs to be
thought that, not hacked around. Or at least I'm not understanding why
this shouldn't work ...
-Daniel


Hi Daniel

In the v2
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2Fdata=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3Dreserved=0
I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
recreate all the child processes and then mmaps all the VMAs it sees (as per
checkpoint snapshot) in the new process address space after the VMA
placements are finalized in the position independent code phase. Since the
inherited VMAs don't have access rights the criu mmap fails.

Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random one-off hacks that don't work across the
board, for a problem where we (drm subsystem) really shouldn't be the only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please grab an
ack from them.

Cheers, Daniel



Maybe Adrian can share his views on this.

Hi Adrian - For the context, on CRIU restore we see mmap failures ( in 
PIE restore phase) due to permission issues on the (render node) VMAs 
that were inherited since the application that check pointed had 
forked.  The VMAs ideally should not be in the child process but the 
smaps file shows these VMAs in the child address space. We didn't want 
to use madvise to avoid this copy and rather change in the kernel mode 
to limit the impact to our user space library thunk. Based on my 
understanding, during PIE restore phase, after the VMA placements are 
finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry 
list and I think its not an issue as per CRIU design but do you think 
we could handle this corner case better inside CRIU?




Regards,

Rajneesh


Regards,
Christian.


Regards,
     Felix



Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly
a good idea.


The child cannot access the BOs mapped by the parent anyway with
access restrictions applied

exactly that is not correct. That behavior is actively used by some
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:

Thanks Christian. Would it make it less intrusive if I just use the
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
this patch? For our use case, just the ttm_bo_mmap_obj change
should suffice and we don't want to put any more work arounds in
the user space (thunk, in our case).

The child cannot access the BOs mapped by the parent anyway with
access 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-22 Thread Bhardwaj, Rajneesh

Adding Adrian Rebel who is the CRIU maintainer and CRIU list

On 12/22/2021 3:53 PM, Daniel Vetter wrote:

On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

On 12/20/2021 4:29 AM, Daniel Vetter wrote:

On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:

Am 09.12.21 um 19:28 schrieb Felix Kuehling:

Am 2021-12-09 um 10:30 a.m. schrieb Christian König:

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

I think it's fine for the whole amdgpu stack, my concern is more about
radeon, nouveau and the ARM stacks which are using this as well.

That blew up so nicely the last time we tried to change it and I know of at
least one case where radeon was/is used with BOs in a child process.

I'm way late and burried again, but I think it'd be good to be consistent



I had committed this change into our amd-staging-drm-next branch last 
week after I got the ACK and RB from Felix and Christian.




here across drivers. Or at least across drm drivers. And we've had the vma
open/close refcounting to make fork work since forever.

I think if we do this we should really only do this for mmap() where this
applies, but reading through the thread here I'm honestly confused why
this is a problem. If CRIU can't handle forked mmaps it needs to be
thought that, not hacked around. Or at least I'm not understanding why
this shouldn't work ...
-Daniel


Hi Daniel

In the v2
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2Fa1a865f5-ad2c-29c8-cbe4-2635d53eceb6%40amd.com%2FT%2Fdata=04%7C01%7Crajneesh.bhardwaj%40amd.com%7Ce4634a16c37149da173408d9c58d1338%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637758031981907821%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=h0z4sO19bsJecMqeHGdz%2BHZElKuyzK%2BW%2FMbLWA79I10%3Dreserved=0
I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
recreate all the child processes and then mmaps all the VMAs it sees (as per
checkpoint snapshot) in the new process address space after the VMA
placements are finalized in the position independent code phase. Since the
inherited VMAs don't have access rights the criu mmap fails.

Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random one-off hacks that don't work across the
board, for a problem where we (drm subsystem) really shouldn't be the only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please grab an
ack from them.

Cheers, Daniel



Maybe Adrian can share his views on this.

Hi Adrian - For the context, on CRIU restore we see mmap failures ( in 
PIE restore phase) due to permission issues on the (render node) VMAs 
that were inherited since the application that check pointed had 
forked.  The VMAs ideally should not be in the child process but the 
smaps file shows these VMAs in the child address space. We didn't want 
to use madvise to avoid this copy and rather change in the kernel mode 
to limit the impact to our user space library thunk. Based on my 
understanding, during PIE restore phase, after the VMA placements are 
finalized, CRIU does a sys_mmap on all the VMA it sees in the VmaEntry 
list and I think its not an issue as per CRIU design but do you think we 
could handle this corner case better inside CRIU?






Regards,

Rajneesh


Regards,
Christian.


Regards,
     Felix



Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly
a good idea.


The child cannot access the BOs mapped by the parent anyway with
access restrictions applied

exactly that is not correct. That behavior is actively used by some
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:

Thanks Christian. Would it make it less intrusive if I just use the
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
this patch? For our use case, just the ttm_bo_mmap_obj change
should suffice and we don't want to put any more work arounds in
the user space (thunk, in our case).

The child cannot access the BOs mapped by the parent anyway with
access restrictions applied so I wonder why even inherit the vma?

On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-22 Thread Daniel Vetter
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:
> 
> On 12/20/2021 4:29 AM, Daniel Vetter wrote:
> > On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
> > > Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> > > > Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> > > > > That still won't work.
> > > > > 
> > > > > But I think we could do this change for the amdgpu mmap callback only.
> > > > If graphics user mode has problems with it, we could even make this
> > > > specific to KFD BOs in the amdgpu_gem_object_mmap callback.
> > > I think it's fine for the whole amdgpu stack, my concern is more about
> > > radeon, nouveau and the ARM stacks which are using this as well.
> > > 
> > > That blew up so nicely the last time we tried to change it and I know of 
> > > at
> > > least one case where radeon was/is used with BOs in a child process.
> > I'm way late and burried again, but I think it'd be good to be consistent
> > here across drivers. Or at least across drm drivers. And we've had the vma
> > open/close refcounting to make fork work since forever.
> > 
> > I think if we do this we should really only do this for mmap() where this
> > applies, but reading through the thread here I'm honestly confused why
> > this is a problem. If CRIU can't handle forked mmaps it needs to be
> > thought that, not hacked around. Or at least I'm not understanding why
> > this shouldn't work ...
> > -Daniel
> > 
> 
> Hi Daniel
> 
> In the v2
> https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53ec...@amd.com/T/
> I pretty much limited the scope of the change to KFD BOs on mmap. Regarding
> CRIU, I think its not a CRIU problem as CRIU on restore, only tries to
> recreate all the child processes and then mmaps all the VMAs it sees (as per
> checkpoint snapshot) in the new process address space after the VMA
> placements are finalized in the position independent code phase. Since the
> inherited VMAs don't have access rights the criu mmap fails.

Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this problem. I
really don't want to have random one-off hacks that don't work across the
board, for a problem where we (drm subsystem) really shouldn't be the only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please grab an
ack from them.

Cheers, Daniel

> 
> Regards,
> 
> Rajneesh
> 
> > > Regards,
> > > Christian.
> > > 
> > > > Regards,
> > > >     Felix
> > > > 
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
> > > > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. 
> > > > > > Thank
> > > > > > you!
> > > > > > 
> > > > > > On 12/9/2021 10:27 AM, Christian König wrote:
> > > > > > > Hi Rajneesh,
> > > > > > > 
> > > > > > > yes, separating this from the drm_gem_mmap_obj() change is 
> > > > > > > certainly
> > > > > > > a good idea.
> > > > > > > 
> > > > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > > > access restrictions applied
> > > > > > > exactly that is not correct. That behavior is actively used by 
> > > > > > > some
> > > > > > > userspace stacks as far as I know.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
> > > > > > > > Thanks Christian. Would it make it less intrusive if I just use 
> > > > > > > > the
> > > > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
> > > > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change
> > > > > > > > should suffice and we don't want to put any more work arounds in
> > > > > > > > the user space (thunk, in our case).
> > > > > > > > 
> > > > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > > > access restrictions applied so I wonder why even inherit the 
> > > > > > > > vma?
> > > > > > > > 
> > > > > > > > On 12/9/2021 2:54 AM, Christian König wrote:
> > > > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
> > > > > > > > > > When an application having open file access to a node 
> > > > > > > > > > forks, its
> > > > > > > > > > shared
> > > > > > > > > > mappings also get reflected in the address space of child 
> > > > > > > > > > process
> > > > > > > > > > even
> > > > > > > > > > though it cannot access them with the object permissions 
> > > > > > > > > > applied.
> > > > > > > > > > With the
> > > > > > > > > > existing permission checks on the gem objects, it might be
> > > > > > > > > > reasonable to
> > > > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space
> > > > > > > > > > application
> > > > > > 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-20 Thread Bhardwaj, Rajneesh



On 12/20/2021 4:29 AM, Daniel Vetter wrote:

On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:

Am 09.12.21 um 19:28 schrieb Felix Kuehling:

Am 2021-12-09 um 10:30 a.m. schrieb Christian König:

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

I think it's fine for the whole amdgpu stack, my concern is more about
radeon, nouveau and the ARM stacks which are using this as well.

That blew up so nicely the last time we tried to change it and I know of at
least one case where radeon was/is used with BOs in a child process.

I'm way late and burried again, but I think it'd be good to be consistent
here across drivers. Or at least across drm drivers. And we've had the vma
open/close refcounting to make fork work since forever.

I think if we do this we should really only do this for mmap() where this
applies, but reading through the thread here I'm honestly confused why
this is a problem. If CRIU can't handle forked mmaps it needs to be
thought that, not hacked around. Or at least I'm not understanding why
this shouldn't work ...
-Daniel



Hi Daniel

In the v2 
https://lore.kernel.org/all/a1a865f5-ad2c-29c8-cbe4-2635d53ec...@amd.com/T/ 
I pretty much limited the scope of the change to KFD BOs on mmap. 
Regarding CRIU, I think its not a CRIU problem as CRIU on restore, only 
tries to recreate all the child processes and then mmaps all the VMAs it 
sees (as per checkpoint snapshot) in the new process address space after 
the VMA placements are finalized in the position independent code phase. 
Since the inherited VMAs don't have access rights the criu mmap fails.


Regards,

Rajneesh


Regards,
Christian.


Regards,
    Felix



Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly
a good idea.


The child cannot access the BOs mapped by the parent anyway with
access restrictions applied

exactly that is not correct. That behavior is actively used by some
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:

Thanks Christian. Would it make it less intrusive if I just use the
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
this patch? For our use case, just the ttm_bo_mmap_obj change
should suffice and we don't want to put any more work arounds in
the user space (thunk, in our case).

The child cannot access the BOs mapped by the parent anyway with
access restrictions applied so I wonder why even inherit the vma?

On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its
shared
mappings also get reflected in the address space of child process
even
though it cannot access them with the object permissions applied.
With the
existing permission checks on the gem objects, it might be
reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space
application
doesn't need to explicitly call the madvise(addr, len,
MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in
the
address space of the child process. It also prevents the memory
leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which
we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint
restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of
this so
this is needed only for the render nodes.

Unfortunately that is most likely a NAK. We already tried
something similar.

While it is illegal by the OpenGL specification and doesn't work
for most userspace stacks, we do have some implementations which
call fork() with a GL context open and expect it to work.

Regards,
Christian.


Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
    drivers/gpu/drm/drm_gem.c   | 3 ++-
    drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
    2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
*obj, unsigned long obj_size,
    goto err_drm_gem_object_put;
    }
    -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP;
+    vma->vm_flags |= VM_IO | 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-20 Thread Daniel Vetter
On Fri, Dec 10, 2021 at 07:58:50AM +0100, Christian König wrote:
> Am 09.12.21 um 19:28 schrieb Felix Kuehling:
> > Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> > > That still won't work.
> > > 
> > > But I think we could do this change for the amdgpu mmap callback only.
> > If graphics user mode has problems with it, we could even make this
> > specific to KFD BOs in the amdgpu_gem_object_mmap callback.
> 
> I think it's fine for the whole amdgpu stack, my concern is more about
> radeon, nouveau and the ARM stacks which are using this as well.
> 
> That blew up so nicely the last time we tried to change it and I know of at
> least one case where radeon was/is used with BOs in a child process.

I'm way late and burried again, but I think it'd be good to be consistent
here across drivers. Or at least across drm drivers. And we've had the vma
open/close refcounting to make fork work since forever.

I think if we do this we should really only do this for mmap() where this
applies, but reading through the thread here I'm honestly confused why
this is a problem. If CRIU can't handle forked mmaps it needs to be
thought that, not hacked around. Or at least I'm not understanding why
this shouldn't work ...
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> >    Felix
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
> > > > Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
> > > > you!
> > > > 
> > > > On 12/9/2021 10:27 AM, Christian König wrote:
> > > > > Hi Rajneesh,
> > > > > 
> > > > > yes, separating this from the drm_gem_mmap_obj() change is certainly
> > > > > a good idea.
> > > > > 
> > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > access restrictions applied
> > > > > exactly that is not correct. That behavior is actively used by some
> > > > > userspace stacks as far as I know.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
> > > > > > Thanks Christian. Would it make it less intrusive if I just use the
> > > > > > flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
> > > > > > this patch? For our use case, just the ttm_bo_mmap_obj change
> > > > > > should suffice and we don't want to put any more work arounds in
> > > > > > the user space (thunk, in our case).
> > > > > > 
> > > > > > The child cannot access the BOs mapped by the parent anyway with
> > > > > > access restrictions applied so I wonder why even inherit the vma?
> > > > > > 
> > > > > > On 12/9/2021 2:54 AM, Christian König wrote:
> > > > > > > Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
> > > > > > > > When an application having open file access to a node forks, its
> > > > > > > > shared
> > > > > > > > mappings also get reflected in the address space of child 
> > > > > > > > process
> > > > > > > > even
> > > > > > > > though it cannot access them with the object permissions 
> > > > > > > > applied.
> > > > > > > > With the
> > > > > > > > existing permission checks on the gem objects, it might be
> > > > > > > > reasonable to
> > > > > > > > also create the VMAs with VM_DONTCOPY flag so a user space
> > > > > > > > application
> > > > > > > > doesn't need to explicitly call the madvise(addr, len,
> > > > > > > > MADV_DONTFORK)
> > > > > > > > system call to prevent the pages in the mapped range to appear 
> > > > > > > > in
> > > > > > > > the
> > > > > > > > address space of the child process. It also prevents the memory
> > > > > > > > leaks
> > > > > > > > due to additional reference counts on the mapped BOs in the 
> > > > > > > > child
> > > > > > > > process that prevented freeing the memory in the parent for 
> > > > > > > > which
> > > > > > > > we had
> > > > > > > > worked around earlier in the user space inside the thunk 
> > > > > > > > library.
> > > > > > > > 
> > > > > > > > Additionally, we faced this issue when using CRIU to checkpoint
> > > > > > > > restore
> > > > > > > > an application that had such inherited mappings in the child 
> > > > > > > > which
> > > > > > > > confuse CRIU when it mmaps on restore. Having this flag set for 
> > > > > > > > the
> > > > > > > > render node VMAs helps. VMAs mapped via KFD already take care of
> > > > > > > > this so
> > > > > > > > this is needed only for the render nodes.
> > > > > > > Unfortunately that is most likely a NAK. We already tried
> > > > > > > something similar.
> > > > > > > 
> > > > > > > While it is illegal by the OpenGL specification and doesn't work
> > > > > > > for most userspace stacks, we do have some implementations which
> > > > > > > call fork() with a GL context open and expect it to work.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Cc: Felix Kuehling 
> > > > > > > > 
> > > > > > > > Signed-off-by: David Yat Sin 
> > > > > > > > Signed-off-by: Rajneesh Bhardwaj 
> > > > > > > > 

Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

Am 09.12.21 um 19:28 schrieb Felix Kuehling:

Am 2021-12-09 um 10:30 a.m. schrieb Christian König:

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.


I think it's fine for the whole amdgpu stack, my concern is more about 
radeon, nouveau and the ARM stacks which are using this as well.


That blew up so nicely the last time we tried to change it and I know of 
at least one case where radeon was/is used with BOs in a child process.


Regards,
Christian.



Regards,
   Felix



Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly
a good idea.


The child cannot access the BOs mapped by the parent anyway with
access restrictions applied

exactly that is not correct. That behavior is actively used by some
userspace stacks as far as I know.

Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:

Thanks Christian. Would it make it less intrusive if I just use the
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
this patch? For our use case, just the ttm_bo_mmap_obj change
should suffice and we don't want to put any more work arounds in
the user space (thunk, in our case).

The child cannot access the BOs mapped by the parent anyway with
access restrictions applied so I wonder why even inherit the vma?

On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its
shared
mappings also get reflected in the address space of child process
even
though it cannot access them with the object permissions applied.
With the
existing permission checks on the gem objects, it might be
reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space
application
doesn't need to explicitly call the madvise(addr, len,
MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in
the
address space of the child process. It also prevents the memory
leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which
we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint
restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of
this so
this is needed only for the render nodes.

Unfortunately that is most likely a NAK. We already tried
something similar.

While it is illegal by the OpenGL specification and doesn't work
for most userspace stacks, we do have some implementations which
call fork() with a GL context open and expect it to work.

Regards,
Christian.


Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
   drivers/gpu/drm/drm_gem.c   | 3 ++-
   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
   2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
*obj, unsigned long obj_size,
   goto err_drm_gem_object_put;
   }
   -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
VM_DONTDUMP;
+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
   vma->vm_page_prot =
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
   }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
*vma, struct ttm_buffer_object *bo)
     vma->vm_private_data = bo;
   -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
   return 0;
   }




Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Felix Kuehling
Am 2021-12-09 um 10:30 a.m. schrieb Christian König:
> That still won't work.
>
> But I think we could do this change for the amdgpu mmap callback only.

If graphics user mode has problems with it, we could even make this
specific to KFD BOs in the amdgpu_gem_object_mmap callback.

Regards,
  Felix


>
> Regards,
> Christian.
>
> Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
>> Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank
>> you!
>>
>> On 12/9/2021 10:27 AM, Christian König wrote:
>>> Hi Rajneesh,
>>>
>>> yes, separating this from the drm_gem_mmap_obj() change is certainly
>>> a good idea.
>>>
 The child cannot access the BOs mapped by the parent anyway with
 access restrictions applied
>>>
>>> exactly that is not correct. That behavior is actively used by some
>>> userspace stacks as far as I know.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
 Thanks Christian. Would it make it less intrusive if I just use the
 flag for ttm bo mmap and remove the drm_gem_mmap_obj change from
 this patch? For our use case, just the ttm_bo_mmap_obj change
 should suffice and we don't want to put any more work arounds in
 the user space (thunk, in our case).

 The child cannot access the BOs mapped by the parent anyway with
 access restrictions applied so I wonder why even inherit the vma?

 On 12/9/2021 2:54 AM, Christian König wrote:
> Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
>> When an application having open file access to a node forks, its
>> shared
>> mappings also get reflected in the address space of child process
>> even
>> though it cannot access them with the object permissions applied.
>> With the
>> existing permission checks on the gem objects, it might be
>> reasonable to
>> also create the VMAs with VM_DONTCOPY flag so a user space
>> application
>> doesn't need to explicitly call the madvise(addr, len,
>> MADV_DONTFORK)
>> system call to prevent the pages in the mapped range to appear in
>> the
>> address space of the child process. It also prevents the memory
>> leaks
>> due to additional reference counts on the mapped BOs in the child
>> process that prevented freeing the memory in the parent for which
>> we had
>> worked around earlier in the user space inside the thunk library.
>>
>> Additionally, we faced this issue when using CRIU to checkpoint
>> restore
>> an application that had such inherited mappings in the child which
>> confuse CRIU when it mmaps on restore. Having this flag set for the
>> render node VMAs helps. VMAs mapped via KFD already take care of
>> this so
>> this is needed only for the render nodes.
>
> Unfortunately that is most likely a NAK. We already tried
> something similar.
>
> While it is illegal by the OpenGL specification and doesn't work
> for most userspace stacks, we do have some implementations which
> call fork() with a GL context open and expect it to work.
>
> Regards,
> Christian.
>
>>
>> Cc: Felix Kuehling 
>>
>> Signed-off-by: David Yat Sin 
>> Signed-off-by: Rajneesh Bhardwaj 
>> ---
>>   drivers/gpu/drm/drm_gem.c   | 3 ++-
>>   drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
>>   2 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
>> index 09c820045859..d9c4149f36dd 100644
>> --- a/drivers/gpu/drm/drm_gem.c
>> +++ b/drivers/gpu/drm/drm_gem.c
>> @@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object
>> *obj, unsigned long obj_size,
>>   goto err_drm_gem_object_put;
>>   }
>>   -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND |
>> VM_DONTDUMP;
>> +    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
>> +    | VM_DONTDUMP | VM_DONTCOPY;
>>   vma->vm_page_prot =
>> pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
>>   vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>   }
>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> index 33680c94127c..420a4898fdd2 100644
>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>> @@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct
>> *vma, struct ttm_buffer_object *bo)
>>     vma->vm_private_data = bo;
>>   -    vma->vm_flags |= VM_PFNMAP;
>> +    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
>>   vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>>   return 0;
>>   }
>
>>>
>


Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Bhardwaj, Rajneesh

Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank you!

On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly a 
good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
When an application having open file access to a node forks, its 
shared

mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to

also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which 
we had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint 
restore

an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }






Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

That still won't work.

But I think we could do this change for the amdgpu mmap callback only.

Regards,
Christian.

Am 09.12.21 um 16:29 schrieb Bhardwaj, Rajneesh:
Sounds good. I will send a v2 with only ttm_bo_mmap_obj change. Thank 
you!


On 12/9/2021 10:27 AM, Christian König wrote:

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly 
a good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from 
this patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:
When an application having open file access to a node forks, its 
shared
mappings also get reflected in the address space of child process 
even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space 
application

doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which 
we had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint 
restore

an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work 
for most userspace stacks, we do have some implementations which 
call fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct 
*vma, struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }








Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Bhardwaj, Rajneesh
Thanks Christian. Would it make it less intrusive if I just use the flag 
for ttm bo mmap and remove the drm_gem_mmap_obj change from this patch? 
For our use case, just the ttm_bo_mmap_obj change should suffice and we 
don't want to put any more work arounds in the user space (thunk, in our 
case).


The child cannot access the BOs mapped by the parent anyway with access 
restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the

existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }




Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

Hi Rajneesh,

yes, separating this from the drm_gem_mmap_obj() change is certainly a 
good idea.


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied


exactly that is not correct. That behavior is actively used by some 
userspace stacks as far as I know.


Regards,
Christian.

Am 09.12.21 um 16:23 schrieb Bhardwaj, Rajneesh:
Thanks Christian. Would it make it less intrusive if I just use the 
flag for ttm bo mmap and remove the drm_gem_mmap_obj change from this 
patch? For our use case, just the ttm_bo_mmap_obj change should 
suffice and we don't want to put any more work arounds in the user 
space (thunk, in our case).


The child cannot access the BOs mapped by the parent anyway with 
access restrictions applied so I wonder why even inherit the vma?


On 12/9/2021 2:54 AM, Christian König wrote:

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. 
With the
existing permission checks on the gem objects, it might be 
reasonable to

also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we 
had

worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of 
this so

this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something 
similar.


While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call 
fork() with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object 
*obj, unsigned long obj_size,

  goto err_drm_gem_object_put;
  }
  -    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;

+    vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+    | VM_DONTDUMP | VM_DONTCOPY;
  vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));

  vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
  }
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, 
struct ttm_buffer_object *bo)

    vma->vm_private_data = bo;
  -    vma->vm_flags |= VM_PFNMAP;
+    vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
  vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
  return 0;
  }






Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-09 Thread Christian König

Am 08.12.21 um 21:53 schrieb Rajneesh Bhardwaj:

When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. With the
existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.


Unfortunately that is most likely a NAK. We already tried something similar.

While it is illegal by the OpenGL specification and doesn't work for 
most userspace stacks, we do have some implementations which call fork() 
with a GL context open and expect it to work.


Regards,
Christian.



Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
  drivers/gpu/drm/drm_gem.c   | 3 ++-
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
  2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned 
long obj_size,
goto err_drm_gem_object_put;
}
  
-		vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

+   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+   | VM_DONTDUMP | VM_DONTCOPY;
vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct 
ttm_buffer_object *bo)
  
  	vma->vm_private_data = bo;
  
-	vma->vm_flags |= VM_PFNMAP;

+   vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
return 0;
  }




[PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

2021-12-08 Thread Rajneesh Bhardwaj
When an application having open file access to a node forks, its shared
mappings also get reflected in the address space of child process even
though it cannot access them with the object permissions applied. With the
existing permission checks on the gem objects, it might be reasonable to
also create the VMAs with VM_DONTCOPY flag so a user space application
doesn't need to explicitly call the madvise(addr, len, MADV_DONTFORK)
system call to prevent the pages in the mapped range to appear in the
address space of the child process. It also prevents the memory leaks
due to additional reference counts on the mapped BOs in the child
process that prevented freeing the memory in the parent for which we had
worked around earlier in the user space inside the thunk library.

Additionally, we faced this issue when using CRIU to checkpoint restore
an application that had such inherited mappings in the child which
confuse CRIU when it mmaps on restore. Having this flag set for the
render node VMAs helps. VMAs mapped via KFD already take care of this so
this is needed only for the render nodes.

Cc: Felix Kuehling 

Signed-off-by: David Yat Sin 
Signed-off-by: Rajneesh Bhardwaj 
---
 drivers/gpu/drm/drm_gem.c   | 3 ++-
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index 09c820045859..d9c4149f36dd 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -1058,7 +1058,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned 
long obj_size,
goto err_drm_gem_object_put;
}
 
-   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | 
VM_DONTDUMP;
+   vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND
+   | VM_DONTDUMP | VM_DONTCOPY;
vma->vm_page_prot = 
pgprot_writecombine(vm_get_page_prot(vma->vm_flags));
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 33680c94127c..420a4898fdd2 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -566,7 +566,7 @@ int ttm_bo_mmap_obj(struct vm_area_struct *vma, struct 
ttm_buffer_object *bo)
 
vma->vm_private_data = bo;
 
-   vma->vm_flags |= VM_PFNMAP;
+   vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY;
vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
return 0;
 }
-- 
2.17.1