Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-15 Thread Josh Triplett
On Sun, Mar 15, 2015 at 10:18:05AM +, David Drysdale wrote:
> On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett  wrote:
> > On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> >> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> >> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> >> > > In any event, we should find out what FreeBSD does in response to
> >> > > read(2) on the fd.
> >> >
> >> > I've just successfully installed FreeBSD and compiled qtbase (main 
> >> > package
> >> > of Qt 5) on it.
> >> >
> >> > I'll test pdfork during the weekend and report its behaviour.
> >>
> >> Here are my findings about pdfork.
> >>
> >> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> >> Qt adaptations: https://codereview.qt-project.org/108561
> >>
> >> Processes created with pdfork() are normal processes that still send 
> >> SIGCHLD
> >> to their parents. The only difference is that you get the extra file 
> >> descriptor
> >> that can be passed to the pdgetpid() system call and works on 
> >> select()/poll().
> >> Trying to read from that file descriptor will result in EOPNOTSUPP.
> >
> > OK, since read() doesn't work on a pdfork() file descriptor, we don't
> > have to worry about compatibility with pdfork()'s read result.
> >
> > However, if the expectation is that pdfork()ed child processes still
> > send SIGCHLD, then I don't see how we can be compatible there, nor do I
> > think we want to; as you mention below, that breaks the ability to
> > encapsulate management of the created process entirely within a library.
> 
> I didn't think that was the case -- my understanding was that pdfork()ed
> children would not generate SIGCHLD (and that does seem to be the
> case with a quick test program).

Well, either way, v2 of this series is capable of producing either
behavior.  You can have a clonefd and still receive SIGCHLD or any other
signal, or none at all, and you can decide independently from that if
you want autoreaping or waiting.

> As an aside, I do think there are some aspects of FreeBSD's process
> descriptors that aren't quite right yet, particularly their interaction with
> waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
> them not to be (to allow libraries to use sub-processes invisibly to the
> programs using them). There's a thread at:
> https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
> but I'm not sure that anything came of that discussion.

As long as you don't use the Linux-specific flags __WALL or __WCLONE, a
process created with clone will be invisible to wait if it has an exit
signal other than SIGCHLD.  That's true independent of this patch
series.  So you can decide if you want processes visible to wait or not.

> As it happens, I'm meeting Robert Watson (one of the progenitors
> of Capsicum/process descriptors) tomorrow, so I'll chase further.

Sounds good.

> >> Since they've never implemented pdwait4() (it's not even declared in the
> >> headers), the only way to reap a child if you only have the file 
> >> descriptor is
> >> to first pdgetpid() and then call wait4() or wait6().
> >
> > Which suggests that we shouldn't try to implement pdwait4() in glibc
> > until FreeBSD implements it in their kernel, since we won't know the
> > exact semantics they expect.
> 
> By the way, I should point out one part of the FreeBSD design
> which might help explain some of the semantics.
> 
> Process descriptors are particularly designed to be used with
> Capsicum, which is a security framework where file descriptors
> get extra rights associated with them, and the kernel polices
> the use of those rights (e.g. you need CAP_READ for read(2)
> operations; normal file descriptors implicitly have all of the
> rights for back-compatibility).
>   https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4
> 
> Capsicum also includes 'capability mode', where system calls
> that access global namespaces are disabled -- including the
> pid namespace.
> 
> So process descriptors are the only way to manipulate child
> processes when a program is in capability mode -- and this
> means that pdkill() is then genuinely needed over and above
> kill(pdgetpid(),...).

Thanks for the explanation.  I've seen some details about Capsicum, and
I found it quite interesting.  I'm particularly interested in the notion
of getting rid of global namespaces in favor of descriptors or similar
mechanisms that you need specific rights to.

Does Capsicum do anything to eliminate the global namespace of UIDs and
GIDs?

> >> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL 
> >> when
> >> the file closes.
> >
> > OK, that makes sense.  We could certainly implement a
> > CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> > future.
> >
> >> Conclusion:
> >> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD 
> >> mess.
> >> As long as all child process act

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-15 Thread David Drysdale
On Sat, Mar 14, 2015 at 7:29 PM, Josh Triplett  wrote:
> On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
>> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
>> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
>> > > In any event, we should find out what FreeBSD does in response to
>> > > read(2) on the fd.
>> >
>> > I've just successfully installed FreeBSD and compiled qtbase (main package
>> > of Qt 5) on it.
>> >
>> > I'll test pdfork during the weekend and report its behaviour.
>>
>> Here are my findings about pdfork.
>>
>> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
>> Qt adaptations: https://codereview.qt-project.org/108561
>>
>> Processes created with pdfork() are normal processes that still send SIGCHLD
>> to their parents. The only difference is that you get the extra file 
>> descriptor
>> that can be passed to the pdgetpid() system call and works on 
>> select()/poll().
>> Trying to read from that file descriptor will result in EOPNOTSUPP.
>
> OK, since read() doesn't work on a pdfork() file descriptor, we don't
> have to worry about compatibility with pdfork()'s read result.
>
> However, if the expectation is that pdfork()ed child processes still
> send SIGCHLD, then I don't see how we can be compatible there, nor do I
> think we want to; as you mention below, that breaks the ability to
> encapsulate management of the created process entirely within a library.

I didn't think that was the case -- my understanding was that pdfork()ed
children would not generate SIGCHLD (and that does seem to be the
case with a quick test program).

As an aside, I do think there are some aspects of FreeBSD's process
descriptors that aren't quite right yet, particularly their interaction with
waitpid(-1, ...) -- IIRC pdfork()ed children are visible to it, but I'd expect
them not to be (to allow libraries to use sub-processes invisibly to the
programs using them). There's a thread at:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2014-March/thread.html
but I'm not sure that anything came of that discussion.

As it happens, I'm meeting Robert Watson (one of the progenitors
of Capsicum/process descriptors) tomorrow, so I'll chase further.

>> Since they've never implemented pdwait4() (it's not even declared in the
>> headers), the only way to reap a child if you only have the file descriptor 
>> is
>> to first pdgetpid() and then call wait4() or wait6().
>
> Which suggests that we shouldn't try to implement pdwait4() in glibc
> until FreeBSD implements it in their kernel, since we won't know the
> exact semantics they expect.

By the way, I should point out one part of the FreeBSD design
which might help explain some of the semantics.

Process descriptors are particularly designed to be used with
Capsicum, which is a security framework where file descriptors
get extra rights associated with them, and the kernel polices
the use of those rights (e.g. you need CAP_READ for read(2)
operations; normal file descriptors implicitly have all of the
rights for back-compatibility).
  https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4

Capsicum also includes 'capability mode', where system calls
that access global namespaces are disabled -- including the
pid namespace.

So process descriptors are the only way to manipulate child
processes when a program is in capability mode -- and this
means that pdkill() is then genuinely needed over and above
kill(pdgetpid(),...).

>> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when
>> the file closes.
>
> OK, that makes sense.  We could certainly implement a
> CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
> future.
>
>> Conclusion:
>> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD 
>> mess.
>> As long as all child process activations use this feature, the problem is
>> solved.
>>
>> Cons: it requires cooperation from all child starters. If some other library
>> or the application installs a global SIGCHLD handler that waits on all child
>> processes, like libvlc used to do and Glib and Ecore still do, you won't be
>> able to get the child exit status.
>>
>> I have not tested what happens if you try to pass the file descriptor to 
>> other
>> processes (can you even do that on FreeBSD?). But even if you could and got
>> notifications, you couldn't wait on the child to get its exit status -- 
>> unless
>> they implement pdwait4.
>
> Even if they do implement pdwait4, they might not bypass the "must be
> the parent process" restriction.  Let's wait to see what semantics they
> go with.

Hmm, interesting point.  FreeBSD certainly allows FD passing, but
I'm not sure what the interactions are when it's a process descriptor
that's passed.

Given the object-capability background to Capsicum, I'd assume that a
holder of the process descriptor should be able to do whatever operations
are allowed by the rights associated with the descriptor (CAP_

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-15 Thread David Drysdale
On Fri, Mar 13, 2015 at 7:42 PM, Josh Triplett  wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the 
>> > caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and 
>> > manage
>> > child processes on behalf of their caller, *without* taking over 
>> > process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting.  However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.

Agreed -- if it's possible to implement equivalent process descriptor
functionality with a wrapper library, but the kernel interface is more
comprehensive and consistent with the rest of the Linux kernel, then
that's a big win.  So thanks for your work and for being willing to look
at the overlap!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-14 Thread Josh Triplett
On Sat, Mar 14, 2015 at 12:03:12PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> > On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > > In any event, we should find out what FreeBSD does in response to
> > > read(2) on the fd.
> > 
> > I've just successfully installed FreeBSD and compiled qtbase (main package
> > of Qt 5) on it.
> > 
> > I'll test pdfork during the weekend and report its behaviour.
> 
> Here are my findings about pdfork.
> 
> Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
> Qt adaptations: https://codereview.qt-project.org/108561
> 
> Processes created with pdfork() are normal processes that still send SIGCHLD 
> to their parents. The only difference is that you get the extra file 
> descriptor 
> that can be passed to the pdgetpid() system call and works on 
> select()/poll(). 
> Trying to read from that file descriptor will result in EOPNOTSUPP.

OK, since read() doesn't work on a pdfork() file descriptor, we don't
have to worry about compatibility with pdfork()'s read result.

However, if the expectation is that pdfork()ed child processes still
send SIGCHLD, then I don't see how we can be compatible there, nor do I
think we want to; as you mention below, that breaks the ability to
encapsulate management of the created process entirely within a library.

> Since they've never implemented pdwait4() (it's not even declared in the 
> headers), the only way to reap a child if you only have the file descriptor 
> is 
> to first pdgetpid() and then call wait4() or wait6().

Which suggests that we shouldn't try to implement pdwait4() in glibc
until FreeBSD implements it in their kernel, since we won't know the
exact semantics they expect.

> If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
> the file closes.

OK, that makes sense.  We could certainly implement a
CLONE_FD_KILL_ON_CLOSE flag with those semantics, if we want one in the
future.

> Conclusion: 
> Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD 
> mess. 
> As long as all child process activations use this feature, the problem is 
> solved.
> 
> Cons: it requires cooperation from all child starters. If some other library 
> or the application installs a global SIGCHLD handler that waits on all child 
> processes, like libvlc used to do and Glib and Ecore still do, you won't be 
> able to get the child exit status.
> 
> I have not tested what happens if you try to pass the file descriptor to 
> other 
> processes (can you even do that on FreeBSD?). But even if you could and got 
> notifications, you couldn't wait on the child to get its exit status -- 
> unless 
> they implement pdwait4.

Even if they do implement pdwait4, they might not bypass the "must be
the parent process" restriction.  Let's wait to see what semantics they
go with.

>  - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
>  - pdwait4: can be emulated with read()
>  - pdgetpid: needs an ioctl
>  - pdkill: needs an ioctl [or just write()]

I think that should be a dedicated syscall, not an ioctl.

It's unfortunate that rt_sigqueueinfo doesn't take a flags argument.
However, I just realized that it takes a 32-bit "int" for the signal
number, yet signal numbers fit in 8 bits.  So we could just add flags in
the high 24 bits of that argument, and in particular add a flag
indicating that the first argument is a file descriptor rather than a
PID.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-14 Thread Thiago Macieira
On Friday 13 March 2015 18:11:32 Thiago Macieira wrote:
> On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> > In any event, we should find out what FreeBSD does in response to
> > read(2) on the fd.
> 
> I've just successfully installed FreeBSD and compiled qtbase (main package
> of Qt 5) on it.
> 
> I'll test pdfork during the weekend and report its behaviour.

Here are my findings about pdfork.

Source: http://fxr.watson.org/fxr/source/kern/sys_procdesc.c?v=FREEBSD10
Qt adaptations: https://codereview.qt-project.org/108561

Processes created with pdfork() are normal processes that still send SIGCHLD 
to their parents. The only difference is that you get the extra file descriptor 
that can be passed to the pdgetpid() system call and works on select()/poll(). 
Trying to read from that file descriptor will result in EOPNOTSUPP.

Since they've never implemented pdwait4() (it's not even declared in the 
headers), the only way to reap a child if you only have the file descriptor is 
to first pdgetpid() and then call wait4() or wait6().

If you don't pass PD_DAEMON, the child process gets killed with SIGKILL when 
the file closes.

Conclusion: 
Pros: this is the bare minimum that we'd need to disentangle the SIGCHLD mess. 
As long as all child process activations use this feature, the problem is 
solved.

Cons: it requires cooperation from all child starters. If some other library 
or the application installs a global SIGCHLD handler that waits on all child 
processes, like libvlc used to do and Glib and Ecore still do, you won't be 
able to get the child exit status.

I have not tested what happens if you try to pass the file descriptor to other 
processes (can you even do that on FreeBSD?). But even if you could and got 
notifications, you couldn't wait on the child to get its exit status -- unless 
they implement pdwait4.

 - pdfork: can be emulated with clone4 + CLONE_FD (+ CLONEFD_KILL_ON_CLOSE)
 - pdwait4: can be emulated with read()
 - pdgetpid: needs an ioctl
 - pdkill: needs an ioctl [or just write()]

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread Thiago Macieira
On Friday 13 March 2015 14:51:47 Andy Lutomirski wrote:
> In any event, we should find out what FreeBSD does in response to
> read(2) on the fd.

I've just successfully installed FreeBSD and compiled qtbase (main package of 
Qt 5) on it.

I'll test pdfork during the weekend and report its behaviour.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread Andy Lutomirski
On Fri, Mar 13, 2015 at 2:45 PM,   wrote:
> On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
>> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett  
>> wrote:
>> > On Fri, Mar 13, 2015 at 04:05:29PM +, David Drysdale wrote:
>> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  
>> >> wrote:
>> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the 
>> >> > caller
>> >> > handle child process exit notification via a file descriptor rather than
>> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and 
>> >> > manage
>> >> > child processes on behalf of their caller, *without* taking over 
>> >> > process-wide
>> >> > SIGCHLD handling (either via signal handler or signalfd).
>> >>
>> >> Hi Josh,
>> >>
>> >> From the overall description (i.e. I haven't looked at the code yet)
>> >> this looks very interesting.  However, it seems to cover a lot of the
>> >> same ground as the process descriptor feature that was added to FreeBSD
>> >> in 9.x/10.x:
>> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>> >
>> > Interesting.
>> >
>> >> I think it would ideally be nice for a userspace library developer to be
>> >> able to do subprocess management (without SIGCHLD) in a similar way
>> >> across both platforms, without lots of complicated autoconf shenanigans.
>> >>
>> >> So could we look at the overlap and seeing if we can come up with
>> >> something that covers your requirements and also allows for something
>> >> that looks like FreeBSD's process descriptors?
>> >
>> > Agreed; however, I think it's reasonable to provide appropriate Linux
>> > system calls, and then let glibc or libbsd or similar provide the
>> > BSD-compatible calls on top of those.  I don't think the kernel
>> > interface needs to exactly match FreeBSD's, as long as it's a superset
>> > of the functionality.
>>
>> We need to be careful with things like read(2), though.  It's hard to
>> write a glibc function that makes read(2) do something other than what
>> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
>> would be really nice to be consistent here.
>
> It doesn't sound like FreeBSD implements read(2) on the pdfork file
> descriptor at all.  If it does, yes, we're not going to be able to be
> compatible with that.

There's an argument that using read(2) for stuff like this is a bad
idea.  If anyone tried to do this in C++ (or any other OO language):

class GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

class Process : public GenericInterface
{
public:
  virtual void DoAction(const char *value, size_t len) = 0;
};

void Kill(Process *p)
{
  p->DoAction("kill", 4);
};

They'd be re-educated very quickly.  This is like duck typing, but
taken to a whole new level: *everything* is a duck, and ducks have a
grand total of three operations.

On the other hand, this seems to be UNIX tradition.  It's not as if
using echo on pidfds is going to be a common idiom, though.

In any event, we should find out what FreeBSD does in response to
read(2) on the fd.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread josh
On Fri, Mar 13, 2015 at 02:33:44PM -0700, Andy Lutomirski wrote:
> On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett  wrote:
> > On Fri, Mar 13, 2015 at 04:05:29PM +, David Drysdale wrote:
> >> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  
> >> wrote:
> >> > This patch series introduces a new clone flag, CLONE_FD, which lets the 
> >> > caller
> >> > handle child process exit notification via a file descriptor rather than
> >> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and 
> >> > manage
> >> > child processes on behalf of their caller, *without* taking over 
> >> > process-wide
> >> > SIGCHLD handling (either via signal handler or signalfd).
> >>
> >> Hi Josh,
> >>
> >> From the overall description (i.e. I haven't looked at the code yet)
> >> this looks very interesting.  However, it seems to cover a lot of the
> >> same ground as the process descriptor feature that was added to FreeBSD
> >> in 9.x/10.x:
> >>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> >
> > Interesting.
> >
> >> I think it would ideally be nice for a userspace library developer to be
> >> able to do subprocess management (without SIGCHLD) in a similar way
> >> across both platforms, without lots of complicated autoconf shenanigans.
> >>
> >> So could we look at the overlap and seeing if we can come up with
> >> something that covers your requirements and also allows for something
> >> that looks like FreeBSD's process descriptors?
> >
> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> 
> We need to be careful with things like read(2), though.  It's hard to
> write a glibc function that makes read(2) do something other than what
> the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
> would be really nice to be consistent here.

It doesn't sound like FreeBSD implements read(2) on the pdfork file
descriptor at all.  If it does, yes, we're not going to be able to be
compatible with that.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread josh
On Fri, Mar 13, 2015 at 02:16:07PM -0700, Thiago Macieira wrote:
> On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > > Hi Josh,
> > > 
> > > From the overall description (i.e. I haven't looked at the code yet)
> > > this looks very interesting.  However, it seems to cover a lot of the
> > > same ground as the process descriptor feature that was added to FreeBSD
> > > 
> > > in 9.x/10.x:
> > >   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> > 
> > Interesting.
> 
> I wasn't aware of the FreeBSD implementation of pdfork(). It is actually 
> exactly what I need in userspace.

Right; libqt should be able to use pdfork on FreeBSD and CLONE_FD on
Linux.

> The only difference between pdfork() and and 
> my proposed forkfd() is where the PID and where the file descriptor are 
> returned (meaning, which is optional and which isn't).
> 
> Josh and I opted to return the file descriptor in the regular return value in 
> forkfd and in clone4 because getting the file descriptor the whole objective 
> of 
> using the forkfd or clone4-with-CLONE_FD in the first place: the file 
> descriptor 
> is not optional, but the PID is.

And as long as you can get the fd, where it's returned really doesn't
matter.

> > Agreed; however, I think it's reasonable to provide appropriate Linux
> > system calls, and then let glibc or libbsd or similar provide the
> > BSD-compatible calls on top of those.  I don't think the kernel
> > interface needs to exactly match FreeBSD's, as long as it's a superset
> > of the functionality.
> > 
> > For example, pdfork can just call clone4 with CLONE_FD and return the
> > resulting file descriptor.
> 
> Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().
> 
> I'm not too attached to the forkfd() interface, but I find it slightly 
> superior 
> for the reasons above.

Agreed.

> If we want the PD_DAEMON flag, it will have to translate to a clone flag, 
> like 
> CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

I think the inverted version makes more sense, so that the default
behavior just changes exit notification without adding the kill-on-close
behavior.  And that kill-on-close behavior can come in a later patch. :)

> > In the future, I plan to add an fd-based equivalent of
> > rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> > whether to kill a process or thread) which is a superset of pdkill.
> > pdkill could then call that and just not pass the extra info.
> > 
> > A fair bit of pdwait4 could be implemented on top of read(), other than
> > the full rusage information (see below), and the ability to wait for
> > STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> > but it'd have to be set via a flag at clone time).
> > 
> > I think it's a feature to use read() rather than an additional magic
> > system call.
> 
> Indeed, even if the libc provides a wrapper for you, like glibc does for 
> eventfd (eventfd_read, eventfd_write).
> 
> Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in 
> the initial patch set, but it was part of the plans.
> 
> > > >   clone4() will never return a file descriptor in the range
> > > >   0-2 to
> > > >   the caller, to avoid ambiguity with the return of 0 in the
> > > >   child
> > > >   process.  Only the  calling  process  will  have  the  new
> > > >file
> > > >   descriptor open; the child process will not.
> > > 
> > > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > > return the file descriptor separately, which avoids the need for special
> > > case processing for low FD values (and means that POSIX's "lowest file
> > > descriptor not currently open" behaviour can be preserved if desired).
> > 
> > That'd be easy to implement if desired, by adding an outbound pointer to
> > clone4_args.
> >
> > The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> > syscalls that use the fd as an identifier, PIDs can hopefully become
> > mostly unnecessary.  However, I'm not that attached to changing the
> > return value; it'd be trivial to switch to an outbound parameter
> > instead, and then drop the "not 0-2".
> 
> See above for more motivation on making the PID optional.
> 
> As for the file descriptor range, if we need to be able to return 0, we can 
> implement a magic constant to mean the child process, like the userspace 
> forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on 
> Linux, since that is neither a valid file descriptor nor a valid errno value.

I don't think that logic is worth implementing, though, since it would
require changing all the architecture-specific copy_thread
implementations.  If we really want to go this path, we should just
return the fd via an out parameter in the clone4_args structure.

> > > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> 

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread Andy Lutomirski
On Fri, Mar 13, 2015 at 12:42 PM, Josh Triplett  wrote:
> On Fri, Mar 13, 2015 at 04:05:29PM +, David Drysdale wrote:
>> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  wrote:
>> > This patch series introduces a new clone flag, CLONE_FD, which lets the 
>> > caller
>> > handle child process exit notification via a file descriptor rather than
>> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and 
>> > manage
>> > child processes on behalf of their caller, *without* taking over 
>> > process-wide
>> > SIGCHLD handling (either via signal handler or signalfd).
>>
>> Hi Josh,
>>
>> From the overall description (i.e. I haven't looked at the code yet)
>> this looks very interesting.  However, it seems to cover a lot of the
>> same ground as the process descriptor feature that was added to FreeBSD
>> in 9.x/10.x:
>>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
>
> Interesting.
>
>> I think it would ideally be nice for a userspace library developer to be
>> able to do subprocess management (without SIGCHLD) in a similar way
>> across both platforms, without lots of complicated autoconf shenanigans.
>>
>> So could we look at the overlap and seeing if we can come up with
>> something that covers your requirements and also allows for something
>> that looks like FreeBSD's process descriptors?
>
> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.

We need to be careful with things like read(2), though.  It's hard to
write a glibc function that makes read(2) do something other than what
the kernel thinks.  Similarly, poll(2) is defined by the kernel.  It
would be really nice to be consistent here.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread Thiago Macieira
On Friday 13 March 2015 12:42:52 Josh Triplett wrote:
> > Hi Josh,
> > 
> > From the overall description (i.e. I haven't looked at the code yet)
> > this looks very interesting.  However, it seems to cover a lot of the
> > same ground as the process descriptor feature that was added to FreeBSD
> > 
> > in 9.x/10.x:
> >   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2
> 
> Interesting.

Hi Josh, David

I wasn't aware of the FreeBSD implementation of pdfork(). It is actually 
exactly what I need in userspace. The only difference between pdfork() and and 
my proposed forkfd() is where the PID and where the file descriptor are 
returned (meaning, which is optional and which isn't).

Josh and I opted to return the file descriptor in the regular return value in 
forkfd and in clone4 because getting the file descriptor the whole objective of 
using the forkfd or clone4-with-CLONE_FD in the first place: the file 
descriptor 
is not optional, but the PID is.

> Agreed; however, I think it's reasonable to provide appropriate Linux
> system calls, and then let glibc or libbsd or similar provide the
> BSD-compatible calls on top of those.  I don't think the kernel
> interface needs to exactly match FreeBSD's, as long as it's a superset
> of the functionality.
> 
> For example, pdfork can just call clone4 with CLONE_FD and return the
> resulting file descriptor.

Agreed, we should recommend libc implement pdfork(), pdkill() and pdwait4().

I'm not too attached to the forkfd() interface, but I find it slightly superior 
for the reasons above.

If we want the PD_DAEMON flag, it will have to translate to a clone flag, like 
CLONEFD_DAEMON or inverted like CLONEFD_KILL_ON_CLOSE.

> In the future, I plan to add an fd-based equivalent of
> rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
> whether to kill a process or thread) which is a superset of pdkill.
> pdkill could then call that and just not pass the extra info.
> 
> A fair bit of pdwait4 could be implemented on top of read(), other than
> the full rusage information (see below), and the ability to wait for
> STOP/CONT (which the CLONE_FD file descriptor could support if desired,
> but it'd have to be set via a flag at clone time).
> 
> I think it's a feature to use read() rather than an additional magic
> system call.

Indeed, even if the libc provides a wrapper for you, like glibc does for 
eventfd (eventfd_read, eventfd_write).

Josh and I didn't want to submit "killfd" (or pdkill in the FreeBSD name) in 
the initial patch set, but it was part of the plans.

> > >   clone4() will never return a file descriptor in the range
> > >   0-2 to
> > >   the caller, to avoid ambiguity with the return of 0 in the
> > >   child
> > >   process.  Only the  calling  process  will  have  the  new
> > >file
> > >   descriptor open; the child process will not.
> > 
> > FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> > return the file descriptor separately, which avoids the need for special
> > case processing for low FD values (and means that POSIX's "lowest file
> > descriptor not currently open" behaviour can be preserved if desired).
> 
> That'd be easy to implement if desired, by adding an outbound pointer to
> clone4_args.
>
> The (very mild) reason I'd dropped the PID: with CLONE_FD and future
> syscalls that use the fd as an identifier, PIDs can hopefully become
> mostly unnecessary.  However, I'm not that attached to changing the
> return value; it'd be trivial to switch to an outbound parameter
> instead, and then drop the "not 0-2".

See above for more motivation on making the PID optional.

As for the file descriptor range, if we need to be able to return 0, we can 
implement a magic constant to mean the child process, like the userspace 
forkfd() does (FFD_CHILD_PROCESS). We'd probably choose the value -4096 on 
Linux, since that is neither a valid file descriptor nor a valid errno value.

> > [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> > process descriptor, including rusage retrieval.  However, I don't think
> > 
> > they actually implemented it:
> >   http://fxr.watson.org/fxr/source/kern/syscalls.master#L928]
> 
> That's a pretty good argument that we don't need to either, at least not
> yet.

pdwait4() can be implemented on top of read(), with the WNOHANG flag being just 
toggling the O_NONBLOCK bit. The problem is with the rest of the flags. We 
could implement it via more ioctls to be done prior to read() if we don't want 
to add a syscall...

Another alternative is to add a P_PD flag that can be passed as the first 
argument to waitid(), making the second argument a file descriptor instead of a 
PID or pgrp.

> > FreeBSD also implements fstat(2) for its process descriptors, although
> > only a few of the fields get filled in.
> 
> I looked at what they provide, and that seems like more o

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread Josh Triplett
On Fri, Mar 13, 2015 at 04:05:29PM +, David Drysdale wrote:
> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the 
> > caller
> > handle child process exit notification via a file descriptor rather than
> > SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and 
> > manage
> > child processes on behalf of their caller, *without* taking over 
> > process-wide
> > SIGCHLD handling (either via signal handler or signalfd).
> 
> Hi Josh,
> 
> From the overall description (i.e. I haven't looked at the code yet)
> this looks very interesting.  However, it seems to cover a lot of the
> same ground as the process descriptor feature that was added to FreeBSD
> in 9.x/10.x:
>   https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

Interesting.

> I think it would ideally be nice for a userspace library developer to be
> able to do subprocess management (without SIGCHLD) in a similar way
> across both platforms, without lots of complicated autoconf shenanigans.
>
> So could we look at the overlap and seeing if we can come up with
> something that covers your requirements and also allows for something
> that looks like FreeBSD's process descriptors?

Agreed; however, I think it's reasonable to provide appropriate Linux
system calls, and then let glibc or libbsd or similar provide the
BSD-compatible calls on top of those.  I don't think the kernel
interface needs to exactly match FreeBSD's, as long as it's a superset
of the functionality.

For example, pdfork can just call clone4 with CLONE_FD and return the
resulting file descriptor.

In my further comments below, I'll suggest ways that the FreeBSD library
calls could be implemented on top of Linux system calls.

> (I've actually got some rough patches to add process descriptor
> functionality on Linux, so I can look at how the two approaches compare
> and contrast.)
> 
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with 
> > process-wide
> > signal handling.
> >
> > The CLONE_FD file descriptor uniquely identifies a process on the system in 
> > a
> > race-free way, by holding a reference to the task_struct.  In the future, we
> > may introduce APIs that support using process file descriptors instead of 
> > PIDs.
> 
> FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
> I suspect we need either need pdkill(2) or a way to retrieve a PID from
> a process file descriptor, so that there's a way to send signals to the
> child.

The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID
to get the PID.

In the future, I plan to add an fd-based equivalent of
rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
whether to kill a process or thread) which is a superset of pdkill.
pdkill could then call that and just not pass the extra info.

A fair bit of pdwait4 could be implemented on top of read(), other than
the full rusage information (see below), and the ability to wait for
STOP/CONT (which the CLONE_FD file descriptor could support if desired,
but it'd have to be set via a flag at clone time).

I think it's a feature to use read() rather than an additional magic
system call.

> > Introducing CLONE_FD required two additional bits of yak shaving: Since 
> > clone
> > has no more usable flags (with the three currently unused flags unusable
> > because old kernels ignore them without EINVAL), also introduce a new clone4
> > system call with more flag bits and an extensible argument structure.  And
> > since the magic pt_regs-based syscall argument processing for clone's tls
> > argument would otherwise prevent introducing a sane clone4 system call, fix
> > that too.
> >
> > I tested the CLONE_SETTLS changes with a thread-local storage test program 
> > (two
> > threads independently reading and writing a __thread variable), on both 
> > 32-bit
> > and 64-bit, and I observed no issues there.
> 
> Worth preserving in tools/testing/selftests/ ?

Not really; it's just the following trivial program, which was faster to
write than to attempt to find somewhere:

#include 
#include 

__thread unsigned x = 0;

void *thread_func(void *unused)
{
unsigned *tx = &x;
for (; *tx < 10; (*tx)++)
printf("child: tx=%p *tx=%u\n", tx, *tx);
return NULL;
}

int main(void)
{
unsigned *tx = &x;
pthread_t thread;
pthread_create(&thread, NULL, thread_func, NULL);
for (; *tx < 10; (*tx)++)
printf("main: tx=%p *tx=%u\n", tx, *tx);
pthread_join(thread, NULL);
return 0;
}

(I didn't bother with error handling, because I ran it under strace.)

> > I tested clone4 and the new CLONE_FD call with several additional test
> > programs, launching either a process or thread (in the former case using
> > syscall(), in the latter case by calling clone4 via assembly and returning 
> > to
> > C), sleeping in paren

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-13 Thread David Drysdale
On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett  wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> handle child process exit notification via a file descriptor rather than
> SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
> child processes on behalf of their caller, *without* taking over process-wide
> SIGCHLD handling (either via signal handler or signalfd).

Hi Josh,

>From the overall description (i.e. I haven't looked at the code yet)
this looks very interesting.  However, it seems to cover a lot of the
same ground as the process descriptor feature that was added to FreeBSD
in 9.x/10.x:
  https://www.freebsd.org/cgi/man.cgi?query=pdfork&sektion=2

I think it would ideally be nice for a userspace library developer to be
able to do subprocess management (without SIGCHLD) in a similar way
across both platforms, without lots of complicated autoconf shenanigans.

So could we look at the overlap and seeing if we can come up with
something that covers your requirements and also allows for something
that looks like FreeBSD's process descriptors?

(I've actually got some rough patches to add process descriptor
functionality on Linux, so I can look at how the two approaches compare
and contrast.)

> Note that signalfd for SIGCHLD does not suffice here, because that still
> receives notification for all child processes, and interferes with 
> process-wide
> signal handling.
>
> The CLONE_FD file descriptor uniquely identifies a process on the system in a
> race-free way, by holding a reference to the task_struct.  In the future, we
> may introduce APIs that support using process file descriptors instead of 
> PIDs.

FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
I suspect we need either need pdkill(2) or a way to retrieve a PID from
a process file descriptor, so that there's a way to send signals to the
child.

> Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> has no more usable flags (with the three currently unused flags unusable
> because old kernels ignore them without EINVAL), also introduce a new clone4
> system call with more flag bits and an extensible argument structure.  And
> since the magic pt_regs-based syscall argument processing for clone's tls
> argument would otherwise prevent introducing a sane clone4 system call, fix
> that too.
>
> I tested the CLONE_SETTLS changes with a thread-local storage test program 
> (two
> threads independently reading and writing a __thread variable), on both 32-bit
> and 64-bit, and I observed no issues there.

Worth preserving in tools/testing/selftests/ ?

> I tested clone4 and the new CLONE_FD call with several additional test
> programs, launching either a process or thread (in the former case using
> syscall(), in the latter case by calling clone4 via assembly and returning to
> C), sleeping in parent and child to test the case of either exiting first, and
> then printing the received clone4_info structure.  Thiago also tested clone4
> with CLONE_FD with a modified version of libqt's process handling, which
> includes a test suite.
>
> I've also included the manpages patch at the end of this series.  (Note that
> the manpage documents the behavior of the future glibc wrapper as well as the
> raw syscall.)  Here's a formatted plain-text version of the manpage for
> reference:

FYI, I've added some comparisons with the FreeBSD equivalents below.

>
> CLONE4(2)  Linux Programmer's Manual CLONE4(2)
>
>
>
> NAME
>clone4 - create a child process
>
> SYNOPSIS
>/* Prototype for the glibc wrapper function */
>
>#define _GNU_SOURCE
>#include 
>
>int clone4(uint64_t flags,
>   size_t args_size,
>   struct clone4_args *args,
>   int (*fn)(void *), void *arg);
>
>/* Prototype for the raw system call */
>
>int clone4(unsigned flags_high, unsigned flags_low,
>   unsigned long args_size,
>   struct clone4_args *args);
>
>struct clone4_args {
>pid_t *ptid;
>pid_t *ctid;
>unsigned long stack_start;
>unsigned long stack_size;
>unsigned long tls;
>};
>
>
> DESCRIPTION
>clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
>clone4() supports additional flags that clone(2) does not, and  accepts
>arguments via an extensible structure.
>
>args  points to a clone4_args structure, and args_size must contain the
>size of that structure, as understood by the  caller.   If  the  caller
>passes  a  shorter  structure  than  the  kernel expects, the remaining
>fields will default to 0.  If the caller passes a larger structure than
>the  kernel  expects  (such  as one from a newer kernel), clone4() will
>return EINVAL.  The clone4_args str

Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-12 Thread Thiago Macieira
On Thursday 12 March 2015 18:40:03 Josh Triplett wrote:
> This patch series introduces a new clone flag, CLONE_FD, which lets the
> caller handle child process exit notification via a file descriptor rather
> than SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch
> and manage child processes on behalf of their caller, *without* taking over
> process-wide SIGCHLD handling (either via signal handler or signalfd).

FYI, the matching use of this feature in Qt can be found at:

https://codereview.qt-project.org/108455
https://codereview.qt-project.org/108456

The forkfd.c file this modifies aims at implementing the semantics of CLONE_FD 
for the fork case when support for CLONE_FD is missing in the kernel.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

2015-03-12 Thread Josh Triplett
This patch series introduces a new clone flag, CLONE_FD, which lets the caller
handle child process exit notification via a file descriptor rather than
SIGCHLD.  CLONE_FD makes it possible for libraries to safely launch and manage
child processes on behalf of their caller, *without* taking over process-wide
SIGCHLD handling (either via signal handler or signalfd).

Note that signalfd for SIGCHLD does not suffice here, because that still
receives notification for all child processes, and interferes with process-wide
signal handling.

The CLONE_FD file descriptor uniquely identifies a process on the system in a
race-free way, by holding a reference to the task_struct.  In the future, we
may introduce APIs that support using process file descriptors instead of PIDs.

Introducing CLONE_FD required two additional bits of yak shaving: Since clone
has no more usable flags (with the three currently unused flags unusable
because old kernels ignore them without EINVAL), also introduce a new clone4
system call with more flag bits and an extensible argument structure.  And
since the magic pt_regs-based syscall argument processing for clone's tls
argument would otherwise prevent introducing a sane clone4 system call, fix
that too.

I tested the CLONE_SETTLS changes with a thread-local storage test program (two
threads independently reading and writing a __thread variable), on both 32-bit
and 64-bit, and I observed no issues there.

I tested clone4 and the new CLONE_FD call with several additional test
programs, launching either a process or thread (in the former case using
syscall(), in the latter case by calling clone4 via assembly and returning to
C), sleeping in parent and child to test the case of either exiting first, and
then printing the received clone4_info structure.  Thiago also tested clone4
with CLONE_FD with a modified version of libqt's process handling, which
includes a test suite.

I've also included the manpages patch at the end of this series.  (Note that
the manpage documents the behavior of the future glibc wrapper as well as the
raw syscall.)  Here's a formatted plain-text version of the manpage for
reference:

CLONE4(2)  Linux Programmer's Manual CLONE4(2)



NAME
   clone4 - create a child process

SYNOPSIS
   /* Prototype for the glibc wrapper function */

   #define _GNU_SOURCE
   #include 

   int clone4(uint64_t flags,
  size_t args_size,
  struct clone4_args *args,
  int (*fn)(void *), void *arg);

   /* Prototype for the raw system call */

   int clone4(unsigned flags_high, unsigned flags_low,
  unsigned long args_size,
  struct clone4_args *args);

   struct clone4_args {
   pid_t *ptid;
   pid_t *ctid;
   unsigned long stack_start;
   unsigned long stack_size;
   unsigned long tls;
   };


DESCRIPTION
   clone4()  creates  a  new  process,  similar  to  clone(2) and fork(2).
   clone4() supports additional flags that clone(2) does not, and  accepts
   arguments via an extensible structure.

   args  points to a clone4_args structure, and args_size must contain the
   size of that structure, as understood by the  caller.   If  the  caller
   passes  a  shorter  structure  than  the  kernel expects, the remaining
   fields will default to 0.  If the caller passes a larger structure than
   the  kernel  expects  (such  as one from a newer kernel), clone4() will
   return EINVAL.  The clone4_args structure may gain additional fields at
   the  end  in  the future, and callers must only pass a size that encom‐
   passes the number of fields they understand.  If the  caller  passes  0
   for args_size, args is ignored and may be NULL.

   In  the clone4_args structure, ptid, ctid, stack_start, stack_size, and
   tls have the same semantics as they do with clone(2) and clone2(2).

   In the glibc wrapper, fn and arg have the same  semantics  as  they  do
   with clone(2).  As with clone(2), the underlying system call works more
   like fork(2), returning 0 in the child process; the glibc wrapper  sim‐
   plifies  thread execution by calling fn(arg) and exiting the child when
   that function exits.

   The 64-bit  flags  argument  (split  into  the  32-bit  flags_high  and
   flags_low arguments in the kernel interface) accepts all the same flags
   as  clone(2),  with  the   exception   of   the   obsolete   CLONE_PID,
   CLONE_DETACHED, and CLONE_STOPPED.  In addition, flags accepts the fol‐
   lowing flags:


   CLONE_FD
  Instead of returning a process ID, clone4()  with  the  CLONE_FD
  flag  returns a file descriptor associated with the new process.
  When the new process exits, the kernel will not send a signal to
  the  parent process, and will not keep the