Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-17 Thread Kevin Easton
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote:
> Unfortunately, using facilities like Unix domain sockets to merely pass file
> descriptors between "worker" processes is unnecessarily difficult, due to
> the following common consideration:
> 
> 1. Domain sockets and named pipes are persistent objects. Applications must
> manage their lifetime and devise unambiguous access schemes in case multiple
> application instances are to be run within the same OS instance. Usually, they
> would also require a writable file system to be mounted.

I believe this particular issue has long been addressed in Linux, with
the "abstract namespace" domain sockets.

These aren't persistent - they go away when the bound socket is closed -
and they don't need a writable filesystem.

If you derived the name in the abstract namespace from your PID (or better,
application identifier and PID) then you would have exactly the same
"ambiguous access" scheme as your proposal.

> int sendfd(pid_t pid, int sig, int fd)

PIDs tend to be regarded as a bit of an iffy way to refer to another
process, because they tend to be racy.  If the process you think you're
talking to dies, and has its PID reused by another unrelated sendfd()-aware
process, you've just sent your open file to somewhere unexpected.

You can avoid that if the process is a child of yours, but in that case
you could have set up a no-fuss domain socket connection with socketpair()
too.

- Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-17 Thread Kevin Easton
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote:
 Unfortunately, using facilities like Unix domain sockets to merely pass file
 descriptors between worker processes is unnecessarily difficult, due to
 the following common consideration:
 
 1. Domain sockets and named pipes are persistent objects. Applications must
 manage their lifetime and devise unambiguous access schemes in case multiple
 application instances are to be run within the same OS instance. Usually, they
 would also require a writable file system to be mounted.

I believe this particular issue has long been addressed in Linux, with
the abstract namespace domain sockets.

These aren't persistent - they go away when the bound socket is closed -
and they don't need a writable filesystem.

If you derived the name in the abstract namespace from your PID (or better,
application identifier and PID) then you would have exactly the same
ambiguous access scheme as your proposal.

 int sendfd(pid_t pid, int sig, int fd)

PIDs tend to be regarded as a bit of an iffy way to refer to another
process, because they tend to be racy.  If the process you think you're
talking to dies, and has its PID reused by another unrelated sendfd()-aware
process, you've just sent your open file to somewhere unexpected.

You can avoid that if the process is a child of yours, but in that case
you could have set up a no-fuss domain socket connection with socketpair()
too.

- Kevin

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-02 Thread Alex Dubov
On Wed, Dec 3, 2014 at 2:26 AM, Jonathan Corbet  wrote:
> On Tue,  2 Dec 2014 15:35:17 +1100
> Alex Dubov  wrote:
>
>
>  - Messing with another process's file descriptor table without its
>knowledge looks like a possible source of all kinds problems.  Might
>there be race conditions with close()/dup() code, for example?  And
>remember that users can be root in a user namespace; maybe there's no
>potential for mischief there, but it needs to be considered.

If process A has sufficient permissions to signal process B, it can
already do arbitrary mischief, no news there (SIGKILL and SIGSTOP will
definitely cause more havoc :-).

I don't believe there can be any race conditions as this is not
different to what happens when dup() is invoked from one of the
threads in multi-threaded application, whereupon other threads go on
with their usual file operations. Descriptor duplication happens prior
to any signal handling activities.

>  - Forcing the use of realtime signals seems strange; this isn't a
>realtime operation by any stretch.

"Real time signals" are merely a misleading name for Posix.1b
micro-messaging facility. To the best of my knowledge they do not
affect scheduling any more then SIGIO or SIGALRM would.

As Posix.1b signals are best handled by signalfd() facility anyway, no
impact on scheduling compared to any other approach (including the
existing domain socket approach) is expected at all.

>
>  - How might the sending process communicate to the recipient what the fd
>is for?  Even if a process only expects one type of file descriptor,
>the ability to communicate information other than its number seems
>like it would often be useful.

There are 32 "real time" signals defined by default in kernel; this
range can be increased at will with kernel recompilation and glibc
will pick up the correct range automatically (this is Posix mandated
behavior and it actually works like that).

I have not seen an app yet that relied on more than half a dozen of
distinct signal numbers. Thus any application can conveniently define
more than 2 dozens of different fd varieties out of the box, delivered
to it with dedicated signal ids, whereupon in most practical
applications only 1 or 2 varieties of file descriptors are ever passed
around.

>
> Some of these concerns might be addressable by requiring the recipient to
> call acceptfd() (or some such) with the ability to use poll().  As an
> alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
> would you still need this feature?

Any process willing to handle Posix.1b signals must explicitly
manipulate the signal masks - otherwise it will be killed the moment
signal is received. Thus, no special "acceptfd()" call is necessary on
the receiver side - applications usually don't  modify their signal
masks unless they expect some particular signal to arrive.

kdbus has something like it and binder on android has it as well. The
problem with both of them are the same as with unix domain sockets
(which implement a whole, rather convoluted, cmsg facility to be ever
used for that single purpose): they try to solve big problems with
fancy functionality, whereupon fd passing is a nice side feature
(which then gets used the most).

To my understanding, commonly used functionality deserves to have its
own quick, low overhead path:

1. We've got eventfd() which is neat and all, but to use it we need an
easy way to pass its fd around.

2. We've got memfd() which is also neat, but to use it..

3. We've got fairly complex (and consequently buggy) functionality
like SO_REUSEPORT, but I can't avoid a feeling that if there was a low
overhead transport available to path fds around (like the one
proposed), the old school approach of having one process running
tightly around accept() and sending sockets to workers may still rival
it (pity I don't have google's setup around to test it).

4. Most importantly, when network appliances are concerned (and those
represent a huge percentage of linux install base), it is desirable to
have the leanest possible code paths both in kernel and in the user
space (no functionality - no vulnerabilities to fish for) and still be
able to rely on multi-process applications (as multi-process
applications are considerably more reliable then multi-threaded ones,
for all the obvious reasons). A compact, easily traceable facility
comprising few hundred LOCs in the kernel, end to end, and very simple
application code (sigqueue() -> signalfd()) pose a distinct advantage
in this regard over largish subsystems which may provide similar
feature (invariable at the expense of unnecessary costs, like
persistent file system objects, specialized user-space libraries, etc)
.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-02 Thread Jonathan Corbet
On Tue,  2 Dec 2014 15:35:17 +1100
Alex Dubov  wrote:

> int sendfd(pid_t pid, int sig, int fd)
> 
> Given a target process pid, the sendfd() syscall will create a duplicate
> file descriptor in a target task's (referred by pid) file table pointing to
> the file references by descriptor fd. Then, it will attempt to notify the
> target task by issuing a Posix.1b real-time signal (sig), carrying the new
> file descriptor as integer payload. If real-time signal can not be enqueued
> at the destination signal queue, the newly created file descriptor will be
> promptly closed.

[ CC += linux-api ]

So I'm not a syscall API design expert, but this one raises a few
questions with me.

 - Messing with another process's file descriptor table without its
   knowledge looks like a possible source of all kinds problems.  Might
   there be race conditions with close()/dup() code, for example?  And
   remember that users can be root in a user namespace; maybe there's no
   potential for mischief there, but it needs to be considered.

 - Forcing the use of realtime signals seems strange; this isn't a
   realtime operation by any stretch.

 - How might the sending process communicate to the recipient what the fd
   is for?  Even if a process only expects one type of file descriptor,
   the ability to communicate information other than its number seems
   like it would often be useful.

Some of these concerns might be addressable by requiring the recipient to
call acceptfd() (or some such) with the ability to use poll().  As an
alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
would you still need this feature?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-02 Thread Jonathan Corbet
On Tue,  2 Dec 2014 15:35:17 +1100
Alex Dubov alex.du...@gmail.com wrote:

 int sendfd(pid_t pid, int sig, int fd)
 
 Given a target process pid, the sendfd() syscall will create a duplicate
 file descriptor in a target task's (referred by pid) file table pointing to
 the file references by descriptor fd. Then, it will attempt to notify the
 target task by issuing a Posix.1b real-time signal (sig), carrying the new
 file descriptor as integer payload. If real-time signal can not be enqueued
 at the destination signal queue, the newly created file descriptor will be
 promptly closed.

[ CC += linux-api ]

So I'm not a syscall API design expert, but this one raises a few
questions with me.

 - Messing with another process's file descriptor table without its
   knowledge looks like a possible source of all kinds problems.  Might
   there be race conditions with close()/dup() code, for example?  And
   remember that users can be root in a user namespace; maybe there's no
   potential for mischief there, but it needs to be considered.

 - Forcing the use of realtime signals seems strange; this isn't a
   realtime operation by any stretch.

 - How might the sending process communicate to the recipient what the fd
   is for?  Even if a process only expects one type of file descriptor,
   the ability to communicate information other than its number seems
   like it would often be useful.

Some of these concerns might be addressable by requiring the recipient to
call acceptfd() (or some such) with the ability to use poll().  As an
alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
would you still need this feature?

Thanks,

jon
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-02 Thread Alex Dubov
On Wed, Dec 3, 2014 at 2:26 AM, Jonathan Corbet cor...@lwn.net wrote:
 On Tue,  2 Dec 2014 15:35:17 +1100
 Alex Dubov alex.du...@gmail.com wrote:


  - Messing with another process's file descriptor table without its
knowledge looks like a possible source of all kinds problems.  Might
there be race conditions with close()/dup() code, for example?  And
remember that users can be root in a user namespace; maybe there's no
potential for mischief there, but it needs to be considered.

If process A has sufficient permissions to signal process B, it can
already do arbitrary mischief, no news there (SIGKILL and SIGSTOP will
definitely cause more havoc :-).

I don't believe there can be any race conditions as this is not
different to what happens when dup() is invoked from one of the
threads in multi-threaded application, whereupon other threads go on
with their usual file operations. Descriptor duplication happens prior
to any signal handling activities.

  - Forcing the use of realtime signals seems strange; this isn't a
realtime operation by any stretch.

Real time signals are merely a misleading name for Posix.1b
micro-messaging facility. To the best of my knowledge they do not
affect scheduling any more then SIGIO or SIGALRM would.

As Posix.1b signals are best handled by signalfd() facility anyway, no
impact on scheduling compared to any other approach (including the
existing domain socket approach) is expected at all.


  - How might the sending process communicate to the recipient what the fd
is for?  Even if a process only expects one type of file descriptor,
the ability to communicate information other than its number seems
like it would often be useful.

There are 32 real time signals defined by default in kernel; this
range can be increased at will with kernel recompilation and glibc
will pick up the correct range automatically (this is Posix mandated
behavior and it actually works like that).

I have not seen an app yet that relied on more than half a dozen of
distinct signal numbers. Thus any application can conveniently define
more than 2 dozens of different fd varieties out of the box, delivered
to it with dedicated signal ids, whereupon in most practical
applications only 1 or 2 varieties of file descriptors are ever passed
around.


 Some of these concerns might be addressable by requiring the recipient to
 call acceptfd() (or some such) with the ability to use poll().  As an
 alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
 would you still need this feature?

Any process willing to handle Posix.1b signals must explicitly
manipulate the signal masks - otherwise it will be killed the moment
signal is received. Thus, no special acceptfd() call is necessary on
the receiver side - applications usually don't  modify their signal
masks unless they expect some particular signal to arrive.

kdbus has something like it and binder on android has it as well. The
problem with both of them are the same as with unix domain sockets
(which implement a whole, rather convoluted, cmsg facility to be ever
used for that single purpose): they try to solve big problems with
fancy functionality, whereupon fd passing is a nice side feature
(which then gets used the most).

To my understanding, commonly used functionality deserves to have its
own quick, low overhead path:

1. We've got eventfd() which is neat and all, but to use it we need an
easy way to pass its fd around.

2. We've got memfd() which is also neat, but to use it..

3. We've got fairly complex (and consequently buggy) functionality
like SO_REUSEPORT, but I can't avoid a feeling that if there was a low
overhead transport available to path fds around (like the one
proposed), the old school approach of having one process running
tightly around accept() and sending sockets to workers may still rival
it (pity I don't have google's setup around to test it).

4. Most importantly, when network appliances are concerned (and those
represent a huge percentage of linux install base), it is desirable to
have the leanest possible code paths both in kernel and in the user
space (no functionality - no vulnerabilities to fish for) and still be
able to rely on multi-process applications (as multi-process
applications are considerably more reliable then multi-threaded ones,
for all the obvious reasons). A compact, easily traceable facility
comprising few hundred LOCs in the kernel, end to end, and very simple
application code (sigqueue() - signalfd()) pose a distinct advantage
in this regard over largish subsystems which may provide similar
feature (invariable at the expense of unnecessary costs, like
persistent file system objects, specialized user-space libraries, etc)
.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-01 Thread Alex Dubov
A common requirement in parallel processing applications (relied upon by
popular network servers, databases and various other applications) is to
pass open file descriptors between processes. Historically, several mechanisms
existed to support this requirement, such as those provided by "cmsg" facility
of unix domain sockets or special operations on named pipes (on Android this
can also be achieved using "binder" facility).

Unfortunately, using facilities like Unix domain sockets to merely pass file
descriptors between "worker" processes is unnecessarily difficult, due to
the following common consideration:

1. Domain sockets and named pipes are persistent objects. Applications must
manage their lifetime and devise unambiguous access schemes in case multiple
application instances are to be run within the same OS instance. Usually, they
would also require a writable file system to be mounted.

2. Interaction with domain sockets and named pipes requires a sizable,
non-trivial and error-prone code on the application side, especially in
cases where multiple worker types started by multiple application instances
must coexist within the same OS instance.

3. Domain sockets and pipes require creation of complex kernel-side set-ups,
whereupon, in many cases, the only information ever passed by the application
over those channels are file descriptors (it is usual for the major part of the
application's shared state to be established through other mechanisms,
like shared memory). In some cases, applications are forced to send meaningless
rubbish over the domain socket merely to "push" the associated "cmsg" carrying
the file descriptor through.

Present patch introduces exceptionally easy to use, low latency and low
overhead mechanism for transferring file descriptors between cooperating
processes:

int sendfd(pid_t pid, int sig, int fd)

Given a target process pid, the sendfd() syscall will create a duplicate
file descriptor in a target task's (referred by pid) file table pointing to
the file references by descriptor fd. Then, it will attempt to notify the
target task by issuing a Posix.1b real-time signal (sig), carrying the new
file descriptor as integer payload. If real-time signal can not be enqueued
at the destination signal queue, the newly created file descriptor will be
promptly closed.

It is believed, that proposed sendfd() syscall, together with recently
accepted "memfd" facility may greatly simplify development of parallel
processing applications, by eliminating the need to rely on tricky and
possibly insecure approaches involving domain sockets and such.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Minimal effort/low overhead file descriptor duplication over Posix.1b s

2014-12-01 Thread Alex Dubov
A common requirement in parallel processing applications (relied upon by
popular network servers, databases and various other applications) is to
pass open file descriptors between processes. Historically, several mechanisms
existed to support this requirement, such as those provided by cmsg facility
of unix domain sockets or special operations on named pipes (on Android this
can also be achieved using binder facility).

Unfortunately, using facilities like Unix domain sockets to merely pass file
descriptors between worker processes is unnecessarily difficult, due to
the following common consideration:

1. Domain sockets and named pipes are persistent objects. Applications must
manage their lifetime and devise unambiguous access schemes in case multiple
application instances are to be run within the same OS instance. Usually, they
would also require a writable file system to be mounted.

2. Interaction with domain sockets and named pipes requires a sizable,
non-trivial and error-prone code on the application side, especially in
cases where multiple worker types started by multiple application instances
must coexist within the same OS instance.

3. Domain sockets and pipes require creation of complex kernel-side set-ups,
whereupon, in many cases, the only information ever passed by the application
over those channels are file descriptors (it is usual for the major part of the
application's shared state to be established through other mechanisms,
like shared memory). In some cases, applications are forced to send meaningless
rubbish over the domain socket merely to push the associated cmsg carrying
the file descriptor through.

Present patch introduces exceptionally easy to use, low latency and low
overhead mechanism for transferring file descriptors between cooperating
processes:

int sendfd(pid_t pid, int sig, int fd)

Given a target process pid, the sendfd() syscall will create a duplicate
file descriptor in a target task's (referred by pid) file table pointing to
the file references by descriptor fd. Then, it will attempt to notify the
target task by issuing a Posix.1b real-time signal (sig), carrying the new
file descriptor as integer payload. If real-time signal can not be enqueued
at the destination signal queue, the newly created file descriptor will be
promptly closed.

It is believed, that proposed sendfd() syscall, together with recently
accepted memfd facility may greatly simplify development of parallel
processing applications, by eliminating the need to rely on tricky and
possibly insecure approaches involving domain sockets and such.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/