Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote: > Unfortunately, using facilities like Unix domain sockets to merely pass file > descriptors between "worker" processes is unnecessarily difficult, due to > the following common consideration: > > 1. Domain sockets and named pipes are persistent objects. Applications must > manage their lifetime and devise unambiguous access schemes in case multiple > application instances are to be run within the same OS instance. Usually, they > would also require a writable file system to be mounted. I believe this particular issue has long been addressed in Linux, with the "abstract namespace" domain sockets. These aren't persistent - they go away when the bound socket is closed - and they don't need a writable filesystem. If you derived the name in the abstract namespace from your PID (or better, application identifier and PID) then you would have exactly the same "ambiguous access" scheme as your proposal. > int sendfd(pid_t pid, int sig, int fd) PIDs tend to be regarded as a bit of an iffy way to refer to another process, because they tend to be racy. If the process you think you're talking to dies, and has its PID reused by another unrelated sendfd()-aware process, you've just sent your open file to somewhere unexpected. You can avoid that if the process is a child of yours, but in that case you could have set up a no-fuss domain socket connection with socketpair() too. - Kevin -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote: Unfortunately, using facilities like Unix domain sockets to merely pass file descriptors between worker processes is unnecessarily difficult, due to the following common consideration: 1. Domain sockets and named pipes are persistent objects. Applications must manage their lifetime and devise unambiguous access schemes in case multiple application instances are to be run within the same OS instance. Usually, they would also require a writable file system to be mounted. I believe this particular issue has long been addressed in Linux, with the abstract namespace domain sockets. These aren't persistent - they go away when the bound socket is closed - and they don't need a writable filesystem. If you derived the name in the abstract namespace from your PID (or better, application identifier and PID) then you would have exactly the same ambiguous access scheme as your proposal. int sendfd(pid_t pid, int sig, int fd) PIDs tend to be regarded as a bit of an iffy way to refer to another process, because they tend to be racy. If the process you think you're talking to dies, and has its PID reused by another unrelated sendfd()-aware process, you've just sent your open file to somewhere unexpected. You can avoid that if the process is a child of yours, but in that case you could have set up a no-fuss domain socket connection with socketpair() too. - Kevin -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Wed, Dec 3, 2014 at 2:26 AM, Jonathan Corbet wrote: > On Tue, 2 Dec 2014 15:35:17 +1100 > Alex Dubov wrote: > > > - Messing with another process's file descriptor table without its >knowledge looks like a possible source of all kinds problems. Might >there be race conditions with close()/dup() code, for example? And >remember that users can be root in a user namespace; maybe there's no >potential for mischief there, but it needs to be considered. If process A has sufficient permissions to signal process B, it can already do arbitrary mischief, no news there (SIGKILL and SIGSTOP will definitely cause more havoc :-). I don't believe there can be any race conditions as this is not different to what happens when dup() is invoked from one of the threads in multi-threaded application, whereupon other threads go on with their usual file operations. Descriptor duplication happens prior to any signal handling activities. > - Forcing the use of realtime signals seems strange; this isn't a >realtime operation by any stretch. "Real time signals" are merely a misleading name for Posix.1b micro-messaging facility. To the best of my knowledge they do not affect scheduling any more then SIGIO or SIGALRM would. As Posix.1b signals are best handled by signalfd() facility anyway, no impact on scheduling compared to any other approach (including the existing domain socket approach) is expected at all. > > - How might the sending process communicate to the recipient what the fd >is for? Even if a process only expects one type of file descriptor, >the ability to communicate information other than its number seems >like it would often be useful. There are 32 "real time" signals defined by default in kernel; this range can be increased at will with kernel recompilation and glibc will pick up the correct range automatically (this is Posix mandated behavior and it actually works like that). I have not seen an app yet that relied on more than half a dozen of distinct signal numbers. Thus any application can conveniently define more than 2 dozens of different fd varieties out of the box, delivered to it with dedicated signal ids, whereupon in most practical applications only 1 or 2 varieties of file descriptors are ever passed around. > > Some of these concerns might be addressable by requiring the recipient to > call acceptfd() (or some such) with the ability to use poll(). As an > alternative, I believe kdbus has fd-passing abilities; if kdbus goes in, > would you still need this feature? Any process willing to handle Posix.1b signals must explicitly manipulate the signal masks - otherwise it will be killed the moment signal is received. Thus, no special "acceptfd()" call is necessary on the receiver side - applications usually don't modify their signal masks unless they expect some particular signal to arrive. kdbus has something like it and binder on android has it as well. The problem with both of them are the same as with unix domain sockets (which implement a whole, rather convoluted, cmsg facility to be ever used for that single purpose): they try to solve big problems with fancy functionality, whereupon fd passing is a nice side feature (which then gets used the most). To my understanding, commonly used functionality deserves to have its own quick, low overhead path: 1. We've got eventfd() which is neat and all, but to use it we need an easy way to pass its fd around. 2. We've got memfd() which is also neat, but to use it.. 3. We've got fairly complex (and consequently buggy) functionality like SO_REUSEPORT, but I can't avoid a feeling that if there was a low overhead transport available to path fds around (like the one proposed), the old school approach of having one process running tightly around accept() and sending sockets to workers may still rival it (pity I don't have google's setup around to test it). 4. Most importantly, when network appliances are concerned (and those represent a huge percentage of linux install base), it is desirable to have the leanest possible code paths both in kernel and in the user space (no functionality - no vulnerabilities to fish for) and still be able to rely on multi-process applications (as multi-process applications are considerably more reliable then multi-threaded ones, for all the obvious reasons). A compact, easily traceable facility comprising few hundred LOCs in the kernel, end to end, and very simple application code (sigqueue() -> signalfd()) pose a distinct advantage in this regard over largish subsystems which may provide similar feature (invariable at the expense of unnecessary costs, like persistent file system objects, specialized user-space libraries, etc) . -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, 2 Dec 2014 15:35:17 +1100 Alex Dubov wrote: > int sendfd(pid_t pid, int sig, int fd) > > Given a target process pid, the sendfd() syscall will create a duplicate > file descriptor in a target task's (referred by pid) file table pointing to > the file references by descriptor fd. Then, it will attempt to notify the > target task by issuing a Posix.1b real-time signal (sig), carrying the new > file descriptor as integer payload. If real-time signal can not be enqueued > at the destination signal queue, the newly created file descriptor will be > promptly closed. [ CC += linux-api ] So I'm not a syscall API design expert, but this one raises a few questions with me. - Messing with another process's file descriptor table without its knowledge looks like a possible source of all kinds problems. Might there be race conditions with close()/dup() code, for example? And remember that users can be root in a user namespace; maybe there's no potential for mischief there, but it needs to be considered. - Forcing the use of realtime signals seems strange; this isn't a realtime operation by any stretch. - How might the sending process communicate to the recipient what the fd is for? Even if a process only expects one type of file descriptor, the ability to communicate information other than its number seems like it would often be useful. Some of these concerns might be addressable by requiring the recipient to call acceptfd() (or some such) with the ability to use poll(). As an alternative, I believe kdbus has fd-passing abilities; if kdbus goes in, would you still need this feature? Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Tue, 2 Dec 2014 15:35:17 +1100 Alex Dubov alex.du...@gmail.com wrote: int sendfd(pid_t pid, int sig, int fd) Given a target process pid, the sendfd() syscall will create a duplicate file descriptor in a target task's (referred by pid) file table pointing to the file references by descriptor fd. Then, it will attempt to notify the target task by issuing a Posix.1b real-time signal (sig), carrying the new file descriptor as integer payload. If real-time signal can not be enqueued at the destination signal queue, the newly created file descriptor will be promptly closed. [ CC += linux-api ] So I'm not a syscall API design expert, but this one raises a few questions with me. - Messing with another process's file descriptor table without its knowledge looks like a possible source of all kinds problems. Might there be race conditions with close()/dup() code, for example? And remember that users can be root in a user namespace; maybe there's no potential for mischief there, but it needs to be considered. - Forcing the use of realtime signals seems strange; this isn't a realtime operation by any stretch. - How might the sending process communicate to the recipient what the fd is for? Even if a process only expects one type of file descriptor, the ability to communicate information other than its number seems like it would often be useful. Some of these concerns might be addressable by requiring the recipient to call acceptfd() (or some such) with the ability to use poll(). As an alternative, I believe kdbus has fd-passing abilities; if kdbus goes in, would you still need this feature? Thanks, jon -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s
On Wed, Dec 3, 2014 at 2:26 AM, Jonathan Corbet cor...@lwn.net wrote: On Tue, 2 Dec 2014 15:35:17 +1100 Alex Dubov alex.du...@gmail.com wrote: - Messing with another process's file descriptor table without its knowledge looks like a possible source of all kinds problems. Might there be race conditions with close()/dup() code, for example? And remember that users can be root in a user namespace; maybe there's no potential for mischief there, but it needs to be considered. If process A has sufficient permissions to signal process B, it can already do arbitrary mischief, no news there (SIGKILL and SIGSTOP will definitely cause more havoc :-). I don't believe there can be any race conditions as this is not different to what happens when dup() is invoked from one of the threads in multi-threaded application, whereupon other threads go on with their usual file operations. Descriptor duplication happens prior to any signal handling activities. - Forcing the use of realtime signals seems strange; this isn't a realtime operation by any stretch. Real time signals are merely a misleading name for Posix.1b micro-messaging facility. To the best of my knowledge they do not affect scheduling any more then SIGIO or SIGALRM would. As Posix.1b signals are best handled by signalfd() facility anyway, no impact on scheduling compared to any other approach (including the existing domain socket approach) is expected at all. - How might the sending process communicate to the recipient what the fd is for? Even if a process only expects one type of file descriptor, the ability to communicate information other than its number seems like it would often be useful. There are 32 real time signals defined by default in kernel; this range can be increased at will with kernel recompilation and glibc will pick up the correct range automatically (this is Posix mandated behavior and it actually works like that). I have not seen an app yet that relied on more than half a dozen of distinct signal numbers. Thus any application can conveniently define more than 2 dozens of different fd varieties out of the box, delivered to it with dedicated signal ids, whereupon in most practical applications only 1 or 2 varieties of file descriptors are ever passed around. Some of these concerns might be addressable by requiring the recipient to call acceptfd() (or some such) with the ability to use poll(). As an alternative, I believe kdbus has fd-passing abilities; if kdbus goes in, would you still need this feature? Any process willing to handle Posix.1b signals must explicitly manipulate the signal masks - otherwise it will be killed the moment signal is received. Thus, no special acceptfd() call is necessary on the receiver side - applications usually don't modify their signal masks unless they expect some particular signal to arrive. kdbus has something like it and binder on android has it as well. The problem with both of them are the same as with unix domain sockets (which implement a whole, rather convoluted, cmsg facility to be ever used for that single purpose): they try to solve big problems with fancy functionality, whereupon fd passing is a nice side feature (which then gets used the most). To my understanding, commonly used functionality deserves to have its own quick, low overhead path: 1. We've got eventfd() which is neat and all, but to use it we need an easy way to pass its fd around. 2. We've got memfd() which is also neat, but to use it.. 3. We've got fairly complex (and consequently buggy) functionality like SO_REUSEPORT, but I can't avoid a feeling that if there was a low overhead transport available to path fds around (like the one proposed), the old school approach of having one process running tightly around accept() and sending sockets to workers may still rival it (pity I don't have google's setup around to test it). 4. Most importantly, when network appliances are concerned (and those represent a huge percentage of linux install base), it is desirable to have the leanest possible code paths both in kernel and in the user space (no functionality - no vulnerabilities to fish for) and still be able to rely on multi-process applications (as multi-process applications are considerably more reliable then multi-threaded ones, for all the obvious reasons). A compact, easily traceable facility comprising few hundred LOCs in the kernel, end to end, and very simple application code (sigqueue() - signalfd()) pose a distinct advantage in this regard over largish subsystems which may provide similar feature (invariable at the expense of unnecessary costs, like persistent file system objects, specialized user-space libraries, etc) . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Minimal effort/low overhead file descriptor duplication over Posix.1b s
A common requirement in parallel processing applications (relied upon by popular network servers, databases and various other applications) is to pass open file descriptors between processes. Historically, several mechanisms existed to support this requirement, such as those provided by "cmsg" facility of unix domain sockets or special operations on named pipes (on Android this can also be achieved using "binder" facility). Unfortunately, using facilities like Unix domain sockets to merely pass file descriptors between "worker" processes is unnecessarily difficult, due to the following common consideration: 1. Domain sockets and named pipes are persistent objects. Applications must manage their lifetime and devise unambiguous access schemes in case multiple application instances are to be run within the same OS instance. Usually, they would also require a writable file system to be mounted. 2. Interaction with domain sockets and named pipes requires a sizable, non-trivial and error-prone code on the application side, especially in cases where multiple worker types started by multiple application instances must coexist within the same OS instance. 3. Domain sockets and pipes require creation of complex kernel-side set-ups, whereupon, in many cases, the only information ever passed by the application over those channels are file descriptors (it is usual for the major part of the application's shared state to be established through other mechanisms, like shared memory). In some cases, applications are forced to send meaningless rubbish over the domain socket merely to "push" the associated "cmsg" carrying the file descriptor through. Present patch introduces exceptionally easy to use, low latency and low overhead mechanism for transferring file descriptors between cooperating processes: int sendfd(pid_t pid, int sig, int fd) Given a target process pid, the sendfd() syscall will create a duplicate file descriptor in a target task's (referred by pid) file table pointing to the file references by descriptor fd. Then, it will attempt to notify the target task by issuing a Posix.1b real-time signal (sig), carrying the new file descriptor as integer payload. If real-time signal can not be enqueued at the destination signal queue, the newly created file descriptor will be promptly closed. It is believed, that proposed sendfd() syscall, together with recently accepted "memfd" facility may greatly simplify development of parallel processing applications, by eliminating the need to rely on tricky and possibly insecure approaches involving domain sockets and such. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Minimal effort/low overhead file descriptor duplication over Posix.1b s
A common requirement in parallel processing applications (relied upon by popular network servers, databases and various other applications) is to pass open file descriptors between processes. Historically, several mechanisms existed to support this requirement, such as those provided by cmsg facility of unix domain sockets or special operations on named pipes (on Android this can also be achieved using binder facility). Unfortunately, using facilities like Unix domain sockets to merely pass file descriptors between worker processes is unnecessarily difficult, due to the following common consideration: 1. Domain sockets and named pipes are persistent objects. Applications must manage their lifetime and devise unambiguous access schemes in case multiple application instances are to be run within the same OS instance. Usually, they would also require a writable file system to be mounted. 2. Interaction with domain sockets and named pipes requires a sizable, non-trivial and error-prone code on the application side, especially in cases where multiple worker types started by multiple application instances must coexist within the same OS instance. 3. Domain sockets and pipes require creation of complex kernel-side set-ups, whereupon, in many cases, the only information ever passed by the application over those channels are file descriptors (it is usual for the major part of the application's shared state to be established through other mechanisms, like shared memory). In some cases, applications are forced to send meaningless rubbish over the domain socket merely to push the associated cmsg carrying the file descriptor through. Present patch introduces exceptionally easy to use, low latency and low overhead mechanism for transferring file descriptors between cooperating processes: int sendfd(pid_t pid, int sig, int fd) Given a target process pid, the sendfd() syscall will create a duplicate file descriptor in a target task's (referred by pid) file table pointing to the file references by descriptor fd. Then, it will attempt to notify the target task by issuing a Posix.1b real-time signal (sig), carrying the new file descriptor as integer payload. If real-time signal can not be enqueued at the destination signal queue, the newly created file descriptor will be promptly closed. It is believed, that proposed sendfd() syscall, together with recently accepted memfd facility may greatly simplify development of parallel processing applications, by eliminating the need to rely on tricky and possibly insecure approaches involving domain sockets and such. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/