On 2018-11-01, Daniel Colascione <dan...@google.com> wrote: > On Thu, Nov 1, 2018 at 7:00 AM, Aleksa Sarai <cyp...@cyphar.com> wrote: > > On 2018-10-29, Daniel Colascione <dan...@google.com> wrote: > >> This patch adds a new file under /proc/pid, /proc/pid/exithand. > >> Attempting to read from an exithand file will block until the > >> corresponding process exits, at which point the read will successfully > >> complete with EOF. The file descriptor supports both blocking > >> operations and poll(2). It's intended to be a minimal interface for > >> allowing a program to wait for the exit of a process that is not one > >> of its children. > >> > >> Why might we want this interface? Android's lmkd kills processes in > >> order to free memory in response to various memory pressure > >> signals. It's desirable to wait until a killed process actually exits > >> before moving on (if needed) to killing the next process. Since the > >> processes that lmkd kills are not lmkd's children, lmkd currently > >> lacks a way to wait for a process to actually die after being sent > >> SIGKILL; today, lmkd resorts to polling the proc filesystem pid > >> entry. This interface allow lmkd to give up polling and instead block > >> and wait for process death. > > > > I agree with the need for this interface (with a few caveats), but there > > are a few points I'd like to make: > > > > * I don't think that making a new procfile is necessary. When you open > > /proc/$pid you already have a handle for the underlying process, and > > you can already poll to check whether the process has died (fstatat > > fails for instance). What if we just used an inotify event to tell > > userspace that the process has died -- to avoid userspace doing a > > poll loop? > > I'm trying to make a simple interface. The basic unix data access > model is that a userspace application wants information (e.g., next > bunch of bytes in a file, next packet from a socket, next signal from > a signal FD, etc.), and tells the kernel so by making a system call on > a file descriptor. Ordinarily, the kernel returns to userspace with > the requested information when it's available, potentially after > blocking until the information is available. Sometimes userspace > doesn't want to block, so it adds O_NONBLOCK to the open file mode, > and in this mode, the kernel can tell the userspace requestor "try > again later", but the source of truth is still that > ordinarily-blocking system call. How does userspace know when to try > again in the "try again later" case? By using > select/poll/epoll/whatever, which suggests a good time for that "try > again later" retry, but is not dispositive about it, since that > ordinarily-blocking system call is still the sole source of truth, and > that poll is allowed to report spurious readabilty.
inotify gives you an event if a file or directory is deleted. A pid dying semantically is similar to the idea of a /proc/$pid being deleted. I don't see how a blocking read on a new procfile is simpler than using the existing notification-on-file-events infrastructure -- not to mention that the idea of "this file blocks until the thing we are indirectly referencing by this file is gone" seems to me to be a really strange interface. Sure, it uses read(2) -- but is that the only constraint on designing simple interfaces? > The event file I'm proposing is so ordinary, in fact, that it works > from the shell. Without some specific technical reason to do something > different, we shouldn't do something unusual. inotify-tools are available on effectively every distribution. > Given that we *can*, cheaply, provide a clean and consistent API to > userspace, why would we instead want to inflict some exotic and > hard-to-use interface on userspace instead? Asking that userspace poll > on a directory file descriptor and, when poll returns, check by > looking for certain errors (we'd have to spec which ones) from fstatat > is awkward. /proc/pid is a directory. In what other context does the > kernel ask userspace to use a directory this way? I'm not sure you understood my proposal. I said that we need an interface to do this, and I was trying to explain (by noting what the current way of doing it would be) what I think the interface should be. To reiterate, I believe that having an inotify event (IN_DELETE_SELF on /proc/$pid) would be in keeping with the current way of doing things but allowing userspace to avoid all of the annoyances you just mentioned and I was alluding to. I *don't* think that the current scheme of looping on fstatat is the way it should be left. And there is an argument the inotify is not sufficient to > > I'm really not a huge fan of the "blocking read" semantic (though if we > > have to have it, can we at least provide as much information as you get > > from proc_connector -- such as the exit status?). > [...] > The exit status in /proc/pid/stat is zeroed out for readers that fail > do_task_stat's ptrace_may_access call. (Falsifying the exit status in > stat seems a privilege check fails seems like a bad idea from a > correctness POV.) It's not clear to me what the purpose of that field is within procfs for *dead* proceses -- which is what we're discussing here. As far as I can tell, you will get an ESRCH when you try to read it. When testing this it also looked like you didn't even get the exit_status as a zombie but I might be mistaken. So while it is masked for !ptrace_may_access, it's also zero (or unreadable) for almost every case outside of stopped processes (AFAICS). Am I missing something? > Should open() on exithand perform the same ptrace_may_access privilege > check? What if the process *becomes* untraceable during its lifetime > (e.g., with setuid). Should that read() on the exithand FD still yield > a siginfo_t? Just having exithand yield EOF all the time punts the > privilege problem to a later discussion because this approach doesn't > leak information. We can always add an "exithand_full" or something > that actually yields a siginfo_t. I agree that read(2) makes this hard. I don't think we should use it. But if we have to use it, I would like us to have feature parity with features that FreeBSD had 18 years ago. > Another option would be to make exithand's read() always yield a > siginfo_t, but have the open() just fail if the caller couldn't > ptrace_may_access it. But why shouldn't you be able to wait on other > processes? If you can see it in /proc, you should be able to wait on > it exiting. I would suggest looking at FreeBSD's kevent semantics for inspiration (or at least to see an alternative way of doing things). In particular, EVFILT_PROC+NOTE_EXIT -- which is attached to a particular process. I wonder what their view is on these sorts of questions. > > Also maybe we should > > integrate this into the exit machinery instead of this loop... > > I don't know what you mean. It's already integrated into the exit > machinery: it's what runs the waitqueue. My mistake, I missed the last hunk of the patch. -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>
signature.asc
Description: PGP signature