On Fri, Jul 10, 2020 at 08:41:40PM +0700, Robert Elz wrote: > I have spent a little time looking at this now, and I think > it is just all a mess.
Indeed... > As best I can work out, and someone correct me if I'm wrong, > the whole purpose of pg_jobc is so that orphanpg() can be called > when a process group is orphaned (no longer has a session leader). Yes. A process group is orphaned if there's no longer a way to do job control on it directly -- the most common case is when you get hung up on and your login shell/session leader goes away, but it can also happen if you make chains of forks and some of them exit. For example, if you start an interactive subshell, run some stuff in the background, then exit it, the background jobs are now orphaned. Detecting orphaned process groups is vexing because it's a graph connectivity problem and you don't want to/can't (for locking reasons) go chugging around the graph explicitly. > I see 3 ways forward... simply drop the KASSERT the way that FreeBSD > have done, and let things return to the semi-broken but rarely bothersome > code that was there before. That's not really a good idea, as the > sanitizers apparently find problems with the code the way it was (not > too surprising, deleting the KASSERT won't fix the bugs, it just stops > noticing them explicitly). Indeed. > Or, we could properly define what pg_jobc is counting, and then make sure > that it counts whatever that is properly - is incremented in all the > appropriate places, and decremented properly as well. Currently > the comment about it in proc.h is: > /* > * Number of processes qualifying > * pgrp for job control > */ > which makes it clear that it is a reference counter (not necessarily > counting the number of something which exists, so that something can be > deleted, but it is counting references to processes). Unfortunately > I have no idea what "qualifying pgrp for job control" is supposed to mean. A processes is job-controlled by its parent process, and a process group is job-controlled by any parent of processes in it. So a process group can be job-controlled if some process in it has a parent in a different process group that can also be job-controlled, going back to the session leader. (Provided that the session has a tty, anyway.) This becomes complicated because although normal process group usage is completely hierarchical, it's possible to place processes in process groups such that the graph of parent relationships ceases to be a tree. The code is trying to count how many processes in the process group have a parent in a different process group that is not itself orphaned. If this count reaches zero, the process group has become orphaned. I can't remember if it's possible to set up process groups where the parent graph is cyclic. I think it might be; or at least, I don't remember any reason why not that wouldn't also force the graph to be a tree. But it's been a while with all this stuff and I may be forgetting something. Anyway, the method should work as long as the graph remains acyclic. The implementation is probably missing cases where it should be adjusting the count. > That could be done, but it seems like a lot of work to me, and not easy. Unfortunately, I think it's the only viable way forard. > Another (more radical) approach would be to simply drop orphanpg() > completely, and thus no longer need pg_jobc at all. The system > wouldn't be bothered by this at all - all orphanpg() does is locate > stopped members of the process group, and send then SIGCONT (to restart) > and SIGHUP (to probably kill them - or at least inform them they their > environment has changed). Unfortunately, removing it violates POSIX, which prescribes all this stuff. > Third, and the option I'd suggest, is to revert to more basic principles, > remove the pg_jobc attempt to detect when a session leader has exited, > or changed to a different process group, and instead at candidate events > (any time a process leaves a process group, for any reason) check if that > process was the session leader, and if it is, clean up any now orphaned > stopped processes. This is likely to be slower than the current attempt, > but is much easier to make correct (and much less likely to cause problems, > other than dangling orphaned stopped processes, if incorrect). I don't think that's sufficient, because it doesn't take into account the cases where jobs become disconnected from the session leader without the session leader exiting. Which is easy to arrange: su, then vi, ^Z, exit. :-/ -- David A. Holland dholl...@netbsd.org