Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > Hi, > > The patch below appears to fix a problem where a number of dead processes > linger on the system. On a highly loaded system, dozens of processes > were found stuck in do_exit(), calling thier very last schedule(), and > then being lost forever. Ok, we spent some time with Paul decrypting what _switch_to is supposed to do. Our understanding at this point is that the current code is correct on both ppc32 and ppc64, that is: The "prev" passed in is always "current" and we don't see how it can be anything else. We use a local variable instead of current in the common code because accessing current can be slow on some architectures. I don't see any codepath where prev != current before switch_to. If we didn't do some black magic that I explain below, _switch_to would switch the entire context, including stack, and thus including the value of "prev". Which means that we would always come back with prev beeing current, which is useless for reaping the old task. What we want is that this "prev" that was passed to _switch_to() is returned so that we can rip that previous task despite the change of context, that is basically prev has to be an invariant vs. the change of context in switch_to. On ppc & ppc64, we implement that by passing that prev (or it's thread counterpart) to the assembly context switch code in r3. This code will preserve it and return it as-is (or re-transformed from thread to task). So your problem must be somewhere else. I've looked at the need_resched code path and we always reload prev = current from a non-preemptible region, so it can't be wrong. This was verified on 2.6.12-rc2, there might be something else wrong in an older kernel. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, Apr 19, 2005 at 11:01:25AM +1000, Benjamin Herrenschmidt was heard to remark: > On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > > > Hi, > > > > The patch below appears to fix a problem where a number of dead processes > > linger on the system. On a highly loaded system, dozens of processes > > were found stuck in do_exit(), calling thier very last schedule(), and > > then being lost forever. > > > > Processes that are PF_DEAD are cleaned up *after* the context switch, > > in a routine called finish_task_switch(task_t *prev). The "prev" gets > > the value returned by _switch() in entry.S, but this value comes from > > > > __switch_to (struct task_struct *prev, > > struct task_struct *new) > > { > >old_thread = >thread; ///XXX shouldn't this be prev, not > > current? > >last = _switch(old_thread, new_thread); > >return last; > > } > > > > The way I see it, "prev" and "current" are almost always going to be > > pointing at the same thing; however, if a "need resched" happens, > > or there's a pre-emept or some-such, then prev and current won't be > > the same; in which case, finish_task_switch() will end up cleaning > > up the old current, instead of prev. This will result in dead processes > > hanging around, which will never be scheduled again, and will never > > get a chance to have put_task_struct() called on them. > > I wonder why we bother doing all that at all... we could just return > "prev" from __switch_to() no ? Like x86 does... Probably. I assume this funny two-step is left-over from a 2.4 kernel design point. Naively, we could rturn "prev", this would save a few cycles. Cut the "addi r3,r3,-THREAD" from entry.S as well. I was being conservative with the patch, making the smallest change possible. Do you want this larger patch? --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, Apr 19, 2005 at 11:07:01AM +1000, Benjamin Herrenschmidt was heard to remark: > On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > > > Hi, > > > > The patch below appears to fix a problem where a number of dead processes > > linger on the system. On a highly loaded system, dozens of processes > > were found stuck in do_exit(), calling thier very last schedule(), and > > then being lost forever. > > > > Processes that are PF_DEAD are cleaned up *after* the context switch, > > in a routine called finish_task_switch(task_t *prev). The "prev" gets > > the value returned by _switch() in entry.S, but this value comes from > > > > __switch_to (struct task_struct *prev, > > struct task_struct *new) > > { > >old_thread = >thread; ///XXX shouldn't this be prev, not > > current? > >last = _switch(old_thread, new_thread); > >return last; > > } > > > > The way I see it, "prev" and "current" are almost always going to be > > pointing at the same thing; however, if a "need resched" happens, > > or there's a pre-emept or some-such, then prev and current won't be > > the same; in which case, finish_task_switch() will end up cleaning > > up the old current, instead of prev. This will result in dead processes > > hanging around, which will never be scheduled again, and will never > > get a chance to have put_task_struct() called on them. > > Ok, thinking moer about this ... that will need maybe some help from > Ingo so I fully understand where schedule's are allowed ... We are > basically in the middle of the scheduler here, so I wonder how much of > the scheduler itself can be preempted or so ... > > Basically, under which circumstances can prev and current be different ? I remember finding a path through void __sched schedule(void) that took a branch through the goto need_resched; that would result in this. I takes a bit of mental gymnastics to see how this might happen. FWIW, I can send you a debug session showing all cpu's idle and 44 dead processes sitting in do_exit(). All but two of these were Java threads, so this seems to be some sort of thread-scheduling subtlty. (the two that weren't java threads were a find|grep pair that must have gotten tangled in.) Given that the patch seems to fix the problem, I didn't dig much deeper. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, Apr 19, 2005 at 11:07:01AM +1000, Benjamin Herrenschmidt was heard to remark: On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. Ok, thinking moer about this ... that will need maybe some help from Ingo so I fully understand where schedule's are allowed ... We are basically in the middle of the scheduler here, so I wonder how much of the scheduler itself can be preempted or so ... Basically, under which circumstances can prev and current be different ? I remember finding a path through void __sched schedule(void) that took a branch through the goto need_resched; that would result in this. I takes a bit of mental gymnastics to see how this might happen. FWIW, I can send you a debug session showing all cpu's idle and 44 dead processes sitting in do_exit(). All but two of these were Java threads, so this seems to be some sort of thread-scheduling subtlty. (the two that weren't java threads were a find|grep pair that must have gotten tangled in.) Given that the patch seems to fix the problem, I didn't dig much deeper. --linas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, Apr 19, 2005 at 11:01:25AM +1000, Benjamin Herrenschmidt was heard to remark: On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. I wonder why we bother doing all that at all... we could just return prev from __switch_to() no ? Like x86 does... Probably. I assume this funny two-step is left-over from a 2.4 kernel design point. Naively, we could rturn prev, this would save a few cycles. Cut the addi r3,r3,-THREAD from entry.S as well. I was being conservative with the patch, making the smallest change possible. Do you want this larger patch? --linas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Ok, we spent some time with Paul decrypting what _switch_to is supposed to do. Our understanding at this point is that the current code is correct on both ppc32 and ppc64, that is: The prev passed in is always current and we don't see how it can be anything else. We use a local variable instead of current in the common code because accessing current can be slow on some architectures. I don't see any codepath where prev != current before switch_to. If we didn't do some black magic that I explain below, _switch_to would switch the entire context, including stack, and thus including the value of prev. Which means that we would always come back with prev beeing current, which is useless for reaping the old task. What we want is that this prev that was passed to _switch_to() is returned so that we can rip that previous task despite the change of context, that is basically prev has to be an invariant vs. the change of context in switch_to. On ppc ppc64, we implement that by passing that prev (or it's thread counterpart) to the assembly context switch code in r3. This code will preserve it and return it as-is (or re-transformed from thread to task). So your problem must be somewhere else. I've looked at the need_resched code path and we always reload prev = current from a non-preemptible region, so it can't be wrong. This was verified on 2.6.12-rc2, there might be something else wrong in an older kernel. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, 2005-04-19 at 11:07 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > > > Hi, > > > > The patch below appears to fix a problem where a number of dead processes > > linger on the system. On a highly loaded system, dozens of processes > > were found stuck in do_exit(), calling thier very last schedule(), and > > then being lost forever. > > > > Processes that are PF_DEAD are cleaned up *after* the context switch, > > in a routine called finish_task_switch(task_t *prev). The "prev" gets > > the value returned by _switch() in entry.S, but this value comes from > > > > __switch_to (struct task_struct *prev, > > struct task_struct *new) > > { > >old_thread = >thread; ///XXX shouldn't this be prev, not > > current? > >last = _switch(old_thread, new_thread); > >return last; > > } > > > > The way I see it, "prev" and "current" are almost always going to be > > pointing at the same thing; however, if a "need resched" happens, > > or there's a pre-emept or some-such, then prev and current won't be > > the same; in which case, finish_task_switch() will end up cleaning > > up the old current, instead of prev. This will result in dead processes > > hanging around, which will never be scheduled again, and will never > > get a chance to have put_task_struct() called on them. > > Ok, thinking moer about this ... that will need maybe some help from > Ingo so I fully understand where schedule's are allowed ... We are > basically in the middle of the scheduler here, so I wonder how much of > the scheduler itself can be preempted or so ... > Not much. schedule() has a small preempt window at the beginning and end of the function. The context switch is of course run with preempt disabled. Ie. your switch_to should never get preempted. > Basically, under which circumstances can prev and current be different ? > Depends on your context switch, really. prev == current before you switch, and when you switch to 'new' it is different. However, I think the 'new' current has *its* old prev on the stack (which == new current). You just have to preserve the old 'prev' somehow (ie. the thread you switched away from). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > Hi, > > The patch below appears to fix a problem where a number of dead processes > linger on the system. On a highly loaded system, dozens of processes > were found stuck in do_exit(), calling thier very last schedule(), and > then being lost forever. > > Processes that are PF_DEAD are cleaned up *after* the context switch, > in a routine called finish_task_switch(task_t *prev). The "prev" gets > the value returned by _switch() in entry.S, but this value comes from > > __switch_to (struct task_struct *prev, > struct task_struct *new) > { >old_thread = >thread; ///XXX shouldn't this be prev, not current? >last = _switch(old_thread, new_thread); >return last; > } > > The way I see it, "prev" and "current" are almost always going to be > pointing at the same thing; however, if a "need resched" happens, > or there's a pre-emept or some-such, then prev and current won't be > the same; in which case, finish_task_switch() will end up cleaning > up the old current, instead of prev. This will result in dead processes > hanging around, which will never be scheduled again, and will never > get a chance to have put_task_struct() called on them. Ok, thinking moer about this ... that will need maybe some help from Ingo so I fully understand where schedule's are allowed ... We are basically in the middle of the scheduler here, so I wonder how much of the scheduler itself can be preempted or so ... Basically, under which circumstances can prev and current be different ? Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: > > Hi, > > The patch below appears to fix a problem where a number of dead processes > linger on the system. On a highly loaded system, dozens of processes > were found stuck in do_exit(), calling thier very last schedule(), and > then being lost forever. > > Processes that are PF_DEAD are cleaned up *after* the context switch, > in a routine called finish_task_switch(task_t *prev). The "prev" gets > the value returned by _switch() in entry.S, but this value comes from > > __switch_to (struct task_struct *prev, > struct task_struct *new) > { >old_thread = >thread; ///XXX shouldn't this be prev, not current? >last = _switch(old_thread, new_thread); >return last; > } > > The way I see it, "prev" and "current" are almost always going to be > pointing at the same thing; however, if a "need resched" happens, > or there's a pre-emept or some-such, then prev and current won't be > the same; in which case, finish_task_switch() will end up cleaning > up the old current, instead of prev. This will result in dead processes > hanging around, which will never be scheduled again, and will never > get a chance to have put_task_struct() called on them. I wonder why we bother doing all that at all... we could just return "prev" from __switch_to() no ? Like x86 does... Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PATCH [PPC64]: dead processes never reaped
Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The "prev" gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = >thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, "prev" and "current" are almost always going to be pointing at the same thing; however, if a "need resched" happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. This patch fixes this. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> --- arch/ppc64/kernel/process.c.orig2005-04-18 14:26:42.0 -0500 +++ arch/ppc64/kernel/process.c 2005-04-18 14:27:54.0 -0500 @@ -204,7 +204,7 @@ struct task_struct *__switch_to(struct t flush_tlb_pending(); new_thread = >thread; - old_thread = >thread; + old_thread = >thread; local_irq_save(flags); last = _switch(old_thread, new_thread); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PATCH [PPC64]: dead processes never reaped
Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. This patch fixes this. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] --- arch/ppc64/kernel/process.c.orig2005-04-18 14:26:42.0 -0500 +++ arch/ppc64/kernel/process.c 2005-04-18 14:27:54.0 -0500 @@ -204,7 +204,7 @@ struct task_struct *__switch_to(struct t flush_tlb_pending(); new_thread = new-thread; - old_thread = current-thread; + old_thread = prev-thread; local_irq_save(flags); last = _switch(old_thread, new_thread); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. I wonder why we bother doing all that at all... we could just return prev from __switch_to() no ? Like x86 does... Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. Ok, thinking moer about this ... that will need maybe some help from Ingo so I fully understand where schedule's are allowed ... We are basically in the middle of the scheduler here, so I wonder how much of the scheduler itself can be preempted or so ... Basically, under which circumstances can prev and current be different ? Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PATCH [PPC64]: dead processes never reaped
On Tue, 2005-04-19 at 11:07 +1000, Benjamin Herrenschmidt wrote: On Mon, 2005-04-18 at 14:38 -0500, Linas Vepstas wrote: Hi, The patch below appears to fix a problem where a number of dead processes linger on the system. On a highly loaded system, dozens of processes were found stuck in do_exit(), calling thier very last schedule(), and then being lost forever. Processes that are PF_DEAD are cleaned up *after* the context switch, in a routine called finish_task_switch(task_t *prev). The prev gets the value returned by _switch() in entry.S, but this value comes from __switch_to (struct task_struct *prev, struct task_struct *new) { old_thread = current-thread; ///XXX shouldn't this be prev, not current? last = _switch(old_thread, new_thread); return last; } The way I see it, prev and current are almost always going to be pointing at the same thing; however, if a need resched happens, or there's a pre-emept or some-such, then prev and current won't be the same; in which case, finish_task_switch() will end up cleaning up the old current, instead of prev. This will result in dead processes hanging around, which will never be scheduled again, and will never get a chance to have put_task_struct() called on them. Ok, thinking moer about this ... that will need maybe some help from Ingo so I fully understand where schedule's are allowed ... We are basically in the middle of the scheduler here, so I wonder how much of the scheduler itself can be preempted or so ... Not much. schedule() has a small preempt window at the beginning and end of the function. The context switch is of course run with preempt disabled. Ie. your switch_to should never get preempted. Basically, under which circumstances can prev and current be different ? Depends on your context switch, really. prev == current before you switch, and when you switch to 'new' it is different. However, I think the 'new' current has *its* old prev on the stack (which == new current). You just have to preserve the old 'prev' somehow (ie. the thread you switched away from). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/