On Thu, Jul 25, 2024 at 11:09:44AM +0200, Martin Pieuchot wrote:
> Thanks a lot for figuring that out.  This is awesome!
> 
> On 24/07/24(Wed) 16:19, Claudio Jeker wrote:
> > On Fri, Jun 21, 2024 at 01:24:27PM +0200, Martin Pieuchot wrote:
> > > So I'm trying to see where the remaining sched_yield() are coming from
> > > ld(1):
> > > 
> > > $ cd /sys/arch/arm64/compile/GENERIC.MP
> > > $ LD="egdb --args ld" make -j32
> > > 
> > > Then I add a breakpoint on sched_yield & hit run.
> > > 
> > > As soon as the first thread is stopped, I can see the trace as usual,
> > > however the process is now in a "stopped" state, impossible to kill or
> > > continue.  Even ddb's kill command doesn't help.
> > 
> > So this is caused by multiple issues in ptrace, the single thread API
> > and the sleep machinery. The fundamental issue is that the ps_singlecnt
> > gets off and egdb hangs in single_thread_wait().
> > 
> > The issue at hand is that single_thread_check() for suspends while in deep
> > is just not right. It will decrement ps_singlecnt but if the proc was
> > already in SSLEEP then that decrement was already done in
> > single_thread_set().
> 
> I don't fully understand.  Are you saying this is a race because two
> threads are in the middle of sleep_setup()?   One is flagged as SSLEEP
> but it currently running and that is why the decrement in single_thread_set()
> is wrong an causes an off-by-one?

No the race is between single_thread_set() and a thread going to a PCATCH
sleep.  At the moment single_thread_set() runs the other thread can be SRUN,
SONPROC or SSLEEP (with and without P_WSLEEP aka before or after the
mi_switch()). The problem is that in suspend mode sleep_signal_check()
does not let the sleep machinery finish but instead does the SSTOP
mi_switch itself. I tried to fix this differently but the result is not
pretty. This was the least nasty fix.

For all cases but SSLEEP the ps_singlecnt decrement needs to
happen in sleep_signal_check() but for SSLEEP the situation depends on the
fact if sleep_signal_check() was called or not.

On top of this there can be concurrent wakeup calls which cause
sleep_finish to not mi_switch(). Which is another can of worms.

The suspend case in single_thread_check() while deep is just horrible. I
would like to move that out but that is again harder then expected. Since
the right place to do this has locks that wont allow us to call wakeup.

> > Here is a possible solution to this problem by adding yet another proc
> > flag to track the time between sleep_setup and single_thread_check (via
> > sleep_signal_check). Using this information single_thread_set() can be
> > adjusted to no decrement ps_singlecnt in that case.
> 
> I wonder if the road to simplify sleep_setup() doesn't start with setting
> `p_stat' to something other than SSLEEP when the thread is placed on the
> sleeping queue.  It seems to me that most of the complexity of the various
> `p_stat' machinery are due to the fact that a thread in SSLEEP might not be
> sleeping.

Using a different p_stat will probably fix some bits but will fail in
other places. It is something to consider but introducing new scheduler
states is yet another can of worms better left closed.
See below on the reason why both P_WSLEEP and P_SINGLESLEEP are needed.
The same problem will happen if we introduce a new state e.g. SGOSLEEP.

Now we can go down this road but it will require some adjustments to
mi_switch() so we can mop up a proc after we switched off from it but
before we released the mutex protecting the proc. Doing this is very
invasive.

> That's more or less what the existing P_WSLEEP flag is for.  Can't we
> use that instead of adding P_SINGLESLEEP?

Not really since that will open a new window where the two can race.
P_WSLEEP is cleared in sleep_finish() and protects wakeups from putting a
process onto the runqueue that is still executing.
P_WSLEEP is protected by the sleep mutex (right now SCHED_LOCK).

P_SINGLESLEEP (horrible name but I ran out of ideas) protects us from a
slightly different window. It is from sleep_setup() to
single_thread_check(). To clear P_SINGLESLEEP we need to hold a lock that
prevents single_thread_check() from missing the transition. For simplicity
I used ps_mtx for that. You could use a different lock but the problem is
that single_thread_check() must be called without SCHED_LOCK (or sleep
queue mtx) since it calls wakeup() itself. This is why we need something
special to ensure consistent view for single_thread_set().
 
> > On top of this I had to fix ptrace to better respect gdb's request to
> > progress a single thread or all threads. Without that I just constantly
> > hit egdb asserts because of unexpected progress of unnrelated threads.
> 
> This is another can of worms worth a complete different thread...
> Thanks for digging it!
> 
> That means single_thread_set() can now be called twice in a row with all
> sibling already parked right?

Yup. It always kind of allowed that e.g. exec uses this to unwind then
exit threads.
 
> I doubt the single_thread API is the right tool for PS_TRACED. Stopped
> & traced threads are not parked.  The tracing API already know which
> thread(s) it wants to resume.  Is it possible with the single_thread API
> to resume a single thread which is not the one that is currently ps_single?

>From my very limited understanding of gdb there is no way to continue a
different thread then the one which caused the break.
OpenBSD only supports the all-stop model for gdb. We could probably support
some extra option for 'scheduler-locking' but this is for someone else to fix.

The problem is that gdb needs to clear the breakpoint the thread hit by
single stepping that thread over that instruction. After that it can
reinstall the breakpoint and let the world run again. (Else an other
thread my run pass the breakpoint without hitting it).

If the system does not support PT_STEP then this is done by injecting one
or two new breakpoints do a PT_CONTINUE on the thread and then mopping all
up (removing temp breakpoints, reinsert the real breakpoint) and finally
PT_CONTINUE the process. With PT_STEP the extra breakpoints are not needed
but gdb still does a PT_STEP for the thread and then a PT_CONTINUE.

Now in our ld.lld case with sched_yield as a breakpoint so many threads
hit the breakpoint concurrently that the internal state of gdb gets
confused. This diff prevents that.

> Shouldn't we let the tracing process decided if it unstop on or all
> thread directly?

This is what I try to do. With the caveat that you can only continue the
thread that caused the stop. We only need to support the steps required
by the all-stop model. At least unless someone steps up and does the work.
 
> However if we need to communicate more information between the tracing &
> traced processes, then I'd suggest extending `ps_xsig' rather than add
> a P_TRACESINGLE flag.

ps_xsig is already bad since it is in the wrong spot. So extending it will
require to fix this first. Again if someone wants to do this, sure but
this is a simple solution that works. We could add a check in ptrace that
ensures that only the thread in ps_single can be advanced individually.
Doing that should be easy.
 
> > With this I can add the sched_yield break point and continue through the
> > full run. Does this break other stuff?
> > Most probably :) there is a lot of bits that are not quite right when it
> > comes to signals and multiple threads.
> 
> Comments below.

... see below
 
> > Index: kern/kern_sig.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_sig.c,v
> > diff -u -p -r1.333 kern_sig.c
> > --- kern/kern_sig.c 22 Jul 2024 09:43:47 -0000      1.333
> > +++ kern/kern_sig.c 24 Jul 2024 14:14:06 -0000
> > @@ -2077,6 +2081,9 @@ single_thread_check_locked(struct proc *
> >  
> >     MUTEX_ASSERT_LOCKED(&pr->ps_mtx);
> >  
> > +   if (deep)
> > +           atomic_clearbits_int(&p->p_flag, P_SINGLESLEEP);
> 
> I don't understand why this is needed and how that fits in the
> description of the problem.
> 

So P_SINGLESLEEP is there to mark the process between sleep_setup() and
single_thread_check_locked(). This flag MUST be cleared with pr->ps_mtx
held or we have a race between the clearing of the flag and
single_thread_set(). This is why it is needed, it is the fundamental fix.
The if (deep) check could be extended to also look for P_SINTR or
P_SINGLESLEEP.

> >     if (pr->ps_single == NULL || pr->ps_single == p)
> >             return (0);
> >  

> > @@ -2263,6 +2277,8 @@ single_thread_clear(struct proc *p, int 
> >              */
> >             SCHED_LOCK();
> >             if (q->p_stat == SSTOP && (q->p_flag & flag) == 0) {
> > +                   if (flag == 0)
> > +                           atomic_clearbits_int(&q->p_flag, P_SUSPSIG);
> 
> OMG that's scary.  I dislike the fact that the single_thread API now
> grows signal logic.  I don't understand why this is needed.  If a thread
> has been stopped by signal, we should clear that somewhere else.

Welcome to SIGSTOP hell. This is here mainly for my own sanity and is not
really needed (since nothing checks P_SUSPSIG). It just confused me that
traced thread had P_SUSPSIG set kind of randomly.

The problem here is that SIGSTOP is able to hit multiple threads before
single_thread_set() is able to step in. Especially threads in SSLEEP will
see this since the transition to SSTOP is done by ptsignal().
Now ptrace() does not proper delivery of SIGCONT, it just does a
unsleep(); setrunnable() for the thread. So kettenis@ abused
single_thread_clear() to also simulate a SIGCONT. IMO this is fair but he
forgot to clear P_SUSPSIG.

Now I'm working on fixing SIGSTOP and only deliver stop and continue
signals to a single thread. I think that will clean up this mess for good
(I hope).
 
> >                     if (q->p_wchan == NULL)
> >                             setrunnable(q);
> >                     else {
> > Index: kern/kern_synch.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_synch.c,v
> > diff -u -p -r1.206 kern_synch.c
> > --- kern/kern_synch.c       23 Jul 2024 08:38:02 -0000      1.206
> > +++ kern/kern_synch.c       24 Jul 2024 14:14:06 -0000
> > @@ -399,15 +399,18 @@ sleep_finish(int timo, int do_sleep)
> >      */
> >     if (p->p_wchan == NULL)
> >             do_sleep = 0;
> > +   KASSERT((p->p_flag & P_SINGLESLEEP) == 0);
> >     atomic_clearbits_int(&p->p_flag, P_WSLEEP);
> >  
> > +   /* If requested to stop always force a stop even if do_sleep == 0 */
> > +   if (p->p_stat == SSTOP)
> > +           do_sleep = 1;
> 
> This is also scary.  The problem with the current scheme is that we don't
> know who changed `p_stat' and if we already did our context switch or
> not. 

It is scary indeed. I would prefer if sleep_signal_check() would not
randomly mi_switch() away behind our back. I first tried that but it is
harder then you think.
 
> > @@ -420,10 +423,7 @@ sleep_finish(int timo, int do_sleep)
> >     p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
> >     SCHED_UNLOCK();
> >  
> > -   /*
> > -    * Even though this belongs to the signal handling part of sleep,
> > -    * we need to clear it before the ktrace.
> > -    */
> > +   /* Must clear this before hitting another sleep point. */
> >     atomic_clearbits_int(&p->p_flag, P_SINTR);
> 
> Nothing related but I believe the P_SINTR comment above cursig()
> regarding locking is no longer true, right?

No that comment is outdated. I noticed it as well (diff in some other
tree).
 
I will circle back into this at some point and mop up some of the bits we
talked about but right now this is a minimal fix (actually two independent
fixes).
-- 
:wq Claudio

Reply via email to