Re: proposed smp_rendezvous change

John Baldwin Tue, 17 May 2011 05:17:03 -0700

On 5/16/11 6:05 PM, Max Laier wrote:

On Monday 16 May 2011 17:54:54 Attilio Rao wrote:

2011/5/16 Max Laier<m...@love2party.net>:

On Monday 16 May 2011 16:46:03 John Baldwin wrote:

On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:

On Monday 16 May 2011 14:21:27 John Baldwin wrote:

Yes, we need to fix that.  Humm, it doesn't preempt when you do a
critical_exit() though?  Or do you use a hand-rolled critical exit
that doesn't do a deferred preemption?


Right now I just did a manual td_critnest++/--, but I guess ...


Ah, ok, so you would "lose" a preemption.  That's not really ideal.

Actually, I'm curious how the spin unlock inside the IPI could yield
the CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I
guess that is ok as long as the critical_exit() just defers the
preemption to the end of the IPI handler.


... the earliest point where it is safe to preempt is after doing the

    atomic_add_int(&smp_rv_waiters[2], 1);

so that we can start other IPIs again.  However, since we don't accept
new IPIs until we signal EOI in the MD code (on amd64), this might
still not be a good place to do the yield?!?


Hmm, yeah, you would want to do the EOI before you yield.  However, we
could actually move the EOI up before calling the MI code so long as we
leave interrupts disabled for the duration of the handler (which we do).

The spin unlock boils down to a critical_exit() and unless we did a
critical_enter() at some point during the redenvouz setup, we will
yield() if we owepreempt.  I'm not quite sure how that can happen, but
it seems like there is a path that allows the scheduler to set it from
a foreign CPU.


No, it is only set on curthread by curthread.  This is actually my main
question.  I've no idea how this could happen unless the rmlock code is
actually triggering a wakeup or sched_add() in its rendezvous handler.

I don't see anything in rm_cleanIPI() that would do that however.

I wonder if your original issue was really fixed  just by the first
patch you had which fixed the race in smp_rendezvous()?


I found the stack that lead me to this patch in the first place:

#0  sched_switch (td=0xffffff011a970000, newtd=0xffffff006e6784b0,
flags=4) at src/sys/kern/sched_ule.c:1939
#1  0xffffffff80285c7f in mi_switch (flags=6, newtd=0x0) at
src/sys/kern/kern_synch.c:475
#2  0xffffffff802a2cb3 in critical_exit () at
src/sys/kern/kern_switch.c:185 #3  0xffffffff80465807 in spinlock_exit
() at
src/sys/amd64/amd64/machdep.c:1458
#4  0xffffffff8027adea in rm_cleanIPI (arg=<value optimized out>) at
src/sys/kern/kern_rmlock.c:180
#5  0xffffffff802b9887 in smp_rendezvous_action () at
src/sys/kern/subr_smp.c:402
#6  0xffffffff8045e2a4 in Xrendezvous () at
src/sys/amd64/amd64/apic_vector.S:235
#7  0xffffffff802a2c6e in critical_exit () at
src/sys/kern/kern_switch.c:179 #8  0xffffffff804365ba in uma_zfree_arg
(zone=0xffffff009ff4b5a0, item=0xffffff000f34cd40,
udata=0xffffff000f34ce08) at
src/sys/vm/uma_core.c:2370
.
.
.

and now that I look at it again, it is clear that critical_exit() just
isn't interrupt safe.  I'm not sure how to fix that, yet ... but this:


        if (td->td_critnest == 1) {
                td->td_critnest = 0;
                if (td->td_owepreempt) {
                        td->td_critnest = 1;

clearly doesn't work.


I'm sorry if I didn't reply to the whole rendezvous thread, but as
long as there is so many people taking care of it, I'll stay hidden in
my corner.

I just wanted to tell that I think you are misunderstanding what
critical section is supposed to do.

When an interrupt fires, it goes on the old "interrupt/kernel context"
which means it has not a context of his own. That is the reason why we
disable interrupts on spinlocks (or similar workaround for !x86
architectures) and this is why spinlocks are the only protection
usable in code that runs in interrupt context.

Preempting just means another thread will be scheduler in the middle
of another thread execution path.

This code is perfectly fine if you consider curthread won't be
descheduled while it is executing.


Well, no - it is not.  With this you can end up with a curthread that has
td_critnest=0 and td_owepreempt=1 in interrupt context.  If you use a spinlock
on such a thread, it will do the preemption at the point where you drop the
spinlock, this is bad in some circumstances.  One example is the smp_rendevous
case we are discussing here.

Yes, the 'owepreempt' flag can leak because we allow it to be set whiletd_critnest is 0. That is the problem, and I think we added this as anoptimization in the last round of changes to this routine to avoidexcessive context switches. However, I think we will just have torevert that optimization and prevent that state from occurring.


Hmm, the log message for the change said it was to avoid races.

http://svnweb.freebsd.org/base?view=revision&revision=144777

That is what introduced the bug of mixing those two states. Hmmm, thereason for moving the td_critnest == 0 up according to my old e-mailswas to avoid a problem with losing preemptions. I think the best way tofix this is to clear owepreempt earlier while td_critnest is still 1 andcache it locally in critical_exit(). So something like this:


Index: kern/sched_ule.c
===================================================================
--- kern/sched_ule.c    (revision 221536)
+++ kern/sched_ule.c    (working copy)
@@ -1785,7 +1785,7 @@
        td->td_oncpu = NOCPU;
        if (!(flags & SW_PREEMPT))
                td->td_flags &= ~TDF_NEEDRESCHED;
-       td->td_owepreempt = 0;
+       MPASS(td->td_owepreempt == 0);
        tdq->tdq_switchcnt++;
        /*
         * The lock pointer in an idle thread should never change.  Reset it
Index: kern/kern_switch.c
===================================================================
--- kern/kern_switch.c  (revision 221536)
+++ kern/kern_switch.c  (working copy)
@@ -192,15 +192,22 @@
 critical_exit(void)
 {
        struct thread *td;
-       int flags;
+       int flags, owepreempt;

        td = curthread;
        KASSERT(td->td_critnest != 0,
            ("critical_exit: td_critnest == 0"));

        if (td->td_critnest == 1) {
+               owepreempt = td->td_owepreempt;
+               td->td_owepreempt = 0;
+               /*
+                * XXX: Should move compiler_memory_barrier() from
+                * rmlock to a header.
+                */
+               __asm __volatile("":::"memory");
                td->td_critnest = 0;
-               if (td->td_owepreempt) {
+               if (owepreempt) {
                        td->td_critnest = 1;
                        thread_lock(td);
                        td->td_critnest--;
Index: kern/kern_synch.c
===================================================================
--- kern/kern_synch.c   (revision 221536)
+++ kern/kern_synch.c   (working copy)
@@ -400,9 +400,7 @@
        if (!TD_ON_LOCK(td) && !TD_IS_RUNNING(td))
                mtx_assert(&Giant, MA_NOTOWNED);
 #endif
-       KASSERT(td->td_critnest == 1 || (td->td_critnest == 2 &&
-           (td->td_owepreempt) && (flags & SW_INVOL) != 0 &&
-           newtd == NULL) || panicstr,
+       KASSERT(td->td_critnest == 1 || panicstr,
            ("mi_switch: switch in a critical section"));
        KASSERT((flags & (SW_INVOL | SW_VOL)) != 0,
            ("mi_switch: switch must be voluntary or involuntary"));
Index: kern/sched_4bsd.c
===================================================================
--- kern/sched_4bsd.c   (revision 221536)
+++ kern/sched_4bsd.c   (working copy)
@@ -947,7 +947,7 @@
        td->td_lastcpu = td->td_oncpu;
        if (!(flags & SW_PREEMPT))
                td->td_flags &= ~TDF_NEEDRESCHED;
-       td->td_owepreempt = 0;
+       MPASS(td->td_owepreempt == 0);
        td->td_oncpu = NOCPU;

        /*

Hmm, I'm also not sure if I really need the compiler barrier sincetd_owepreempt and td_critnest are marked volatile?


--
John Baldwin
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Re: proposed smp_rendezvous change

Reply via email to