Am Thu, Sep 09, 2021 at 04:10:31PM +0200 schrieb Mark Kettenis:
> > Date: Mon, 6 Sep 2021 21:43:29 +0200
> > From: Patrick Wildt <patr...@blueri.se>
> > 
> > Am Fri, Jul 30, 2021 at 07:55:29PM +0200 schrieb Alexander Bluhm:
> > > On Mon, Jul 26, 2021 at 08:12:39AM -0500, Scott Cheloha wrote:
> > > > On Fri, Jun 25, 2021 at 06:09:27PM -0500, Scott Cheloha wrote:
> > > > 1 month bump.  I really appreciate the tests I've gotten so far, thank
> > > > you.
> > > 
> > > On my Xeon machine it works and all regress tests pass.
> > > 
> > > But it fails on my old Opteron machine.  It hangs after attaching
> > > cpu1.
> > 
> > This seems to be caused by contention on the mutex in i8254's gettick().
> > 
> > With Scott's diff, delay_func is i8254_delay() on that old AMD machine.
> > Its gettick() implementation uses a mutex to protect I/O access to the
> > i8254.
> > 
> > When secondary CPUs come up, they will wait for CPU0 to let them boot up
> > further by checking for a flag:
> > 
> >     /*
> >      * We need to wait until we can identify, otherwise dmesg
> >      * output will be messy.
> >      */
> >     while ((ci->ci_flags & CPUF_IDENTIFY) == 0)
> >             delay(10);
> > 
> > Now that machine has 3 secondary cores that are spinning like that.  At
> > the same time CPU0 waits for the core to come up:
> > 
> >     /* wait for it to identify */
> >     for (i = 2000000; (ci->ci_flags & CPUF_IDENTIFY) && i > 0; i--)
> >             delay(10);
> > 
> > That means we have 3-4 cores spinning just to be able to delay().  Our
> > mutex implementation isn't fair, which means whoever manages to claim
> > the free mutex wins.  Now if CPU2 and CPU3 are spinning all the time,
> > CPU1 identifies and needs delay() and CPU0 waits for CPU1, maybe the
> > one that needs to make progress never gets it.
> > 
> > I changed those delay(10) in cpu_hatch() to CPU_BUSY_CYCLE() and it went
> > ahead a bit better instead of hanging forever.
> > 
> > Then I remembered an idea something from years ago: fair kernel mutexes,
> > so basically mutexes implemented as ticket lock, like our kerne lock.
> > 
> > I did a quick diff, which probably contains a million bugs, but with
> > this bluhm's machine boots up well.
> > 
> > I'm not saying this is the solution, but it might be.
> 
> So the idea really is that the kernel mutexes are cheap and simple
> spin locks.  The assumption has always been that there shouldn't be a
> lot of contention on them.  If you have contention, your locking
> probably isn't fine-grained enough, or you're using the wrong lock
> type.  Note that our mpsafe pmaps use a per-page mutex.  So increasing
> the size of struct mutex is going to have a significant impact.
> 
> Maybe we need another lock type, although we already have one that
> tries to be "fair": struct __mp_lock, which is what we use for the
> kernel lock and the scheduler lock.  A non-recursive version of that
> might make sense.

After further testing I have come to the conclusion that changing
our mutexes to become a ticket lock is a flawed concept.  There are
lock order issues with the kernel lock caused by the spinning path
to be interruptible with code that can take the kernel lock.

CPU0: has the mutex, will soon release it
CPU1: gets the next ticket, spins on the mutex, gets an IRQ that tries
      to take the kernel lock
CPU2: has the kernel lock, tries to get the mutex, gets the ticket after
      CPU1, spins on the mutex

So now CPU2 is waiting for CPU1 to be done with the mutex, but since
CPU1 is waiting for CPU2 to be done with the kernel lock, we're stuck.

As long as a mutex can be interrupted by stuff that takes the kernel
lock, this concept won't work.

Back to the original problem: contention on the i2854_delay lock.

Changing it to an mp lock could work, because then it would be fair.
Does it need to be fair?  If we reduce the contention, there's no need
to change the lock.

So there are two things one could have a look at.  First, make use of
CPU_BUSY_CYCLE() instead of delay(10) when spinning up cores.  Second,
try to not use i8254_delay as delay_func.

Patrick

Reply via email to