On Wed, Aug 13, 2014 at 06:00:49AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 13, 2014 at 11:14:39AM +0530, Amit Shah wrote:
> > On (Tue) 12 Aug 2014 [14:41:51], Paul E. McKenney wrote:
> > > On Tue, Aug 12, 2014 at 02:39:36PM -0700, Paul E. McKenney wrote:
> > > > On Tue, Aug 12, 2014 at 09:06:21AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Aug 12, 2014 at 11:03:21AM +0530, Amit Shah wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > I know of only virtio-console doing this (via userspace only,
> > > > > > though).
> > > > > 
> > > > > As in userspace within the guest?  That would not work.  The userspace
> > > > > that the qemu is running in might.  There is a way to extract ftrace 
> > > > > info
> > > > > from crash dumps, so one approach would be "sendkey alt-sysrq-c", then
> > > > > pull the buffer from the resulting dump.  For all I know, there might 
> > > > > also
> > > > > be some script that uses the qemu "x" command to get at the ftrace 
> > > > > buffer.
> > > > > 
> > > > > Again, I cannot reproduce this, and I have been through the code 
> > > > > several
> > > > > times over the past few days, and am not seeing it.  I could start
> > > > > sending you random diagnostic patches, but it would be much better if
> > > > > we could get the trace data from the failure.
> > 
> > I think the only recourse I now have is to dump the guest state from
> > qemu, and attempt to find the ftrace buffers by poking pages and
> > finding some ftrace-like struct... and then dumping the buffers.
> 
> The data exists in the qemu guest state, so it would be good to have
> it one way or another.  My current (perhaps self-serving) guess is that
> you have come up with a way to trick qemu into dropping IPIs.

Oh, and I wrote up my last inspection of the nocb code.  Please see below.

                                                        Thanx, Paul

------------------------------------------------------------------------

Given that specifying rcu_nocb_poll avoids the hang, the natural focus
is on checks for rcu_nocb_poll:

o       __call_rcu_nocb_enqueue() skips the wakeup if rcu_nocb_poll,
        which is legitimate because the rcuo kthreads do their own
        wakeups in this case.

o       nocb_leader_wait() does wait_event_interruptible() on
        my_rdp->nocb_leader_wake if !rcu_nocb_poll.  So one further
        question is the handling of ->nocb_leader_wake.

        The thing to check for is if ->nocb_leader_wake can get set to
        true without a wakeup while the leader sleeps, as this would
        clearly lead to a hang.  Checking each use:

        o       wake_nocb_leader() tests ->nocb_leader_wake, and
                if false, sets it and does a wakeup.  The set is
                ordered after the test due to control dependencies.
                Multiple followers might concurrently attempt to
                wake their leader, and this can result in multiple
                wakeups, which should be OK -- we only need one
                wakeup, so more won't hurt.

                Here, every time ->nocb_leader_wake is set, we
                follow up with a wakeup, so this particular use
                avoids the sleep-while-set problem.

                It is also important to note that

        o       nocb_leader_wait() waits for ->nocb_leader_wake, as
                noted above.

        o       nocb_leader_wait() checks for spurious wakeups, but
                before sleeping again, it clears ->nocb_leader_wake,
                does a memory barrier, and rescans the callback
                queues, and sets ->nocb_leader_wake if any have
                callbacks.  Either way, it goes to wait again.  If it
                set ->nocb_leader_wake, then the wait won't wait,
                as required.

                The check for spurious wakeups also moves callbacks
                to an intermediate list for the grace-period-wait
                operation.  This ensures that we don't prematurely
                invoke any callbacks that arrive while the grace period
                is in progress.

        o       If the wakeup was real, nocb_leader_wait() clears
                ->nocb_leader_wake, does a memory barrier, and moves
                callbacks from the intermediate lists to the followers'
                lists (including itself, as a leader is its own first
                follower).  During this move, the leader checks for
                new callbacks having arrived during the grace period,
                and sets ->nocb_leader_wake if there are any, again
                short-circuiting the following wake.

        o       Note that nocb_leader_wait() never sets ->nocb_leader_wake
                unless it has found callbacks waiting for it, and that
                setting ->nocb_leader_wake short-circuits the next wait,
                so that a wakeup is not required.

                Note also that every time nocb_leader_wait() clears
                ->nocb_leader_wake, it does a memory barrier and
                scans all the lists before sleeping again.  This means
                that if a wakeup arrives just before nocb_leader_wait()
                clears ->nocb_leader_wake, then nocb_leader_wait() will
                be guaranteed to see the newly arrived callback, and
                thus not need a wakeup.

o       Because nocb_leader_wait() always checks the queue head, it is
        necessary that the callers of wake_nocb_leader() ensure that
        the head is updated before the call to wake_nocb_leader().
        This is a problem, and smp_mb__after_atomic() is added after
        the atomic_add_long() in __call_rcu_nocb_enqueue() to fix
        this.  Note that atomic_add_long() does not have a barrier()
        directive, so this misordering could in theory happen even on
        strongly-ordered systems, courtesy of the compiler.

o       The leader-follower wakeups also need attention, as it references
        rcu_nocb_poll.

        nocb_follower_wait() does a wait on rdp->nocb_follower_head, and
        ensures ordering with the later smp_load_acquire() from that same
        variable.

        The wakeup happens in nocb_leader_wait().  This has the same
        sequence as __call_rcu_nocb_enqueue(), but there is no _wake
        flag.  This means that the only way that confusion could result
        is if the assignment to *tail was moved to after the wake_up().
        Now, wake_up() unconditionally acquires a lock, but in theory
        only guarantees that stuff before the lock acquisition happens
        before the lock release.  So might as well also add
        smp_mb__after_atomic here, also.  (This will not affect x86,
        which has strong ordering on lock acquisition.)

o       Check use of do_nocb_deferred_wakeup().  If any of these has
        been omitted, then a wakeup could be lost.

        This is invoked on next entry to an extended quiescent state and
        on exit from __rcu_process_callbacks().  It is checked on every
        scheduling-clock interrupt via rcu_nocb_need_deferred_wakeup(),
        and if needed, an RCU_SOFTIRQ is issued, which will result in
        a later invocation of __rcu_process_callbacks().

        So the only vulnerability is a call_rcu() with irqs disabled from
        idle.  This needs a check.  Note that there is a similar check
        in __call_rcu_core(), though not for irqs disabled.  Check added.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to