On Tue, Mar 17, 2015 at 09:42:21AM +0100, Sander Eikelenboom wrote:
> 
> Tuesday, March 17, 2015, 9:18:32 AM, you wrote:
> 
> >>>> On 16.03.15 at 18:59, <konrad.w...@oracle.com> wrote:
> >> Hence was wondering if it would just be easier to put
> >> this patch in (see above) - with the benfit that folks have
> >> an faster interrupt passthrough experience and then I work on another
> >> variant of this with tristate cmpxchg and ->mapping atomic counter.
> 
> > Considering how long this issue has been pending I think we really
> > need to get _something_ in (or revert); if this something is the
> > patch in its most recent form, so be it (even if maybe not the
> > simplest of all possible variants). So please submit as a proper non-
> > RFC patch.
> 
> > Jan
> 
> I'm still running with this first simple stopgap patch from Konrad,
> and it has worked fine for me since.

I believe the patch that Sander and Malcom had been running is the best
candidate.

The other ones I had been fiddling with - such as the one attached
here - I cannot make myself comfortable that it will not hit
a dead-lock. On Intel hardware the softirq is called
from the vmx_resume - which means that the whole 'interrupt guest' and
deliever the event code happens during the VMEXIT to VMENTER time. But that
does not preclude another interrupt destined for this same vCPU
to come right in as we are progressing through the softirqs - and 
dead-lock: in the vmx_resume stack we are in hvm_dirq_assist
(called from dpci_softirq) and haven't cleared the STATE_SHED, while in
the IRQ stack we spin in the raise_sofitrq_for for the STATE_SCHED to be 
cleared.

An dead-lock avoidance could be added to save the CPU value of
the softirq that is executing the dpci. And then 'raise_softirq_for'
can check that and bail out if (smp_processor_id == dpci_pirq->cpu).
Naturlly this means being very careful _where_ we initialize
the 'cpu' to -1, etc - which brings back to carefully work
out the corner cases and make sure we do the right thing - which can
take time.

The re-using the 'dpci' on the per-cpu list is doing the same
exact thing that older tasklet code was doing. That is :
If the function assigned to the tasklet was running  - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time. 

And that is what we need. I will post an proper patch
and also add Tested-by from Malcom and Sander on it - as it did
fix their test-cases and is unmodified (except an updated
comment) from what theytested in 2014.

> 
> I will see if this new one also "works-for-me", somewhere today :-)
> 
> --
> Sander
> 
> 
> 
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index ae050df..d1421b0 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -804,7 +804,18 @@ static void dpci_softirq(void)
>          d = pirq_dpci->dom;
>          smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
>          if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> -            BUG();
> +        {
> +            unsigned long flags;
> +
> +            /* Put back on the list and retry. */
> +            local_irq_save(flags);
> +            list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
> +            local_irq_restore(flags);
> +
> +            raise_softirq(HVM_DPCI_SOFTIRQ);
> +            continue;
> +        }
> +
>          /*
>           * The one who clears STATE_SCHED MUST refcount the domain.
>           */
> 
>From 6b32dccfbe00518d3ca9cd94d19a6e007b2645d9 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
Date: Tue, 17 Mar 2015 09:46:09 -0400
Subject: [PATCH] dpci: when scheduling spin until STATE_RUN or STATE_SCHED has
 been cleared.

There is race when we clear the STATE_SCHED in the softirq
- which allows the 'raise_softirq_for' (on another CPU)
to schedule the dpci.

Specifically this can happen whenthe other CPU receives
an interrupt, calls 'raise_softirq_for', and puts the dpci
on its per-cpu list (same dpci structure).

There would be two 'dpci_softirq' running at the same time
(on different CPUs) where on one CPU it would be executing
hvm_dirq_assist (so had cleared STATE_SCHED and set STATE_RUN)
and on the other CPU it is trying to call:

   if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
        BUG();

Since STATE_RUN is already set it would end badly.

The reason we can get his with this is when an interrupt
affinity is set over multiple CPUs.

Potential solutions:

a) Instead of the BUG() we can put the dpci back on the per-cpu
list to deal with later (when the softirq are activated again).
This putting the 'dpci' back on the per-cpu list is an spin
until the bad condition clears.

b) We could also expand the test-and-set(STATE_SCHED) in raise_softirq_for
to detect for 'STATE_RUN' bit being set and schedule the dpci
in a more safe manner (delay it). The dpci would stil not
be scheduled when STATE_SCHED bit was set.

c) This patch explores a third option - we will only schedule
the dpci when the state is cleared (no STATE_SCHED and no STATE_RUN).

We will spin if STATE_RUN is set (as it is in progress and will
finish). If the STATE_SCHED is set (so hasn't run yet) we won't
try to spin and just exit. This can cause an dead-lock if the interrupt
comes when we are processing the dpci in the softirq.

Interestingly the old ('tasklet') code used an a) mechanism.
If the function assigned to the tasklet was running  - the softirq
that ran said function (hvm_dirq_assist) would be responsible for
putting the tasklet back on the per-cpu list. This would allow
to have an running tasklet and an 'to-be-scheduled' tasklet
at the same time. This solution moves this 'to-be-scheduled'
job to be done in 'raise_softirq_for' (instead of the
'softirq').

Reported-by: Sander Eikelenboom <li...@eikelenboom.it>
Reported-by: Malcolm Crossley <malcolm.cross...@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
---
 xen/drivers/passthrough/io.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index ae050df..9c30ebb 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -63,10 +63,32 @@ enum {
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
 {
     unsigned long flags;
+    unsigned long old;
 
-    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
-        return;
-
+    /*
+     * This cmpxchg spins until the state is zero (unused).
+     */
+    for ( ;; )
+    {
+        old = cmpxchg(&pirq_dpci->state, 0, 1 << STATE_SCHED);
+        switch ( old )
+        {
+        case (1 << STATE_SCHED):
+            /*
+             * Whenever STATE_SCHED is set we MUST not schedule it.
+             */
+            return;
+        case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        case (1 << STATE_RUN):
+            /* Getting close to finish. Spin. */
+            continue;
+        }
+        /*
+         * If the 'state' is 0 (not in use) we can schedule it.
+         */
+        if ( old == 0 )
+            break;
+    }
     get_knownalive_domain(pirq_dpci->dom);
 
     local_irq_save(flags);
-- 
2.1.0

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reply via email to