If we pass in INTx type devices to a guest on an over-subscribed
machine - and in an over-worked guest - we can cause the
pirq_dpci->softirq_list to become corrupted.

The reason for this is that the 'pt_irq_guest_eoi' ends up
setting the 'state' to zero value. However the 'state' value
(STATE_SCHED, STATE_RUN) is used to communicate between
 'raise_softirq_for' and 'dpci_softirq' to determine whether the
'struct hvm_pirq_dpci' can be re-scheduled. We are ignoring the
teardown path for simplicity for right now. The 'pt_irq_guest_eoi' was
not adhering to the proper dialogue and was not using locked cmpxchg or
test_bit operations and ended setting 'state' set to zero. That
meant 'raise_softirq_for' was free to schedule it while the
'struct hvm_pirq_dpci'' was still on an per-cpu list.
The end result was list_del being called twice and the second call
corrupting the per-cpu list.

For this to occur one of the CPUs must be in the idle loop executing
softirqs and the interrupt handler in the guest must not
respond to the pending interrupt within 8ms, and we must receive
another interrupt for this device on another CPU.

CPU0:                                  CPU1:

timer_softirq_action
 \- pt_irq_time_out
     state = 0;                        do_IRQ
 [out of timer code, the                raise_softirq
 pirq_dpci is on the CPU0 dpci_list]      [adds the pirq_dpci to CPU1
                                           dpci_list as state == 0]

softirq_dpci:                            softirq_dpci:
        list_del
        [list entries are poisoned]
                                                list_del <= BOOM

The fix is simple - enroll 'pt_irq_guest_eoi' to use the locked
semantics for 'state'. We piggyback on pt_pirq_softirq_cancel (was
pt_pirq_softirq_reset) to use cmpxchg. We also expand said function
to reset the '->dom' only on the teardown paths - but not on the
timeouts.

Reported-and-Tested-by: Sander Eikelenboom <li...@eikelenboom.it>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.w...@oracle.com>
---
 xen/drivers/passthrough/io.c | 27 +++++++++++++++++----------
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index efc66dc..2039d31 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -57,7 +57,7 @@ enum {
  * This can be called multiple times, but the softirq is only raised once.
  * That is until the STATE_SCHED state has been cleared. The state can be
  * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
- * or by 'pt_pirq_softirq_reset' (which will try to clear the state before
+ * or by 'pt_pirq_softirq_cancel' (which will try to clear the state before
  * the softirq had a chance to run).
  */
 static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
@@ -97,13 +97,15 @@ bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci 
*pirq_dpci)
 }
 
 /*
- * Reset the pirq_dpci->dom parameter to NULL.
+ * Cancels an outstanding pirq_dpci (if scheduled). Also if clear is set,
+ * reset pirq_dpci->dom parameter to NULL (used for teardown).
  *
  * This function checks the different states to make sure it can do it
  * at the right time. If it unschedules the 'hvm_dirq_assist' from running
  * it also refcounts (which is what the softirq would have done) properly.
  */
-static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+static void pt_pirq_softirq_cancel(struct hvm_pirq_dpci *pirq_dpci,
+                                   unsigned int clear)
 {
     struct domain *d = pirq_dpci->dom;
 
@@ -125,8 +127,13 @@ static void pt_pirq_softirq_reset(struct hvm_pirq_dpci 
*pirq_dpci)
          * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
          * in local variable before it sets STATE_RUN - and therefore will not
          * dereference '->dom' which would crash.
+         *
+         * However, if this is called from 'pt_irq_time_out' we do not want to
+         * clear the '->dom' as we can re-use the 'pirq_dpci' after that and
+         * need '->dom'.
          */
-        pirq_dpci->dom = NULL;
+        if ( clear )
+            pirq_dpci->dom = NULL;
         break;
     }
 }
@@ -142,7 +149,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct 
hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->state = 0;
+        pt_pirq_softirq_cancel(pirq_dpci, 0 /* keep dom */);
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -285,7 +292,7 @@ int pt_irq_create_bind(
                      * to be scheduled but we must deal with the one that may 
be
                      * in the queue.
                      */
-                    pt_pirq_softirq_reset(pirq_dpci);
+                    pt_pirq_softirq_cancel(pirq_dpci, 1 /* reset dom */);
                 }
             }
             if ( unlikely(rc) )
@@ -536,9 +543,9 @@ int pt_irq_destroy_bind(
         pirq_dpci->flags = 0;
         /*
          * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
-         * call to pt_pirq_softirq_reset.
+         * call to pt_pirq_softirq_cancel.
          */
-        pt_pirq_softirq_reset(pirq_dpci);
+        pt_pirq_softirq_cancel(pirq_dpci, 1 /* reset dom */);
 
         pirq_cleanup_check(pirq, d);
     }
@@ -773,8 +780,8 @@ unlock:
 }
 
 /*
- * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
- * doing it. If that is the case we let 'pt_pirq_softirq_reset' do 
ref-counting.
+ * Note: 'pt_pirq_softirq_cancel' can clear the STATE_SCHED before we get to
+ * doing it. If that is the case we let 'pt_pirq_softirq_cancel' do 
ref-counting.
  */
 static void dpci_softirq(void)
 {
-- 
1.9.3


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Reply via email to