Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 01:33:48PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 28, 2013 at 01:22:45PM +0200, Gleb Natapov wrote:
> > On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> > > On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> > > >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> > > >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> > > >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> > > On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> > > >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> > > >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> > > >Without synchronize_rcu you could have
> > > >
> > > >VCPU writes to routing table
> > > >   e = entry from IRQ 
> > > > routing table
> > > >kvm_irq_routing_update(kvm, new);
> > > >VCPU resumes execution
> > > >   kvm_set_msi_irq(e, &irq);
> > > >   
> > > > kvm_irq_delivery_to_apic_fast();
> > > >
> > > >where the entry is stale but the VCPU has already resumed 
> > > >execution.
> > > >
> > > >>>If we use call_rcu()(Not consider the problem that Gleb pointed 
> > > >>>out temporarily) instead of synchronize_rcu(), should we still 
> > > >>>ensure this?
> > > >>The problem is that we should ensure this, so using call_rcu is not
> > > >>possible (even not considering the memory allocation problem).
> > > >>
> > > >Not changing current behaviour is certainly safer, but I am still 
> > > >not 100%
> > > >convinced we have to ensure this.
> > > >
> > > >Suppose guest does:
> > > >
> > > >1: change msi interrupt by writing to pci register
> > > >2: read the pci register to flush the write
> > > >3: zero idt
> > > >
> > > >I am pretty certain that this code can get interrupt after step 2 on 
> > > >real HW,
> > > >but I cannot tell if guest can rely on it to be delivered exactly 
> > > >after
> > > >read instruction or it can be delayed by couple of instructions. 
> > > >Seems to me
> > > >it would be fragile for an OS to depend on this behaviour. AFAIK 
> > > >Linux does not.
> > > >
> > > Linux is safe, it does interrupt migration from within the interrupt
> > > handler.  If you do that before the device-specific EOI, you won't
> > > get another interrupt until programming the MSI is complete.
> > > 
> > > Is virtio safe? IIRC it can post multiple interrupts without guest 
> > > acks.
> > > 
> > > Using call_rcu() is a better solution than srcu IMO.  Less code
> > > changes, consistently faster.
> > > >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> > > >>>
> > > >>>
> > > >>Shouldn't it work with old userspace too? Maybe I misunderstood your 
> > > >>intent.
> > > >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> > > >environment, so I propose how he can fix the problem in his specific
> > > >environment.  It will not fix older userspace obviously, but kernel
> > > >fix will also require kernel update and usually updating userspace
> > > >is easier.
> > > >
> > > >
> > > 
> > > Isn't the latency due to interrupt migration causing long
> > > synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> > > 
> > If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> > irq routing table changing MSI configuration should not cause update to
> > irq routing table (not saying this is what happens with current QEMU, but
> > theoretically there is not reason to update routing table in this case).
> > 
> > --
> > Gleb.
> 
> Unfortunately all high performance users (vhost net,
> vhost scsi, virtio-blk data plane, vfio) switched to using
> eventfd.
> 
Right :(

> KVM_SIGNAL_MSI is used as a simple mechanism to avoid routing
> table hassles e.g. for hotplug MSIs.
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:31 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:23, Gleb Natapov ha scritto:

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.


So we will have to disable preemption in a reader to prevent big latencies for
a writer, no?

I don't think that's necessary.  The writer itself could also be
scheduled out, and the reader critical sections are really small.  Let's
wait for Zhang to try SRCU and report back.



I think readers have preemption disabled already in that context (irqfd 
writes).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 12:23, Gleb Natapov ha scritto:
>>> > > Unless what ? :) Unless reader is scheduled out?
>> > 
>> > Yes.  Or unless my brain is scheduled out in the middle of a sentence.
>> > 
> So we will have to disable preemption in a reader to prevent big latencies for
> a writer, no?

I don't think that's necessary.  The writer itself could also be
scheduled out, and the reader critical sections are really small.  Let's
wait for Zhang to try SRCU and report back.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:22 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, &irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.



Isn't the latency due to interrupt migration causing long
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).


I see.  That pushes the problem to userspace, which uses traditional 
locking, so the problem disappears until qemu starts using rcu too to 
manage this.


There is also irqfd, however.  We could also do a KVM_UPDATE_IRQFD to 
change the payload it delivers, but that has exactly the same problems.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Michael S. Tsirkin
On Thu, Nov 28, 2013 at 01:22:45PM +0200, Gleb Natapov wrote:
> On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> > On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> > >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> > >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> > >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> > On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> > >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> > >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> > >Without synchronize_rcu you could have
> > >
> > >VCPU writes to routing table
> > >   e = entry from IRQ routing 
> > > table
> > >kvm_irq_routing_update(kvm, new);
> > >VCPU resumes execution
> > >   kvm_set_msi_irq(e, &irq);
> > >   
> > > kvm_irq_delivery_to_apic_fast();
> > >
> > >where the entry is stale but the VCPU has already resumed 
> > >execution.
> > >
> > >>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> > >>>temporarily) instead of synchronize_rcu(), should we still ensure 
> > >>>this?
> > >>The problem is that we should ensure this, so using call_rcu is not
> > >>possible (even not considering the memory allocation problem).
> > >>
> > >Not changing current behaviour is certainly safer, but I am still not 
> > >100%
> > >convinced we have to ensure this.
> > >
> > >Suppose guest does:
> > >
> > >1: change msi interrupt by writing to pci register
> > >2: read the pci register to flush the write
> > >3: zero idt
> > >
> > >I am pretty certain that this code can get interrupt after step 2 on 
> > >real HW,
> > >but I cannot tell if guest can rely on it to be delivered exactly after
> > >read instruction or it can be delayed by couple of instructions. Seems 
> > >to me
> > >it would be fragile for an OS to depend on this behaviour. AFAIK Linux 
> > >does not.
> > >
> > Linux is safe, it does interrupt migration from within the interrupt
> > handler.  If you do that before the device-specific EOI, you won't
> > get another interrupt until programming the MSI is complete.
> > 
> > Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> > 
> > Using call_rcu() is a better solution than srcu IMO.  Less code
> > changes, consistently faster.
> > >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> > >>>
> > >>>
> > >>Shouldn't it work with old userspace too? Maybe I misunderstood your 
> > >>intent.
> > >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> > >environment, so I propose how he can fix the problem in his specific
> > >environment.  It will not fix older userspace obviously, but kernel
> > >fix will also require kernel update and usually updating userspace
> > >is easier.
> > >
> > >
> > 
> > Isn't the latency due to interrupt migration causing long
> > synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> > 
> If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> irq routing table changing MSI configuration should not cause update to
> irq routing table (not saying this is what happens with current QEMU, but
> theoretically there is not reason to update routing table in this case).
> 
> --
>   Gleb.

Unfortunately all high performance users (vhost net,
vhost scsi, virtio-blk data plane, vfio) switched to using
eventfd.

KVM_SIGNAL_MSI is used as a simple mechanism to avoid routing
table hassles e.g. for hotplug MSIs.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 12:10:40PM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 12:09, Gleb Natapov ha scritto:
> > > - if there are no callbacks, but there are readers, synchronize_srcu
> > > busy-loops for some time checking if the readers complete.  After a
> > > while (20 us for synchronize_srcu, 120 us for
> > > synchronize_srcu_expedited) it gives up and starts using a workqueue to
> > > poll every millisecond.  This should never happen unless
> > 
> > Unless what ? :) Unless reader is scheduled out?
> 
> Yes.  Or unless my brain is scheduled out in the middle of a sentence.
> 
So we will have to disable preemption in a reader to prevent big latencies for
a writer, no?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >Without synchronize_rcu you could have
> >
> >VCPU writes to routing table
> >   e = entry from IRQ routing 
> > table
> >kvm_irq_routing_update(kvm, new);
> >VCPU resumes execution
> >   kvm_set_msi_irq(e, &irq);
> >   
> > kvm_irq_delivery_to_apic_fast();
> >
> >where the entry is stale but the VCPU has already resumed execution.
> >
> >>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >>>temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>The problem is that we should ensure this, so using call_rcu is not
> >>possible (even not considering the memory allocation problem).
> >>
> >Not changing current behaviour is certainly safer, but I am still not 
> >100%
> >convinced we have to ensure this.
> >
> >Suppose guest does:
> >
> >1: change msi interrupt by writing to pci register
> >2: read the pci register to flush the write
> >3: zero idt
> >
> >I am pretty certain that this code can get interrupt after step 2 on 
> >real HW,
> >but I cannot tell if guest can rely on it to be delivered exactly after
> >read instruction or it can be delayed by couple of instructions. Seems 
> >to me
> >it would be fragile for an OS to depend on this behaviour. AFAIK Linux 
> >does not.
> >
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't
> get another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code
> changes, consistently faster.
> >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >>>
> >>>
> >>Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> >environment, so I propose how he can fix the problem in his specific
> >environment.  It will not fix older userspace obviously, but kernel
> >fix will also require kernel update and usually updating userspace
> >is easier.
> >
> >
> 
> Isn't the latency due to interrupt migration causing long
> synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> 
If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, &irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.




Isn't the latency due to interrupt migration causing long 
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


The problem occurs with assigned devices too AFAICT.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:10 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:09, Gleb Natapov ha scritto:

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.



You need a grace period.  Or just sleep().
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 12:09, Gleb Natapov ha scritto:
> > - if there are no callbacks, but there are readers, synchronize_srcu
> > busy-loops for some time checking if the readers complete.  After a
> > while (20 us for synchronize_srcu, 120 us for
> > synchronize_srcu_expedited) it gives up and starts using a workqueue to
> > poll every millisecond.  This should never happen unless
> 
> Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 11:40:06AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 11:16, Avi Kivity ha scritto:
> >> The QRCU I linked would work great latency-wise (it has roughly the same
> >> latency of an rwsem but readers are lock-free).  However, the locked
> >> operations in the read path would hurt because of cache misses, so it's
> >> not good either.
> > 
> > I guess srcu would work.  Do you know what's the typical write-side
> > latency there? (in terms of what it waits for, not nanoseconds).
> 
> If there's no concurrent reader, it's zero if I read the code right.
> Otherwise it depends:
> 
> - if there are many callbacks, only 10 of them are processed per
> millisecond.  But unless there are concurrent synchronize_srcu calls
> there should not be any callback at all.  If all VCPUs were to furiously
> change the MSIs, the latency could go up to #vcpu/10 milliseconds.
> 
> - if there are no callbacks, but there are readers, synchronize_srcu
> busy-loops for some time checking if the readers complete.  After a
> while (20 us for synchronize_srcu, 120 us for
> synchronize_srcu_expedited) it gives up and starts using a workqueue to
> poll every millisecond.  This should never happen unless
> 
Unless what ? :) Unless reader is scheduled out?

> So, given the very restricted usage this SRCU would have, we probably
> can expect synchronize_srcu_expedited to finish its job in the
> busy-looping phase, and 120 us should be the expected maximum
> latency---more likely to be an order of magnitude smaller, and in very
> rare cases higher.
> 
> Paolo

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> >>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>Without synchronize_rcu you could have
> >>>
> >>>VCPU writes to routing table
> >>>   e = entry from IRQ routing table
> >>>kvm_irq_routing_update(kvm, new);
> >>>VCPU resumes execution
> >>>   kvm_set_msi_irq(e, &irq);
> >>>   kvm_irq_delivery_to_apic_fast();
> >>>
> >>>where the entry is stale but the VCPU has already resumed execution.
> >>>
> >If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >temporarily) instead of synchronize_rcu(), should we still ensure this?
> The problem is that we should ensure this, so using call_rcu is not
> possible (even not considering the memory allocation problem).
> 
> >>>Not changing current behaviour is certainly safer, but I am still not 100%
> >>>convinced we have to ensure this.
> >>>
> >>>Suppose guest does:
> >>>
> >>>1: change msi interrupt by writing to pci register
> >>>2: read the pci register to flush the write
> >>>3: zero idt
> >>>
> >>>I am pretty certain that this code can get interrupt after step 2 on real 
> >>>HW,
> >>>but I cannot tell if guest can rely on it to be delivered exactly after
> >>>read instruction or it can be delayed by couple of instructions. Seems to 
> >>>me
> >>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux 
> >>>does not.
> >>>
> >>Linux is safe, it does interrupt migration from within the interrupt
> >>handler.  If you do that before the device-specific EOI, you won't
> >>get another interrupt until programming the MSI is complete.
> >>
> >>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> >>
> >>Using call_rcu() is a better solution than srcu IMO.  Less code
> >>changes, consistently faster.
> >Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >
> >
> 
> Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:40 PM, Paolo Bonzini wrote:

Il 28/11/2013 11:16, Avi Kivity ha scritto:

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

I guess srcu would work.  Do you know what's the typical write-side
latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.



Looks like it's a good fit then.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 11:16, Avi Kivity ha scritto:
>> The QRCU I linked would work great latency-wise (it has roughly the same
>> latency of an rwsem but readers are lock-free).  However, the locked
>> operations in the read path would hurt because of cache misses, so it's
>> not good either.
> 
> I guess srcu would work.  Do you know what's the typical write-side
> latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:53 AM, Paolo Bonzini wrote:

Il 28/11/2013 10:49, Avi Kivity ha scritto:

Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't get
another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes,
consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.



I guess srcu would work.  Do you know what's the typical write-side 
latency there? (in terms of what it waits for, not nanoseconds).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, &irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?




Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >Without synchronize_rcu you could have
> >
> >VCPU writes to routing table
> >   e = entry from IRQ routing table
> >kvm_irq_routing_update(kvm, new);
> >VCPU resumes execution
> >   kvm_set_msi_irq(e, &irq);
> >   kvm_irq_delivery_to_apic_fast();
> >
> >where the entry is stale but the VCPU has already resumed execution.
> >
> >>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >>>temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>The problem is that we should ensure this, so using call_rcu is not
> >>possible (even not considering the memory allocation problem).
> >>
> >Not changing current behaviour is certainly safer, but I am still not 100%
> >convinced we have to ensure this.
> >
> >Suppose guest does:
> >
> >1: change msi interrupt by writing to pci register
> >2: read the pci register to flush the write
> >3: zero idt
> >
> >I am pretty certain that this code can get interrupt after step 2 on real HW,
> >but I cannot tell if guest can rely on it to be delivered exactly after
> >read instruction or it can be delayed by couple of instructions. Seems to me
> >it would be fragile for an OS to depend on this behaviour. AFAIK Linux does 
> >not.
> >
> 
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't
> get another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code
> changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 10:49, Avi Kivity ha scritto:
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't get
> another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code changes,
> consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, &irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

  
Not changing current behaviour is certainly safer, but I am still not 100%

convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.



Linux is safe, it does interrupt migration from within the interrupt 
handler.  If you do that before the device-specific EOI, you won't get 
another interrupt until programming the MSI is complete.


Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes, 
consistently faster.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 10:29:36AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 10:19, Gleb Natapov ha scritto:
> > Not changing current behaviour is certainly safer, but I am still not 100%
> > convinced we have to ensure this.
> > 
> > Suppose guest does:
> > 
> > 1: change msi interrupt by writing to pci register
> > 2: read the pci register to flush the write
> > 3: zero idt
> > 
> > I am pretty certain that this code can get interrupt after step 2 on real 
> > HW,
> > but I cannot tell if guest can rely on it to be delivered exactly after
> > read instruction or it can be delayed by couple of instructions.
> 
> I agree it's fragile, but if a dedicated SRCU can meet the requirements
> (possibly with synchronize_srcu_expedited), I prefer not to break it.
> 
Definitely. If we can find reasonable solution that preserves current semantics
it's preferable path to take.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 10:19, Gleb Natapov ha scritto:
> Not changing current behaviour is certainly safer, but I am still not 100%
> convinced we have to ensure this.
> 
> Suppose guest does:
> 
> 1: change msi interrupt by writing to pci register
> 2: read the pci register to flush the write
> 3: zero idt
> 
> I am pretty certain that this code can get interrupt after step 2 on real HW,
> but I cannot tell if guest can rely on it to be delivered exactly after
> read instruction or it can be delayed by couple of instructions.

I agree it's fragile, but if a dedicated SRCU can meet the requirements
(possibly with synchronize_srcu_expedited), I prefer not to break it.

Paolo

 Seems to me
> it would be fragile for an OS to depend on this behaviour. AFAIK Linux does 
> not.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 09:14:22AM +, Zhanghaoyu (A) wrote:
> >  No, this would be exactly the same code that is running now:
> >
> >  mutex_lock(&kvm->irq_lock);
> >  old = kvm->irq_routing;
> >  kvm_irq_routing_update(kvm, new);
> >  mutex_unlock(&kvm->irq_lock);
> >
> >  synchronize_rcu();
> >  kfree(old);
> >  return 0;
> >
> >  Except that the kfree would run in the call_rcu kernel thread 
> > instead of
> >  the vcpu thread.  But the vcpus already see the new routing table 
> > after
> >  the rcu_assign_pointer that is in kvm_irq_routing_update.
> >
> > I understood the proposal was also to eliminate the 
> > synchronize_rcu(), so while new interrupts would see the new 
> > routing table, interrupts already in flight could pick up the old one.
>  Isn't that always the case with RCU?  (See my answer above: "the 
>  vcpus already see the new routing table after the rcu_assign_pointer 
>  that is in kvm_irq_routing_update").
> >>> With synchronize_rcu(), you have the additional guarantee that any 
> >>> parallel accesses to the old routing table have completed.  Since we 
> >>> also trigger the irq from rcu context, you know that after
> >>> synchronize_rcu() you won't get any interrupts to the old destination 
> >>> (see kvm_set_irq_inatomic()).
> >> We do not have this guaranty for other vcpus that do not call 
> >> synchronize_rcu(). They may still use outdated routing table while a 
> >> vcpu or iothread that performed table update sits in synchronize_rcu().
> >>
> >
> >Consider this guest code:
> >
> >   write msi entry, directing the interrupt away from this vcpu
> >   nop
> >   memset(&idt, 0, sizeof(idt));
> >
> >Currently, this code will never trigger a triple fault.  With the change to 
> >call_rcu(), it may.
> >
> >Now it may be that the guest does not expect this to work (PCI writes are 
> >posted; and interrupts can be delayed indefinitely by the pci fabric), 
> >but we don't know if there's a path that guarantees the guest something that 
> >we're taking away with this change.
> 
> In native environment, if a CPU's LAPIC's IRR and ISR have been pending many 
> interrupts, then OS perform zeroing this CPU's IDT before receiving 
> interrupts,
> will the same problem happen?
> 
This is just an example. OS can ensure that there is no other pending interrupt 
by some other means.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov
On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >> >Without synchronize_rcu you could have
> >> >
> >> >VCPU writes to routing table
> >> >   e = entry from IRQ routing table
> >> >kvm_irq_routing_update(kvm, new);
> >> >VCPU resumes execution
> >> >   kvm_set_msi_irq(e, &irq);
> >> >   kvm_irq_delivery_to_apic_fast();
> >> >
> >> >where the entry is stale but the VCPU has already resumed execution.
> >> >
> > If we use call_rcu()(Not consider the problem that Gleb pointed out 
> > temporarily) instead of synchronize_rcu(), should we still ensure this?
> 
> The problem is that we should ensure this, so using call_rcu is not
> possible (even not considering the memory allocation problem).
>
 
Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Zhanghaoyu (A)
>  No, this would be exactly the same code that is running now:
>
>  mutex_lock(&kvm->irq_lock);
>  old = kvm->irq_routing;
>  kvm_irq_routing_update(kvm, new);
>  mutex_unlock(&kvm->irq_lock);
>
>  synchronize_rcu();
>  kfree(old);
>  return 0;
>
>  Except that the kfree would run in the call_rcu kernel thread 
> instead of
>  the vcpu thread.  But the vcpus already see the new routing table 
> after
>  the rcu_assign_pointer that is in kvm_irq_routing_update.
>
> I understood the proposal was also to eliminate the 
> synchronize_rcu(), so while new interrupts would see the new 
> routing table, interrupts already in flight could pick up the old one.
 Isn't that always the case with RCU?  (See my answer above: "the 
 vcpus already see the new routing table after the rcu_assign_pointer 
 that is in kvm_irq_routing_update").
>>> With synchronize_rcu(), you have the additional guarantee that any 
>>> parallel accesses to the old routing table have completed.  Since we 
>>> also trigger the irq from rcu context, you know that after
>>> synchronize_rcu() you won't get any interrupts to the old destination 
>>> (see kvm_set_irq_inatomic()).
>> We do not have this guaranty for other vcpus that do not call 
>> synchronize_rcu(). They may still use outdated routing table while a 
>> vcpu or iothread that performed table update sits in synchronize_rcu().
>>
>
>Consider this guest code:
>
>   write msi entry, directing the interrupt away from this vcpu
>   nop
>   memset(&idt, 0, sizeof(idt));
>
>Currently, this code will never trigger a triple fault.  With the change to 
>call_rcu(), it may.
>
>Now it may be that the guest does not expect this to work (PCI writes are 
>posted; and interrupts can be delayed indefinitely by the pci fabric), 
>but we don't know if there's a path that guarantees the guest something that 
>we're taking away with this change.

In native environment, if a CPU's LAPIC's IRR and ISR have been pending many 
interrupts, then OS perform zeroing this CPU's IDT before receiving interrupts,
will the same problem happen?

Zhang Haoyu
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Paolo Bonzini
Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
>> >Without synchronize_rcu you could have
>> >
>> >VCPU writes to routing table
>> >   e = entry from IRQ routing table
>> >kvm_irq_routing_update(kvm, new);
>> >VCPU resumes execution
>> >   kvm_set_msi_irq(e, &irq);
>> >   kvm_irq_delivery_to_apic_fast();
>> >
>> >where the entry is stale but the VCPU has already resumed execution.
>> >
> If we use call_rcu()(Not consider the problem that Gleb pointed out 
> temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

Can you try using SRCU and synchronize_srcu?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-27 Thread Zhanghaoyu (A)
> > >>I understood the proposal was also to eliminate the 
> > >>synchronize_rcu(), so while new interrupts would see the new 
> > >>routing table, interrupts already in flight could pick up the old one.
 > >Isn't that always the case with RCU?  (See my answer above: "the 
 > >vcpus already see the new routing table after the 
 > >rcu_assign_pointer that is in kvm_irq_routing_update").
>>> > 
>>> > With synchronize_rcu(), you have the additional guarantee that any 
>>> > parallel accesses to the old routing table have completed.  Since 
>>> > we also trigger the irq from rcu context, you know that after
>>> > synchronize_rcu() you won't get any interrupts to the old 
>>> > destination (see kvm_set_irq_inatomic()).
>> We do not have this guaranty for other vcpus that do not call 
>> synchronize_rcu(). They may still use outdated routing table while a 
>> vcpu or iothread that performed table update sits in synchronize_rcu().
>
>Avi's point is that, after the VCPU resumes execution, you know that no 
>interrupt will be sent to the old destination because kvm_set_msi_inatomic 
>(and ultimately kvm_irq_delivery_to_apic_fast) is also called within the RCU 
>read-side critical section.
>
>Without synchronize_rcu you could have
>
>VCPU writes to routing table
>   e = entry from IRQ routing table
>kvm_irq_routing_update(kvm, new);
>VCPU resumes execution
>   kvm_set_msi_irq(e, &irq);
>   kvm_irq_delivery_to_apic_fast();
>
>where the entry is stale but the VCPU has already resumed execution.
>
If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

Thanks,
Zhang Haoyu

>If we want to ensure, we need to use a different mechanism for synchronization 
>than the global RCU.  QRCU would work; readers are not wait-free but only if 
>there is a concurrent synchronize_qrcu, which should be rare.
>
>Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 17:21, Avi Kivity ha scritto:
>>> It's indeed safe, but I think there's a nice win to be had if we
>>> drop the assumption.
>> I'm not arguing with that, but a minor commoent below:
>>
 (BTW, PCI memory writes are posted, but configuration writes are not).
>>> MSIs are configured via PCI memory writes.
>>>
>>> By itself, that doesn't buy us anything, since the guest could flush
>>> the write via a read.  But I think the fact that the interrupt
>>> messages themselves are posted proves that it is safe.
>> FYI, PCI read flushes the interrupt itself in, too.
> 
> I guess that kills the optimization then.  Maybe you can do qrcu,
> whatever that is.

It's "srcu" (a separate SRCU instance specific to the irq routing
table), which I managed to misspell twice.

Actually, it turns out that qrcu actually exists
(http://lwn.net/Articles/223752/) and has extremely fast grace periods,
but read_lock/read_unlock are also more expensive.  So it was probably
some kind of Freudian slip.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 06:27:47PM +0200, Avi Kivity wrote:
> On 11/26/2013 06:24 PM, Gleb Natapov wrote:
> >On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> >>Il 26/11/2013 16:03, Gleb Natapov ha scritto:
> >I understood the proposal was also to eliminate the 
> >synchronize_rcu(),
> >so while new interrupts would see the new routing table, interrupts
> >already in flight could pick up the old one.
> >>>Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>>already see the new routing table after the rcu_assign_pointer that is
> >>>in kvm_irq_routing_update").
> >With synchronize_rcu(), you have the additional guarantee that any
> >parallel accesses to the old routing table have completed.  Since we
> >also trigger the irq from rcu context, you know that after
> >synchronize_rcu() you won't get any interrupts to the old
> >destination (see kvm_set_irq_inatomic()).
> >>>We do not have this guaranty for other vcpus that do not call
> >>>synchronize_rcu(). They may still use outdated routing table while a vcpu
> >>>or iothread that performed table update sits in synchronize_rcu().
> >>Avi's point is that, after the VCPU resumes execution, you know that no
> >>interrupt will be sent to the old destination because
> >>kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> >>also called within the RCU read-side critical section.
> >>
> >>Without synchronize_rcu you could have
> >>
> >> VCPU writes to routing table
> >>e = entry from IRQ routing table
> >> kvm_irq_routing_update(kvm, new);
> >> VCPU resumes execution
> >>kvm_set_msi_irq(e, &irq);
> >>kvm_irq_delivery_to_apic_fast();
> >>
> >>where the entry is stale but the VCPU has already resumed execution.
> >>
> >So how is it different from what we have now:
> >
> >disable_irq()
> >VCPU writes to routing table
> >  e = entry from IRQ routing table
> >  kvm_set_msi_irq(e, &irq);
> >  kvm_irq_delivery_to_apic_fast();
> >kvm_irq_routing_update(kvm, new);
> >synchronize_rcu()
> >VCPU resumes execution
> >enable_irq()
> >receive stale irq
> >
> >
> 
> Suppose the guest did not disable_irq() and enable_irq(), but
> instead had a pci read where you have the enable_irq().  After the
> read you cannot have a stale irq (assuming the read flushes the irq
> all the way to the APIC).
There still may be race between pci read and MSI registered in IRR. I do
not believe such read can undo IRR changes.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 05:28:23PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 17:24, Gleb Natapov ha scritto:
> >> VCPU writes to routing table
> >>e = entry from IRQ routing table
> >> kvm_irq_routing_update(kvm, new);
> >> VCPU resumes execution
> >>kvm_set_msi_irq(e, &irq);
> >>kvm_irq_delivery_to_apic_fast();
> >> 
> >> where the entry is stale but the VCPU has already resumed execution.
> > 
> > So how is it different from what we have now:
> > 
> > disable_irq()
> > VCPU writes to routing table
> >  e = entry from IRQ routing table
> >  kvm_set_msi_irq(e, &irq);
> >  kvm_irq_delivery_to_apic_fast();
> > kvm_irq_routing_update(kvm, new);
> > synchronize_rcu()
> > VCPU resumes execution
> > enable_irq()
> > receive stale irq
> 
> Adding a "disable/enable IRQs" looks like a relatively big change.  But
> perhaps it's not for some reason I'm missing.
> 
You will receive stale irq even without disable/enable IRQs of course. I
put it there so that guest would have a chance to do stupid things like
zeroing idt before receiving interrupt, but on real HW timing is
different from what we emulate, so the same race may happen even without
disable/enable IRQs.
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 17:24, Gleb Natapov ha scritto:

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, &irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, &irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

Adding a "disable/enable IRQs" looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.



Those are guest operations, which may not be there at all.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Michael S. Tsirkin
On Tue, Nov 26, 2013 at 04:58:53PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 16:35, Avi Kivity ha scritto:
>  If we want to ensure, we need to use a different mechanism for
>  synchronization than the global RCU.  QRCU would work; readers are not
>  wait-free but only if there is a concurrent synchronize_qrcu, which
>  should be rare.
> >>> An alternative path is to convince ourselves that the hardware does not
> >>> provide the guarantees that the current code provides, and so we can
> >>> relax them.
> >> No, I think it's a reasonable guarantee to provide.
> > 
> > Why?
> 
> Because IIUC the semantics may depend not just on the interrupt
> controller, but also on the specific PCI device.  It seems safer to
> assume that at least one device/driver pair wants this to work.
> 
> (BTW, PCI memory writes are posted, but configuration writes are not).
> 
> Paolo

You can also do a PCI read and flush out the writes.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 17:24, Gleb Natapov ha scritto:
>> VCPU writes to routing table
>>e = entry from IRQ routing table
>> kvm_irq_routing_update(kvm, new);
>> VCPU resumes execution
>>kvm_set_msi_irq(e, &irq);
>>kvm_irq_delivery_to_apic_fast();
>> 
>> where the entry is stale but the VCPU has already resumed execution.
> 
> So how is it different from what we have now:
> 
> disable_irq()
> VCPU writes to routing table
>  e = entry from IRQ routing table
>  kvm_set_msi_irq(e, &irq);
>  kvm_irq_delivery_to_apic_fast();
> kvm_irq_routing_update(kvm, new);
> synchronize_rcu()
> VCPU resumes execution
> enable_irq()
> receive stale irq

Adding a "disable/enable IRQs" looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:24 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, &irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, &irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

   



Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 16:03, Gleb Natapov ha scritto:
>  > >>I understood the proposal was also to eliminate the 
>  > >>synchronize_rcu(),
>  > >>so while new interrupts would see the new routing table, interrupts
>  > >>already in flight could pick up the old one.
> >>> > >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>> > >already see the new routing table after the rcu_assign_pointer that is
> >>> > >in kvm_irq_routing_update").
> >> > 
> >> > With synchronize_rcu(), you have the additional guarantee that any
> >> > parallel accesses to the old routing table have completed.  Since we
> >> > also trigger the irq from rcu context, you know that after
> >> > synchronize_rcu() you won't get any interrupts to the old
> >> > destination (see kvm_set_irq_inatomic()).
> > We do not have this guaranty for other vcpus that do not call
> > synchronize_rcu(). They may still use outdated routing table while a vcpu
> > or iothread that performed table update sits in synchronize_rcu().
> 
> Avi's point is that, after the VCPU resumes execution, you know that no
> interrupt will be sent to the old destination because
> kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> also called within the RCU read-side critical section.
> 
> Without synchronize_rcu you could have
> 
> VCPU writes to routing table
>e = entry from IRQ routing table
> kvm_irq_routing_update(kvm, new);
> VCPU resumes execution
>kvm_set_msi_irq(e, &irq);
>kvm_irq_delivery_to_apic_fast();
> 
> where the entry is stale but the VCPU has already resumed execution.
> 
So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
 e = entry from IRQ routing table
 kvm_set_msi_irq(e, &irq);
 kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

  
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we
drop the assumption.

I'm not arguing with that, but a minor commoent below:


(BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush
the write via a read.  But I think the fact that the interrupt
messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.



I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Michael S. Tsirkin
On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:
> On 11/26/2013 05:58 PM, Paolo Bonzini wrote:
> >Il 26/11/2013 16:35, Avi Kivity ha scritto:
> >If we want to ensure, we need to use a different mechanism for
> >synchronization than the global RCU.  QRCU would work; readers are not
> >wait-free but only if there is a concurrent synchronize_qrcu, which
> >should be rare.
> An alternative path is to convince ourselves that the hardware does not
> provide the guarantees that the current code provides, and so we can
> relax them.
> >>>No, I think it's a reasonable guarantee to provide.
> >>Why?
> >Because IIUC the semantics may depend not just on the interrupt
> >controller, but also on the specific PCI device.  It seems safer to
> >assume that at least one device/driver pair wants this to work.
> 
> It's indeed safe, but I think there's a nice win to be had if we
> drop the assumption.

I'm not arguing with that, but a minor commoent below:

> >(BTW, PCI memory writes are posted, but configuration writes are not).
> 
> MSIs are configured via PCI memory writes.
> 
> By itself, that doesn't buy us anything, since the guest could flush
> the write via a read.  But I think the fact that the interrupt
> messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.

> The fact
> that Linux does interrupt migration from within the interrupt
> handler also shows that someone else believes that it is the only
> safe place to do it.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.


It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.



(BTW, PCI memory writes are posted, but configuration writes are not).


MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 16:35, Avi Kivity ha scritto:
 If we want to ensure, we need to use a different mechanism for
 synchronization than the global RCU.  QRCU would work; readers are not
 wait-free but only if there is a concurrent synchronize_qrcu, which
 should be rare.
>>> An alternative path is to convince ourselves that the hardware does not
>>> provide the guarantees that the current code provides, and so we can
>>> relax them.
>> No, I think it's a reasonable guarantee to provide.
> 
> Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

(BTW, PCI memory writes are posted, but configuration writes are not).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:25, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.



Why?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 16:25, Avi Kivity ha scritto:
>> If we want to ensure, we need to use a different mechanism for
>> synchronization than the global RCU.  QRCU would work; readers are not
>> wait-free but only if there is a concurrent synchronize_qrcu, which
>> should be rare.
> 
> An alternative path is to convince ourselves that the hardware does not
> provide the guarantees that the current code provides, and so we can
> relax them.

No, I think it's a reasonable guarantee to provide.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:20 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, &irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.


An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:03 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(&kvm->irq_lock);
 old = kvm->irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(&kvm->irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().



Consider this guest code:

  write msi entry, directing the interrupt away from this vcpu
  nop
  memset(&idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.


Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 16:03, Gleb Natapov ha scritto:
 > >>I understood the proposal was also to eliminate the synchronize_rcu(),
 > >>so while new interrupts would see the new routing table, interrupts
 > >>already in flight could pick up the old one.
>>> > >Isn't that always the case with RCU?  (See my answer above: "the vcpus
>>> > >already see the new routing table after the rcu_assign_pointer that is
>>> > >in kvm_irq_routing_update").
>> > 
>> > With synchronize_rcu(), you have the additional guarantee that any
>> > parallel accesses to the old routing table have completed.  Since we
>> > also trigger the irq from rcu context, you know that after
>> > synchronize_rcu() you won't get any interrupts to the old
>> > destination (see kvm_set_irq_inatomic()).
> We do not have this guaranty for other vcpus that do not call
> synchronize_rcu(). They may still use outdated routing table while a vcpu
> or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, &irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov
On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:
> On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
> >Il 26/11/2013 15:36, Avi Kivity ha scritto:
> >> No, this would be exactly the same code that is running now:
> >>
> >> mutex_lock(&kvm->irq_lock);
> >> old = kvm->irq_routing;
> >> kvm_irq_routing_update(kvm, new);
> >> mutex_unlock(&kvm->irq_lock);
> >>
> >> synchronize_rcu();
> >> kfree(old);
> >> return 0;
> >>
> >> Except that the kfree would run in the call_rcu kernel thread instead 
> >> of
> >> the vcpu thread.  But the vcpus already see the new routing table after
> >> the rcu_assign_pointer that is in kvm_irq_routing_update.
> >>
> >>I understood the proposal was also to eliminate the synchronize_rcu(),
> >>so while new interrupts would see the new routing table, interrupts
> >>already in flight could pick up the old one.
> >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >already see the new routing table after the rcu_assign_pointer that is
> >in kvm_irq_routing_update").
> 
> With synchronize_rcu(), you have the additional guarantee that any
> parallel accesses to the old routing table have completed.  Since we
> also trigger the irq from rcu context, you know that after
> synchronize_rcu() you won't get any interrupts to the old
> destination (see kvm_set_irq_inatomic()).
We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(&kvm->irq_lock);
 old = kvm->irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(&kvm->irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").


With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).


It's another question whether the hardware provides the same guarantee.


If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.


Sure, if you drop the synchronize_rcu(), you have to add call_rcu().
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Paolo Bonzini
Il 26/11/2013 15:36, Avi Kivity ha scritto:
> 
> No, this would be exactly the same code that is running now:
> 
> mutex_lock(&kvm->irq_lock);
> old = kvm->irq_routing;
> kvm_irq_routing_update(kvm, new);
> mutex_unlock(&kvm->irq_lock);
> 
> synchronize_rcu();
> kfree(old);
> return 0;
> 
> Except that the kfree would run in the call_rcu kernel thread instead of
> the vcpu thread.  But the vcpus already see the new routing table after
> the rcu_assign_pointer that is in kvm_irq_routing_update.
> 
> I understood the proposal was also to eliminate the synchronize_rcu(),
> so while new interrupts would see the new routing table, interrupts
> already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: "the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update").

If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html