from:"Gleb Natapov"

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 01:33:48PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 28, 2013 at 01:22:45PM +0200, Gleb Natapov wrote:
> > On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> > > On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> > > >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> > > >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> > > >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> > > >>>>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> > > >>>>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> > > >>>>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> > > >>>>>>>>>Without synchronize_rcu you could have
> > > >>>>>>>>>
> > > >>>>>>>>>VCPU writes to routing table
> > > >>>>>>>>>   e = entry from IRQ 
> > > >>>>>>>>> routing table
> > > >>>>>>>>>kvm_irq_routing_update(kvm, new);
> > > >>>>>>>>>VCPU resumes execution
> > > >>>>>>>>>   kvm_set_msi_irq(e, &irq);
> > > >>>>>>>>>   
> > > >>>>>>>>> kvm_irq_delivery_to_apic_fast();
> > > >>>>>>>>>
> > > >>>>>>>>>where the entry is stale but the VCPU has already resumed 
> > > >>>>>>>>>execution.
> > > >>>>>>>>>
> > > >>>>>>>If we use call_rcu()(Not consider the problem that Gleb pointed 
> > > >>>>>>>out temporarily) instead of synchronize_rcu(), should we still 
> > > >>>>>>>ensure this?
> > > >>>>>>The problem is that we should ensure this, so using call_rcu is not
> > > >>>>>>possible (even not considering the memory allocation problem).
> > > >>>>>>
> > > >>>>>Not changing current behaviour is certainly safer, but I am still 
> > > >>>>>not 100%
> > > >>>>>convinced we have to ensure this.
> > > >>>>>
> > > >>>>>Suppose guest does:
> > > >>>>>
> > > >>>>>1: change msi interrupt by writing to pci register
> > > >>>>>2: read the pci register to flush the write
> > > >>>>>3: zero idt
> > > >>>>>
> > > >>>>>I am pretty certain that this code can get interrupt after step 2 on 
> > > >>>>>real HW,
> > > >>>>>but I cannot tell if guest can rely on it to be delivered exactly 
> > > >>>>>after
> > > >>>>>read instruction or it can be delayed by couple of instructions. 
> > > >>>>>Seems to me
> > > >>>>>it would be fragile for an OS to depend on this behaviour. AFAIK 
> > > >>>>>Linux does not.
> > > >>>>>
> > > >>>>Linux is safe, it does interrupt migration from within the interrupt
> > > >>>>handler.  If you do that before the device-specific EOI, you won't
> > > >>>>get another interrupt until programming the MSI is complete.
> > > >>>>
> > > >>>>Is virtio safe? IIRC it can post multiple interrupts without guest 
> > > >>>>acks.
> > > >>>>
> > > >>>>Using call_rcu() is a better solution than srcu IMO.  Less code
> > > >>>>changes, consistently faster.
> > > >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> > > >>>
> > > >>>
> > > >>Shouldn't it work with old userspace too? Maybe I misunderstood your 
> > > >>intent.
> > > >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> > > >environment, so I propose how he can fix the problem in his specific
> > > >environment.  It will not fix older userspace obviously, but kernel
> > > >fix will also require kernel update and usually updating userspace
> > > >is easier.
> > > >
> > > >
> > > 
> > > Isn't the latency due to interrupt migration causing long
> > > synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> > > 
> > If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
> > irq routing table changing MSI configuration should not cause update to
> > irq routing table (not saying this is what happens with current QEMU, but
> > theoretically there is not reason to update routing table in this case).
> > 
> > --
> > Gleb.
> 
> Unfortunately all high performance users (vhost net,
> vhost scsi, virtio-blk data plane, vfio) switched to using
> eventfd.
> 
Right :(

> KVM_SIGNAL_MSI is used as a simple mechanism to avoid routing
> table hassles e.g. for hotplug MSIs.
> 

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 12:10:40PM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 12:09, Gleb Natapov ha scritto:
> > > - if there are no callbacks, but there are readers, synchronize_srcu
> > > busy-loops for some time checking if the readers complete.  After a
> > > while (20 us for synchronize_srcu, 120 us for
> > > synchronize_srcu_expedited) it gives up and starts using a workqueue to
> > > poll every millisecond.  This should never happen unless
> > 
> > Unless what ? :) Unless reader is scheduled out?
> 
> Yes.  Or unless my brain is scheduled out in the middle of a sentence.
> 
So we will have to disable preemption in a reader to prevent big latencies for
a writer, no?

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:
> On 11/28/2013 01:02 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> >>On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> >>>>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >>>>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>>>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>>>>>Without synchronize_rcu you could have
> >>>>>>>>>
> >>>>>>>>>VCPU writes to routing table
> >>>>>>>>>   e = entry from IRQ routing 
> >>>>>>>>> table
> >>>>>>>>>kvm_irq_routing_update(kvm, new);
> >>>>>>>>>VCPU resumes execution
> >>>>>>>>>   kvm_set_msi_irq(e, &irq);
> >>>>>>>>>   
> >>>>>>>>> kvm_irq_delivery_to_apic_fast();
> >>>>>>>>>
> >>>>>>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>>>>>
> >>>>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >>>>>>>temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>>>>>The problem is that we should ensure this, so using call_rcu is not
> >>>>>>possible (even not considering the memory allocation problem).
> >>>>>>
> >>>>>Not changing current behaviour is certainly safer, but I am still not 
> >>>>>100%
> >>>>>convinced we have to ensure this.
> >>>>>
> >>>>>Suppose guest does:
> >>>>>
> >>>>>1: change msi interrupt by writing to pci register
> >>>>>2: read the pci register to flush the write
> >>>>>3: zero idt
> >>>>>
> >>>>>I am pretty certain that this code can get interrupt after step 2 on 
> >>>>>real HW,
> >>>>>but I cannot tell if guest can rely on it to be delivered exactly after
> >>>>>read instruction or it can be delayed by couple of instructions. Seems 
> >>>>>to me
> >>>>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux 
> >>>>>does not.
> >>>>>
> >>>>Linux is safe, it does interrupt migration from within the interrupt
> >>>>handler.  If you do that before the device-specific EOI, you won't
> >>>>get another interrupt until programming the MSI is complete.
> >>>>
> >>>>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> >>>>
> >>>>Using call_rcu() is a better solution than srcu IMO.  Less code
> >>>>changes, consistently faster.
> >>>Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >>>
> >>>
> >>Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
> >Zhanghaoyu said that the problem mostly hurts in real-time telecom
> >environment, so I propose how he can fix the problem in his specific
> >environment.  It will not fix older userspace obviously, but kernel
> >fix will also require kernel update and usually updating userspace
> >is easier.
> >
> >
> 
> Isn't the latency due to interrupt migration causing long
> synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?
> 
If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 11:40:06AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 11:16, Avi Kivity ha scritto:
> >> The QRCU I linked would work great latency-wise (it has roughly the same
> >> latency of an rwsem but readers are lock-free).  However, the locked
> >> operations in the read path would hurt because of cache misses, so it's
> >> not good either.
> > 
> > I guess srcu would work.  Do you know what's the typical write-side
> > latency there? (in terms of what it waits for, not nanoseconds).
> 
> If there's no concurrent reader, it's zero if I read the code right.
> Otherwise it depends:
> 
> - if there are many callbacks, only 10 of them are processed per
> millisecond.  But unless there are concurrent synchronize_srcu calls
> there should not be any callback at all.  If all VCPUs were to furiously
> change the MSIs, the latency could go up to #vcpu/10 milliseconds.
> 
> - if there are no callbacks, but there are readers, synchronize_srcu
> busy-loops for some time checking if the readers complete.  After a
> while (20 us for synchronize_srcu, 120 us for
> synchronize_srcu_expedited) it gives up and starts using a workqueue to
> poll every millisecond.  This should never happen unless
> 
Unless what ? :) Unless reader is scheduled out?

> So, given the very restricted usage this SRCU would have, we probably
> can expect synchronize_srcu_expedited to finish its job in the
> busy-looping phase, and 120 us should be the expected maximum
> latency---more likely to be an order of magnitude smaller, and in very
> rare cases higher.
> 
> Paolo

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:
> On 11/28/2013 12:11 PM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> >>On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >>>On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>>>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>>>Without synchronize_rcu you could have
> >>>>>>>
> >>>>>>>VCPU writes to routing table
> >>>>>>>   e = entry from IRQ routing table
> >>>>>>>kvm_irq_routing_update(kvm, new);
> >>>>>>>VCPU resumes execution
> >>>>>>>   kvm_set_msi_irq(e, &irq);
> >>>>>>>   kvm_irq_delivery_to_apic_fast();
> >>>>>>>
> >>>>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>>>
> >>>>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >>>>>temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>>>The problem is that we should ensure this, so using call_rcu is not
> >>>>possible (even not considering the memory allocation problem).
> >>>>
> >>>Not changing current behaviour is certainly safer, but I am still not 100%
> >>>convinced we have to ensure this.
> >>>
> >>>Suppose guest does:
> >>>
> >>>1: change msi interrupt by writing to pci register
> >>>2: read the pci register to flush the write
> >>>3: zero idt
> >>>
> >>>I am pretty certain that this code can get interrupt after step 2 on real 
> >>>HW,
> >>>but I cannot tell if guest can rely on it to be delivered exactly after
> >>>read instruction or it can be delayed by couple of instructions. Seems to 
> >>>me
> >>>it would be fragile for an OS to depend on this behaviour. AFAIK Linux 
> >>>does not.
> >>>
> >>Linux is safe, it does interrupt migration from within the interrupt
> >>handler.  If you do that before the device-specific EOI, you won't
> >>get another interrupt until programming the MSI is complete.
> >>
> >>Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> >>
> >>Using call_rcu() is a better solution than srcu IMO.  Less code
> >>changes, consistently faster.
> >Why not fix userspace to use KVM_SIGNAL_MSI instead?
> >
> >
> 
> Shouldn't it work with old userspace too? Maybe I misunderstood your intent.
Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:
> On 11/28/2013 11:19 AM, Gleb Natapov wrote:
> >On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> >>Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >>>>>Without synchronize_rcu you could have
> >>>>>
> >>>>>VCPU writes to routing table
> >>>>>   e = entry from IRQ routing table
> >>>>>kvm_irq_routing_update(kvm, new);
> >>>>>VCPU resumes execution
> >>>>>   kvm_set_msi_irq(e, &irq);
> >>>>>   kvm_irq_delivery_to_apic_fast();
> >>>>>
> >>>>>where the entry is stale but the VCPU has already resumed execution.
> >>>>>
> >>>If we use call_rcu()(Not consider the problem that Gleb pointed out 
> >>>temporarily) instead of synchronize_rcu(), should we still ensure this?
> >>The problem is that we should ensure this, so using call_rcu is not
> >>possible (even not considering the memory allocation problem).
> >>
> >Not changing current behaviour is certainly safer, but I am still not 100%
> >convinced we have to ensure this.
> >
> >Suppose guest does:
> >
> >1: change msi interrupt by writing to pci register
> >2: read the pci register to flush the write
> >3: zero idt
> >
> >I am pretty certain that this code can get interrupt after step 2 on real HW,
> >but I cannot tell if guest can rely on it to be delivered exactly after
> >read instruction or it can be delayed by couple of instructions. Seems to me
> >it would be fragile for an OS to depend on this behaviour. AFAIK Linux does 
> >not.
> >
> 
> Linux is safe, it does interrupt migration from within the interrupt
> handler.  If you do that before the device-specific EOI, you won't
> get another interrupt until programming the MSI is complete.
> 
> Is virtio safe? IIRC it can post multiple interrupts without guest acks.
> 
> Using call_rcu() is a better solution than srcu IMO.  Less code
> changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 10:29:36AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 10:19, Gleb Natapov ha scritto:
> > Not changing current behaviour is certainly safer, but I am still not 100%
> > convinced we have to ensure this.
> > 
> > Suppose guest does:
> > 
> > 1: change msi interrupt by writing to pci register
> > 2: read the pci register to flush the write
> > 3: zero idt
> > 
> > I am pretty certain that this code can get interrupt after step 2 on real 
> > HW,
> > but I cannot tell if guest can rely on it to be delivered exactly after
> > read instruction or it can be delayed by couple of instructions.
> 
> I agree it's fragile, but if a dedicated SRCU can meet the requirements
> (possibly with synchronize_srcu_expedited), I prefer not to break it.
> 
Definitely. If we can find reasonable solution that preserves current semantics
it's preferable path to take.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 09:14:22AM +, Zhanghaoyu (A) wrote:
> >  No, this would be exactly the same code that is running now:
> >
> >  mutex_lock(&kvm->irq_lock);
> >  old = kvm->irq_routing;
> >  kvm_irq_routing_update(kvm, new);
> >  mutex_unlock(&kvm->irq_lock);
> >
> >  synchronize_rcu();
> >  kfree(old);
> >  return 0;
> >
> >  Except that the kfree would run in the call_rcu kernel thread 
> > instead of
> >  the vcpu thread.  But the vcpus already see the new routing table 
> > after
> >  the rcu_assign_pointer that is in kvm_irq_routing_update.
> >
> > I understood the proposal was also to eliminate the 
> > synchronize_rcu(), so while new interrupts would see the new 
> > routing table, interrupts already in flight could pick up the old one.
>  Isn't that always the case with RCU?  (See my answer above: "the 
>  vcpus already see the new routing table after the rcu_assign_pointer 
>  that is in kvm_irq_routing_update").
> >>> With synchronize_rcu(), you have the additional guarantee that any 
> >>> parallel accesses to the old routing table have completed.  Since we 
> >>> also trigger the irq from rcu context, you know that after
> >>> synchronize_rcu() you won't get any interrupts to the old destination 
> >>> (see kvm_set_irq_inatomic()).
> >> We do not have this guaranty for other vcpus that do not call 
> >> synchronize_rcu(). They may still use outdated routing table while a 
> >> vcpu or iothread that performed table update sits in synchronize_rcu().
> >>
> >
> >Consider this guest code:
> >
> >   write msi entry, directing the interrupt away from this vcpu
> >   nop
> >   memset(&idt, 0, sizeof(idt));
> >
> >Currently, this code will never trigger a triple fault.  With the change to 
> >call_rcu(), it may.
> >
> >Now it may be that the guest does not expect this to work (PCI writes are 
> >posted; and interrupts can be delayed indefinitely by the pci fabric), 
> >but we don't know if there's a path that guarantees the guest something that 
> >we're taking away with this change.
> 
> In native environment, if a CPU's LAPIC's IRR and ISR have been pending many 
> interrupts, then OS perform zeroing this CPU's IDT before receiving 
> interrupts,
> will the same problem happen?
> 
This is just an example. OS can ensure that there is no other pending interrupt 
by some other means.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Gleb Natapov

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:
> Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:
> >> >Without synchronize_rcu you could have
> >> >
> >> >VCPU writes to routing table
> >> >   e = entry from IRQ routing table
> >> >kvm_irq_routing_update(kvm, new);
> >> >VCPU resumes execution
> >> >   kvm_set_msi_irq(e, &irq);
> >> >   kvm_irq_delivery_to_apic_fast();
> >> >
> >> >where the entry is stale but the VCPU has already resumed execution.
> >> >
> > If we use call_rcu()(Not consider the problem that Gleb pointed out 
> > temporarily) instead of synchronize_rcu(), should we still ensure this?
> 
> The problem is that we should ensure this, so using call_rcu is not
> possible (even not considering the memory allocation problem).
>

Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 06:05:37PM +0200, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2013 at 02:56:10PM +0200, Gleb Natapov wrote:
> > On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> > > Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > > > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread 
> > > > will IOCTL return to QEMU from hypervisor, then vcpu thread ask the 
> > > > hypervisor to update the irq routing table,
> > > > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread 
> > > > is blocked for so much time to wait RCU grace period, and during this 
> > > > period, this vcpu cannot provide service to VM,
> > > > so those interrupts delivered to this vcpu cannot be handled in time, 
> > > > and the apps running on this vcpu cannot be serviced too.
> > > > It's unacceptable in some real-time scenario, e.g. telecom. 
> > > > 
> > > > So, I want to create a single workqueue for each VM, to asynchronously 
> > > > performing the RCU synchronization for irq routing table, 
> > > > and let the vcpu thread return and VMENTRY to service VM immediately, 
> > > > no more need to blocked to wait RCU grace period.
> > > > And, I have implemented a raw patch, took a test in our telecom 
> > > > environment, above problem disappeared.
> > > 
> > > I don't think a workqueue is even needed.  You just need to use call_rcu
> > > to free "old" after releasing kvm->irq_lock.
> > > 
> > > What do you think?
> > > 
> > It should be rate limited somehow. Since it guest triggarable guest may 
> > cause
> > host to allocate a lot of memory this way.
> 
> The checks in __call_rcu(), should handle this I think.  These keep a per-CPU
> counter, which can be adjusted via rcutree.blimit, which defaults
> to taking evasive action if more than 10K callbacks are waiting on a
> given CPU.
> 
> 
Documentation/RCU/checklist.txt has:

An especially important property of the synchronize_rcu()
primitive is that it automatically self-limits: if grace periods
are delayed for whatever reason, then the synchronize_rcu()
primitive will correspondingly delay updates.  In contrast,
code using call_rcu() should explicitly limit update rate in
cases where grace periods are delayed, as failing to do so can
result in excessive realtime latencies or even OOM conditions.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 06:27:47PM +0200, Avi Kivity wrote:
> On 11/26/2013 06:24 PM, Gleb Natapov wrote:
> >On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> >>Il 26/11/2013 16:03, Gleb Natapov ha scritto:
> >>>>>>>>>I understood the proposal was also to eliminate the 
> >>>>>>>>>synchronize_rcu(),
> >>>>>>>>>so while new interrupts would see the new routing table, interrupts
> >>>>>>>>>already in flight could pick up the old one.
> >>>>>>>Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>>>>>>already see the new routing table after the rcu_assign_pointer that is
> >>>>>>>in kvm_irq_routing_update").
> >>>>>With synchronize_rcu(), you have the additional guarantee that any
> >>>>>parallel accesses to the old routing table have completed.  Since we
> >>>>>also trigger the irq from rcu context, you know that after
> >>>>>synchronize_rcu() you won't get any interrupts to the old
> >>>>>destination (see kvm_set_irq_inatomic()).
> >>>We do not have this guaranty for other vcpus that do not call
> >>>synchronize_rcu(). They may still use outdated routing table while a vcpu
> >>>or iothread that performed table update sits in synchronize_rcu().
> >>Avi's point is that, after the VCPU resumes execution, you know that no
> >>interrupt will be sent to the old destination because
> >>kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> >>also called within the RCU read-side critical section.
> >>
> >>Without synchronize_rcu you could have
> >>
> >> VCPU writes to routing table
> >>e = entry from IRQ routing table
> >> kvm_irq_routing_update(kvm, new);
> >> VCPU resumes execution
> >>kvm_set_msi_irq(e, &irq);
> >>kvm_irq_delivery_to_apic_fast();
> >>
> >>where the entry is stale but the VCPU has already resumed execution.
> >>
> >So how is it different from what we have now:
> >
> >disable_irq()
> >VCPU writes to routing table
> >  e = entry from IRQ routing table
> >  kvm_set_msi_irq(e, &irq);
> >  kvm_irq_delivery_to_apic_fast();
> >kvm_irq_routing_update(kvm, new);
> >synchronize_rcu()
> >VCPU resumes execution
> >enable_irq()
> >receive stale irq
> >
> >
> 
> Suppose the guest did not disable_irq() and enable_irq(), but
> instead had a pci read where you have the enable_irq().  After the
> read you cannot have a stale irq (assuming the read flushes the irq
> all the way to the APIC).
There still may be race between pci read and MSI registered in IRR. I do
not believe such read can undo IRR changes.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 05:28:23PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 17:24, Gleb Natapov ha scritto:
> >> VCPU writes to routing table
> >>e = entry from IRQ routing table
> >> kvm_irq_routing_update(kvm, new);
> >> VCPU resumes execution
> >>kvm_set_msi_irq(e, &irq);
> >>kvm_irq_delivery_to_apic_fast();
> >> 
> >> where the entry is stale but the VCPU has already resumed execution.
> > 
> > So how is it different from what we have now:
> > 
> > disable_irq()
> > VCPU writes to routing table
> >  e = entry from IRQ routing table
> >  kvm_set_msi_irq(e, &irq);
> >  kvm_irq_delivery_to_apic_fast();
> > kvm_irq_routing_update(kvm, new);
> > synchronize_rcu()
> > VCPU resumes execution
> > enable_irq()
> > receive stale irq
> 
> Adding a "disable/enable IRQs" looks like a relatively big change.  But
> perhaps it's not for some reason I'm missing.
> 
You will receive stale irq even without disable/enable IRQs of course. I
put it there so that guest would have a chance to do stupid things like
zeroing idt before receiving interrupt, but on real HW timing is
different from what we emulate, so the same race may happen even without
disable/enable IRQs.
--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 16:03, Gleb Natapov ha scritto:
> >>>> > >>I understood the proposal was also to eliminate the 
> >>>> > >>synchronize_rcu(),
> >>>> > >>so while new interrupts would see the new routing table, interrupts
> >>>> > >>already in flight could pick up the old one.
> >>> > >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >>> > >already see the new routing table after the rcu_assign_pointer that is
> >>> > >in kvm_irq_routing_update").
> >> > 
> >> > With synchronize_rcu(), you have the additional guarantee that any
> >> > parallel accesses to the old routing table have completed.  Since we
> >> > also trigger the irq from rcu context, you know that after
> >> > synchronize_rcu() you won't get any interrupts to the old
> >> > destination (see kvm_set_irq_inatomic()).
> > We do not have this guaranty for other vcpus that do not call
> > synchronize_rcu(). They may still use outdated routing table while a vcpu
> > or iothread that performed table update sits in synchronize_rcu().
> 
> Avi's point is that, after the VCPU resumes execution, you know that no
> interrupt will be sent to the old destination because
> kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
> also called within the RCU read-side critical section.
> 
> Without synchronize_rcu you could have
> 
> VCPU writes to routing table
>e = entry from IRQ routing table
> kvm_irq_routing_update(kvm, new);
> VCPU resumes execution
>kvm_set_msi_irq(e, &irq);
>kvm_irq_delivery_to_apic_fast();
> 
> where the entry is stale but the VCPU has already resumed execution.
> 
So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
 e = entry from IRQ routing table
 kvm_set_msi_irq(e, &irq);
 kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

  
--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:
> On 11/26/2013 04:46 PM, Paolo Bonzini wrote:
> >Il 26/11/2013 15:36, Avi Kivity ha scritto:
> >> No, this would be exactly the same code that is running now:
> >>
> >> mutex_lock(&kvm->irq_lock);
> >> old = kvm->irq_routing;
> >> kvm_irq_routing_update(kvm, new);
> >> mutex_unlock(&kvm->irq_lock);
> >>
> >> synchronize_rcu();
> >> kfree(old);
> >> return 0;
> >>
> >> Except that the kfree would run in the call_rcu kernel thread instead 
> >> of
> >> the vcpu thread.  But the vcpus already see the new routing table after
> >> the rcu_assign_pointer that is in kvm_irq_routing_update.
> >>
> >>I understood the proposal was also to eliminate the synchronize_rcu(),
> >>so while new interrupts would see the new routing table, interrupts
> >>already in flight could pick up the old one.
> >Isn't that always the case with RCU?  (See my answer above: "the vcpus
> >already see the new routing table after the rcu_assign_pointer that is
> >in kvm_irq_routing_update").
> 
> With synchronize_rcu(), you have the additional guarantee that any
> parallel accesses to the old routing table have completed.  Since we
> also trigger the irq from rcu context, you know that after
> synchronize_rcu() you won't get any interrupts to the old
> destination (see kvm_set_irq_inatomic()).
We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 01:47:03PM +0100, Paolo Bonzini wrote:
> Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
> > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
> > IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor 
> > to update the irq routing table,
> > in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
> > blocked for so much time to wait RCU grace period, and during this period, 
> > this vcpu cannot provide service to VM,
> > so those interrupts delivered to this vcpu cannot be handled in time, and 
> > the apps running on this vcpu cannot be serviced too.
> > It's unacceptable in some real-time scenario, e.g. telecom. 
> > 
> > So, I want to create a single workqueue for each VM, to asynchronously 
> > performing the RCU synchronization for irq routing table, 
> > and let the vcpu thread return and VMENTRY to service VM immediately, no 
> > more need to blocked to wait RCU grace period.
> > And, I have implemented a raw patch, took a test in our telecom 
> > environment, above problem disappeared.
> 
> I don't think a workqueue is even needed.  You just need to use call_rcu
> to free "old" after releasing kvm->irq_lock.
> 
> What do you think?
> 
It should be rate limited somehow. Since it guest triggarable guest may cause
host to allocate a lot of memory this way.

Is this about MSI interrupt affinity? IIRC changing INT interrupt
affinity should not trigger kvm_set_irq_routing update. If this is about
MSI only then what about changing userspace to use KVM_SIGNAL_MSI for
MSI injection?

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 02:48:10PM +0200, Gleb Natapov wrote:
> On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote:
> > Hi all,
> > 
> > When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
> > IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor 
> > to update the irq routing table,
> Why vcpu thread ask the hypervisor to update the irq routing table on
> pcpu migration?
> 
Ah, I misread. Guest sets irq smp_affinity not host.

--
Gleb.

Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Gleb Natapov

On Tue, Nov 26, 2013 at 12:40:36PM +, Zhanghaoyu (A) wrote:
> Hi all,
> 
> When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread will 
> IOCTL return to QEMU from hypervisor, then vcpu thread ask the hypervisor to 
> update the irq routing table,
Why vcpu thread ask the hypervisor to update the irq routing table on
pcpu migration?

> in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread is 
> blocked for so much time to wait RCU grace period, and during this period, 
> this vcpu cannot provide service to VM,
> so those interrupts delivered to this vcpu cannot be handled in time, and the 
> apps running on this vcpu cannot be serviced too.
> It's unacceptable in some real-time scenario, e.g. telecom. 
> 
> So, I want to create a single workqueue for each VM, to asynchronously 
> performing the RCU synchronization for irq routing table, 
> and let the vcpu thread return and VMENTRY to service VM immediately, no more 
> need to blocked to wait RCU grace period.
> And, I have implemented a raw patch, took a test in our telecom environment, 
> above problem disappeared.
> 
> Any better ideas?
> 
> Thanks,
> Zhang Haoyu

--
Gleb.

[Qemu-devel] [PATCH 2/3] pci-assign: Remove dead code for direct I/O region access from userspace

2013-11-21 Thread Gleb Natapov

From: Jan Kiszka 

This feature was already deprecated back then in qemu-kvm, ie. before
pci-assign went upstream. assigned_dev_ioport_rw will never be invoked
with resource_fd < 0.

Signed-off-by: Jan Kiszka 
Acked-by: Alex Williamson 
Signed-off-by: Gleb Natapov 
---
 hw/i386/kvm/pci-assign.c | 56 +---
 1 file changed, 10 insertions(+), 46 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 011764f..4e65110 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -154,55 +154,19 @@ static uint64_t assigned_dev_ioport_rw(AssignedDevRegion 
*dev_region,
 uint64_t val = 0;
 int fd = dev_region->region->resource_fd;
 
-if (fd >= 0) {
-if (data) {
-DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
-  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
-if (pwrite(fd, data, size, addr) != size) {
-error_report("%s - pwrite failed %s",
- __func__, strerror(errno));
-}
-} else {
-if (pread(fd, &val, size, addr) != size) {
-error_report("%s - pread failed %s",
- __func__, strerror(errno));
-val = (1UL << (size * 8)) - 1;
-}
-DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
-  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
+if (data) {
+DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
+  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
+if (pwrite(fd, data, size, addr) != size) {
+error_report("%s - pwrite failed %s", __func__, strerror(errno));
 }
 } else {
-uint32_t port = addr + dev_region->u.r_baseport;
-
-if (data) {
-DEBUG("out data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
-  ", host=%x\n", *data, size, addr, port);
-switch (size) {
-case 1:
-outb(*data, port);
-break;
-case 2:
-outw(*data, port);
-break;
-case 4:
-outl(*data, port);
-break;
-}
-} else {
-switch (size) {
-case 1:
-val = inb(port);
-break;
-case 2:
-val = inw(port);
-break;
-case 4:
-val = inl(port);
-break;
-}
-DEBUG("in data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
-  ", host=%x\n", val, size, addr, port);
+if (pread(fd, &val, size, addr) != size) {
+error_report("%s - pread failed %s", __func__, strerror(errno));
+val = (1UL << (size * 8)) - 1;
 }
+DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
+  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
 }
 return val;
 }
-- 
1.8.4.rc3

[Qemu-devel] [PATCH 0/3] [PULL] qemu-kvm.git uq/master queue

2013-11-21 Thread Gleb Natapov

The following changes since commit fc8ead74674b7129e8f31c2595c76658e5622197:

  Merge remote-tracking branch 'qemu-kvm/uq/master' into staging (2013-10-18 
10:03:24 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

for you to fetch changes up to ef4cbe14342c1f63b3c754e306218f004f4e26c4:

  kvm: Fix uninitialized cpuid_data (2013-11-07 13:14:56 +0200)


Jan Kiszka (1):
  pci-assign: Remove dead code for direct I/O region access from userspace

Paolo Bonzini (1):
  KVM: x86: fix typo in KVM_GET_XCRS

Stefan Weil (1):
  kvm: Fix uninitialized cpuid_data

 hw/i386/kvm/pci-assign.c | 56 +---
 target-i386/kvm.c| 13 ---
 2 files changed, 14 insertions(+), 55 deletions(-)

[Qemu-devel] [PATCH 1/3] KVM: x86: fix typo in KVM_GET_XCRS

2013-11-21 Thread Gleb Natapov

From: Paolo Bonzini 

Only the first item of the array was ever looked at.  No
practical effect, but still worth fixing.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Gleb Natapov 
---
 target-i386/kvm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 749aa09..27071e3 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1314,8 +1314,8 @@ static int kvm_get_xcrs(X86CPU *cpu)
 
 for (i = 0; i < xcrs.nr_xcrs; i++) {
 /* Only support xcr0 now */
-if (xcrs.xcrs[0].xcr == 0) {
-env->xcr0 = xcrs.xcrs[0].value;
+if (xcrs.xcrs[i].xcr == 0) {
+env->xcr0 = xcrs.xcrs[i].value;
 break;
 }
 }
-- 
1.8.4.rc3

[Qemu-devel] [PATCH 3/3] kvm: Fix uninitialized cpuid_data

2013-11-21 Thread Gleb Natapov

From: Stefan Weil 

This error was reported by valgrind when running qemu-system-x86_64
with kvm:

==975== Conditional jump or move depends on uninitialised value(s)
==975==at 0x521C38: cpuid_find_entry (kvm.c:176)
==975==by 0x5235BA: kvm_arch_init_vcpu (kvm.c:686)
==975==by 0x4D5175: kvm_init_vcpu (kvm-all.c:267)
==975==by 0x45035B: qemu_kvm_cpu_thread_fn (cpus.c:858)
==975==by 0xD361E0D: start_thread (pthread_create.c:311)
==975==by 0xD65E9EC: clone (clone.S:113)
==975==  Uninitialised value was created by a stack allocation
==975==at 0x5226E4: kvm_arch_init_vcpu (kvm.c:446)

Instead of adding more memset calls for parts of cpuid_data, the existing
calls were removed and cpuid_data is now initialized completely in one
call.

Signed-off-by: Stefan Weil 
Signed-off-by: Gleb Natapov 
---
 target-i386/kvm.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 27071e3..1188482 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -456,11 +456,12 @@ int kvm_arch_init_vcpu(CPUState *cs)
 uint32_t signature[3];
 int r;
 
+memset(&cpuid_data, 0, sizeof(cpuid_data));
+
 cpuid_i = 0;
 
 /* Paravirtualization CPUIDs */
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_SIGNATURE;
 if (!hyperv_enabled(cpu)) {
 memcpy(signature, "KVMKVMKVM\0\0\0", 12);
@@ -474,7 +475,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c->edx = signature[2];
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_FEATURES;
 c->eax = env->features[FEAT_KVM];
 
@@ -483,13 +483,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c->eax = signature[0];
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = HYPERV_CPUID_VERSION;
 c->eax = 0x1bbc;
 c->ebx = 0x00060001;
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = HYPERV_CPUID_FEATURES;
 if (cpu->hyperv_relaxed_timing) {
 c->eax |= HV_X64_MSR_HYPERCALL_AVAILABLE;
@@ -500,7 +498,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
 }
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = HYPERV_CPUID_ENLIGHTMENT_INFO;
 if (cpu->hyperv_relaxed_timing) {
 c->eax |= HV_X64_RELAXED_TIMING_RECOMMENDED;
@@ -511,13 +508,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
 c->ebx = cpu->hyperv_spinlock_attempts;
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = HYPERV_CPUID_IMPLEMENT_LIMITS;
 c->eax = 0x40;
 c->ebx = 0x40;
 
 c = &cpuid_data.entries[cpuid_i++];
-memset(c, 0, sizeof(*c));
 c->function = KVM_CPUID_SIGNATURE_NEXT;
 memcpy(signature, "KVMKVMKVM\0\0\0", 12);
 c->eax = 0;
-- 
1.8.4.rc3

Re: [Qemu-devel] [PATCH 0/3] [PULL for 1.7?] qemu-kvm.git uq/master queue

2013-11-21 Thread Gleb Natapov

On Thu, Nov 21, 2013 at 07:11:46PM +0100, Paolo Bonzini wrote:
> Il 21/11/2013 18:38, Stefan Weil ha scritto:
> >> > Jan Kiszka (1):
> >> >   pci-assign: Remove dead code for direct I/O region access from 
> >> > userspace
> >> >
> >> > Paolo Bonzini (1):
> >> >   KVM: x86: fix typo in KVM_GET_XCRS
> >> >
> >> > Stefan Weil (1):
> >> >   kvm: Fix uninitialized cpuid_data
> >> >
> >> >  hw/i386/kvm/pci-assign.c | 56 
> >> > +---
> >> >  target-i386/kvm.c| 13 ---
> >> >  2 files changed, 14 insertions(+), 55 deletions(-)
> >> >
> > I think these patches should be included in QEMU 1.7, too.
> 
> Yes.
> 
Yeah, forget to add 1.7 to the subject.

--
Gleb.

Re: [Qemu-devel] question about VM kernel parameter idle=

2013-11-20 Thread Gleb Natapov

On Thu, Nov 21, 2013 at 09:01:39AM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 21, 2013 at 03:45:28AM +, Zhanghaoyu (A) wrote:
> > Hi, all
> > 
> > What's the difference of the linux guest kernel parameter 
> > idle=, especially in performance?
> > 
> > Taking the performance into account, which one is best?
> > 
> > In my opinion, if the number of all VMs' vcpus is far more than that of 
> > pcpus, e.g. SPECVirt test, idle=halt is better for server's total 
> > throughput,
> > otherwise, e.g. in some CT scenario, the number of total vcpus is not 
> > greater than that of pcpus, idle=poll is better for server's total 
> > throughput,
> > because of less latency and VMEXIT.
> 
> Makes sense overall.
> 
> > linux-3.9 and above, idle=mwait is not recommended.
> > 
> > Thanks,
> > Zhang Haoyu
> 
> Does it actually have effect? I didn't think it would.
mwait instruction is never exposed to a guest. With idle=mwait is will
likely fall back to halt silently.

--
Gleb.

Re: [Qemu-devel] [PATCH for-1.7] target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()

2013-11-12 Thread Gleb Natapov

On Wed, Nov 13, 2013 at 12:27:10PM +1000, Richard Henderson wrote:
> On 11/13/2013 08:53 AM, Paolo Bonzini wrote:
> > Il 12/11/2013 19:54, Richard Henderson ha scritto:
> >> For what it's worth, I think BOTH of the patches that have been posted
> >> should be applied.  That is, the patch that does (X || 1) -> (1 || X),
> >> and the patch that adds the stub.
> >>
> >> Frankly I'd have thought this was obvious
> > 
> > It's not that obvious to me.
> > 
> > If you add the stub, the patch that reorders operands is not necessary.
> >  If you reorder operands, the stub is not necessary.
> > 
> > The patch that does (X || 1) -> (1 || X) is unnecessary as a
> > microoptimization, since this code basically runs once at startup.  The
> > code is also a little bit less clear with the reordered operands, but
> > perhaps that's just me because I wrote the code that way.  (Splitting
> > the if in two would also make sense, and would not affect clarity).
> > 
> > Why should both be applied?
> 
> It's worth working around the clang missed optimization, if for nothing else
> than avoiding the noise of the bugs that would otherwise be filed against the
> release.
> 
> I think it's also worthwhile to implement the kvm api in kvm-stub.c,
> unnecessary or not.  If you really want compile-time feedback on those that
> ought to have been removed by optimization, you could elide them from the stub
> file depending on ifndef __OPTIMIZE__.
> 
Sounds like a nice compromise.

--
Gleb.

Re: [Qemu-devel] [PATCH for-1.7] target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()

2013-11-12 Thread Gleb Natapov

On Tue, Nov 12, 2013 at 02:57:49PM +0100, Paolo Bonzini wrote:
> Il 12/11/2013 14:23, Gleb Natapov ha scritto:
> >> If -O0 does not do that, let's move debug builds to -O1.
> >
> > Why not enable dce with -fdce?
> 
> First, because clang doesn't have fine-tuned optimization options (at
> least I couldn't find them and -fdce doesn't work).
> 
-O1 then for clang.

> Second, because most optimization options are no-ops at -O0 (try "-fdce
> -fdump-tree-all" with GCC.
> 
Strange. Is this by design? We can do -O1 and bunch of "-fno-" to
disable most of optimizations -O1 enables, but this is ugly.
  
--
Gleb.

Re: [Qemu-devel] [PATCH for-1.7] target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()

2013-11-12 Thread Gleb Natapov

On Tue, Nov 12, 2013 at 01:21:51PM +, Peter Maydell wrote:
>  (Similarly,
> you can put code that's a syntax error inside #if 0,
> but that won't work inside an "if (0)". The solution
> is not to do that.)
> 
That's the advantage of using "if (0)" instead of  #if 0. You do not
need to enable insane amount of options to check that your change does
not break complication for any of them.

--
Gleb.

Re: [Qemu-devel] [PATCH for-1.7] target-i386: Fix build by providing stub kvm_arch_get_supported_cpuid()

2013-11-12 Thread Gleb Natapov

On Tue, Nov 12, 2013 at 02:12:56PM +0100, Paolo Bonzini wrote:
> Il 12/11/2013 13:16, Peter Maydell ha scritto:
> > On 12 November 2013 12:09, Paolo Bonzini  wrote:
> >> Il 12/11/2013 12:07, Peter Maydell ha scritto:
> >>> For the compiler to eliminate this we are relying on:
> >>>  * dead-code elimination of code following a 'break'
> >>>statement in a case block
> >>>  * constant-folding of "something || 1" to 1
> >>>  * the compiler having done enough reasoning to be
> >>>sure that env is not NULL
> >>
> >> Yes, it's not trivial, but there are simpler ways to do it.
> >>
> >> For example there is no need to make sure that env is non-NULL, only to
> >> see that "something || 1" is never zero and thus "if (x) y;" is just
> >> "(void)x; y;".  This seems easier to me than DCE after "break" which
> >> clang is able to do.
> > 
> > You seem to be trying to reason about what the compiler
> > might choose to do or how it might be implemented internally.
> 
> I'm not reasoning about that in general (I was in the context of the
> message you quoted).
> 
> I'm saying it's *reasonable* to expect that "-O0" means "reduce compile
> time, make debugging produce expected results, and try (not too hard) to
> not break what works at -O2".  It's a simple QoI argument based on the
> fact that people *will* switch back and forth between -O2 and -O0.  Of
> course not everything can be kept to work, since the compilers do pretty
> surprising optimizations (not counting the ones that break your code of
> course...).  But I think a limited amount of dead code elimination
> *should* be expected because most people are now preferring "if" to
> "#ifdef" for compiling out code.
> 
> If -O0 does not do that, let's move debug builds to -O1.
> 
Why not enable dce with -fdce?

--
Gleb.

Re: [Qemu-devel] [PATCH for 1.7] kvm: Fix uninitialized cpuid_data

2013-11-07 Thread Gleb Natapov

On Wed, Nov 06, 2013 at 10:35:27PM +0100, Stefan Weil wrote:
> This error was reported by valgrind when running qemu-system-x86_64
> with kvm:
> 
> ==975== Conditional jump or move depends on uninitialised value(s)
> ==975==at 0x521C38: cpuid_find_entry (kvm.c:176)
> ==975==by 0x5235BA: kvm_arch_init_vcpu (kvm.c:686)
> ==975==by 0x4D5175: kvm_init_vcpu (kvm-all.c:267)
> ==975==by 0x45035B: qemu_kvm_cpu_thread_fn (cpus.c:858)
> ==975==by 0xD361E0D: start_thread (pthread_create.c:311)
> ==975==by 0xD65E9EC: clone (clone.S:113)
> ==975==  Uninitialised value was created by a stack allocation
> ==975==at 0x5226E4: kvm_arch_init_vcpu (kvm.c:446)
> 
> Instead of adding more memset calls for parts of cpuid_data, the existing
> calls were removed and cpuid_data is now initialized completely in one
> call.
> 
> Signed-off-by: Stefan Weil 
Applied, thanks.

> ---
> 
> I did not check whether older versions also need this fix.
> 
> Stefan W.
> 
>  target-i386/kvm.c |9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 749aa09..3ffd3e0 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -456,11 +456,12 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  uint32_t signature[3];
>  int r;
>  
> +memset(&cpuid_data, 0, sizeof(cpuid_data));
> +
>  cpuid_i = 0;
>  
>  /* Paravirtualization CPUIDs */
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = KVM_CPUID_SIGNATURE;
>  if (!hyperv_enabled(cpu)) {
>  memcpy(signature, "KVMKVMKVM\0\0\0", 12);
> @@ -474,7 +475,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  c->edx = signature[2];
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = KVM_CPUID_FEATURES;
>  c->eax = env->features[FEAT_KVM];
>  
> @@ -483,13 +483,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  c->eax = signature[0];
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = HYPERV_CPUID_VERSION;
>  c->eax = 0x1bbc;
>  c->ebx = 0x00060001;
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = HYPERV_CPUID_FEATURES;
>  if (cpu->hyperv_relaxed_timing) {
>  c->eax |= HV_X64_MSR_HYPERCALL_AVAILABLE;
> @@ -500,7 +498,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  }
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = HYPERV_CPUID_ENLIGHTMENT_INFO;
>  if (cpu->hyperv_relaxed_timing) {
>  c->eax |= HV_X64_RELAXED_TIMING_RECOMMENDED;
> @@ -511,13 +508,11 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  c->ebx = cpu->hyperv_spinlock_attempts;
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = HYPERV_CPUID_IMPLEMENT_LIMITS;
>  c->eax = 0x40;
>  c->ebx = 0x40;
>  
>  c = &cpuid_data.entries[cpuid_i++];
> -memset(c, 0, sizeof(*c));
>  c->function = KVM_CPUID_SIGNATURE_NEXT;
>  memcpy(signature, "KVMKVMKVM\0\0\0", 12);
>  c->eax = 0;
> -- 
> 1.7.10.4

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master] pci-assign: Remove dead code for direct I/O region access from userspace

2013-11-07 Thread Gleb Natapov

On Mon, Nov 04, 2013 at 02:42:55PM +0100, Jan Kiszka wrote:
> This feature was already deprecated back then in qemu-kvm, ie. before
> pci-assign went upstream. assigned_dev_ioport_rw will never be invoked
> with resource_fd < 0.
> 
> Signed-off-by: Jan Kiszka 
Applied, thanks.

> ---
>  hw/i386/kvm/pci-assign.c | 56 
> +---
>  1 file changed, 10 insertions(+), 46 deletions(-)
> 
> diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
> index 011764f..4e65110 100644
> --- a/hw/i386/kvm/pci-assign.c
> +++ b/hw/i386/kvm/pci-assign.c
> @@ -154,55 +154,19 @@ static uint64_t 
> assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
>  uint64_t val = 0;
>  int fd = dev_region->region->resource_fd;
>  
> -if (fd >= 0) {
> -if (data) {
> -DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
> -if (pwrite(fd, data, size, addr) != size) {
> -error_report("%s - pwrite failed %s",
> - __func__, strerror(errno));
> -}
> -} else {
> -if (pread(fd, &val, size, addr) != size) {
> -error_report("%s - pread failed %s",
> - __func__, strerror(errno));
> -val = (1UL << (size * 8)) - 1;
> -}
> -DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
> +if (data) {
> +DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> +  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
> +if (pwrite(fd, data, size, addr) != size) {
> +error_report("%s - pwrite failed %s", __func__, strerror(errno));
>  }
>  } else {
> -uint32_t port = addr + dev_region->u.r_baseport;
> -
> -if (data) {
> -DEBUG("out data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", host=%x\n", *data, size, addr, port);
> -switch (size) {
> -case 1:
> -outb(*data, port);
> -break;
> -case 2:
> -outw(*data, port);
> -break;
> -case 4:
> -outl(*data, port);
> -break;
> -}
> -} else {
> -switch (size) {
> -case 1:
> -val = inb(port);
> -break;
> -case 2:
> -val = inw(port);
> -break;
> -case 4:
> -val = inl(port);
> -break;
> -}
> -DEBUG("in data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", host=%x\n", val, size, addr, port);
> +if (pread(fd, &val, size, addr) != size) {
> +error_report("%s - pread failed %s", __func__, strerror(errno));
> +val = (1UL << (size * 8)) - 1;
>  }
> +DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> +  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
>  }
>  return val;
>  }
> -- 
> 1.8.1.1.298.ge7eed54

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master] pci-assign: Remove dead code for direct I/O region access from userspace

2013-11-05 Thread Gleb Natapov

Alex can you review please?

On Mon, Nov 04, 2013 at 02:42:55PM +0100, Jan Kiszka wrote:
> This feature was already deprecated back then in qemu-kvm, ie. before
> pci-assign went upstream. assigned_dev_ioport_rw will never be invoked
> with resource_fd < 0.
> 
> Signed-off-by: Jan Kiszka 
> ---
>  hw/i386/kvm/pci-assign.c | 56 
> +---
>  1 file changed, 10 insertions(+), 46 deletions(-)
> 
> diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
> index 011764f..4e65110 100644
> --- a/hw/i386/kvm/pci-assign.c
> +++ b/hw/i386/kvm/pci-assign.c
> @@ -154,55 +154,19 @@ static uint64_t 
> assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
>  uint64_t val = 0;
>  int fd = dev_region->region->resource_fd;
>  
> -if (fd >= 0) {
> -if (data) {
> -DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
> -if (pwrite(fd, data, size, addr) != size) {
> -error_report("%s - pwrite failed %s",
> - __func__, strerror(errno));
> -}
> -} else {
> -if (pread(fd, &val, size, addr) != size) {
> -error_report("%s - pread failed %s",
> - __func__, strerror(errno));
> -val = (1UL << (size * 8)) - 1;
> -}
> -DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
> +if (data) {
> +DEBUG("pwrite data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> +  ", addr="TARGET_FMT_plx"\n", *data, size, addr, addr);
> +if (pwrite(fd, data, size, addr) != size) {
> +error_report("%s - pwrite failed %s", __func__, strerror(errno));
>  }
>  } else {
> -uint32_t port = addr + dev_region->u.r_baseport;
> -
> -if (data) {
> -DEBUG("out data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", host=%x\n", *data, size, addr, port);
> -switch (size) {
> -case 1:
> -outb(*data, port);
> -break;
> -case 2:
> -outw(*data, port);
> -break;
> -case 4:
> -outl(*data, port);
> -break;
> -}
> -} else {
> -switch (size) {
> -case 1:
> -val = inb(port);
> -break;
> -case 2:
> -val = inw(port);
> -break;
> -case 4:
> -val = inl(port);
> -break;
> -}
> -DEBUG("in data=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> -  ", host=%x\n", val, size, addr, port);
> +if (pread(fd, &val, size, addr) != size) {
> +error_report("%s - pread failed %s", __func__, strerror(errno));
> +val = (1UL << (size * 8)) - 1;
>  }
> +DEBUG("pread val=%" PRIx64 ", size=%d, e_phys=" TARGET_FMT_plx
> +  ", addr=" TARGET_FMT_plx "\n", val, size, addr, addr);
>  }
>  return val;
>  }
> -- 
> 1.8.1.1.298.ge7eed54

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master] KVM: x86: fix typo in KVM_GET_XCRS

2013-11-04 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 04:47:52PM +0200, Paolo Bonzini wrote:
> Only the first item of the array was ever looked at.  No
> practical effect, but still worth fixing.
> 
> Signed-off-by: Paolo Bonzini 
Applied, thanks.

> ---
>  target-i386/kvm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 749aa09..27071e3 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1314,8 +1314,8 @@ static int kvm_get_xcrs(X86CPU *cpu)
>  
>  for (i = 0; i < xcrs.nr_xcrs; i++) {
>  /* Only support xcr0 now */
> -if (xcrs.xcrs[0].xcr == 0) {
> -env->xcr0 = xcrs.xcrs[0].value;
> +if (xcrs.xcrs[i].xcr == 0) {
> +env->xcr0 = xcrs.xcrs[i].value;
>  break;
>  }
>  }
> -- 
> 1.8.3.1

--
Gleb.

Re: [Qemu-devel] [PATCH] import kvm-unittest in QEMU source tree

2013-10-31 Thread Gleb Natapov

On Wed, Oct 30, 2013 at 04:06:19PM -0700, Andrew Jones wrote:
> On Wed, Oct 16, 2013 at 10:03:37PM +0300, Michael S. Tsirkin wrote:
> > This simply imports kvm-unittest git into qemu source tree.
> > 
> > We can next work on making make check run it
> > automatically.
> > 
> > Squashed 'kvm-unittest/' content from commit 2bc0e29
> > 
> > git-subtree-dir: kvm-unittest
> > git-subtree-split: 2bc0e29ee4447bebcd3b90053881f59265306adc
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > 
> > ---
> > 
> > Gleb, Paolo, any objections to this?  I really want a small guest for
> > running ACPI tests during make check, and kvm-unittest seems to fit the
> > bill.
> > 
> > Ability to test e.g. PCI with this in-tree would be very benefitial.
> >
> 
> Sorry for slow reply, I just noticed this mail now.
> 
> So far qemu is a dependency for kvm-unit-tests (both x86 and arm use it),
> but at some point that could change (if another arch doesn't want to use
> qemu). So if that happens, then it won't make much sense for the tests to
> live in qemu. Thus I'm not 100% in favor of moving them. However, if the
> consensus is to move them, then I have two comments on this patch.
> 
Do not worry, we are far from such consensus :)

> 1) There are a couple pendings patches on the kvm list that tidy up the
> kvm-unit-tests repo - removing lots of the files. That should be
> committed first to avoid importing a bunch of files right before deleting
> them.
> 
> 2) The name shouldn't change from kvm-unit-tests to kvm-unittest.
> 
> drew

--
Gleb.

Re: [Qemu-devel] [patch 2/2] i386: pc: align gpa<->hpa on 1GB boundary

2013-10-30 Thread Gleb Natapov

On Tue, Oct 29, 2013 at 07:21:59PM -0200, Marcelo Tosatti wrote:
> > > Could add a warning to memory API: if memory region is larger than 1GB
> > > and RAM is 1GB backed, and not properly aligned, warn.
> > Perhaps it would be better do abort and ask user to fix configuration,
> > and on hugepage allocation failure not fallback to malloc but abort and
> > tell user amount of hugepages needed to run guest with hugepage backend.
> 
> You want to back your guest with 1GB hugepages. You get 1 such page at a
> time, worst case.
> 
> You either 
> 
> 1) map the guest physical address space region (1GB sized) where
> the hole is located with smaller page sizes, which must be 2MB, see *,
> which requires the user to specify a different hugetlbfs mount path with
> sufficient 2MB huge pages.
> 
Why not really on THP to do the work?

> 2) move the pieces of memory which can't be 1GB mapped backed into
> 1GB hugepages, and map the remaining 1GB-aligned regions to individual 1GB 
> pages.
> 
> I am trying to avoid 1) as it complicates management (and fixes a bug).

--
Gleb.

Re: [Qemu-devel] Prohibit Windows from running in QEMU

2013-10-29 Thread Gleb Natapov

On Tue, Oct 29, 2013 at 02:17:10PM +0200, Michael S. Tsirkin wrote:
> On Tue, Oct 29, 2013 at 01:26:59PM +0200, Gleb Natapov wrote:
> > On Tue, Oct 29, 2013 at 01:13:24PM +0200, Michael S. Tsirkin wrote:
> > > On Tue, Oct 29, 2013 at 10:48:07AM +0100, Peter Lieven wrote:
> > > > Hi all,
> > > > 
> > > > this question might seem a bit weird, but does anyone see a good way to 
> > > > avoid
> > > > that Windows is able to boot inside qemu?
> > > > 
> > > > We have defined several profiles for different operation systems and I 
> > > > want
> > > > to avoid that someone chooses Linux and then installs Windows within
> > > > a VM. Reason is licensing.
> > > > 
> > > > Thanks,
> > > > Peter
> > > 
> > > - create a device
> > > - write a linux driver
> > > - if driver is not enabled crash guest
> > > 
> > For how long to wait before a crash?
> 
> Or don't crash, disable some other functionality, for example, you can
> keep all network links down until your driver is loaded.
> 
Unless your root is on nfs and driver is in a module :). Anyway if you
need to write guest code there are easier ways to do it than writing new
device/driver. In ideal world you could have used ACPI _OS(?) function,
but since most bioses are broken for anything but Windows Linux reports
that it is Windows too.
 
--
Gleb.

Re: [Qemu-devel] Prohibit Windows from running in QEMU

2013-10-29 Thread Gleb Natapov

On Tue, Oct 29, 2013 at 12:31:18PM +0100, Peter Lieven wrote:
> On 29.10.2013 12:26, Gleb Natapov wrote:
> >On Tue, Oct 29, 2013 at 01:13:24PM +0200, Michael S. Tsirkin wrote:
> >>On Tue, Oct 29, 2013 at 10:48:07AM +0100, Peter Lieven wrote:
> >>>Hi all,
> >>>
> >>>this question might seem a bit weird, but does anyone see a good way to 
> >>>avoid
> >>>that Windows is able to boot inside qemu?
> >>>
> >>>We have defined several profiles for different operation systems and I want
> >>>to avoid that someone chooses Linux and then installs Windows within
> >>>a VM. Reason is licensing.
> >>>
> >>>Thanks,
> >>>Peter
> >>- create a device
> >>- write a linux driver
> >>- if driver is not enabled crash guest
> >>
> >For how long to wait before a crash?
> I would not like to alter the software in the guest anyway. If this would be 
> required
> I could force a Linux version that would search for the alternate KVM 
> signature
> in the cpuid leaf.
> 
You can detect certain patterns of RTC usage (Linux does not use it
usually), but it is fragile since Linux allows userspace to access RTC
and it may create the same usage pattern.

--
Gleb.

Re: [Qemu-devel] Prohibit Windows from running in QEMU

2013-10-29 Thread Gleb Natapov

On Tue, Oct 29, 2013 at 01:13:24PM +0200, Michael S. Tsirkin wrote:
> On Tue, Oct 29, 2013 at 10:48:07AM +0100, Peter Lieven wrote:
> > Hi all,
> > 
> > this question might seem a bit weird, but does anyone see a good way to 
> > avoid
> > that Windows is able to boot inside qemu?
> > 
> > We have defined several profiles for different operation systems and I want
> > to avoid that someone chooses Linux and then installs Windows within
> > a VM. Reason is licensing.
> > 
> > Thanks,
> > Peter
> 
> - create a device
> - write a linux driver
> - if driver is not enabled crash guest
> 
For how long to wait before a crash?

--
Gleb.

Re: [Qemu-devel] kvm/hyper-v: obtaining client machine id

2013-10-29 Thread Gleb Natapov

On Mon, Oct 28, 2013 at 08:50:31AM +0100, Peter Lieven wrote:
> Hi,
> 
> do you know if it is possible to obtain the client machine id of a Windows 
> vServer via
> a Hyper-V hypercall? I would need an information to check vServer activations 
> against
> our KMS.
> 
Have not idea, sorry.

--
Gleb.

Re: [Qemu-devel] Prohibit Windows from running in QEMU

2013-10-29 Thread Gleb Natapov

On Tue, Oct 29, 2013 at 11:19:54AM +0100, Paolo Bonzini wrote:
> Il 29/10/2013 11:11, Peter Lieven ha scritto:
> > On 29.10.2013 10:59, Paolo Bonzini wrote:
> >> Il 29/10/2013 10:48, Peter Lieven ha scritto:
> >>> Hi all,
> >>>
> >>> this question might seem a bit weird, but does anyone see a good way to
> >>> avoid
> >>> that Windows is able to boot inside qemu?
> >>>
> >>> We have defined several profiles for different operation systems and
> >>> I want
> >>> to avoid that someone chooses Linux and then installs Windows within
> >>> a VM. Reason is licensing.
> >> Patch QEMU to crash when Hyper-V extensions are enabled...
> > 
> > I was thinking about this, but wouldn't this mean the cpu signature
> > would always be "Microsoft Hv"
> > and not "KVMKVMKVM\0\0\0"?
> 
> The KVM signature should be at CPUID leaf 0x4100.
> 
But only recently Linux started to search for it there.

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-17 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 12:55:16PM +0200, Paolo Bonzini wrote:
> Il 17/10/2013 08:27, Gleb Natapov ha scritto:
> > On Wed, Oct 16, 2013 at 10:46:53PM +0300, Michael S. Tsirkin wrote:
> >> Old GCC didn't let you reference variable by
> >> number if it is listed with a specific register
> >> constraint, on the assumption you can just
> >> use the register name explicitly.
> >>
> >> Build fails with errors like this:
> >> a.c:6: error: invalid 'asm': invalid operand code 'd'
> >>
> > Is it worth to support such ancient compiler? Nobody complained till
> > now. BTW with your patch I still cannot compile with 4.2:
> > 
> > x86/s3.c: In function 'main':
> > x86/s3.c:145: error: inconsistent operand constraints in an 'asm'
> > 
> >> To fix, let's just use %eax %al etc.
> >>
> > Only %d0 does not work and dropping "d" fixes it since compiler can
> > figure out correct register from variable size. The patch bellow fixes
> > compilation for 4.2.
> >  
> > diff --git a/lib/x86/pci.c b/lib/x86/pci.c
> > index f95cd88..231668a 100644
> > --- a/lib/x86/pci.c
> > +++ b/lib/x86/pci.c
> > @@ -3,13 +3,13 @@
> >  
> >  static void outl(unsigned short port, unsigned val)
> >  {
> > -asm volatile("outl %d0, %w1" : : "a"(val), "Nd"(port));
> > +asm volatile("outl %0, %w1" : : "a"(val), "Nd"(port));
> >  }
> >  
> >  static unsigned inl(unsigned short port)
> >  {
> >  unsigned data;
> > -asm volatile("inl %w1, %d0" : "=a"(data) : "Nd"(port));
> > +asm volatile("inl %w1, %0" : "=a"(data) : "Nd"(port));
> >  return data;
> >  }
> 
> This is okay.
> 
> >  static uint32_t pci_config_read(pcidevaddr_t dev, uint8_t reg)
> > diff --git a/x86/s3.c b/x86/s3.c
> > index 71d3ff9..d568aa7 100644
> > --- a/x86/s3.c
> > +++ b/x86/s3.c
> > @@ -143,14 +143,14 @@ static inline int rtc_in(u8 reg)
> >  {
> >  u8 x = reg;
> >  asm volatile("outb %b1, $0x70; inb $0x71, %b0"
> > -: "+a"(x) : "0"(x));
> > +: "=a"(x) : "0"(x));
> >  return x;
> >  }
> 
> This should be wrong.  GCC should complain that the same operand is used
> for both input and output but has an "=" constraint.
> 
> >  static inline void rtc_out(u8 reg, u8 val)
> >  {
> >  asm volatile("outb %b1, $0x70; mov %b2, %b1; outb %b1, $0x71"
> > -: "+a"(reg) : "0"(reg), "ri"(val));
> > +: "=a"(reg) : "0"(reg), "ri"(val));
> >  }
> 
> Same here.
> 
> But I'm not sure what is the error message for older GCC for s3.c, as I
> wrote in reply to Michael.
> 
x86/s3.c: In function 'main':
x86/s3.c:145: error: inconsistent operand constraints in an 'asm'

And I am puzzled by this too.

> >  extern char resume_start, resume_end;
> > diff --git a/x86/vmexit.c b/x86/vmexit.c
> > index 3b945de..7e9af15 100644
> > --- a/x86/vmexit.c
> > +++ b/x86/vmexit.c
> > @@ -26,7 +26,7 @@ static void outw(unsigned short port, unsigned val)
> >  
> >  static void outl(unsigned short port, unsigned val)
> >  {
> > -asm volatile("outl %d0, %w1" : : "a"(val), "Nd"(port));
> > +asm volatile("outl %0, %w1" : : "a"(val), "Nd"(port));
> >  }
> 
> Okay.
> 
> Paolo
> 
> >  static unsigned int inb(unsigned short port)
> > --
> > Gleb.
> > 

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-17 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 12:44:46PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 17, 2013 at 12:33:39PM +0300, Gleb Natapov wrote:
> > > > It just papers over the problem. Compiler should either complain that it
> > > > does not know what %w0 or complain that variable length does not match
> > > > assembly
> > > 
> > > Of course it can't. Compiler does not parse assembly at all:
> > > these are just constant strings as far as it's concerned.
> > > 
> > Compiler does not pars assembly itself but it parses things like %w0 to
> > know what assembly to emit. That is why it complained about %d0 in the
> > first place.
> 
> Right. I meant that with this: "outl %0" it has no chance to figure out
> that outl needs eax. It has to go by variable type. 
Yes.

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-17 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 12:28:58PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 17, 2013 at 11:34:41AM +0300, Gleb Natapov wrote:
> > On Thu, Oct 17, 2013 at 11:27:37AM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 17, 2013 at 11:20:27AM +0300, Gleb Natapov wrote:
> > > > On Thu, Oct 17, 2013 at 11:12:31AM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Oct 17, 2013 at 09:27:51AM +0300, Gleb Natapov wrote:
> > > > > > On Wed, Oct 16, 2013 at 10:46:53PM +0300, Michael S. Tsirkin wrote:
> > > > > > > Old GCC didn't let you reference variable by
> > > > > > > number if it is listed with a specific register
> > > > > > > constraint, on the assumption you can just
> > > > > > > use the register name explicitly.
> > > > > > > 
> > > > > > > Build fails with errors like this:
> > > > > > > a.c:6: error: invalid 'asm': invalid operand code 'd'
> > > > > > > 
> > > > > > Is it worth to support such ancient compiler? Nobody complained till
> > > > > > now.
> > > > > 
> > > > > Well it's not as widely used as kvm itself yet :)
> > > > > The patch seems simple enough though.
> > > > > 
> > > > > > BTW with your patch I still cannot compile with 4.2:
> > > > > > 
> > > > > > x86/s3.c: In function 'main':
> > > > > > x86/s3.c:145: error: inconsistent operand constraints in an 'asm'
> > > > > 
> > > > > OK that's easy to fix.
> > > > > 
> > > > > > > To fix, let's just use %eax %al etc.
> > > > > > > 
> > > > > > Only %d0 does not work and dropping "d" fixes it since compiler can
> > > > > > figure out correct register from variable size. The patch bellow 
> > > > > > fixes
> > > > > > compilation for 4.2.
> > > > > 
> > > > > It does produce warnings with -Wall though:
> > > > > Assembler messages:
> > > > > Warning: using `%ax' instead of `%eax' due to `w' suffix
> > > > > Warning: using `%al' instead of `%eax' due to `b' suffix
> > > > > 
> > > > Not for me. No warning with 4.7.3 and 4.2. .s file produced by gcc shows
> > > > correct assembly with my patch.
> > > 
> > > 4.3 doesn't.
> > > 
> > I pretty much doubt that 4.2 and 4.7 produce correct assembly and 4.3
> > does not. Which file is it? Send me .s file produced by it.
> 
> Tried to get it but problem went away after I reset and re-applied your
> patch. I think I had some other change in there that
> caused it, and blamed your patch incorrectly. Sorry.
> 
NP, that makes more sense.

> > > We have the "a" constraint there anyway - so what's the issue with just
> > > writing %eax etc? I think it's safer than relying on compiler to do
> > > the right thing.
> > > 
> > It just papers over the problem. Compiler should either complain that it
> > does not know what %w0 or complain that variable length does not match
> > assembly
> 
> Of course it can't. Compiler does not parse assembly at all:
> these are just constant strings as far as it's concerned.
> 
Compiler does not pars assembly itself but it parses things like %w0 to
know what assembly to emit. That is why it complained about %d0 in the
first place.

> > or use correct register.
> 
> Actually it does: it seems to correctly deduce
> size from variable type.
> 
> So your patch looks good to me, please apply.
> 
> 
Will do.

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-17 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 11:27:37AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 17, 2013 at 11:20:27AM +0300, Gleb Natapov wrote:
> > On Thu, Oct 17, 2013 at 11:12:31AM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 17, 2013 at 09:27:51AM +0300, Gleb Natapov wrote:
> > > > On Wed, Oct 16, 2013 at 10:46:53PM +0300, Michael S. Tsirkin wrote:
> > > > > Old GCC didn't let you reference variable by
> > > > > number if it is listed with a specific register
> > > > > constraint, on the assumption you can just
> > > > > use the register name explicitly.
> > > > > 
> > > > > Build fails with errors like this:
> > > > > a.c:6: error: invalid 'asm': invalid operand code 'd'
> > > > > 
> > > > Is it worth to support such ancient compiler? Nobody complained till
> > > > now.
> > > 
> > > Well it's not as widely used as kvm itself yet :)
> > > The patch seems simple enough though.
> > > 
> > > > BTW with your patch I still cannot compile with 4.2:
> > > > 
> > > > x86/s3.c: In function 'main':
> > > > x86/s3.c:145: error: inconsistent operand constraints in an 'asm'
> > > 
> > > OK that's easy to fix.
> > > 
> > > > > To fix, let's just use %eax %al etc.
> > > > > 
> > > > Only %d0 does not work and dropping "d" fixes it since compiler can
> > > > figure out correct register from variable size. The patch bellow fixes
> > > > compilation for 4.2.
> > > 
> > > It does produce warnings with -Wall though:
> > > Assembler messages:
> > > Warning: using `%ax' instead of `%eax' due to `w' suffix
> > > Warning: using `%al' instead of `%eax' due to `b' suffix
> > > 
> > Not for me. No warning with 4.7.3 and 4.2. .s file produced by gcc shows
> > correct assembly with my patch.
> 
> 4.3 doesn't.
> 
I pretty much doubt that 4.2 and 4.7 produce correct assembly and 4.3
does not. Which file is it? Send me .s file produced by it.

> We have the "a" constraint there anyway - so what's the issue with just
> writing %eax etc? I think it's safer than relying on compiler to do
> the right thing.
> 
It just papers over the problem. Compiler should either complain that it
does not know what %w0 or complain that variable length does not match
assembly or use correct register.

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-17 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 11:12:31AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 17, 2013 at 09:27:51AM +0300, Gleb Natapov wrote:
> > On Wed, Oct 16, 2013 at 10:46:53PM +0300, Michael S. Tsirkin wrote:
> > > Old GCC didn't let you reference variable by
> > > number if it is listed with a specific register
> > > constraint, on the assumption you can just
> > > use the register name explicitly.
> > > 
> > > Build fails with errors like this:
> > > a.c:6: error: invalid 'asm': invalid operand code 'd'
> > > 
> > Is it worth to support such ancient compiler? Nobody complained till
> > now.
> 
> Well it's not as widely used as kvm itself yet :)
> The patch seems simple enough though.
> 
> > BTW with your patch I still cannot compile with 4.2:
> > 
> > x86/s3.c: In function 'main':
> > x86/s3.c:145: error: inconsistent operand constraints in an 'asm'
> 
> OK that's easy to fix.
> 
> > > To fix, let's just use %eax %al etc.
> > > 
> > Only %d0 does not work and dropping "d" fixes it since compiler can
> > figure out correct register from variable size. The patch bellow fixes
> > compilation for 4.2.
> 
> It does produce warnings with -Wall though:
> Assembler messages:
> Warning: using `%ax' instead of `%eax' due to `w' suffix
> Warning: using `%al' instead of `%eax' due to `b' suffix
> 
Not for me. No warning with 4.7.3 and 4.2. .s file produced by gcc shows
correct assembly with my patch.

--
Gleb.

Re: [Qemu-devel] [PATCH] kvm-unittest: fix build with gcc 4.3.X and older

2013-10-16 Thread Gleb Natapov

On Wed, Oct 16, 2013 at 10:46:53PM +0300, Michael S. Tsirkin wrote:
> Old GCC didn't let you reference variable by
> number if it is listed with a specific register
> constraint, on the assumption you can just
> use the register name explicitly.
> 
> Build fails with errors like this:
> a.c:6: error: invalid 'asm': invalid operand code 'd'
> 
Is it worth to support such ancient compiler? Nobody complained till
now. BTW with your patch I still cannot compile with 4.2:

x86/s3.c: In function 'main':
x86/s3.c:145: error: inconsistent operand constraints in an 'asm'

> To fix, let's just use %eax %al etc.
> 
Only %d0 does not work and dropping "d" fixes it since compiler can
figure out correct register from variable size. The patch bellow fixes
compilation for 4.2.
 
diff --git a/lib/x86/pci.c b/lib/x86/pci.c
index f95cd88..231668a 100644
--- a/lib/x86/pci.c
+++ b/lib/x86/pci.c
@@ -3,13 +3,13 @@
 
 static void outl(unsigned short port, unsigned val)
 {
-asm volatile("outl %d0, %w1" : : "a"(val), "Nd"(port));
+asm volatile("outl %0, %w1" : : "a"(val), "Nd"(port));
 }
 
 static unsigned inl(unsigned short port)
 {
 unsigned data;
-asm volatile("inl %w1, %d0" : "=a"(data) : "Nd"(port));
+asm volatile("inl %w1, %0" : "=a"(data) : "Nd"(port));
 return data;
 }
 static uint32_t pci_config_read(pcidevaddr_t dev, uint8_t reg)
diff --git a/x86/s3.c b/x86/s3.c
index 71d3ff9..d568aa7 100644
--- a/x86/s3.c
+++ b/x86/s3.c
@@ -143,14 +143,14 @@ static inline int rtc_in(u8 reg)
 {
 u8 x = reg;
 asm volatile("outb %b1, $0x70; inb $0x71, %b0"
-: "+a"(x) : "0"(x));
+: "=a"(x) : "0"(x));
 return x;
 }
 
 static inline void rtc_out(u8 reg, u8 val)
 {
 asm volatile("outb %b1, $0x70; mov %b2, %b1; outb %b1, $0x71"
-: "+a"(reg) : "0"(reg), "ri"(val));
+: "=a"(reg) : "0"(reg), "ri"(val));
 }
 
 extern char resume_start, resume_end;
diff --git a/x86/vmexit.c b/x86/vmexit.c
index 3b945de..7e9af15 100644
--- a/x86/vmexit.c
+++ b/x86/vmexit.c
@@ -26,7 +26,7 @@ static void outw(unsigned short port, unsigned val)
 
 static void outl(unsigned short port, unsigned val)
 {
-asm volatile("outl %d0, %w1" : : "a"(val), "Nd"(port));
+asm volatile("outl %0, %w1" : : "a"(val), "Nd"(port));
 }
 
 static unsigned int inb(unsigned short port)
--
Gleb.

Re: [Qemu-devel] [PULL 42/43] piix4: add acpi pci hotplug support

2013-10-16 Thread Gleb Natapov

On Thu, Oct 17, 2013 at 08:32:14AM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 17, 2013 at 12:25:32AM +0200, Paolo Bonzini wrote:
> > Il 17/10/2013 00:03, Michael S. Tsirkin ha scritto:
> > > On Wed, Oct 16, 2013 at 11:26:11PM +0200, Paolo Bonzini wrote:
> > >> Il 16/10/2013 20:37, Michael S. Tsirkin ha scritto:
> > >>> Gleb, Paolo, what do you think? OK to merge kvm unit test
> > >>> into qemu? It depends on qemu anyway, in-tree will make it easier.
> > >>> Maybe someone's looking at this already?
> > >>
> > >> I think merging KVM unit tests doesn't make much sense because, with
> > >> some small exceptions, it is mostly a test or a benchmark for KVM.
> > > 
> > > But why keep them separate? They need qemu to work, don't they?
> > 
> > Not necessarily.  They need a userspace component of course, but most of
> > them do not need something as big as QEMU.  Most tests, perhaps all,
> > only write to a handful of ports and use no BIOS services.
> 
> Well all of them use the test device to report status, right?
> Do they work on e.g. kvmtool?
> If anyone uses these tests outside QEMU then maybe it's
> worth it to be nice and keep it separate.
> If not it's just extra pain and compatibility worries without
> any real gain - and if someone starts using it
> this way down the line we'll be able to use something
> like git subtree to extract it again.
> 
Same logic can be used to argue that unittest should be part of Linux
tree. Unittest was part of qemu-kvm and it was decided that having
separate repository for it is much better. Time showed that this is
true. Look at unittests commits and see how much of them have something
to do with QEMU changes and how much do not, and now look at the same
commit and see how much of them has something to do with kernel. See? The
argument for putting unittest into kernel repo is much stronger.

> For example I think it might be nice to switch everyone
> to use pci device instead of fixed ports - less magic
> numbers this way. But if I need to keep it working on
> old qemu and keep two versions of code around, I'd
> probably not bother.
> 
Nothing nice about it. Test device is all about simplicity. If separate
repository stops one from even attempting it the separation was already
a win :)

--
Gleb.

Re: [Qemu-devel] RFC: KVM _CREATE_DEVICE considered harmful?

2013-10-16 Thread Gleb Natapov

On Wed, Oct 16, 2013 at 02:59:47PM +0200, Christian Borntraeger wrote:
> Folks,
> 
> from time to time I update valgrind or qemu to work reasonably well
> with KVM.
> 
> Now, newer KVMs have the ability to create subdevices of a KVM guest (e.g. an 
> in kernel
> kvm interrupt controller) with the following ioctl:
> 
> #define KVM_CREATE_DEVICE _IOWR(KVMIO,  0xe0, struct 
> kvm_create_device)
> 
> qemu can then work on these devices with the ioctls 
> 
> /* ioctls for fds returned by KVM_CREATE_DEVICE */
> #define KVM_SET_DEVICE_ATTR   _IOW(KVMIO,  0xe1, struct kvm_device_attr)
> #define KVM_GET_DEVICE_ATTR   _IOW(KVMIO,  0xe2, struct kvm_device_attr)
> #define KVM_HAS_DEVICE_ATTR   _IOW(KVMIO,  0xe3, struct kvm_device_attr)
> 
> struct kvm_device_attr {
> __u32   flags;  /* no flags currently defined */
> __u32   group;  /* device-defined */
> __u64   attr;   /* group-defined */
> __u64   addr;   /* userspace address of attr data */
> };
> 
> to communicate with the in-kvm devices to exchange data. There is an 
> interesting
> problem here: Properly defined ioctls allow to deduct the written or read area
> of a ioctl. This is useful, e.g. for valgrind. For unknown ioctls, valgrind 
> will
> decode the ioctl number according to the _IOxx prefixes and deducts the memory
> area that the kernels reads/writes. This is very helpful for definedness 
> checkins.
> 
And is not very helpful if you need to change structure size or it
changes by itself because of 32bit->64bit move.

> For specific or more complex ioctls valgrind has special handling. The 
> problem is
> now, that looking at the current implementations of kvm device attr, all use 
> 0,1,2,
> etc for group/attr. So by inspecting the ioctl, valgrind cannot find out what 
> device
> it is talking to without tracking the original create ioctl.
> 
> So it seems that by using a multiplexing ioctls we actually make the interface
> less well defined and more error prone than individual ioctls.
I do not see why it is more error prone. Yes, it is harder for valgrind
to track it, but this by itself does not make it error prone.

> 
> Another hint to look at are system calls. Older archs (like i386) have an IPC 
> system
> call that multiplexes several things. But time has shown that having a system 
> for each
> and every subfunctions is better that a multiplexer. (so newer archs like 
> amd64 dont
> have an ipc system call they have semop, semget etc. This also makes tools 
> like strace
> easier to implement.
> 
The difference is in amount of those subfunctions. We can define
separate ioctl for each cpu/device register of each architecture and it
will be nicely straceable, but how much new ioctls this will create? It
will easily reach 1000s quickly. So at some point you start multiplex.
ONE_REG interface is multiplexer, DEVICE_ATTR is same.

> Whats even more puzzling to me: The main complaint about ioctls is the 
> possibility to
> create arbitrary interfaces into the kernel. Now with the IORW family, we 
> actually
> had some limited form of control. With an additional uncoordinated group/attr 
> parameter we dropped all of that.
How this control is actually used in the kernel?

> 
> 
> So how to process from here? 
> a) leave it as is and accept false positives from valgrind and undecoded 
> straces
strace still cannot decode kvm ioctls after all this years. I am not
concerned about straces at all, it is less then useless to debug KVM.

> b) We could enhance the device_addr group/attr values with size macros, e.g
> the same as _IOWR (do we need direction?) for future users
I am fine with that. But even in KVM this is not the first interface
that has this drawback. See KVM_GET_DIRTY_LOG for instance.

> c) Avoid the device api for new code and go back to individual ioctls in KVM
Individual ioctls were a mess. People are adding devices without even
knowing beforehand how final interface will look like. To accommodate
such "design" strategy devices need to have as fine grained interfaces as
possible otherwise ioctls will be deprecated faster than added (that was
justification for ONE_REG too BTW). If you want fine grained interface
you need to multiplex at some point or add hundreds of ioctls, if you
need to multiplex anyway multiplexing at device level is logical thing
to do. I understand valgrind sentiment, but it should not be a central
point of interface design.

> 
> The reason for KVM_CREATE_DEVICE seems to be that in qemu we want to model 
> everything 
> as a device. Having an in-kernel KVM device seemed logical. The problem is, 
> that we
> should really have a 2nd look at the Linux system call ABI. Does it have to 
> follow the
> device model? especially if the interface has obvious draw backs.
> 
Try to add new Linux system call. It's very hard and for a good reason,
you do not want to have 1000s of them where most of then is one-trick
pony and mostly deprecated (but

Re: [Qemu-devel] [PATCH] virtio: Introduce virtio-testdev

2013-10-15 Thread Gleb Natapov

On Tue, Oct 15, 2013 at 11:14:12AM +0100, Peter Maydell wrote:
> On 15 October 2013 10:47, Anup Patel  wrote:
> > On Tue, Oct 15, 2013 at 2:06 PM, Andrew Jones  wrote:
> >> I'm not opposed to it, but at the moment I'm not sure how we would
> >> utilize it within kvm-unit-tests. Maybe it would be useful for another
> >> application though? So maybe we can add it as an add-on patch at the
> >> time we come up with its use case?
> >
> > I suggested it because we have "machvirt" machine in QEMU for
> > KVM ARM/ARM64 which has only VirtIO devices. In "machvirt", we
> > don't have mechanism to reset the system because none of the
> > VirtIO devices have such a mechanism. Now since you are introducing
> > a "testdev", we can have a RESET command in VirtIO and implement
> > VirtIO REBOOT driver in Linux kernel to use it.
> >
> > Currently, due to no RESET support in "machvirt" we are not able
> > to reboot Guest Linux from Guest console.
> 
> We shouldn't be abusing a device intended for testing to
> provide necessary features. If we need guest reboot support
> (which seems like an obvious thing to want) then we need
> to implement a sensible mechanism for it.
> 
Definitely. Test device is only for use by unit tests. It does not normally
present and may contain hacks not suitable for human consumption.

--
Gleb.

Re: [Qemu-devel] problems with 1G hugepages and linux 3.12-rc3

2013-10-09 Thread Gleb Natapov

Copying Andrea,

On Sun, Oct 06, 2013 at 02:47:41AM +0200, andy123 wrote:
> Hi,
> 
> as the subject states, I have some problems with 1G hugepages with 
> qemu(-vfio-git) on Linux 3.12-rc3.
> 
> I start qemu like this, for example:
> "/usr/bin/qemu-system-x86_64 -enable-kvm -m 1024 -mem-path /dev/hugepages 
> -drive file=/files/vm/arch.img,if=virtio,media=disk -monitor stdio"
> where /dev/hugepages is "hugetlbfs on /dev/hugepages type hugetlbfs 
> (rw,relatime,mode=1770,gid=78,pagesize=1G,pagesize=1G)"
> and the kernel is booted with "hugepagesz=1G hugepages=4".
> This result in lots of error message in dmesg, as seen here: 
> https://gist.github.com/ajs124/6842823 (starting at 18:04:28)
> 
> After starting and stopping multiple virtual machines, the hugepages seem to 
> "fill up" and qemu outputs
> "file_ram_alloc: can't mmap RAM pages: Cannot allocate memory", but works 
> anyways.
> With fill up, I mean that I can start qemu 2 time with "-m 2048" and 4 times 
> with "-m 1024", before it fails to mmap.
> 
I can reproduce huge page leak, but not oops, but they can be related.
Can you revert 11feeb498086a3a5907b8148bdf1786a9b18fc55 and retry?

> This works without any problems in 3.11.x and also with 2M hugepages (umount 
> /dev/hugepages && mount -t hugetlbfs -o pagesize=2048k /dev/hugepages).
> 
> I'm running this in arch linux and there has already been some discussion on 
> the arch forums 
> (https://bbs.archlinux.org/viewtopic.php?pid=1333469#p1333469) but I tried to 
> add everything I wrote over there in this mail.
> 
> In case I missed something or you need more information, I'll happily supply 
> it in order to get this resolved.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master] kvmvapic: Prevent reading beyond the end of guest RAM

2013-10-04 Thread Gleb Natapov

On Mon, Sep 30, 2013 at 12:35:13PM +0200, Jan Kiszka wrote:
> rom_state_paddr is guest provided (caller address of outw(VAPIC_PORT) +
> writen 16-bit value) and can be influenced to point beyond the end of
> the host memory backing the guest's RAM. Make sure we do not use this
> pointer to actually read beyond the limits.
> 
> Reading arbitrary guest bytes is harmless, the guest kernel has to
> manage access to this I/O port anyway.
> 
> Signed-off-by: Jan Kiszka 
Applied, thanks.

> ---
>  hw/i386/kvmvapic.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index 1c2dbf5..2d87600 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -596,6 +596,9 @@ static int vapic_map_rom_writable(VAPICROMState *s)
>  section = memory_region_find(as, 0, 1);
>  
>  /* read ROM size from RAM region */
> +if (rom_paddr + 2 >= memory_region_size(section.mr)) {
> +return -1;
> +}
>  ram = memory_region_get_ram_ptr(section.mr);
>  rom_size = ram[rom_paddr + 2] * ROM_BLOCK_SIZE;
>  if (rom_size == 0) {
> -- 
> 1.8.1.1.298.ge7eed54

--
Gleb.

Re: [Qemu-devel] [PATCH v3 uq/master 2/2] x86: cpuid: reconstruct leaf 0Dh data

2013-10-03 Thread Gleb Natapov

On Thu, Oct 03, 2013 at 11:59:24AM +0200, Igor Mammedov wrote:
> On Wed,  2 Oct 2013 17:54:57 +0200
> Paolo Bonzini  wrote:
> 
> > The data in leaf 0Dh depends on information from other feature bits.
> > Instead of passing it blindly from the host, compute it based on
> > whether these feature bits are enabled.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  target-i386/cpu.c | 65 
> > ---
> >  1 file changed, 48 insertions(+), 17 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index ac83106..1addb18 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -328,6 +328,15 @@ X86RegisterInfo32 x86_reg_info_32[CPU_NB_REGS32] = {
> >  };
> >  #undef REGISTER
> >  
> > +typedef struct ExtSaveArea {
> > +uint32_t feature, bits;
> > +uint32_t offset, size;
> > +} ExtSaveArea;
> > +
> > +static const ExtSaveArea ext_save_areas[] = {
> > +[2] = { .feature = FEAT_1_ECX, .bits = CPUID_EXT_AVX,
> > +.offset = 0x100, .size = 0x240 },
> > +};
> >  
> >  const char *get_register_name_32(unsigned int reg)
> >  {
> > @@ -2169,29 +2178,51 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t 
> > index, uint32_t count,
> >  *edx = 0;
> >  }
> >  break;
> > -case 0xD:
> > +case 0xD: {
> > +KVMState *s = cs->kvm_state;
> > +uint64_t kvm_mask;
> > +int i;
> > +
> >  /* Processor Extended State */
> > -if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE)) {
> > -*eax = 0;
> > -*ebx = 0;
> > -*ecx = 0;
> > -*edx = 0;
> > +*eax = 0;
> > +*ebx = 0;
> > +*ecx = 0;
> > +*edx = 0;
> > +if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE) || 
> > !kvm_enabled()) {
> >  break;
> >  }
> > -if (kvm_enabled()) {
> > -KVMState *s = cs->kvm_state;
> > +kvm_mask =
> > +kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EAX) |
> > +((uint64_t)kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EDX) << 
> > 32);
> calling kvm_arch_get_supported_cpuid() without kvm_enabled() guard
> could regress TCG mode on non KVM host:
> 
But there is kvm_enabled() guard above.

> kvm_arch_get_supported_cpuid -> get_supported_cpuid -> try_get_cpuid ->
>r = kvm_ioctl(s, KVM_GET_SUPPORTED_CPUID, cpuid);
>...
>if (r < 0) {
> if (r == -E2BIG) {
> g_free(cpuid);
> return NULL;
> } else {
> fprintf(stderr, "KVM_GET_SUPPORTED_CPUID failed: %s\n",
> strerror(-r));
> exit(1);
> ^ guest suddenly dies
> 
> >  
> > -*eax = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EAX);
> > -*ebx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EBX);
> > -*ecx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_ECX);
> > -*edx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EDX);
> > -} else {
> > -*eax = 0;
> > -*ebx = 0;
> > -*ecx = 0;
> > -*edx = 0;
> > +if (count == 0) {
> > +*ecx = 0x240;
> > +for (i = 2; i < ARRAY_SIZE(ext_save_areas); i++) {
> > +const ExtSaveArea *esa = &ext_save_areas[i];
> > +if ((env->features[esa->feature] & esa->bits) == esa->bits 
> > &&
> > +(kvm_mask & (1 << i)) != 0) {
> > +if (i < 32) {
> > +*eax |= 1 << i;
> > +} else {
> > +*edx |= 1 << (i - 32);
> > +}
> > +*ecx = MAX(*ecx, esa->offset + esa->size);
> > +}
> > +}
> > +*eax |= kvm_mask & (XSTATE_FP | XSTATE_SSE);
> > +*ebx = *ecx;
> > +} else if (count == 1) {
> > +*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
> > +} else if (count < ARRAY_SIZE(ext_save_areas)) {
> > +const ExtSaveArea *esa = &ext_save_areas[count];
> > +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> > +(kvm_mask & (1 << count)) != 0) {
> > +*eax = esa->offset;
> > +*ebx = esa->size;
> > +}
> >  }
> >  break;
> > +}
> >  case 0x8000:
> >  *eax = env->cpuid_xlevel;
> >  *ebx = env->cpuid_vendor1;

--
Gleb.

Re: [Qemu-devel] [PATCH v3 uq/master 2/2] x86: cpuid: reconstruct leaf 0Dh data

2013-10-02 Thread Gleb Natapov

On Wed, Oct 02, 2013 at 05:54:57PM +0200, Paolo Bonzini wrote:
> The data in leaf 0Dh depends on information from other feature bits.
> Instead of passing it blindly from the host, compute it based on
> whether these feature bits are enabled.
> 
Applied both. Thanks.

> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/cpu.c | 65 
> ---
>  1 file changed, 48 insertions(+), 17 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index ac83106..1addb18 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -328,6 +328,15 @@ X86RegisterInfo32 x86_reg_info_32[CPU_NB_REGS32] = {
>  };
>  #undef REGISTER
>  
> +typedef struct ExtSaveArea {
> +uint32_t feature, bits;
> +uint32_t offset, size;
> +} ExtSaveArea;
> +
> +static const ExtSaveArea ext_save_areas[] = {
> +[2] = { .feature = FEAT_1_ECX, .bits = CPUID_EXT_AVX,
> +.offset = 0x100, .size = 0x240 },
> +};
>  
>  const char *get_register_name_32(unsigned int reg)
>  {
> @@ -2169,29 +2178,51 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  *edx = 0;
>  }
>  break;
> -case 0xD:
> +case 0xD: {
> +KVMState *s = cs->kvm_state;
> +uint64_t kvm_mask;
> +int i;
> +
>  /* Processor Extended State */
> -if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE)) {
> -*eax = 0;
> -*ebx = 0;
> -*ecx = 0;
> -*edx = 0;
> +*eax = 0;
> +*ebx = 0;
> +*ecx = 0;
> +*edx = 0;
> +if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE) || 
> !kvm_enabled()) {
>  break;
>  }
> -if (kvm_enabled()) {
> -KVMState *s = cs->kvm_state;
> +kvm_mask =
> +kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EAX) |
> +((uint64_t)kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EDX) << 32);
>  
> -*eax = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EAX);
> -*ebx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EBX);
> -*ecx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_ECX);
> -*edx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EDX);
> -} else {
> -*eax = 0;
> -*ebx = 0;
> -*ecx = 0;
> -*edx = 0;
> +if (count == 0) {
> +*ecx = 0x240;
> +for (i = 2; i < ARRAY_SIZE(ext_save_areas); i++) {
> +const ExtSaveArea *esa = &ext_save_areas[i];
> +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> +(kvm_mask & (1 << i)) != 0) {
> +if (i < 32) {
> +*eax |= 1 << i;
> +} else {
> +*edx |= 1 << (i - 32);
> +}
> +*ecx = MAX(*ecx, esa->offset + esa->size);
> +}
> +}
> +*eax |= kvm_mask & (XSTATE_FP | XSTATE_SSE);
> +*ebx = *ecx;
> +} else if (count == 1) {
> +*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
> +} else if (count < ARRAY_SIZE(ext_save_areas)) {
> +const ExtSaveArea *esa = &ext_save_areas[count];
> +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> +(kvm_mask & (1 << count)) != 0) {
> +*eax = esa->offset;
> +*ebx = esa->size;
> +}
>  }
>  break;
> +}
>  case 0x8000:
>  *eax = env->cpuid_xlevel;
>  *ebx = env->cpuid_vendor1;
> -- 
> 1.8.3.1

--
Gleb.

Re: [Qemu-devel] [PATCH v2 uq/master 2/2] x86: cpuid: reconstruct leaf 0Dh data

2013-10-02 Thread Gleb Natapov

On Wed, Oct 02, 2013 at 05:37:31PM +0200, Paolo Bonzini wrote:
> Il 02/10/2013 17:21, Gleb Natapov ha scritto:
> >> -if (kvm_enabled()) {
> >> -KVMState *s = cs->kvm_state;
> >> +kvm_mask =
> >> +kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EAX) |
> >> +((uint64_t)kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EDX) << 
> >> 32);
> >>  
> >> -*eax = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EAX);
> >> -*ebx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EBX);
> >> -*ecx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_ECX);
> >> -*edx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EDX);
> >> -} else {
> >> -*eax = 0;
> >> -*ebx = 0;
> >> -*ecx = 0;
> >> -*edx = 0;
> >> +if (count == 0) {
> >> +*ecx = 0x240;
> >> +for (i = 2; i < ARRAY_SIZE(ext_save_areas); i++) {
> >> +const ExtSaveArea *esa = &ext_save_areas[i];
> >> +if ((env->features[esa->feature] & esa->bits) == 
> >> esa->bits &&
> >> +(kvm_mask & (1 << i)) != 0) {
> >> +if (i < 32) {
> >> +*eax |= 1 << i;
> >> +} else {
> >> +*edx |= 1 << (i - 32);
> >> +}
> >> +*ecx = MAX(*ecx, esa->offset + esa->size);
> >> +}
> >> +}
> >> +*eax |= kvm_mask & 3;
> > Lets use define from previous patch.
> 
> Right.
> 
> >> +*ebx = *ecx;
> >> +} else if (count == 1) {
> >> +*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
> >> +} else if (count < ARRAY_SIZE(ext_save_areas)) {
> >> +const ExtSaveArea *esa = &ext_save_areas[count];
> >> +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> >> +(kvm_mask & (1 << count)) != 0) {
> >> +*eax = esa->offset;
> >> +*ebx = esa->size;
> > Why do you hard code them instead of querying kernel? What if they
> > depend on cpu type? (well if this happens we can forget about
> > migration, but still...)
> 
> HPA confirmed (on xen-devel) that they will not depend on the CPU type.
>  All offsets are documented in the SDM and in the additional Skylake
> manual except for MPX, and he reported that he'd ask for MPX to be
> documented as well.  As you said, if they changed it would be a total mess.
> 
> I hardcoded them because this is not KVM-specific knowledge.  TCG could
> in principle reuse the same code, just skipping the part where it masks
> away features not supported by KVM.
> 
OK. Can you send new version with defines please?

--
Gleb.

Re: [Qemu-devel] [PATCH v2 uq/master 2/2] x86: cpuid: reconstruct leaf 0Dh data

2013-10-02 Thread Gleb Natapov

On Fri, Sep 13, 2013 at 03:55:58PM +0200, Paolo Bonzini wrote:
> The data in leaf 0Dh depends on information from other feature bits.
> Instead of passing it blindly from the host, compute it based on
> whether these feature bits are enabled.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/cpu.c | 63 +++-
>  1 file changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index ac83106..e6179f4 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -328,6 +328,15 @@ X86RegisterInfo32 x86_reg_info_32[CPU_NB_REGS32] = {
>  };
>  #undef REGISTER
>  
> +typedef struct ExtSaveArea {
> +uint32_t feature, bits;
> +uint32_t offset, size;
> +} ExtSaveArea;
> +
> +static const ExtSaveArea ext_save_areas[] = {
> +[2] = { .feature = FEAT_1_ECX, .bits = CPUID_EXT_AVX,
> +.offset = 0x100, .size = 0x240 },
> +};
>  
>  const char *get_register_name_32(unsigned int reg)
>  {
> @@ -2169,29 +2178,51 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  *edx = 0;
>  }
>  break;
> -case 0xD:
> +case 0xD: {
> +KVMState *s = cs->kvm_state;
> +uint64_t kvm_mask;
> +int i;
> +
>  /* Processor Extended State */
> -if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE)) {
> -*eax = 0;
> -*ebx = 0;
> -*ecx = 0;
> -*edx = 0;
> +*eax = 0;
> +*ebx = 0;
> +*ecx = 0;
> +*edx = 0;
> +if (!(env->features[FEAT_1_ECX] & CPUID_EXT_XSAVE) || 
> !kvm_enabled()) {
>  break;
>  }
> -if (kvm_enabled()) {
> -KVMState *s = cs->kvm_state;
> +kvm_mask =
> +kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EAX) |
> +((uint64_t)kvm_arch_get_supported_cpuid(s, 0xd, 0, R_EDX) << 32);
>  
> -*eax = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EAX);
> -*ebx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EBX);
> -*ecx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_ECX);
> -*edx = kvm_arch_get_supported_cpuid(s, 0xd, count, R_EDX);
> -} else {
> -*eax = 0;
> -*ebx = 0;
> -*ecx = 0;
> -*edx = 0;
> +if (count == 0) {
> +*ecx = 0x240;
> +for (i = 2; i < ARRAY_SIZE(ext_save_areas); i++) {
> +const ExtSaveArea *esa = &ext_save_areas[i];
> +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> +(kvm_mask & (1 << i)) != 0) {
> +if (i < 32) {
> +*eax |= 1 << i;
> +} else {
> +*edx |= 1 << (i - 32);
> +}
> +*ecx = MAX(*ecx, esa->offset + esa->size);
> +}
> +}
> +*eax |= kvm_mask & 3;
Lets use define from previous patch.

> +*ebx = *ecx;
> +} else if (count == 1) {
> +*eax = kvm_arch_get_supported_cpuid(s, 0xd, 1, R_EAX);
> +} else if (count < ARRAY_SIZE(ext_save_areas)) {
> +const ExtSaveArea *esa = &ext_save_areas[count];
> +if ((env->features[esa->feature] & esa->bits) == esa->bits &&
> +(kvm_mask & (1 << count)) != 0) {
> +*eax = esa->offset;
> +*ebx = esa->size;
Why do you hard code them instead of querying kernel? What if they
depend on cpu type? (well if this happens we can forget about
migration, but still...)

> +}
>  }
>  break;
> +}
>  case 0x8000:
>  *eax = env->cpuid_xlevel;
>  *ebx = env->cpuid_vendor1;
> -- 
> 1.8.3.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.

Re: [Qemu-devel] [Bug 599958] Re: Timedrift problems with Win7: hpet missing time drift fixups

2013-10-01 Thread Gleb Natapov

On Tue, Oct 01, 2013 at 11:36:39AM -0500, Ben "Root" Anderson wrote:
> Agh, I forgot reply all.
> 
I have to re reply now :)

> Seems like something that should be changed, no? It would've saved me
> a lot of headache if there was a switch e.g.
> -optimize-for=[linux,winxp,
> win7,etc] that changed the defaults to be
> most accomodating to the specified OS as a guest.
> 
QEMU developers see it to be a management job. Management
(libvirt/virt-manager) knows what guest is and what optimal config is. I
am not sure they use -no-hpet for windows currently if they do not it
should be changed there.

> On Tue, Oct 1, 2013 at 11:33 AM, Gleb Natapov  wrote:
> > On Tue, Oct 01, 2013 at 11:23:07AM -0500, Ben "Root" Anderson wrote:
> >> Fair enough in itself, but if HPET is known to have problems with
> >> arguably the most popular OS family to use as a guest, why is it
> >> enabled by default?
> >>
> > Arguably :) But QEMU defaults are arguably far from been optimal for any
> > guest.
> >
> >> On Tue, Oct 1, 2013 at 10:56 AM, Gleb Natapov  wrote:
> >> > On Tue, Oct 01, 2013 at 09:34:06AM -, Ben A wrote:
> >> >> Apparently this bug's still alive and kicking.
> >> >>
> >> > And no plans to fix it. Do not use hpet with windows guests this buys
> >> > you nothing.
> >> >
> >> >> There's an obvious clock skew problem on Windows 7; in the Date & Time
> >> >> dialog, the clock jumps through seconds visibly too fast.
> >> >>
> >> >> I also found a case where HPET bugs are causing a real problem: Terraria
> >> >> (dedicated server) seems to be relying on (something that relies on)
> >> >> HPET, and QEMU doesn't get it right. The result is a goofy and
> >> >> aggravating behavior I've nicknamed "Turbo Monsters of Doom" and it
> >> >> makes killing anything tougher than a normal zombie basically
> >> >> impossible.
> >> >>
> >> >> --
> >> >> You received this bug notification because you are a member of qemu-
> >> >> devel-ml, which is subscribed to QEMU.
> >> >> https://bugs.launchpad.net/bugs/599958
> >> >>
> >> >> Title:
> >> >>   Timedrift problems with Win7: hpet missing time drift fixups
> >> >>
> >> >> Status in QEMU:
> >> >>   Confirmed
> >> >>
> >> >> Bug description:
> >> >>   We've been finding timedrift issues witth Win7 under qemu-kvm on our
> >> >>   daily testing
> >> >>
> >> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_load   FAIL1   
> >> >> Time drift too large after rest period: 38.63%
> >> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_reboot FAIL1   
> >> >> Time drift too large at iteration 1: 17.77 seconds
> >> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_migration  FAIL1 
> >> >>   Time drift too large at iteration 2: 3.08 seconds
> >> >>
> >> >>   Steps to reproduce:
> >> >>
> >> >>   timedrift.with_load
> >> >>
> >> >>   1) Log into a guest.
> >> >>   2) Take a time reading from the guest and host.
> >> >>   3) Run load on the guest and host.
> >> >>   4) Take a second time reading.
> >> >>   5) Stop the load and rest for a while.
> >> >>   6) Take a third time reading.
> >> >>   7) If the drift immediately after load is higher than a user-
> >> >>   specified value (in %), fail.
> >> >>   If the drift after the rest period is higher than a 
> >> >> user-specified value,
> >> >>   fail.
> >> >>
> >> >>   timedrift.with_migration
> >> >>
> >> >>   1) Log into a guest.
> >> >>   2) Take a time reading from the guest and host.
> >> >>   3) Migrate the guest.
> >> >>   4) Take a second time reading.
> >> >>   5) If the drift (in seconds) is higher than a user specified value, 
> >> >> fail.
> >> >>
> >> >>   timedrift.with_reboot
> >> >>
> >> >>   1) Log into a guest.
> >> >>   2) Take a time reading from the guest and host.
> >> >>   3) Reboot the guest.
> >> >>   4) Take a second time reading.
> >> >>   5) If the drift (in seconds) is higher than a user specified value, 
> >> >> fail.
> >> >>
> >> >>   This bug is to register those issues and keep an eye on them.
> >> >>
> >> >>   Attached, some logs from the autotest tests executed on the guest
> >> >>
> >> >> To manage notifications about this bug go to:
> >> >> https://bugs.launchpad.net/qemu/+bug/599958/+subscriptions
> >> >
> >> > --
> >> > Gleb.
> >
> > --
> > Gleb.

--
Gleb.

Re: [Qemu-devel] [Bug 599958] Re: Timedrift problems with Win7: hpet missing time drift fixups

2013-10-01 Thread Gleb Natapov

On Tue, Oct 01, 2013 at 11:23:07AM -0500, Ben "Root" Anderson wrote:
> Fair enough in itself, but if HPET is known to have problems with
> arguably the most popular OS family to use as a guest, why is it
> enabled by default?
> 
Arguably :) But QEMU defaults are arguably far from been optimal for any
guest.

> On Tue, Oct 1, 2013 at 10:56 AM, Gleb Natapov  wrote:
> > On Tue, Oct 01, 2013 at 09:34:06AM -, Ben A wrote:
> >> Apparently this bug's still alive and kicking.
> >>
> > And no plans to fix it. Do not use hpet with windows guests this buys
> > you nothing.
> >
> >> There's an obvious clock skew problem on Windows 7; in the Date & Time
> >> dialog, the clock jumps through seconds visibly too fast.
> >>
> >> I also found a case where HPET bugs are causing a real problem: Terraria
> >> (dedicated server) seems to be relying on (something that relies on)
> >> HPET, and QEMU doesn't get it right. The result is a goofy and
> >> aggravating behavior I've nicknamed "Turbo Monsters of Doom" and it
> >> makes killing anything tougher than a normal zombie basically
> >> impossible.
> >>
> >> --
> >> You received this bug notification because you are a member of qemu-
> >> devel-ml, which is subscribed to QEMU.
> >> https://bugs.launchpad.net/bugs/599958
> >>
> >> Title:
> >>   Timedrift problems with Win7: hpet missing time drift fixups
> >>
> >> Status in QEMU:
> >>   Confirmed
> >>
> >> Bug description:
> >>   We've been finding timedrift issues witth Win7 under qemu-kvm on our
> >>   daily testing
> >>
> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_load   FAIL1   Time 
> >> drift too large after rest period: 38.63%
> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_reboot FAIL1   Time 
> >> drift too large at iteration 1: 17.77 seconds
> >>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_migration  FAIL1
> >>Time drift too large at iteration 2: 3.08 seconds
> >>
> >>   Steps to reproduce:
> >>
> >>   timedrift.with_load
> >>
> >>   1) Log into a guest.
> >>   2) Take a time reading from the guest and host.
> >>   3) Run load on the guest and host.
> >>   4) Take a second time reading.
> >>   5) Stop the load and rest for a while.
> >>   6) Take a third time reading.
> >>   7) If the drift immediately after load is higher than a user-
> >>   specified value (in %), fail.
> >>   If the drift after the rest period is higher than a user-specified 
> >> value,
> >>   fail.
> >>
> >>   timedrift.with_migration
> >>
> >>   1) Log into a guest.
> >>   2) Take a time reading from the guest and host.
> >>   3) Migrate the guest.
> >>   4) Take a second time reading.
> >>   5) If the drift (in seconds) is higher than a user specified value, fail.
> >>
> >>   timedrift.with_reboot
> >>
> >>   1) Log into a guest.
> >>   2) Take a time reading from the guest and host.
> >>   3) Reboot the guest.
> >>   4) Take a second time reading.
> >>   5) If the drift (in seconds) is higher than a user specified value, fail.
> >>
> >>   This bug is to register those issues and keep an eye on them.
> >>
> >>   Attached, some logs from the autotest tests executed on the guest
> >>
> >> To manage notifications about this bug go to:
> >> https://bugs.launchpad.net/qemu/+bug/599958/+subscriptions
> >
> > --
> > Gleb.

--
Gleb.

Re: [Qemu-devel] [Bug 599958] Re: Timedrift problems with Win7: hpet missing time drift fixups

2013-10-01 Thread Gleb Natapov

On Tue, Oct 01, 2013 at 09:34:06AM -, Ben A wrote:
> Apparently this bug's still alive and kicking.
> 
And no plans to fix it. Do not use hpet with windows guests this buys
you nothing.

> There's an obvious clock skew problem on Windows 7; in the Date & Time
> dialog, the clock jumps through seconds visibly too fast.
> 
> I also found a case where HPET bugs are causing a real problem: Terraria
> (dedicated server) seems to be relying on (something that relies on)
> HPET, and QEMU doesn't get it right. The result is a goofy and
> aggravating behavior I've nicknamed "Turbo Monsters of Doom" and it
> makes killing anything tougher than a normal zombie basically
> impossible.
> 
> -- 
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/599958
> 
> Title:
>   Timedrift problems with Win7: hpet missing time drift fixups
> 
> Status in QEMU:
>   Confirmed
> 
> Bug description:
>   We've been finding timedrift issues witth Win7 under qemu-kvm on our
>   daily testing
> 
>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_load   FAIL1   Time 
> drift too large after rest period: 38.63%
>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_reboot FAIL1   Time 
> drift too large at iteration 1: 17.77 seconds
>   kvm.qemu-kvm-git.smp2.Win7.64.timedrift.with_migration  FAIL1   
> Time drift too large at iteration 2: 3.08 seconds
> 
>   Steps to reproduce:
> 
>   timedrift.with_load
> 
>   1) Log into a guest.
>   2) Take a time reading from the guest and host.
>   3) Run load on the guest and host.
>   4) Take a second time reading.
>   5) Stop the load and rest for a while.
>   6) Take a third time reading.
>   7) If the drift immediately after load is higher than a user-
>   specified value (in %), fail.
>   If the drift after the rest period is higher than a user-specified 
> value,
>   fail.
> 
>   timedrift.with_migration
> 
>   1) Log into a guest.
>   2) Take a time reading from the guest and host.
>   3) Migrate the guest.
>   4) Take a second time reading.
>   5) If the drift (in seconds) is higher than a user specified value, fail.
> 
>   timedrift.with_reboot
> 
>   1) Log into a guest.
>   2) Take a time reading from the guest and host.
>   3) Reboot the guest.
>   4) Take a second time reading.
>   5) If the drift (in seconds) is higher than a user specified value, fail.
> 
>   This bug is to register those issues and keep an eye on them.
> 
>   Attached, some logs from the autotest tests executed on the guest
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/599958/+subscriptions

--
Gleb.

Re: [Qemu-devel] [PATCH 1/6] kvm: Add KVM_GET_EMULATED_CPUID

2013-09-24 Thread Gleb Natapov

On Tue, Sep 24, 2013 at 11:57:00AM +0200, Borislav Petkov wrote:
> On Mon, September 23, 2013 6:28 pm, Eduardo Habkost wrote:
> > On Sun, Sep 22, 2013 at 04:44:50PM +0200, Borislav Petkov wrote:
> >> From: Borislav Petkov 
> >>
> >> Add a kvm ioctl which states which system functionality kvm emulates.
> >> The format used is that of CPUID and we return the corresponding CPUID
> >> bits set for which we do emulate functionality.
> >
> > Let me check if I understood the purpose of the new ioctl correctly: the
> > only reason for GET_EMULATED_CPUID to exist is to allow userspace to
> > differentiate features that are native or that are emulated efficiently
> > (GET_SUPPORTED_CPUID) and features that are emulated not very
> > efficiently (GET_EMULATED_CPUID)?
> 
> Not only that - emulated features are not reported in CPUID so they
> can be enabled only when specifically and explicitly requested, i.e.
> "+movbe". Basically, you want to emulate that feature for the guest but
> only for this specific guest - the others shouldn't see it.
> 
> > If that's the case, how do we decide how efficient emulation should be,
> > to deserve inclusion in GET_SUPPORTED_CPUID? I am guessing that the
> > criterion will be: if enabling it doesn't risk making performance worse,
> > it can get in GET_SUPPORTED_CPUID.
> 
> Well, in the MOVBE case, supported means, the host can execute this
> instruction natively. Now, you guys say you can emulate x2apic very
> efficiently and I'm guessing emulating x2apic doesn't bring any
> emulation overhead, thus SUPPORTED_CPUID.
x2apic emulation has nothing to do with x2apic in a host. It is emulated
same way no matter if host has it or not. x2apic is not really cpu
feature, but apic one and apic is fully emulated by KVM anyway.

> 
> But for single instructions or group of instructions, the distinction
> should be very clear.
> 
> At least this is how I see it but Gleb probably can comment too.
> 
That's how I see it two. Basically you want to use movbe emulation (as
opposite of virtualization) only if you have binary kernel that compiled
for CPU with movbe (Borislav's use case), or you want to migrate
temporarily from movbe enabled host to non movbe host because downtime
is not an option. We should avoid enabling it "by mistake".

--
Gleb.

Re: [Qemu-devel] in_asm substitute for accel=kvm:tcg

2013-09-22 Thread Gleb Natapov

On Sun, Sep 22, 2013 at 11:05:37AM +0300, Andriy Gapon wrote:
> on 22/09/2013 09:31 Gleb Natapov said the following:
> > Which kernel version is this? What BSD version?
> 
> $ uname -a
> Linux kvm 3.8.0-27-generic #40-Ubuntu SMP Tue Jul 9 00:17:05 UTC 2013 x86_64
> x86_64 x86_64 GNU/Linux
> 
That's pretty old kernel and there were a lot fixes in emulation since.
Before we spend more time tracking this can you verify that the bug is
reproducible with 3.11?

--
Gleb.

Re: [Qemu-devel] [PATCH] target-i386: Enable x2apic by default on more recent CPU models

2013-09-21 Thread Gleb Natapov

On Fri, Sep 20, 2013 at 04:15:17PM -0300, Eduardo Habkost wrote:
> This enables x2apic on the following CPU models: Conroe, Penryn,
> Nehalem, Westmere, Opteron_G[12345].
> 
> Normally we try to keep the CPU model definitions as close as the real
> CPUs as possible, but x2apic can be emulated by KVM without host CPU
> support for x2apic, and it improves performance by reducing APIC access
> overhead. x2apic emulation is available on KVM since 2009 (Linux
> 2.6.32-rc1), there's no reason for not enabling x2apic by default when
> running KVM.
> 
> About testing: Conroe, Penryn, Nehalem, Westemere and Opteron_G[123]
> have x2apic enabled on RHEL-6 since RHEL-6.0, so the presence of x2apic
> on those CPU models got lots of testing in the last few years. I want to
> eventually enable x2apic on all other CPU models as well, but it will
> require some testing to ensure it won't confuse guests.
> 
> This shouldn't affect TCG at all because features not supported by TCG
> are automatically and silently disabled by QEMU when initializing the
> CPU.
> 
> Signed-off-by: Eduardo Habkost 
Acked-by: Gleb Natapov 

> ---
>  hw/i386/pc_piix.c |  9 +
>  hw/i386/pc_q35.c  |  9 +
>  target-i386/cpu.c | 37 +++--
>  3 files changed, 37 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 907792b..0af00ef 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -240,6 +240,15 @@ static void pc_compat_1_6(QEMUMachineInitArgs *args)
>  {
>  has_pci_info = false;
>  rom_file_in_ram = false;
> +x86_cpu_compat_set_features("Conroe", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Penryn", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Nehalem", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Westmere", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G1", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G2", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G3", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G4", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G5", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
>  }
>  
>  static void pc_compat_1_5(QEMUMachineInitArgs *args)
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index ca84e1c..82e7a23 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -224,6 +224,15 @@ static void pc_compat_1_6(QEMUMachineInitArgs *args)
>  {
>  has_pci_info = false;
>  rom_file_in_ram = false;
> +x86_cpu_compat_set_features("Conroe", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Penryn", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Nehalem", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Westmere", FEAT_1_ECX, 0, CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G1", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G2", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G3", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G4", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
> +x86_cpu_compat_set_features("Opteron_G5", FEAT_1_ECX, 0, 
> CPUID_EXT_X2APIC);
>  }
>  
>  static void pc_compat_1_5(QEMUMachineInitArgs *args)
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 9abb73f..23a44c3 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -791,7 +791,7 @@ static x86_def_t builtin_x86_defs[] = {
>   CPUID_MCE | CPUID_PAE | CPUID_MSR | CPUID_TSC | CPUID_PSE |
>   CPUID_DE | CPUID_FP87,
>  .features[FEAT_1_ECX] =
> -CPUID_EXT_SSSE3 | CPUID_EXT_SSE3,
> +CPUID_EXT_X2APIC | CPUID_EXT_SSSE3 | CPUID_EXT_SSE3,
>  .features[FEAT_8000_0001_EDX] =
>  CPUID_EXT2_LM | CPUID_EXT2_NX | CPUID_EXT2_SYSCALL,
>  .features[FEAT_8000_0001_ECX] =
> @@ -813,8 +813,8 @@ static x86_def_t builtin_x86_defs[] = {
>   CPUID_MCE | CPUID_PAE | CPUID_MSR | CPUID_TSC | CPUID_PSE |
>   CPUID_DE | CPUID_FP87,
>  .features[FEAT_1_ECX] =
> -CPUID_EXT_SSE41 | CPUID_EXT_CX16 | CPUID_EXT_SSSE3 |
> - CPUID_EXT_SSE3,
> +CPUID_EXT_X2APIC | CPUID_EXT_SSE41 | CPUID_EXT_CX

Re: [Qemu-devel] in_asm substitute for accel=kvm:tcg

2013-09-21 Thread Gleb Natapov

On Thu, Sep 19, 2013 at 08:49:51PM +0300, Andriy Gapon wrote:
> on 19/09/2013 19:53 Paolo Bonzini said the following:
> > Il 19/09/2013 16:36, Andriy Gapon ha scritto:
> >> Not sure how the code ends up at 0x9315 after that.
> > 
> > Events are dropped, probably corresponding to more emulation.
> 
> I've got a trace without dropped events between the last "normal" instruction
> and the loop (and also including a snippet where the same code is executed
> without a problem):
Which kernel version is this? What BSD version?


> ...
> qemu-system-x86-12024 [003] 278157.048876: kvm_emulate_insn: 0:9366:b1 10
> (prot32)
>  qemu-system-x86-12024 [003] 278157.048877: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.048878: kvm_emulate_insn: 0:9368:8e d1
> (prot32)
>  qemu-system-x86-12024 [003] 278157.048880: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.048882: kvm_exit: reason
> CR_ACCESS rip 0x9312 info 0 0
>  qemu-system-x86-12024 [003] 278157.048883: kvm_cr:   cr_write 0 
> = 0x10
>  qemu-system-x86-12024 [003] 278157.048885: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.048886: kvm_emulate_insn: 0:9315:ea 1a
> 93 00 00 (real)
>  qemu-system-x86-12024 [003] 278157.048887: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.04: kvm_emulate_insn: 0:931a:31 c0
> (real)
> ... ...
>  qemu-system-x86-12024 [003] 278157.048990: kvm_set_irq:  gsi 4 level > 0
> source 0
>  qemu-system-x86-12024 [003] 278157.048991: kvm_pic_set_irq:  chip 0 pin 4
> (edge|masked)
>  qemu-system-x86-12024 [003] 278157.048992: kvm_ioapic_set_irq:   pin 4 dst 0
> vec=0 (Fixed|physical|edge|masked)
>  qemu-system-x86-12024 [003] 278157.049001: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.049002: kvm_exit: reason
> IO_INSTRUCTION rip 0x1e675 info 3fd0008 0
>  qemu-system-x86-12024 [003] 278157.049005: kvm_emulate_insn: 
> a000:1e675:ec
> (prot32)
>  qemu-system-x86-12024 [003] 278157.049005: kvm_pio:  pio_read at
> 0x3fd size 1 count 1
>  qemu-system-x86-12024 [003] 278157.049006: kvm_userspace_exit:   reason
> KVM_EXIT_IO (2)
>  qemu-system-x86-12024 [003] 278157.049024: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.049027: kvm_exit: reason
> CR_ACCESS rip 0x9312 info 0 0
>  qemu-system-x86-12024 [003] 278157.049028: kvm_cr:   cr_write 0 
> = 0x10
>  qemu-system-x86-12024 [003] 278157.049030: kvm_entry:vcpu 0
>  qemu-system-x86-12024 [003] 278157.049031: kvm_emulate_insn: 0:9315: 
> (real)
>  qemu-system-x86-12024 [003] 278157.049033: kvm_emulate_insn: 0:9315: 
> (real)
>  qemu-system-x86-12024 [003] 278157.049034: kvm_emulate_insn: 0:9315: 
> (real)
> ...
> 
> It's strange that no instruction gets reported in those repeating "0:9315:
> (real)" lines.  It's like kvm is somehow losing track of what should be 
> executed
> and just loops over the same ip without actually doing anything.
> 
> -- 
> Andriy Gapon

--
Gleb.

Re: [Qemu-devel] [PATCH] linux-headers: update to 3.11

2013-09-17 Thread Gleb Natapov

On Wed, Sep 18, 2013 at 01:04:01PM +1000, Alexey Kardashevskiy wrote:
> On 09/05/2013 04:07 PM, Paolo Bonzini wrote:
> > Il 05/09/2013 05:16, Alexey Kardashevskiy ha scritto:
> >> Sorry for my ignorance, but this is The Kernel, it is already there, 
> >> broken
> >> or not, even if it is broken, qemu cannot stay isolated, no?
> >> This is a mechanical change, no more.
> 
>  It's a matter of keeping things bisectable.  If we can detect a
>  breakage, we can first work around it, and then apply the header update.
>   And if we don't detect it, maintainers usually send pull requests when
>  they have time to work on breakage caused by their patches.
> >>
> >> I can see the discussion but I do not see if anyone is going to pull this
> >> through any tree. Please, somebody, pull. Thanks.
> > 
> > It will go in through the KVM tree, probably sometime next week.
> 
> 
> It is pretty impressive how such a trivial patch is still on its way to the
> upstream :)
> 
It was applied long time ago
https://git.kernel.org/cgit/virt/kvm/qemu-kvm.git/commit/?h=uq/master&id=8c409e659bc1373fab49a277bb5dc71babb73551

Sync with qemu.git does not happen after every patch though. If you are
working on kvm parts using qemu-kvm.git uq/master branch will make sure
that you are working on most up to date kvm bits. If you work depends on
something that is not in qemu-kvm.git then let kvm maintainers know and
we will send pull request to Anthony earlier.

--
Gleb.

Re: [Qemu-devel] in_asm substitute for accel=kvm:tcg

2013-09-17 Thread Gleb Natapov

On Tue, Sep 17, 2013 at 05:33:57PM +0300, Andriy Gapon wrote:
> on 17/09/2013 15:32 Andreas Färber said the following:
> > Hi,
> > 
> > Am 17.09.2013 13:37, schrieb Andriy Gapon:
> >>
> >> It seems that when qemu is run with accel=kvm:tcg then -d in_asm does not
> >> produce anything.  At least, with the qemu and kvm that I have access to.
> > 
> > Are you saying that with accel=kvm:tcg when falling back to TCG, -d
> > in_asm does not work?
> 
> Yes, exactly. I see the same behavior with accel=kvm:tcg and accel=kvm:tcg.
> qemu version is 1.4.0.
> 
> > For accel=kvm it's expected not to produce any output since QEMU does
> > not process any Translation Blocks then.
> > 
> >> Is there any way to obtain equivalent logging in such a configuration?
> >> A note: a host and a guest are both amd64 (x86_64).
> > 
> > Under Linux, trace options for the kvm kernel module can be enabled via
> > the file system.
> 
> I'll try to google for this.
> 
http://www.linux-kvm.org/page/Tracing

--
Gleb.

Re: [Qemu-devel] cpufreq and QEMU guests

2013-09-17 Thread Gleb Natapov

On Mon, Sep 16, 2013 at 08:42:58PM +0200, Benoît Canet wrote:
> Le Monday 16 Sep 2013 à 18:58:40 (+0300), Gleb Natapov a écrit :
> > On Mon, Sep 16, 2013 at 05:46:04PM +0200, Benoît Canet wrote:
> > > Le Monday 16 Sep 2013 à 18:32:39 (+0300), Gleb Natapov a écrit :
> > > > On Mon, Sep 16, 2013 at 05:05:45PM +0200, Benoît Canet wrote:
> > > > > Le Monday 16 Sep 2013 à 09:39:10 (-0500), Alexander Graf a écrit :
> > > > > > 
> > > > > > 
> > > > > > Am 16.09.2013 um 07:15 schrieb Benoît Canet 
> > > > > > :
> > > > > > 
> > > > > > > 
> > > > > > > Hello,
> > > > > > > 
> > > > > > > I know a cloud provider worried about the fact that the 
> > > > > > > /proc/cpuinfo of his
> > > > > > > guests give a bogus frequency to his customer.
> > > > > > > 
> > > > > > > QEMU and the guests kernel currently have no way to reflect the 
> > > > > > > host frequency
> > > > > > > changes to the guests.
> > > > > > > 
> > > > > > > The customer compute intensive application then read this 
> > > > > > > information and take
> > > > > > > wrong decisions.
> > > > > > 
> > > > > > Why do they care about the frequency? Is it for scheduling 
> > > > > > workloads? The only other case I can think of would be the TSC and 
> > > > > > that should be fixed frequency these days.
> > > > > > 
> > > > > > If it's scheduling, you could maybe expose the unavailable compute 
> > > > > > time as steal time to the guest. Exposibg frequency in a virtual 
> > > > > > environment feels backwards.
> > > > > 
> > > > > The final customer have a compute intensive workload.
> > > > > At startup the code retrieve the cpu cache topology, the cpu model, 
> > > > > and various
> > > > > informations including the guest cpu frequency before starting the 
> > > > > compute job.
> > > > > The QEMU instance typicaly use -cpu host.
> > > > > 
> > > > > The code inspects the cpu frequency has seen by the guests to choose 
> > > > > the number
> > > > > of vms to instanciate to compute the given task.
> > > > I am not sure I understand. They look at guest cpu frequency to estimate
> > > > guest's performance?
> > > 
> > > Yes they take guest cpu count, model and frequency to estimate the 
> > > performance
> > > of the guest.
> > > Next they cluster enough guests to be able to compute the job in a given 
> > > time by
> > > using this estimate.
> > > 
> > They do it wrong. They should take guest cpu count, host cpu model and
> > frequency, pcpu/vcpu over commit (if any), guest/host memory overcommit
> > (if any) and estimate performance based on this. For pure computational
> > performance guest core performance should be close to host core
> > performance if there is not cpu/memory overcommit. With a lot of IO
> > things become more complicated.
> 
> I ommited to write some details of the use case.
> 
> The cloud is a Amazon compatible one this means there is no guest agent in the
> guest to help retrieve the host frequency and model.
>
> Also the AWS APIs don't provide a way to communicate the host CPU infos to the
> program responsible of the vm orchestrations.
> 
> So the only interface to access the host cpu info is QEMU and it's started 
> with
> -cpu host to passthrough the cpu model to the guest.
> 
Why are they sure they are started with "-cpu host"? Do they know if
host is overcommitted or guest's vcpu usage is restricted by any other
means?

> What hurt the final customer badly is that the guest /proc/cpuinfo see the
> regular max frequency of the host cpu but won't see the turbo frequency or a
> scaled down one.
> 
What he sees is host tsc frequency of the cpu a guest was booted on
[1] which should be adequate to estimate performance if guest is not
migrated. The frequency host cpu is running on at any given moment is
out of guest control and depend on host frequency governor and load.

[1] the value comes from host, for not constant tsc hosts this is max
possible frequency

--
Gleb.

Re: [Qemu-devel] cpufreq and QEMU guests

2013-09-16 Thread Gleb Natapov

On Mon, Sep 16, 2013 at 05:46:04PM +0200, Benoît Canet wrote:
> Le Monday 16 Sep 2013 à 18:32:39 (+0300), Gleb Natapov a écrit :
> > On Mon, Sep 16, 2013 at 05:05:45PM +0200, Benoît Canet wrote:
> > > Le Monday 16 Sep 2013 à 09:39:10 (-0500), Alexander Graf a écrit :
> > > > 
> > > > 
> > > > Am 16.09.2013 um 07:15 schrieb Benoît Canet :
> > > > 
> > > > > 
> > > > > Hello,
> > > > > 
> > > > > I know a cloud provider worried about the fact that the /proc/cpuinfo 
> > > > > of his
> > > > > guests give a bogus frequency to his customer.
> > > > > 
> > > > > QEMU and the guests kernel currently have no way to reflect the host 
> > > > > frequency
> > > > > changes to the guests.
> > > > > 
> > > > > The customer compute intensive application then read this information 
> > > > > and take
> > > > > wrong decisions.
> > > > 
> > > > Why do they care about the frequency? Is it for scheduling workloads? 
> > > > The only other case I can think of would be the TSC and that should be 
> > > > fixed frequency these days.
> > > > 
> > > > If it's scheduling, you could maybe expose the unavailable compute time 
> > > > as steal time to the guest. Exposibg frequency in a virtual environment 
> > > > feels backwards.
> > > 
> > > The final customer have a compute intensive workload.
> > > At startup the code retrieve the cpu cache topology, the cpu model, and 
> > > various
> > > informations including the guest cpu frequency before starting the 
> > > compute job.
> > > The QEMU instance typicaly use -cpu host.
> > > 
> > > The code inspects the cpu frequency has seen by the guests to choose the 
> > > number
> > > of vms to instanciate to compute the given task.
> > I am not sure I understand. They look at guest cpu frequency to estimate
> > guest's performance?
> 
> Yes they take guest cpu count, model and frequency to estimate the performance
> of the guest.
> Next they cluster enough guests to be able to compute the job in a given time 
> by
> using this estimate.
> 
They do it wrong. They should take guest cpu count, host cpu model and
frequency, pcpu/vcpu over commit (if any), guest/host memory overcommit
(if any) and estimate performance based on this. For pure computational
performance guest core performance should be close to host core
performance if there is not cpu/memory overcommit. With a lot of IO
things become more complicated.

--
Gleb.

Re: [Qemu-devel] cpufreq and QEMU guests

2013-09-16 Thread Gleb Natapov

On Mon, Sep 16, 2013 at 05:05:45PM +0200, Benoît Canet wrote:
> Le Monday 16 Sep 2013 à 09:39:10 (-0500), Alexander Graf a écrit :
> > 
> > 
> > Am 16.09.2013 um 07:15 schrieb Benoît Canet :
> > 
> > > 
> > > Hello,
> > > 
> > > I know a cloud provider worried about the fact that the /proc/cpuinfo of 
> > > his
> > > guests give a bogus frequency to his customer.
> > > 
> > > QEMU and the guests kernel currently have no way to reflect the host 
> > > frequency
> > > changes to the guests.
> > > 
> > > The customer compute intensive application then read this information and 
> > > take
> > > wrong decisions.
> > 
> > Why do they care about the frequency? Is it for scheduling workloads? The 
> > only other case I can think of would be the TSC and that should be fixed 
> > frequency these days.
> > 
> > If it's scheduling, you could maybe expose the unavailable compute time as 
> > steal time to the guest. Exposibg frequency in a virtual environment feels 
> > backwards.
> 
> The final customer have a compute intensive workload.
> At startup the code retrieve the cpu cache topology, the cpu model, and 
> various
> informations including the guest cpu frequency before starting the compute 
> job.
> The QEMU instance typicaly use -cpu host.
> 
> The code inspects the cpu frequency has seen by the guests to choose the 
> number
> of vms to instanciate to compute the given task.
I am not sure I understand. They look at guest cpu frequency to estimate
guest's performance?

> They even destroy and recreate some vms that would be underperforming to
> mitigate the high inter vm communication costs.
> 
> Do you think the steal time trick would work for this ?
> 

--
Gleb.

Re: [Qemu-devel] [PATCH] qdev-monitor: Avoid exiting when hot-plugging two devices with the same bootindex value

2013-09-16 Thread Gleb Natapov

On Mon, Sep 16, 2013 at 12:54:39PM +0300, Marcel Apfelbaum wrote:
> On Thu, 2013-09-12 at 13:04 +0200, Markus Armbruster wrote:
> > Marcel Apfelbaum  writes:
> > 
> > > On Thu, 2013-09-12 at 11:43 +0200, Markus Armbruster wrote:
> > >> Paolo Bonzini  writes:
> > >> 
> > >> > Il 11/09/2013 20:26, Marcel Apfelbaum ha scritto:
> > >> >> Qemu is expected to quit if the same boot index value is used by
> > >> >> two devices.
> > >> >> However, hot-plugging a device with a bootindex value already used 
> > >> >> should
> > >> >> fail with a friendly message rather than quitting a running VM.
> > >> >
> > >> > I think the problem is right where QEMU exits, i.e. in
> > >> > add_boot_device_path.  This function should return an error instead, 
> > >> > via
> > >> > an Error ** argument.
> > >> 
> > >> Agree.
> 
> I understood that the boot order is passed in fw cfg and updated only once at
> "machine done". There is no update of this list after this point.
The reason it is done at his point is because when
add_boot_device_path() is called dev is not fully instantiated so
qdev_get_fw_dev_path() cannot be called on it yet. For hotplug we need
to re-create boot device list when device is fully ready.

> Modifying the boot order from monitor does not work at all.
> 
> So in order to solve this issue we can:
> 1. Don't allow use of bootindex at hot-plug
I'd rather have a proper fix then workaround. BTW this will change
qmp interface so a command that worked before (for some definition of
"worked") will start to fail. Markus proposed to ignore bootindex clash,
also simple solution and has no downside described above, but has others
that we discussed.

> 2. Change the architecture so boot order changing during hot-plug will be 
> possible.
> 
This is an easy part of the problem though. The hard part how not to
exit when bootindex clash happens ans this is easy since nobody knows
how well device creation errors are handled by qdev.

--
Gleb.

Re: [Qemu-devel] [PATCH v2 0/2] KVM: s390: add floating irq controller

2013-09-15 Thread Gleb Natapov

On Fri, Sep 06, 2013 at 03:30:38PM +0200, Christian Borntraeger wrote:
> On 06/09/13 14:19, Jens Freimann wrote:> This series adds a kvm_device that 
> acts as a irq controller for floating
> > interrupts.  As a first step it implements functionality to retrieve and 
> > inject
> > interrupts for the purpose of migration and for hardening the reset code by
> > allowing user space to explicitly remove all pending floating interrupts.
> > 
> > PFAULT patches will also use this device for enabling/disabling pfault, 
> > therefore
> > the pfault patch series will be reworked to use this device.
> > 
> > * Patch 1/2 adds a new data structure to hold interrupt information. The 
> > current
> >   one (struct kvm_s390_interrupt) does not allow to inject every kind of 
> > interrupt,
> >   e.g. some data for program interrupts and machine check interruptions were
> >   missing.
> > 
> > * Patch 2/2 adds a kvm_device which supports getting/setting currently 
> > pending
> >   floating interrupts as well as deleting all currently pending interrupts
> > 
> > 
> > Jens Freimann (2):
> >   KVM: s390: add and extend interrupt information data structs
> >   KVM: s390: add floating irq controller
> > 
> >  Documentation/virtual/kvm/devices/s390_flic.txt |  36 +++
> >  arch/s390/include/asm/kvm_host.h|  35 +--
> >  arch/s390/include/uapi/asm/kvm.h|   5 +
> >  arch/s390/kvm/interrupt.c   | 304 
> > 
> >  arch/s390/kvm/kvm-s390.c|   1 +
> >  include/linux/kvm_host.h|   1 +
> >  include/uapi/linux/kvm.h|  65 +
> >  virt/kvm/kvm_main.c |   5 +
> >  8 files changed, 368 insertions(+), 84 deletions(-)
> >  create mode 100644 Documentation/virtual/kvm/devices/s390_flic.txt
> > 
> 
> 
> Gleb, Paolo,
> 
> since the qemu part relies on a kernel header file, it makes sense to not 
> only let the kernel
> part go via the kvm tree, but also the qemu part. I want Alex to Ack the 
> interface, and if he
> agrees then I am fine with applying the whole series.
> 
Still waiting for Alex's ACK.

> If nothing else comes up, feel free to apply the small change request from 
> Peter yourself or
> ask Jens for a resend.
> 
> --snip
> 
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -908,7 +908,7 @@ struct kvm_device_attr {
>  #define KVM_DEV_TYPE_FSL_MPIC_20   1
>  #define KVM_DEV_TYPE_FSL_MPIC_42   2
>  #define KVM_DEV_TYPE_XICS  3
> -#define KVM_DEV_TYPE_FLIC  4
> +#define KVM_DEV_TYPE_FLIC  5
>  
>  /*
>   * ioctls for VM fds
> 
> --snip

--
Gleb.

Re: [Qemu-devel] [Bug 1180777] Re: Windows 7 VM freeze on Ubuntu 12.04 KVM

2013-09-14 Thread Gleb Natapov

On Fri, Sep 13, 2013 at 07:29:49PM -, Vasile Dumitrescu wrote:
> I also see these EXACT symptoms, using kvm (VM managed through livirt
> virsh) on Debian x64 host, guest is Windows 8, RedHat VirtIo network
> driver.
> 
Can you trace KVM [1] when hang happens next time?

[1] http://www.linux-kvm.org/page/Tracing

> rgds
> 
> ** Also affects: debian
>Importance: Undecided
>Status: New
> 
> -- 
> You received this bug notification because you are a member of qemu-
> devel-ml, which is subscribed to QEMU.
> https://bugs.launchpad.net/bugs/1180777
> 
> Title:
>   Windows 7 VM freeze on Ubuntu 12.04 KVM
> 
> Status in QEMU:
>   New
> Status in “qemu-kvm” package in Ubuntu:
>   Confirmed
> Status in Debian GNU/Linux:
>   New
> 
> Bug description:
>   Hi,
> 
>   I have recently setup a Windows 7 VM on KVM and started using it
>   through remote desktop.
> 
>   What happens is that, after some hours of usage, the remote desktop
>   connection freezes. I thought it was a remmina bug, as the it was
>   enough to kill and restart it to successfully connect again to the VM.
> 
>   However, today I've switched to a different RDP client (2X Client
>   chromium app) and the freeze just happened again!
> 
>   Some information:
>   - the host and the VM are completely idle when the freeze occurs
>   - I've tried sniffing the network packets toward the RDP port during the 
> freeze and found that the client is sending packets but no packet is sent back
> 
>   Could this be a KVM issue? How can I further debug this one (I expect
>   the freeze to happen again...)?
> 
>   ProblemType: Bug
>   DistroRelease: Ubuntu 12.04
>   Package: kvm 1:84+dfsg-0ubuntu16+1.0+noroms+0ubuntu14.8
>   ProcVersionSignature: Ubuntu 3.2.0-41.66-generic 3.2.42
>   Uname: Linux 3.2.0-41-generic x86_64
>   ApportVersion: 2.0.1-0ubuntu17.2
>   Architecture: amd64
>   Date: Thu May 16 14:12:40 2013
>   MachineType: Hewlett-Packard HP ProBook 4520s
>   MarkForUpload: True
>   ProcEnviron:
>    TERM=xterm
>    PATH=(custom, no user)
>    LANG=en_US.UTF-8
>    SHELL=/bin/bash
>   ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-41-generic 
> root=UUID=D2E20BC3E20BAAB5 loop=/hostname/disks/root.disk ro quiet splash 
> vt.handoff=7
>   SourcePackage: qemu-kvm
>   UpgradeStatus: No upgrade log present (probably fresh install)
>   dmi.bios.date: 08/26/2010
>   dmi.bios.vendor: Hewlett-Packard
>   dmi.bios.version: 68AZZ Ver. F.0A
>   dmi.board.name: 1411
>   dmi.board.vendor: Hewlett-Packard
>   dmi.board.version: KBC Version 57.30
>   dmi.chassis.type: 10
>   dmi.chassis.vendor: Hewlett-Packard
>   dmi.modalias: 
> dmi:bvnHewlett-Packard:bvr68AZZVer.F.0A:bd08/26/2010:svnHewlett-Packard:pnHPProBook4520s:pvr:rvnHewlett-Packard:rn1411:rvrKBCVersion57.30:cvnHewlett-Packard:ct10:cvr:
>   dmi.product.name: HP ProBook 4520s
>   dmi.sys.vendor: Hewlett-Packard
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/qemu/+bug/1180777/+subscriptions

--
Gleb.

Re: [Qemu-devel] guest kernel 3.10 panic at boot (__mcheck_cpu_init_generic) with kvm64 vcpu + amd host cpu (qemu 1.4)

2013-09-11 Thread Gleb Natapov

On Wed, Sep 11, 2013 at 05:39:01AM +0200, Alexandre DERUMIER wrote:
> Hi List,
> I'm trying to boot a debian squeeze guest with a 3.10 kernel, and I have a 
> crash at boot
> 
> This only happen with kvm64 vcpu + amd host (opteron 6100 or opteron 6300).
> 
Check host dmesg for unhandled msrs.

> host vcpu + amd host works fine
> kvm64 vcpu + intel host works fine too.
> 
> 
> host os is proxmox 3.1 (rhel 6.4 2.6.32 kernel + qemu 64)
> 
> I can boot fine this guest with 3.7 kernel.
> 
> (I'll try to build 3.8 and 3.9 kernel later this week).
> 
> Any idea ?
> 
> Regards,
> 
> Alexandre
> 
> 
> 
> 
> [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-3.10.7 
> root=UUID=a8eb9f6e-dbc6-4797-baea-8e7198e8e541 ro console=ttyS0 console=tty0
> [0.00] e820: BIOS-provided physical RAM map:
> [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
> [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
> [0.00] BIOS-e820: [mem 0x000f-0x000f] reserved
> [0.00] BIOS-e820: [mem 0x0010-0xdfffdfff] usable
> [0.00] BIOS-e820: [mem 0xdfffe000-0xdfff] reserved
> [0.00] BIOS-e820: [mem 0xfeffc000-0xfeff] reserved
> [0.00] BIOS-e820: [mem 0xfffc-0x] reserved
> [0.00] BIOS-e820: [mem 0x0001-0x00011fff] usable
> [0.00] NX (Execute Disable) protection: active
> [0.00] SMBIOS 2.4 present.
> [0.00] No AGP bridge found
> [0.00] e820: last_pfn = 0x12 max_arch_pfn = 0x4
> [0.00] x86 PAT enabled: cpu 0, old 0x70406, new 0x7010600070106
> [0.00] e820: last_pfn = 0xdfffe max_arch_pfn = 0x4
> [0.00] found SMP MP-table at [mem 0x000fdaa0-0x000fdaaf] mapped at 
> [880fdaa0]
> [0.00] init_memory_mapping: [mem 0x-0x000f]
> [0.00] init_memory_mapping: [mem 0x11fe0-0x11fff]
> [0.00] init_memory_mapping: [mem 0x11c00-0x11fdf]
> [0.00] init_memory_mapping: [mem 0x1-0x11bff]
> [0.00] init_memory_mapping: [mem 0x0010-0xdfffdfff]
> [0.00] RAMDISK: [mem 0x37576000-0x37fe]
> [0.00] ACPI: RSDP 000fd890 00014 (v00 BOCHS )
> [0.00] ACPI: RSDT dfffe380 00034 (v01 BOCHS  BXPCRSDT 
> 0001 BXPC 0001)
> [0.00] ACPI: FACP df80 00074 (v01 BOCHS  BXPCFACP 
> 0001 BXPC 0001)
> [0.00] ACPI: DSDT dfffe3c0 011A9 (v01   BXPC   BXDSDT 
> 0001 INTL 20100528)
> [0.00] ACPI: FACS df40 00040
> [0.00] ACPI: SSDT d6e0 00858 (v01 BOCHS  BXPCSSDT 
> 0001 BXPC 0001)
> [0.00] ACPI: APIC d5b0 00090 (v01 BOCHS  BXPCAPIC 
> 0001 BXPC 0001)
> [0.00] ACPI: HPET d570 00038 (v01 BOCHS  BXPCHPET 
> 0001 BXPC 0001)
> [0.00] No NUMA configuration found
> [0.00] Faking a node at [mem 0x-0x00011fff]
> [0.00] Initmem setup node 0 [mem 0x-0x11fff]
> [0.00]   NODE_DATA [mem 0x11fff8000-0x11fffbfff]
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x]
> [0.00]   Normal   [mem 0x1-0x11fff]
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0xdfffdfff]
> [0.00]   node   0: [mem 0x1-0x11fff]
> [0.00] ACPI: PM-Timer IO Port: 0xb008
> [0.00] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
> [0.00] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
> [0.00] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
> [0.00] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
> [0.00] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
> [0.00] ACPI: IOAPIC (id[0x00] address[0xfec0] gsi_base[0])
> [0.00] IOAPIC[0]: apic_id 0, version 17, address 0xfec0, GSI 0-23
> [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
> [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
> [0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
> [0.00] Using ACPI (MADT) for SMP configuration information
> [0.00] ACPI: HPET id: 0x8086a201 base: 0xfed0
> [0.00] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
> [0.00] PM: Registered nosave memory: 0009f000 - 
> 000a
> [0.00] PM: Registered nosave memory: 000a - 
> 000f
> [0.0

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 01:46:49PM +0200, Paolo Bonzini wrote:
> >> Yes.  QEMU unmarshals information from the XSAVE region and back, so it
> >> cannot support MPX or AVX-512 yet (even if KVM were).  Separate bug, 
> >> though.
> >>
> > IMO this is the main issue here, not separate bug. If we gonna let guest
> > use CPU state QEMU does not support we gonna have a bad time.
> 
>  We cannot force the guest not to use a feature; all we can do is hide
> >>>
> >>> Of course we can't, this is correct for other features too, but this is
> >>> guest's problem.
> >>
> >> Ok, then we agree that QEMU doesn't have a problem?  The XSAVE data will
> > 
> > Which problem exactly. The problems I see is that 1. We do not support
> > MPX and AVX-512 (but this is probably not the problem you meant :)) 2. 0D
> > data is not consistent with features. Guest may not expect it and do stupid
> > things.
> 
> It is not a problem to unmarshal information out of KVM_GET_XSAVE data
> (and back).  If the guest does stupid things, it's a bug in an
> ill-behaving guest.
> 
You know I am first in line to blame guest for everything :) (who needs
guests anyway) but in this case I didn't mean that guest does something
illegal. If we advertise support for some XSAVE state in 0D leaf guest
is in his right to make conclusions we may not expect from that. It may
check corespondent feature bit and crash if it is not present for
instance.

> On the other hand, I agree that passthrough of host 0xD data is bad and
> will fix it.
> 
Thanks!

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 01:07:37PM +0200, Paolo Bonzini wrote:
>  In fact, another bug is that kvm_vcpu_ioctl_x86_set_xsave ignores
>  xstate_bv when XSAVE is not available.  Instead, it should reset the
>  FXSAVE data to processor-reset values (except for MXCSR which always
>  comes from XRSTOR data), i.e. to all-zeros except for the x87 control
>  and tag words.  It should also check reserved bits of MXCSR.
> >>>
> >>> I do not see why.
> >>
> >> Because otherwise it behaves in a subtly different manner for XSAVE and
> >> non-XSAVE hosts.
> > 
> > I do not see how. Can you elaborate?
> 
> Suppose userspace calls KVM_SET_XSAVE with XSTATE_BV=0.
> 
> On an XSAVE host, when the guest FPU state is loaded KVM will do an
> XRSTOR.  The XRSTOR will restore the FPU state to default values.
> 
> On a non-XSAVE host, when the guest FPU state is loaded KVM will do an
> FXRSTR.  The FXRSTR will load the FPU state from the first 512 bytes of
> the block that was passed to KVM_SET_XSAVE.
> 
> This is not a problem because userspace will usually pass to
> KVM_SET_XSAVE only something that it got from KVM_GET_XSAVE, and
> KVM_GET_XSAVE will never set XSTATE_BV=0.  However, KVM_SET_XSAVE is
> supposed to emulate XSAVE/XRSTOR if it is not available, and it is
> failing to emulate this detail.
> 
You are trying to be bug to bug compatible :) XSTATE_BV can be zero only
if FPU state is reset one, otherwise the guest will not survive. KVM_SET_XSAVE
is not suppose to emulate XSAVE/XRSTOR, it is not emulator function. It
is better to outlaw zero value for XSTATE_BV at all, but we cannot do it
because current QEMU uses it.

>  Yes.  QEMU unmarshals information from the XSAVE region and back, so it
>  cannot support MPX or AVX-512 yet (even if KVM were).  Separate bug, 
>  though.
> 
> >>> IMO this is the main issue here, not separate bug. If we gonna let guest
> >>> use CPU state QEMU does not support we gonna have a bad time.
> >>
> >> We cannot force the guest not to use a feature; all we can do is hide
> > 
> > Of course we can't, this is correct for other features too, but this is
> > guest's problem.
> 
> Ok, then we agree that QEMU doesn't have a problem?  The XSAVE data will
Which problem exactly. The problems I see is that 1. We do not support
MPX and AVX-512 (but this is probably not the problem you meant :)) 2. 0D
data is not consistent with features. Guest may not expect it and do stupid
things.

> always be "fresh" as long as the guest obeys CPUID bits it receives, and
> the CPUID bits that QEMU passes will never enable XSAVE data that QEMU
> does not support.
> 

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 01:54:50PM +0300, Gleb Natapov wrote:
> On Mon, Sep 09, 2013 at 11:53:45AM +0200, Paolo Bonzini wrote:
> > Il 09/09/2013 11:03, Gleb Natapov ha scritto:
> > > On Mon, Sep 09, 2013 at 10:31:15AM +0200, Paolo Bonzini wrote:
> > >> Il 08/09/2013 13:40, Gleb Natapov ha scritto:
> > >>> On Thu, Sep 05, 2013 at 03:06:21PM +0200, Paolo Bonzini wrote:
> > >>>> On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv,
> > >>>> and not restore anything.
> > >>>>
> > >>> XRSTOR restores FP/SSE state to reset state if no bits are set in
> > >>> xstate_bv. This is what should happen on reset, no?
> > >>
> > >> Yes. The problem happens on the migration destination when XSAVE data is
> > >> not transmitted.  FP/SSE data is transmitted and must be restored, but
> > >> xstate_bv is zero and KVM_SET_XSAVE restores FP/SSE state to reset
> > >> state.  The vcpu then loses the values that were set in the migration 
> > >> data.
> > >>
> > >>>> Since FP and SSE data are always valid, set them in xstate_bv at reset
> > >>>> time.  In fact, that value is the same that KVM_GET_XSAVE returns on
> > >>>> pre-XSAVE hosts.
> > >>> It is needed for migration between non xsave host to xsave host.
> > >>
> > >> Yes, and this patch does the same for migration between non-XSAVE QEMU
> > >> and XSAVE QEMU.
> > >>
> > > Can such migration happen? The commit that added xsave support
> > > (f1665b21f16c5dc0ac37de60233a4975aff31193) changed vmstate version id.
> > 
> > Yes, old->new migration can happen.  New->old of course cannot.
> > 
> I see. I am fine with the patch, but please drop defines that are not
> used in the patch itself.
> 
BTW migration question, will xstate_bv no be zeroed by migration code in
old->new case?
 
--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 11:53:45AM +0200, Paolo Bonzini wrote:
> Il 09/09/2013 11:03, Gleb Natapov ha scritto:
> > On Mon, Sep 09, 2013 at 10:31:15AM +0200, Paolo Bonzini wrote:
> >> Il 08/09/2013 13:40, Gleb Natapov ha scritto:
> >>> On Thu, Sep 05, 2013 at 03:06:21PM +0200, Paolo Bonzini wrote:
> >>>> On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv,
> >>>> and not restore anything.
> >>>>
> >>> XRSTOR restores FP/SSE state to reset state if no bits are set in
> >>> xstate_bv. This is what should happen on reset, no?
> >>
> >> Yes. The problem happens on the migration destination when XSAVE data is
> >> not transmitted.  FP/SSE data is transmitted and must be restored, but
> >> xstate_bv is zero and KVM_SET_XSAVE restores FP/SSE state to reset
> >> state.  The vcpu then loses the values that were set in the migration data.
> >>
> >>>> Since FP and SSE data are always valid, set them in xstate_bv at reset
> >>>> time.  In fact, that value is the same that KVM_GET_XSAVE returns on
> >>>> pre-XSAVE hosts.
> >>> It is needed for migration between non xsave host to xsave host.
> >>
> >> Yes, and this patch does the same for migration between non-XSAVE QEMU
> >> and XSAVE QEMU.
> >>
> > Can such migration happen? The commit that added xsave support
> > (f1665b21f16c5dc0ac37de60233a4975aff31193) changed vmstate version id.
> 
> Yes, old->new migration can happen.  New->old of course cannot.
> 
I see. I am fine with the patch, but please drop defines that are not
used in the patch itself.

> >> In fact, another bug is that kvm_vcpu_ioctl_x86_set_xsave ignores
> >> xstate_bv when XSAVE is not available.  Instead, it should reset the
> >> FXSAVE data to processor-reset values (except for MXCSR which always
> >> comes from XRSTOR data), i.e. to all-zeros except for the x87 control
> >> and tag words.  It should also check reserved bits of MXCSR.
> >
> > I do not see why.
> 
> Because otherwise it behaves in a subtly different manner for XSAVE and
> non-XSAVE hosts.
I do not see how. Can you elaborate?

> 
> >> Yes.  QEMU unmarshals information from the XSAVE region and back, so it
> >> cannot support MPX or AVX-512 yet (even if KVM were).  Separate bug, 
> >> though.
> >>
> > IMO this is the main issue here, not separate bug. If we gonna let guest
> > use CPU state QEMU does not support we gonna have a bad time.
> 
> We cannot force the guest not to use a feature; all we can do is hide
Of course we can't, this is correct for other features too, but this is
guest's problem.

> the CPUID bits so that a well-behaved guest will not use it.  QEMU does
> hide CPUID bits for non-supported XSAVE states, except for "-cpu host".
>  So this will not be a problem except with "-cpu host".
> 

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 2/2] KVM: make XSAVE support more robust

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 11:50:03AM +0200, Paolo Bonzini wrote:
> Il 09/09/2013 11:18, Gleb Natapov ha scritto:
> > On Mon, Sep 09, 2013 at 10:51:58AM +0200, Paolo Bonzini wrote:
> >> Il 08/09/2013 13:52, Gleb Natapov ha scritto:
> >>> On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> >>>> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> >>>> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> >>>> region as an opaque blob, it might be impossible to set some state on
> >>>> the destination if migrating to an older version.
> >>>>
> >>>> This patch blocks migration if it finds that unsupported bits are set
> >>>> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> >>>> only report in env->xstate_bv those fields that will actually end up
> >>>> in the migration stream.
> >>>
> >>> We usually handle host cpu differences in cpuid layer, not by trying to
> >>> validate migration data.
> >>
> >> Actually we do both.  QEMU for example detects invalid subsections and
> >> blocks migration, and CPU differences also result in subsections that
> >> the destination does not know.
> >>
> > That's different from what you do here though. If xstate_bv was in its
> > separate subsection things would be easier, but it is not.
> 
> I agree.  And also if YMM was in its separate subsections; future XSAVE
> states will likely use subsections (whose presence is keyed off bits in
> env->xstate_bv).
> 
> >> However, KVM_GET/SET_XSAVE should still return all values supported by
> >> the hypervisor, independent of the supported CPUID bits.
> >
> > Why?
> 
> Because this is not talking to the guest, it is talking to userspace.
> 
> The VCPU state is more than what is visible to the guest, and returning
If a state does not affect guest in any way there is not reason to
migrate it.

> all of it seems more consistent with the rest of the KVM API.  For
> example, KVM_GET_FPU always returns SSE state even if the CPUID lacks
> SSE and/or FXSR.
There are counter examples too :) If APIC is not created we do not return
fake information on GET_IRQCHIP.  I think nobody expected FPU state to
grow indefinitely, so fixed, inflexible API was introduced, but now,
when CPU state has flexible extended state management it make sense to
model it in the API too.

> 
> >> A well-behaved guest should not have modified that state anyway, since:
> >>
> >> * the source and destination machines should have the same CPU
> >>
> >> * since the destination QEMU does not support the feature, the source
> >> should have masked it as well
> >>
> >> * the guest should always probe CPUID before using a feature
> >>
> > The I fail to see what is the purpose of the patch. I see two cases:
> > 1. Each extended state has separate CPUID bit (is this guarantied?)
> 
> Not guaranteed, but it has always happened so far (AVX, AVX-512, MPX).
> 
OK, So for now no need to make 0D configurable, but we need to provide
correct one according to those flags, not to passthrough host values.
 
> >   - In this case, as you say, by matching CPUID on src and dst we guaranty
> > that migration data is good.
> 
> But we don't match CPUID on src and destination.  This is something that
Yes, I was saying that management infrastructure already knows how to handle
it.

> the user should do, but it's better if we can test it too.  Subsections
> do that for us; I am, in some sense, emulating subsections for the XSAVE
> states that are not stored in subsections.
We do not do it for other bits. It is possible currently to migrate to a
slightly different cpu without failure and it may cause guest to crash,
but we are not trying actively to catch those situations. Why XSAVE is
different?

> 
> >> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
> >> and we should hide all features that are not visible in CPUID.  It is
> >> okay, however, to test it in cpu_post_load.
> > 
> > The kernel should not even return state that is not visible in CPUID.
> 
> That's an interesting point of view that I hadn't considered.  But just
> like you asked me why it should return state that is not visible in
> CPUID, I'm asking you why it should not...
> 
For number of reasons. First because since a sate is not used there is no
point in migrating it. Second to make interface more deterministic for
QEMU. i.e QEMU configures only features it supports and gets
exactly same state from the kernel no matter what host cpu is and what
kernel version is. This patch will not be needed since kernel will do
the job.

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 2/2] KVM: make XSAVE support more robust

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 10:51:58AM +0200, Paolo Bonzini wrote:
> Il 08/09/2013 13:52, Gleb Natapov ha scritto:
> > On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> >> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> >> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> >> region as an opaque blob, it might be impossible to set some state on
> >> the destination if migrating to an older version.
> >>
> >> This patch blocks migration if it finds that unsupported bits are set
> >> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> >> only report in env->xstate_bv those fields that will actually end up
> >> in the migration stream.
> > 
> > We usually handle host cpu differences in cpuid layer, not by trying to
> > validate migration data.
> 
> Actually we do both.  QEMU for example detects invalid subsections and
> blocks migration, and CPU differences also result in subsections that
> the destination does not know.
> 
That's different from what you do here though. If xstate_bv was in its
separate subsection things would be easier, but it is not.

> But as far as QEMU is concerned, setting an unknown bit in XSTATE_BV is
> not a CPU difference, it is simply invalid migration data.
> 
> > i.e CPUID.0D should be configurable and
> > management should be able to query QEMU what is supported and prevent
> > migration attempt accordingly.
> 
> Management is already able to query QEMU of what is supported, because
> new XSAVE state is always attached to new CPUID bits in leaves other
> than 0Dh (e.g. EAX=07h, ECX=0h returns AVX512 and MPX support in EBX).
> QEMU should compute 0Dh data based on those bits indeed.
If it is computable from other data even better, easier for us.

> 
> However, KVM_GET/SET_XSAVE should still return all values supported by
> the hypervisor, independent of the supported CPUID bits.
> 
Why?

> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >>  target-i386/kvm.c | 3 ++-
> >>  target-i386/machine.c | 4 
> >>  2 files changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> >> index 749aa09..df08a4b 100644
> >> --- a/target-i386/kvm.c
> >> +++ b/target-i386/kvm.c
> >> @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
> >>  sizeof env->fpregs);
> >>  memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
> >>  sizeof env->xmm_regs);
> >> -env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
> >> +env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
> >> +XSTATE_SUPPORTED;
> > Don't we just drop state here that will not be restored on the
> > destination and destination will not be able to tell since we masked
> > unsupported bits?
> 
> A well-behaved guest should not have modified that state anyway, since:
> 
> * the source and destination machines should have the same CPU
> 
> * since the destination QEMU does not support the feature, the source
> should have masked it as well
> 
> * the guest should always probe CPUID before using a feature
> 
The I fail to see what is the purpose of the patch. I see two cases:
1. Each extended state has separate CPUID bit (is this guarantied?)
  - In this case, as you say, by matching CPUID on src and dst we guaranty
that migration data is good.
2. There is a state that is advertised in CPUID.0D, but does not have
   any regular "feature" CPUID associated with it.
 - In this case this patch will drop valid state that needs to be
   restored.

> There will be only one change for well-behaved guests with this patch
> (and the change will be invisible to them since they behave well).
> After the patch, KVM_SET_XSAVE will set the extended states to the
> processor-reset state instead of all-zeros.  However, all
> currently-defined states have a processor-reset state that is equal to
> all-zeroes, so this change is theoretical.
> 
> In fact, perhaps even XSTATE_SUPPORTED is not restrictive enough here,
> and we should hide all features that are not visible in CPUID.  It is
> okay, however, to test it in cpu_post_load.
The kernel should not even return state that is not visible in CPUID.

> 
> Paolo
> 
> >>  memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
> >>  sizeof env->ymmh_regs);
> >>  return 0;
> >> diff --git a/target-i386/machine.c b/target-i386/machine.c
> >> inde

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-09 Thread Gleb Natapov

On Mon, Sep 09, 2013 at 10:31:15AM +0200, Paolo Bonzini wrote:
> Il 08/09/2013 13:40, Gleb Natapov ha scritto:
> > On Thu, Sep 05, 2013 at 03:06:21PM +0200, Paolo Bonzini wrote:
> >> On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv,
> >> and not restore anything.
> >>
> > XRSTOR restores FP/SSE state to reset state if no bits are set in
> > xstate_bv. This is what should happen on reset, no?
> 
> Yes. The problem happens on the migration destination when XSAVE data is
> not transmitted.  FP/SSE data is transmitted and must be restored, but
> xstate_bv is zero and KVM_SET_XSAVE restores FP/SSE state to reset
> state.  The vcpu then loses the values that were set in the migration data.
> 
> >> Since FP and SSE data are always valid, set them in xstate_bv at reset
> >> time.  In fact, that value is the same that KVM_GET_XSAVE returns on
> >> pre-XSAVE hosts.
> > It is needed for migration between non xsave host to xsave host.
> 
> Yes, and this patch does the same for migration between non-XSAVE QEMU
> and XSAVE QEMU.
> 
Can such migration happen? The commit that added xsave support
(f1665b21f16c5dc0ac37de60233a4975aff31193) changed vmstate version id.

> In fact, another bug is that kvm_vcpu_ioctl_x86_set_xsave ignores
> xstate_bv when XSAVE is not available.  Instead, it should reset the
> FXSAVE data to processor-reset values (except for MXCSR which always
> comes from XRSTOR data), i.e. to all-zeros except for the x87 control
> and tag words.  It should also check reserved bits of MXCSR.
> 
I do not see why.

> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >>  target-i386/cpu.c | 1 +
> >>  target-i386/cpu.h | 5 +
> >>  2 files changed, 6 insertions(+)
> >>
> >> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> >> index c36345e..ac83106 100644
> >> --- a/target-i386/cpu.c
> >> +++ b/target-i386/cpu.c
> >> @@ -2386,6 +2386,7 @@ static void x86_cpu_reset(CPUState *s)
> >>  env->fpuc = 0x37f;
> >>  
> >>  env->mxcsr = 0x1f80;
> >> +env->xstate_bv = XSTATE_FP | XSTATE_SSE;
> >>  
> >>  env->pat = 0x0007040600070406ULL;
> >>  env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT;
> >> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> >> index 5723eff..a153078 100644
> >> --- a/target-i386/cpu.h
> >> +++ b/target-i386/cpu.h
> >> @@ -380,6 +380,11 @@
> >>  
> >>  #define MSR_VM_HSAVE_PA 0xc0010117
> >>  
> >> +#define XSTATE_SUPPORTED  (XSTATE_FP|XSTATE_SSE|XSTATE_YMM)
> > Supported by whom? By QEMU? We should filer unsupported bits from CPUID.0D 
> > then too.
> 
> Yes.  QEMU unmarshals information from the XSAVE region and back, so it
> cannot support MPX or AVX-512 yet (even if KVM were).  Separate bug, though.
> 
IMO this is the main issue here, not separate bug. If we gonna let guest
use CPU state QEMU does not support we gonna have a bad time.

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 2/2] KVM: make XSAVE support more robust

2013-09-08 Thread Gleb Natapov

On Thu, Sep 05, 2013 at 03:06:22PM +0200, Paolo Bonzini wrote:
> QEMU moves state from CPUArchState to struct kvm_xsave and back when it
> invokes the KVM_*_XSAVE ioctls.  Because it doesn't treat the XSAVE
> region as an opaque blob, it might be impossible to set some state on
> the destination if migrating to an older version.
> 
> This patch blocks migration if it finds that unsupported bits are set
> in the XSTATE_BV header field.  To make this work robustly, QEMU should
> only report in env->xstate_bv those fields that will actually end up
> in the migration stream.
We usually handle host cpu differences in cpuid layer, not by trying to
validate migration data. i.e CPUID.0D should be configurable and
management should be able to query QEMU what is supported and prevent
migration attempt accordingly.

> 
> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/kvm.c | 3 ++-
>  target-i386/machine.c | 4 
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 749aa09..df08a4b 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -1291,7 +1291,8 @@ static int kvm_get_xsave(X86CPU *cpu)
>  sizeof env->fpregs);
>  memcpy(env->xmm_regs, &xsave->region[XSAVE_XMM_SPACE],
>  sizeof env->xmm_regs);
> -env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV];
> +env->xstate_bv = *(uint64_t *)&xsave->region[XSAVE_XSTATE_BV] &
> +XSTATE_SUPPORTED;
Don't we just drop state here that will not be restored on the
destination and destination will not be able to tell since we masked
unsupported bits?

>  memcpy(env->ymmh_regs, &xsave->region[XSAVE_YMMH_SPACE],
>  sizeof env->ymmh_regs);
>  return 0;
> diff --git a/target-i386/machine.c b/target-i386/machine.c
> index dc81cde..9e2cfcf 100644
> --- a/target-i386/machine.c
> +++ b/target-i386/machine.c
> @@ -278,6 +278,10 @@ static int cpu_post_load(void *opaque, int version_id)
>  CPUX86State *env = &cpu->env;
>  int i;
>  
> +if (env->xstate_bv & ~XSTATE_SUPPORTED) {
> +return -EINVAL;
> +}
> + 
>  /*
>   * Real mode guest segments register DPL should be zero.
>   * Older KVM version were setting it wrongly.
> -- 
> 1.8.3.1

--
Gleb.

Re: [Qemu-devel] [PATCH uq/master 1/2] x86: fix migration from pre-version 12

2013-09-08 Thread Gleb Natapov

On Thu, Sep 05, 2013 at 03:06:21PM +0200, Paolo Bonzini wrote:
> On KVM, the KVM_SET_XSAVE would be executed with a 0 xstate_bv,
> and not restore anything.
> 
XRSTOR restores FP/SSE state to reset state if no bits are set in
xstate_bv. This is what should happen on reset, no?

> Since FP and SSE data are always valid, set them in xstate_bv at reset
> time.  In fact, that value is the same that KVM_GET_XSAVE returns on
> pre-XSAVE hosts.
It is needed for migration between non xsave host to xsave host.

> 
> Signed-off-by: Paolo Bonzini 
> ---
>  target-i386/cpu.c | 1 +
>  target-i386/cpu.h | 5 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index c36345e..ac83106 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2386,6 +2386,7 @@ static void x86_cpu_reset(CPUState *s)
>  env->fpuc = 0x37f;
>  
>  env->mxcsr = 0x1f80;
> +env->xstate_bv = XSTATE_FP | XSTATE_SSE;
>  
>  env->pat = 0x0007040600070406ULL;
>  env->msr_ia32_misc_enable = MSR_IA32_MISC_ENABLE_DEFAULT;
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 5723eff..a153078 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -380,6 +380,11 @@
>  
>  #define MSR_VM_HSAVE_PA 0xc0010117
>  
> +#define XSTATE_SUPPORTED (XSTATE_FP|XSTATE_SSE|XSTATE_YMM)
Supported by whom? By QEMU? We should filer unsupported bits from CPUID.0D then 
too.

> +#define XSTATE_FP1
> +#define XSTATE_SSE   2
> +#define XSTATE_YMM   4
> +
>  /* CPUID feature words */
>  typedef enum FeatureWord {
>  FEAT_1_EDX, /* CPUID[1].EDX */
> -- 
> 1.8.3.1
> 

--
Gleb.

Re: [Qemu-devel] [KVM] segmentation fault happened when reboot VM after hot-uplug virtio NIC

2013-09-03 Thread Gleb Natapov

On Tue, Sep 03, 2013 at 12:06:33PM +, Zhanghaoyu (A) wrote:
> Hi, all
> 
> Segmentation fault happened when reboot VM after hot-unplug virtio NIC, which 
> can be reproduced 100%.
> See similar bug report to https://bugzilla.redhat.com/show_bug.cgi?id=988256
> 
> test environment:
> host: SLES11SP2 (kenrel version: 3.0.58)
> qemu: 1.5.1, upstream-qemu (commit 545825d4cda03ea292b7788b3401b99860efe8bc)
> libvirt: 1.1.0
> guest os: win2k8 R2 x64bit or sles11sp2 x64 or win2k3 32bit
> 
> You can reproduce this problem by following steps:
> 1. start a VM with virtio NIC(s)
> 2. hot-unplug a virtio NIC from the VM
> 3. reboot the VM, then segmentation fault happened during starting period
> 
> the qemu backtrace shown as below:
> #0  0x7ff4be3288d0 in __memcmp_sse4_1 () from /lib64/libc.so.6
> #1  0x7ff4c07f82c0 in patch_hypercalls (s=0x7ff4c15dd610) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/hw/i386/kvmvapic.c:549
> #2  0x7ff4c07f84f0 in vapic_prepare (s=0x7ff4c15dd610) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/hw/i386/kvmvapic.c:614
> #3  0x7ff4c07f85e7 in vapic_write (opaque=0x7ff4c15dd610, addr=0, 
> data=32, size=2)
> at /mnt/zhanghaoyu/qemu/qemu-1.5.1/hw/i386/kvmvapic.c:651
> #4  0x7ff4c082a917 in memory_region_write_accessor 
> (opaque=0x7ff4c15df938, addr=0, value=0x7ff4bbfe3d00, size=2, 
> shift=0, mask=65535) at /mnt/zhanghaoyu/qemu/qemu-1.5.1/memory.c:334
> #5  0x7ff4c082a9ee in access_with_adjusted_size (addr=0, 
> value=0x7ff4bbfe3d00, size=2, access_size_min=1, 
> access_size_max=4, access=0x7ff4c082a89a , 
> opaque=0x7ff4c15df938)
> at /mnt/zhanghaoyu/qemu/qemu-1.5.1/memory.c:364
> #6  0x7ff4c082ae49 in memory_region_iorange_write 
> (iorange=0x7ff4c15dfca0, offset=0, width=2, data=32)
> at /mnt/zhanghaoyu/qemu/qemu-1.5.1/memory.c:439
> #7  0x7ff4c08236f7 in ioport_writew_thunk (opaque=0x7ff4c15dfca0, 
> addr=126, data=32)
> at /mnt/zhanghaoyu/qemu/qemu-1.5.1/ioport.c:219
> #8  0x7ff4c0823078 in ioport_write (index=1, address=126, data=32) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/ioport.c:83
> #9  0x7ff4c0823ca9 in cpu_outw (addr=126, val=32) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/ioport.c:296
> #10 0x7ff4c0827485 in kvm_handle_io (port=126, data=0x7ff4c051, 
> direction=1, size=2, count=1)
> at /mnt/zhanghaoyu/qemu/qemu-1.5.1/kvm-all.c:1485
> #11 0x7ff4c0827e14 in kvm_cpu_exec (env=0x7ff4c15bf270) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/kvm-all.c:1634
> #12 0x7ff4c07b6f27 in qemu_kvm_cpu_thread_fn (arg=0x7ff4c15bf270) at 
> /mnt/zhanghaoyu/qemu/qemu-1.5.1/cpus.c:759
> #13 0x7ff4be58af05 in start_thread () from /lib64/libpthread.so.0
> #14 0x7ff4be2cd53d in clone () from /lib64/libc.so.6
> 
> If I apply below patch to the upstream qemu, this problem will disappear,
> ---
>  hw/i386/kvmvapic.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index 15beb80..6fff299 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -652,11 +652,11 @@ static void vapic_write(void *opaque, hwaddr addr, 
> uint64_t data,
>  switch (size) {
>  case 2:
>  if (s->state == VAPIC_INACTIVE) {
> -rom_paddr = (env->segs[R_CS].base + env->eip) & ROM_BLOCK_MASK;
> -s->rom_state_paddr = rom_paddr + data;
> -
>  s->state = VAPIC_STANDBY;
>  }
> +rom_paddr = (env->segs[R_CS].base + env->eip) & ROM_BLOCK_MASK;
> +s->rom_state_paddr = rom_paddr + data;
> +
Jan, does this mean that vapic state dies not move to inactive during
reset?

>  if (vapic_prepare(s) < 0) {
>  s->state = VAPIC_INACTIVE;
>  break;
> --
> 1.8.1.4
> 
> Thanks,
> Daniel
> 
> 

--
Gleb.

Re: [Qemu-devel] [PATCH v2] kvm: warn if num cpus is greater than num recommended

2013-09-03 Thread Gleb Natapov

On Fri, Aug 23, 2013 at 03:24:37PM +0200, Andrew Jones wrote:
> The comment in kvm_max_vcpus() states that it's using the recommended
> procedure from the kernel API documentation to get the max number
> of vcpus that kvm supports. It is, but by always returning the
> maximum number supported. The maximum number should only be used
> for development purposes. qemu should check KVM_CAP_NR_VCPUS for
> the recommended number of vcpus. This patch adds a warning if a user
> specifies a number of cpus between the recommended and max.
> 
> v2:
> Incorporate tests for max_cpus, which specifies the maximum number
> of hotpluggable cpus. An additional note is that the message for
> the fail case was slightly changed, 'exceeds max cpus' to
> 'exceeds the maximum cpus'. If this is unacceptable change for
> users like libvirt, then I'll need to spin a v3.
> 
> Signed-off-by: Andrew Jones 
Applied, thanks.

> ---
>  kvm-all.c | 69 
> ---
>  1 file changed, 40 insertions(+), 29 deletions(-)
> 
> diff --git a/kvm-all.c b/kvm-all.c
> index a2d49786365e3..021f5f47e53da 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1322,24 +1322,20 @@ static int kvm_irqchip_create(KVMState *s)
>  return 0;
>  }
>  
> -static int kvm_max_vcpus(KVMState *s)
> +/* Find number of supported CPUs using the recommended
> + * procedure from the kernel API documentation to cope with
> + * older kernels that may be missing capabilities.
> + */
> +static int kvm_recommended_vcpus(KVMState *s)
>  {
> -int ret;
> -
> -/* Find number of supported CPUs using the recommended
> - * procedure from the kernel API documentation to cope with
> - * older kernels that may be missing capabilities.
> - */
> -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
> -if (ret) {
> -return ret;
> -}
> -ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
> -if (ret) {
> -return ret;
> -}
> +int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
> +return (ret) ? ret : 4;
> +}
>  
> -return 4;
> +static int kvm_max_vcpus(KVMState *s)
> +{
> +int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
> +return (ret) ? ret : kvm_recommended_vcpus(s);
>  }
>  
>  int kvm_init(void)
> @@ -1347,11 +1343,19 @@ int kvm_init(void)
>  static const char upgrade_note[] =
>  "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"
>  "(see http://sourceforge.net/projects/kvm).\n";
> +struct {
> +const char *name;
> +int num;
> +} num_cpus[] = {
> +{ "SMP",  smp_cpus },
> +{ "hotpluggable", max_cpus },
> +{ NULL, }
> +}, *nc = num_cpus;
> +int soft_vcpus_limit, hard_vcpus_limit;
>  KVMState *s;
>  const KVMCapabilityInfo *missing_cap;
>  int ret;
>  int i;
> -int max_vcpus;
>  
>  s = g_malloc0(sizeof(KVMState));
>  
> @@ -1392,19 +1396,26 @@ int kvm_init(void)
>  goto err;
>  }
>  
> -max_vcpus = kvm_max_vcpus(s);
> -if (smp_cpus > max_vcpus) {
> -ret = -EINVAL;
> -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus "
> -"supported by KVM (%d)\n", smp_cpus, max_vcpus);
> -goto err;
> -}
> +/* check the vcpu limits */
> +soft_vcpus_limit = kvm_recommended_vcpus(s);
> +hard_vcpus_limit = kvm_max_vcpus(s);
>  
> -if (max_cpus > max_vcpus) {
> -ret = -EINVAL;
> -fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds 
> max cpus "
> -"supported by KVM (%d)\n", max_cpus, max_vcpus);
> -goto err;
> +while (nc->name) {
> +if (nc->num > soft_vcpus_limit) {
> +fprintf(stderr,
> +"Warning: Number of %s cpus requested (%d) exceeds "
> +"the recommended cpus supported by KVM (%d)\n",
> +nc->name, nc->num, soft_vcpus_limit);
> +
> +if (nc->num > hard_vcpus_limit) {
> +ret = -EINVAL;
> +fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
> +"the maximum cpus supported by KVM (%d)\n",
> +nc->name, nc->num, hard_vcpus_limit);
> +goto err;
> +}
> +}
> +nc++;
>  }
>  
>  s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
> -- 
> 1.8.1.4

--
Gleb.

Re: [Qemu-devel] [PATCH v2] cpu: Move cpu state syncs up into cpu_dump_state()

2013-09-01 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 12:19:10PM +0100, James Hogan wrote:
> The x86 and ppc targets call cpu_synchronize_state() from their
> *_cpu_dump_state() callbacks to ensure that up to date state is dumped
> when KVM is enabled (for example when a KVM internal error occurs).
> 
> Move this call up into the generic cpu_dump_state() function so that
> other KVM targets (namely MIPS) can take advantage of it.
> 
> This requires kvm_cpu_synchronize_state() and cpu_synchronize_state() to
> be moved out of the #ifdef NEED_CPU_H in  so that they're
> accessible to qom/cpu.c.
> 
Applied, thanks.

> Signed-off-by: James Hogan 
> Cc: Andreas Färber 
> Cc: Alexander Graf 
> Cc: Gleb Natapov 
> Cc: qemu-...@nongnu.org
> Cc: k...@vger.kernel.org
> ---
> Changes in v2 (was kvm: sync cpu state on internal error before dump)
>  - rewrite to fix in cpu_dump_state() (Gleb Natapov)
> ---
>  include/sysemu/kvm.h   | 20 ++--
>  qom/cpu.c  |  1 +
>  target-i386/helper.c   |  2 --
>  target-ppc/translate.c |  2 --
>  4 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index de74411..71a0186 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -270,16 +270,6 @@ int kvm_check_extension(KVMState *s, unsigned int 
> extension);
>  
>  uint32_t kvm_arch_get_supported_cpuid(KVMState *env, uint32_t function,
>uint32_t index, int reg);
> -void kvm_cpu_synchronize_state(CPUState *cpu);
> -
> -/* generic hooks - to be moved/refactored once there are more users */
> -
> -static inline void cpu_synchronize_state(CPUState *cpu)
> -{
> -if (kvm_enabled()) {
> -kvm_cpu_synchronize_state(cpu);
> -}
> -}
>  
>  #if !defined(CONFIG_USER_ONLY)
>  int kvm_physical_memory_addr_from_host(KVMState *s, void *ram_addr,
> @@ -288,9 +278,19 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void 
> *ram_addr,
>  
>  #endif /* NEED_CPU_H */
>  
> +void kvm_cpu_synchronize_state(CPUState *cpu);
>  void kvm_cpu_synchronize_post_reset(CPUState *cpu);
>  void kvm_cpu_synchronize_post_init(CPUState *cpu);
>  
> +/* generic hooks - to be moved/refactored once there are more users */
> +
> +static inline void cpu_synchronize_state(CPUState *cpu)
> +{
> +if (kvm_enabled()) {
> +kvm_cpu_synchronize_state(cpu);
> +}
> +}
> +
>  static inline void cpu_synchronize_post_reset(CPUState *cpu)
>  {
>  if (kvm_enabled()) {
> diff --git a/qom/cpu.c b/qom/cpu.c
> index aa95108..cfe7e24 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -174,6 +174,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, 
> fprintf_function cpu_fprintf,
>  CPUClass *cc = CPU_GET_CLASS(cpu);
>  
>  if (cc->dump_state) {
> +cpu_synchronize_state(cpu);
>  cc->dump_state(cpu, f, cpu_fprintf, flags);
>  }
>  }
> diff --git a/target-i386/helper.c b/target-i386/helper.c
> index bf3e2ac..2aecfd0 100644
> --- a/target-i386/helper.c
> +++ b/target-i386/helper.c
> @@ -188,8 +188,6 @@ void x86_cpu_dump_state(CPUState *cs, FILE *f, 
> fprintf_function cpu_fprintf,
>  char cc_op_name[32];
>  static const char *seg_name[6] = { "ES", "CS", "SS", "DS", "FS", "GS" };
>  
> -cpu_synchronize_state(cs);
> -
>  eflags = cpu_compute_eflags(env);
>  #ifdef TARGET_X86_64
>  if (env->hflags & HF_CS64_MASK) {
> diff --git a/target-ppc/translate.c b/target-ppc/translate.c
> index f07d70d..c6a6ff8 100644
> --- a/target-ppc/translate.c
> +++ b/target-ppc/translate.c
> @@ -9536,8 +9536,6 @@ void ppc_cpu_dump_state(CPUState *cs, FILE *f, 
> fprintf_function cpu_fprintf,
>  CPUPPCState *env = &cpu->env;
>  int i;
>  
> -cpu_synchronize_state(cs);
> -
>  cpu_fprintf(f, "NIP " TARGET_FMT_lx "   LR " TARGET_FMT_lx " CTR "
>  TARGET_FMT_lx " XER " TARGET_FMT_lx "\n",
>  env->nip, env->lr, env->ctr, cpu_read_xer(env));
> -- 
> 1.8.1.2
> 

--
Gleb.

Re: [Qemu-devel] [PATCH v2] kvm: warn if num cpus is greater than num recommended

2013-09-01 Thread Gleb Natapov

On Fri, Aug 23, 2013 at 03:24:37PM +0200, Andrew Jones wrote:
> The comment in kvm_max_vcpus() states that it's using the recommended
> procedure from the kernel API documentation to get the max number
> of vcpus that kvm supports. It is, but by always returning the
> maximum number supported. The maximum number should only be used
> for development purposes. qemu should check KVM_CAP_NR_VCPUS for
> the recommended number of vcpus. This patch adds a warning if a user
> specifies a number of cpus between the recommended and max.
> 
> v2:
> Incorporate tests for max_cpus, which specifies the maximum number
> of hotpluggable cpus. An additional note is that the message for
> the fail case was slightly changed, 'exceeds max cpus' to
> 'exceeds the maximum cpus'. If this is unacceptable change for
> users like libvirt, then I'll need to spin a v3.
> 
Looks good to me. Any ACKs, objections?

> Signed-off-by: Andrew Jones 
> ---
>  kvm-all.c | 69 
> ---
>  1 file changed, 40 insertions(+), 29 deletions(-)
> 
> diff --git a/kvm-all.c b/kvm-all.c
> index a2d49786365e3..021f5f47e53da 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1322,24 +1322,20 @@ static int kvm_irqchip_create(KVMState *s)
>  return 0;
>  }
>  
> -static int kvm_max_vcpus(KVMState *s)
> +/* Find number of supported CPUs using the recommended
> + * procedure from the kernel API documentation to cope with
> + * older kernels that may be missing capabilities.
> + */
> +static int kvm_recommended_vcpus(KVMState *s)
>  {
> -int ret;
> -
> -/* Find number of supported CPUs using the recommended
> - * procedure from the kernel API documentation to cope with
> - * older kernels that may be missing capabilities.
> - */
> -ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
> -if (ret) {
> -return ret;
> -}
> -ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
> -if (ret) {
> -return ret;
> -}
> +int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
> +return (ret) ? ret : 4;
> +}
>  
> -return 4;
> +static int kvm_max_vcpus(KVMState *s)
> +{
> +int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
> +return (ret) ? ret : kvm_recommended_vcpus(s);
>  }
>  
>  int kvm_init(void)
> @@ -1347,11 +1343,19 @@ int kvm_init(void)
>  static const char upgrade_note[] =
>  "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"
>  "(see http://sourceforge.net/projects/kvm).\n";
> +struct {
> +const char *name;
> +int num;
> +} num_cpus[] = {
> +{ "SMP",  smp_cpus },
> +{ "hotpluggable", max_cpus },
> +{ NULL, }
> +}, *nc = num_cpus;
> +int soft_vcpus_limit, hard_vcpus_limit;
>  KVMState *s;
>  const KVMCapabilityInfo *missing_cap;
>  int ret;
>  int i;
> -int max_vcpus;
>  
>  s = g_malloc0(sizeof(KVMState));
>  
> @@ -1392,19 +1396,26 @@ int kvm_init(void)
>  goto err;
>  }
>  
> -max_vcpus = kvm_max_vcpus(s);
> -if (smp_cpus > max_vcpus) {
> -ret = -EINVAL;
> -fprintf(stderr, "Number of SMP cpus requested (%d) exceeds max cpus "
> -"supported by KVM (%d)\n", smp_cpus, max_vcpus);
> -goto err;
> -}
> +/* check the vcpu limits */
> +soft_vcpus_limit = kvm_recommended_vcpus(s);
> +hard_vcpus_limit = kvm_max_vcpus(s);
>  
> -if (max_cpus > max_vcpus) {
> -ret = -EINVAL;
> -fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds 
> max cpus "
> -"supported by KVM (%d)\n", max_cpus, max_vcpus);
> -goto err;
> +while (nc->name) {
> +if (nc->num > soft_vcpus_limit) {
> +fprintf(stderr,
> +"Warning: Number of %s cpus requested (%d) exceeds "
> +"the recommended cpus supported by KVM (%d)\n",
> +nc->name, nc->num, soft_vcpus_limit);
> +
> +if (nc->num > hard_vcpus_limit) {
> +ret = -EINVAL;
> +fprintf(stderr, "Number of %s cpus requested (%d) exceeds "
> +"the maximum cpus supported by KVM (%d)\n",
> +nc->name, nc->num, hard_vcpus_limit);
> +goto err;
> +}
> +}
> +nc++;
>  }
>  
>  s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
> -- 
> 1.8.1.4

--
Gleb.

Re: [Qemu-devel] [PATCH] KVM: always use MADV_DONTFORK

2013-09-01 Thread Gleb Natapov

On Thu, Jul 25, 2013 at 12:11:15PM +0200, Andrea Arcangeli wrote:
> MADV_DONTFORK prevents fork to fail with -ENOMEM if the default
> overcommit heuristics decides there's too much anonymous virtual
> memory allocated. If the KVM secondary MMU is synchronized with MMU
> notifiers or not, doesn't make a difference in that regard.
> 
> Secondly it's always more efficient to avoid copying the guest
> physical address space in the fork child (so we avoid to mark all the
> guest memory readonly in the parent and so we skip the establishment
> and teardown of lots of pagetables in the child).
> 
> In the common case we can ignore the error if MADV_DONTFORK is not
> available. Leave a second invocation that errors out in the KVM path
> if MMU notifiers are missing and KVM is enabled, to abort in such
> case.
> 
> Signed-off-by: Andrea Arcangeli 
Applied to uq/master, thanks.

> ---
>  exec.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/exec.c b/exec.c
> index c99a883..d3bb58d 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1162,6 +1162,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, 
> void *host,
>  
>  qemu_ram_setup_dump(new_block->host, size);
>  qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);
> +qemu_madvise(new_block->host, size, QEMU_MADV_DONTFORK);
>  
>  if (kvm_enabled())
>  kvm_setup_guest_memory(new_block->host, size);

--
Gleb.

Re: [Qemu-devel] PING^2 Re: [PATCH] KVM: always use MADV_DONTFORK

2013-08-30 Thread Gleb Natapov

On Fri, Aug 30, 2013 at 06:03:42PM +0200, Paolo Bonzini wrote:
> Il 30/08/2013 17:52, Andreas Färber ha scritto:
> > OK. The patch is mislabelled as "KVM:" though putting this in a common
> > code path and touching exec.c only, so please put it on uq/master and
> > fix that up before you set the history in stone. :)
> 
> Well, touching exec.c is why I wasn't picking it up through uq/master,
> but that's indeed a good observation!
> 
> If Anthony doesn't pick it up, I'll do so as soon as it's my turn to
> manage the qemu-kvm.git repository.
> 
Send your acks and I will merge it.

--
Gleb.

Re: [Qemu-devel] [edk2] OVMF hung on qemu 1.6.0 with KVM

2013-08-30 Thread Gleb Natapov

On Fri, Aug 30, 2013 at 01:58:59PM +0200, Paolo Bonzini wrote:
> Il 30/08/2013 11:37, Laszlo Ersek ha scritto:
> > Disclaimer: I don't know what I'm talking about.
> 
> No problem. :)
> 
> > So, Jordan's patch for OVMF (SVN r14494) builds the page tables (and
> > finally writes the root to CR3) in a phase when paging is not enabled
> > yet in the VM.
> > 
> > Again, I have no clue, but if the guest hasn't even enabled paging yet,
> > then the hypervisor (without EPT?) might have no idea that what the
> > guest is writing to memory are its pagetables-to-be. The first notice
> > the hypervisor might take is the store to CR3. At which point (or maybe
> > even later, when paging is enabled?) the hypervisor would have to walk
> > the guest's tables all at once, and build the shadow tables "in batch".
> 
> The hypervisor builds shadow page tables lazily; as soon as CR0.PG is
> set the next instruction will pagefault and shadow page tables will
> start to get populated.
> 
> However, surprise!  There is another set of "flat" page tables for X64,
> built by UefiCpuPkg/ResetVector/Vtf0/Tools/FixupForRawSection.py when
> you run UefiCpuPkg/ResetVector/Vtf0/Build.py.  These are always in ROM.
> 
> As in Jordan's patches, the problem is that the hypervisor is expecting
> to be able to write to the page tables, but this is not the case because
> the page tables are in a read-only memory slot.
> 
Only when setting dirty/accessed bits, are they not set in ROM version
of page tables?

> Making shadow and EPT behave similarly is probably a good thing, so I'm
> sending an RFC patch to k...@vger.kernel.org that fixes the bug.
> 
> However, if you guys can figure out a patch for
> UefiCpuPkg/ResetVector/Vtf0/Ia32/Flat32ToFlat64.asm so that it builds
> the page tables in RAM (and removing the page table Python code), that
> would also work.
> 
> Paolo

--
Gleb.

[Qemu-devel] [PATCH 06/10] kvm: Simplify kvm_handle_io

2013-08-28 Thread Gleb Natapov

From: Jan Kiszka 

Now that cpu_in/out is just a wrapper around address_space_rw, we can
also call the latter directly. As host endianness == guest endianness,
there is no need for the memory access helpers st*_p/ld*_p as well.

Signed-off-by: Jan Kiszka 
Signed-off-by: Paolo Bonzini 
---
 kvm-all.c |   28 ++--
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index bfa4aac..ef52a0f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1508,32 +1508,8 @@ static void kvm_handle_io(uint16_t port, void *data, int 
direction, int size,
 uint8_t *ptr = data;
 
 for (i = 0; i < count; i++) {
-if (direction == KVM_EXIT_IO_IN) {
-switch (size) {
-case 1:
-stb_p(ptr, cpu_inb(port));
-break;
-case 2:
-stw_p(ptr, cpu_inw(port));
-break;
-case 4:
-stl_p(ptr, cpu_inl(port));
-break;
-}
-} else {
-switch (size) {
-case 1:
-cpu_outb(port, ldub_p(ptr));
-break;
-case 2:
-cpu_outw(port, lduw_p(ptr));
-break;
-case 4:
-cpu_outl(port, ldl_p(ptr));
-break;
-}
-}
-
+address_space_rw(&address_space_io, port, ptr, size,
+ direction == KVM_EXIT_IO_OUT);
 ptr += size;
 }
 }
-- 
1.7.10.4

[Qemu-devel] [PATCH 05/10] kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled

2013-08-28 Thread Gleb Natapov

From: Liu Jinsong 

This patch is to fix the bug https://bugs.launchpad.net/qemu-kvm/+bug/1207623

IA32_FEATURE_CONTROL is pointless if not expose VMX or SMX bits to
cpuid.1.ecx of vcpu. Current qemu-kvm will error return when kvm_put_msrs
or kvm_get_msrs.

Signed-off-by: Liu Jinsong 
Signed-off-by: Paolo Bonzini 
---
 target-i386/kvm.c |   17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 513ae52..7bb8455 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -65,6 +65,7 @@ static bool has_msr_star;
 static bool has_msr_hsave_pa;
 static bool has_msr_tsc_adjust;
 static bool has_msr_tsc_deadline;
+static bool has_msr_feature_control;
 static bool has_msr_async_pf_en;
 static bool has_msr_pv_eoi_en;
 static bool has_msr_misc_enable;
@@ -666,6 +667,12 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
 qemu_add_vm_change_state_handler(cpu_update_state, env);
 
+c = cpuid_find_entry(&cpuid_data.cpuid, 1, 0);
+if (c) {
+has_msr_feature_control = !!(c->ecx & CPUID_EXT_VMX) ||
+  !!(c->ecx & CPUID_EXT_SMX);
+}
+
 cpuid_data.cpuid.padding = 0;
 r = kvm_vcpu_ioctl(cs, KVM_SET_CPUID2, &cpuid_data);
 if (r) {
@@ -1169,7 +1176,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (hyperv_vapic_recommended()) {
 kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
 }
-kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, 
env->msr_ia32_feature_control);
+if (has_msr_feature_control) {
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL,
+  env->msr_ia32_feature_control);
+}
 }
 if (env->mcg_cap) {
 int i;
@@ -1394,7 +1404,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_misc_enable) {
 msrs[n++].index = MSR_IA32_MISC_ENABLE;
 }
-msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
+if (has_msr_feature_control) {
+msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
+}
 
 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1509,6 +1521,7 @@ static int kvm_get_msrs(X86CPU *cpu)
 break;
 case MSR_IA32_FEATURE_CONTROL:
 env->msr_ia32_feature_control = msrs[i].data;
+break;
 default:
 if (msrs[i].index >= MSR_MC0_CTL &&
 msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) {
-- 
1.7.10.4

[Qemu-devel] [PATCH 02/10] target-i386: remove tabs from target-i386/cpu.h

2013-08-28 Thread Gleb Natapov

From: Paolo Bonzini 

Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.h |  192 ++---
 1 file changed, 96 insertions(+), 96 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 3a52f94..af4c0f7 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -37,9 +37,9 @@
 #define TARGET_HAS_ICE 1
 
 #ifdef TARGET_X86_64
-#define ELF_MACHINEEM_X86_64
+#define ELF_MACHINE EM_X86_64
 #else
-#define ELF_MACHINEEM_386
+#define ELF_MACHINE EM_386
 #endif
 
 #define CPUArchState struct CPUX86State
@@ -98,10 +98,10 @@
 #define DESC_TSS_BUSY_MASK (1 << 9)
 
 /* eflags masks */
-#define CC_C   0x0001
-#define CC_P   0x0004
-#define CC_A   0x0010
-#define CC_Z   0x0040
+#define CC_C0x0001
+#define CC_P0x0004
+#define CC_A0x0010
+#define CC_Z0x0040
 #define CC_S0x0080
 #define CC_O0x0800
 
@@ -109,14 +109,14 @@
 #define IOPL_SHIFT 12
 #define VM_SHIFT   17
 
-#define TF_MASK0x0100
-#define IF_MASK0x0200
-#define DF_MASK0x0400
-#define IOPL_MASK  0x3000
-#define NT_MASK0x4000
-#define RF_MASK0x0001
-#define VM_MASK0x0002
-#define AC_MASK0x0004
+#define TF_MASK 0x0100
+#define IF_MASK 0x0200
+#define DF_MASK 0x0400
+#define IOPL_MASK   0x3000
+#define NT_MASK 0x4000
+#define RF_MASK 0x0001
+#define VM_MASK 0x0002
+#define AC_MASK 0x0004
 #define VIF_MASK0x0008
 #define VIP_MASK0x0010
 #define ID_MASK 0x0020
@@ -238,28 +238,28 @@
 #define DR7_TYPE_IO_RW   0x2
 #define DR7_TYPE_DATA_RW 0x3
 
-#define PG_PRESENT_BIT 0
-#define PG_RW_BIT  1
-#define PG_USER_BIT2
-#define PG_PWT_BIT 3
-#define PG_PCD_BIT 4
-#define PG_ACCESSED_BIT5
-#define PG_DIRTY_BIT   6
-#define PG_PSE_BIT 7
-#define PG_GLOBAL_BIT  8
-#define PG_NX_BIT  63
+#define PG_PRESENT_BIT  0
+#define PG_RW_BIT   1
+#define PG_USER_BIT 2
+#define PG_PWT_BIT  3
+#define PG_PCD_BIT  4
+#define PG_ACCESSED_BIT 5
+#define PG_DIRTY_BIT6
+#define PG_PSE_BIT  7
+#define PG_GLOBAL_BIT   8
+#define PG_NX_BIT   63
 
 #define PG_PRESENT_MASK  (1 << PG_PRESENT_BIT)
-#define PG_RW_MASK  (1 << PG_RW_BIT)
-#define PG_USER_MASK(1 << PG_USER_BIT)
-#define PG_PWT_MASK (1 << PG_PWT_BIT)
-#define PG_PCD_MASK (1 << PG_PCD_BIT)
+#define PG_RW_MASK   (1 << PG_RW_BIT)
+#define PG_USER_MASK (1 << PG_USER_BIT)
+#define PG_PWT_MASK  (1 << PG_PWT_BIT)
+#define PG_PCD_MASK  (1 << PG_PCD_BIT)
 #define PG_ACCESSED_MASK (1 << PG_ACCESSED_BIT)
-#define PG_DIRTY_MASK   (1 << PG_DIRTY_BIT)
-#define PG_PSE_MASK (1 << PG_PSE_BIT)
-#define PG_GLOBAL_MASK  (1 << PG_GLOBAL_BIT)
+#define PG_DIRTY_MASK(1 << PG_DIRTY_BIT)
+#define PG_PSE_MASK  (1 << PG_PSE_BIT)
+#define PG_GLOBAL_MASK   (1 << PG_GLOBAL_BIT)
 #define PG_HI_USER_MASK  0x7ff0LL
-#define PG_NX_MASK  (1LL << PG_NX_BIT)
+#define PG_NX_MASK   (1LL << PG_NX_BIT)
 
 #define PG_ERROR_W_BIT 1
 
@@ -269,32 +269,32 @@
 #define PG_ERROR_RSVD_MASK 0x08
 #define PG_ERROR_I_D_MASK  0x10
 
-#define MCG_CTL_P  (1ULL<<8)   /* MCG_CAP register available */
-#define MCG_SER_P  (1ULL<<24) /* MCA recovery/new status bits */
+#define MCG_CTL_P   (1ULL<<8)   /* MCG_CAP register available */
+#define MCG_SER_P   (1ULL<<24) /* MCA recovery/new status bits */
 
-#define MCE_CAP_DEF(MCG_CTL_P|MCG_SER_P)
-#define MCE_BANKS_DEF  10
+#define MCE_CAP_DEF (MCG_CTL_P|MCG_SER_P)
+#define MCE_BANKS_DEF   10
 
-#define MCG_STATUS_RIPV(1ULL<<0)   /* restart ip valid */
-#define MCG_STATUS_EIPV(1ULL<<1)   /* ip points to correct instruction 
*/
-#define MCG_STATUS_MCIP(1ULL<<2)   /* machine check in progress */
+#define MCG_STATUS_RIPV (1ULL<<0)   /* restart ip valid */
+#define MCG_STATUS_EIPV (1ULL<<1)   /* ip points to correct instruction */
+#define MCG_STATUS_MCIP (1ULL<<2)   /* machine check in progress */
 
-#define MCI_STATUS_VAL (1ULL<<63)  /* valid error */
-#define MCI_STATUS_OVER(1ULL<<62)  /* previous errors lost */
-#define MCI_STATUS_UC  (1ULL<<61)  /* uncorrected error */
-#define MCI_STATUS_EN  (1ULL<<60)  /* error enabled */
-#define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */
-#define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */
-#define MCI_STATUS_PCC (1ULL<<57)  /* processor context corrupt */
-#define MCI_STATUS_S   (1ULL<<56)  /* Signaled machine check */
-#define MCI_STATUS_AR  (1ULL<<55)  /* Action required */
+#define MCI_STATUS_VAL   (1ULL<<63)  /* valid error */
+#define MCI_STATUS_OVER  (1ULL<<62)  /* previous errors lost */
+#define MCI_STA

[Qemu-devel] [PATCH 07/10] kvm-all.c: max_cpus should not exceed KVM vcpu limit

2013-08-28 Thread Gleb Natapov

From: Marcelo Tosatti 

maxcpus, which specifies the maximum number of hotpluggable CPUs,
should not exceed KVM's vcpu limit.

Signed-off-by: Marcelo Tosatti 
[Reword message. - Paolo]
Signed-off-by: Paolo Bonzini 
---
 kvm-all.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/kvm-all.c b/kvm-all.c
index ef52a0f..a2d4978 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1400,6 +1400,13 @@ int kvm_init(void)
 goto err;
 }
 
+if (max_cpus > max_vcpus) {
+ret = -EINVAL;
+fprintf(stderr, "Number of hotpluggable cpus requested (%d) exceeds 
max cpus "
+"supported by KVM (%d)\n", max_cpus, max_vcpus);
+goto err;
+}
+
 s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
 if (s->vmfd < 0) {
 #ifdef TARGET_S390X
-- 
1.7.10.4

[Qemu-devel] [PATCH 01/10] Initialize IA32_FEATURE_CONTROL MSR in reset and migration

2013-08-28 Thread Gleb Natapov

From: Arthur Chunqi Li 

The recent KVM patch adds IA32_FEATURE_CONTROL support. QEMU needs
to clear this MSR when reset vCPU and keep the value of it when
migration. This patch add this feature.

Signed-off-by: Arthur Chunqi Li 
Signed-off-by: Gleb Natapov 
---
 target-i386/cpu.h |2 ++
 target-i386/kvm.c |4 
 target-i386/machine.c |   22 ++
 3 files changed, 28 insertions(+)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index cedefdc..3a52f94 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -301,6 +301,7 @@
 #define MSR_IA32_APICBASE_BSP   (1<<8)
 #define MSR_IA32_APICBASE_ENABLE(1<<11)
 #define MSR_IA32_APICBASE_BASE  (0xf<<12)
+#define MSR_IA32_FEATURE_CONTROL0x003a
 #define MSR_TSC_ADJUST  0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
@@ -813,6 +814,7 @@ typedef struct CPUX86State {
 
 uint64_t mcg_status;
 uint64_t msr_ia32_misc_enable;
+uint64_t msr_ia32_feature_control;
 
 /* exception/interrupt handling */
 int error_code;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 3c9d10a..84ac00a 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1121,6 +1121,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (hyperv_vapic_recommended()) {
 kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_APIC_ASSIST_PAGE, 0);
 }
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_FEATURE_CONTROL, 
env->msr_ia32_feature_control);
 }
 if (env->mcg_cap) {
 int i;
@@ -1345,6 +1346,7 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_misc_enable) {
 msrs[n++].index = MSR_IA32_MISC_ENABLE;
 }
+msrs[n++].index = MSR_IA32_FEATURE_CONTROL;
 
 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1443,6 +1445,8 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_IA32_MISC_ENABLE:
 env->msr_ia32_misc_enable = msrs[i].data;
 break;
+case MSR_IA32_FEATURE_CONTROL:
+env->msr_ia32_feature_control = msrs[i].data;
 default:
 if (msrs[i].index >= MSR_MC0_CTL &&
 msrs[i].index < MSR_MC0_CTL + (env->mcg_cap & 0xff) * 4) {
diff --git a/target-i386/machine.c b/target-i386/machine.c
index f9ec581..0d2088e 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -435,6 +435,14 @@ static bool misc_enable_needed(void *opaque)
 return env->msr_ia32_misc_enable != MSR_IA32_MISC_ENABLE_DEFAULT;
 }
 
+static bool feature_control_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+
+return env->msr_ia32_feature_control != 0;
+}
+
 static const VMStateDescription vmstate_msr_ia32_misc_enable = {
 .name = "cpu/msr_ia32_misc_enable",
 .version_id = 1,
@@ -446,6 +454,17 @@ static const VMStateDescription 
vmstate_msr_ia32_misc_enable = {
 }
 };
 
+static const VMStateDescription vmstate_msr_ia32_feature_control = {
+.name = "cpu/msr_ia32_feature_control",
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT64(env.msr_ia32_feature_control, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
 const VMStateDescription vmstate_x86_cpu = {
 .name = "cpu",
 .version_id = 12,
@@ -571,6 +590,9 @@ const VMStateDescription vmstate_x86_cpu = {
 }, {
 .vmsd = &vmstate_msr_ia32_misc_enable,
 .needed = misc_enable_needed,
+}, {
+.vmsd = &vmstate_msr_ia32_feature_control,
+.needed = feature_control_needed,
 } , {
 /* empty */
 }
-- 
1.7.10.4

[Qemu-devel] [PATCH 10/10] kvm-stub: fix compilation

2013-08-28 Thread Gleb Natapov

From: Paolo Bonzini 

Non-KVM targets fail compilation on the uq/master branch.
Fix the prototype of kvm_irqchip_add_irqfd_notifier to match
the one in kvm-all.c.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Gleb Natapov 
---
 kvm-stub.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kvm-stub.c b/kvm-stub.c
index 7b2233a..806b044 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -135,7 +135,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 return -ENOSYS;
 }
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq)
 {
 return -ENOSYS;
 }
-- 
1.7.10.4

[Qemu-devel] [PATCH 04/10] kvm: add KVM_IRQFD_FLAG_RESAMPLE support

2013-08-28 Thread Gleb Natapov

From: Vincenzo Maffione 

Added an EventNotifier* parameter to
kvm-all.c:kvm_irqchip_add_irqfd_notifier(), in order to give KVM
another eventfd to be used as "resamplefd". See the documentation
in the linux kernel sources in Documentation/virtual/kvm/api.txt
(section 4.75) for more details.
When the added parameter is passed NULL, the behaviour of the
function is unchanged with respect to the previous versions.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Vincenzo Maffione 
Signed-off-by: Paolo Bonzini 
---
 hw/misc/vfio.c |4 ++--
 hw/virtio/virtio-pci.c |2 +-
 include/sysemu/kvm.h   |3 ++-
 kvm-all.c  |   17 +
 4 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index ad8ce77..54af34a 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -646,7 +646,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, 
unsigned int nr,
 vector->virq = msg ? kvm_irqchip_add_msi_route(kvm_state, *msg) : -1;
 if (vector->virq < 0 ||
 kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt,
-   vector->virq) < 0) {
+   NULL, vector->virq) < 0) {
 if (vector->virq >= 0) {
 kvm_irqchip_release_virq(kvm_state, vector->virq);
 vector->virq = -1;
@@ -814,7 +814,7 @@ retry:
 vector->virq = kvm_irqchip_add_msi_route(kvm_state, msg);
 if (vector->virq < 0 ||
 kvm_irqchip_add_irqfd_notifier(kvm_state, &vector->interrupt,
-   vector->virq) < 0) {
+   NULL, vector->virq) < 0) {
 qemu_set_fd_handler(event_notifier_get_fd(&vector->interrupt),
 vfio_msi_interrupt, NULL, vector);
 }
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index c38cfd1..c4db407 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -508,7 +508,7 @@ static int kvm_virtio_pci_irqfd_use(VirtIOPCIProxy *proxy,
 VirtQueue *vq = virtio_get_queue(proxy->vdev, queue_no);
 EventNotifier *n = virtio_queue_get_guest_notifier(vq);
 int ret;
-ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, irqfd->virq);
+ret = kvm_irqchip_add_irqfd_notifier(kvm_state, n, NULL, irqfd->virq);
 return ret;
 }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index f8ac448..ce3efaf 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -309,7 +309,8 @@ int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg);
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg);
 void kvm_irqchip_release_virq(KVMState *s, int virq);
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq);
 int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
 void kvm_pc_gsi_handler(void *opaque, int n, int level);
 void kvm_pc_setup_irq_routing(bool pci_enabled);
diff --git a/kvm-all.c b/kvm-all.c
index 4fb4ccb..bfa4aac 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1230,7 +1230,8 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 return kvm_update_routing_entry(s, &kroute);
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
+static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
+bool assign)
 {
 struct kvm_irqfd irqfd = {
 .fd = fd,
@@ -1238,6 +1239,11 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, 
int virq, bool assign)
 .flags = assign ? 0 : KVM_IRQFD_FLAG_DEASSIGN,
 };
 
+if (rfd != -1) {
+irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+irqfd.resamplefd = rfd;
+}
+
 if (!kvm_irqfds_enabled()) {
 return -ENOSYS;
 }
@@ -1276,14 +1282,17 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, 
MSIMessage msg)
 }
 #endif /* !KVM_CAP_IRQ_ROUTING */
 
-int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
+int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
+   EventNotifier *rn, int virq)
 {
-return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, true);
+return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
+   rn ? event_notifier_get_fd(rn) : -1, virq, true);
 }
 
 int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int virq)
 {
-return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), virq, false);
+return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
+   false);
 }
 
 static int kvm_irqchip_create(KVMState *s)
-- 
1.7.10.4

[Qemu-devel] [PATCH 08/10] kvm: i386: fix LAPIC TSC deadline timer save/restore

2013-08-28 Thread Gleb Natapov

From: Marcelo Tosatti 

The configuration of the timer represented by MSR_IA32_TSCDEADLINE depends on:

- APIC LVT Timer register.
- TSC value.

Change the order to respect the dependency.

Signed-off-by: Marcelo Tosatti 
Signed-off-by: Paolo Bonzini 
---
 target-i386/kvm.c |   29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7bb8455..58f7bb7 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1073,6 +1073,26 @@ static void kvm_msr_entry_set(struct kvm_msr_entry 
*entry,
 entry->data = value;
 }
 
+static int kvm_put_tscdeadline_msr(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct {
+struct kvm_msrs info;
+struct kvm_msr_entry entries[1];
+} msr_data;
+struct kvm_msr_entry *msrs = msr_data.entries;
+
+if (!has_msr_tsc_deadline) {
+return 0;
+}
+
+kvm_msr_entry_set(&msrs[0], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
+
+msr_data.info.nmsrs = 1;
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, &msr_data);
+}
+
 static int kvm_put_msrs(X86CPU *cpu, int level)
 {
 CPUX86State *env = &cpu->env;
@@ -1096,9 +1116,6 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (has_msr_tsc_adjust) {
 kvm_msr_entry_set(&msrs[n++], MSR_TSC_ADJUST, env->tsc_adjust);
 }
-if (has_msr_tsc_deadline) {
-kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
-}
 if (has_msr_misc_enable) {
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_MISC_ENABLE,
   env->msr_ia32_misc_enable);
@@ -1808,6 +1825,12 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
 return ret;
 }
 }
+
+ret = kvm_put_tscdeadline_msr(x86_cpu);
+if (ret < 0) {
+return ret;
+}
+
 ret = kvm_put_vcpu_events(x86_cpu, level);
 if (ret < 0) {
 return ret;
-- 
1.7.10.4

[Qemu-devel] [PATCH 03/10] kvm: migrate vPMU state

2013-08-28 Thread Gleb Natapov

From: Paolo Bonzini 

Reviewed-by: Gleb Natapov 
Signed-off-by: Paolo Bonzini 
---
 target-i386/cpu.h |   23 
 target-i386/kvm.c |   93 ++---
 target-i386/machine.c |   44 +++
 3 files changed, 155 insertions(+), 5 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index af4c0f7..31de265 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -305,6 +305,8 @@
 #define MSR_TSC_ADJUST  0x003b
 #define MSR_IA32_TSCDEADLINE0x6e0
 
+#define MSR_P6_PERFCTR0 0xc1
+
 #define MSR_MTRRcap 0xfe
 #define MSR_MTRRcap_VCNT8
 #define MSR_MTRRcap_FIXRANGE_SUPPORT(1 << 8)
@@ -318,6 +320,8 @@
 #define MSR_MCG_STATUS  0x17a
 #define MSR_MCG_CTL 0x17b
 
+#define MSR_P6_EVNTSEL0 0x186
+
 #define MSR_IA32_PERF_STATUS0x198
 
 #define MSR_IA32_MISC_ENABLE0x1a0
@@ -343,6 +347,14 @@
 
 #define MSR_MTRRdefType 0x2ff
 
+#define MSR_CORE_PERF_FIXED_CTR00x309
+#define MSR_CORE_PERF_FIXED_CTR10x30a
+#define MSR_CORE_PERF_FIXED_CTR20x30b
+#define MSR_CORE_PERF_FIXED_CTR_CTRL0x38d
+#define MSR_CORE_PERF_GLOBAL_STATUS 0x38e
+#define MSR_CORE_PERF_GLOBAL_CTRL   0x38f
+#define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
+
 #define MSR_MC0_CTL 0x400
 #define MSR_MC0_STATUS  0x401
 #define MSR_MC0_ADDR0x402
@@ -721,6 +733,9 @@ typedef struct {
 #define CPU_NB_REGS CPU_NB_REGS32
 #endif
 
+#define MAX_FIXED_COUNTERS 3
+#define MAX_GP_COUNTERS(MSR_IA32_PERF_STATUS - MSR_P6_EVNTSEL0)
+
 #define NB_MMU_MODES 3
 
 typedef enum TPRAccess {
@@ -816,6 +831,14 @@ typedef struct CPUX86State {
 uint64_t msr_ia32_misc_enable;
 uint64_t msr_ia32_feature_control;
 
+uint64_t msr_fixed_ctr_ctrl;
+uint64_t msr_global_ctrl;
+uint64_t msr_global_status;
+uint64_t msr_global_ovf_ctrl;
+uint64_t msr_fixed_counters[MAX_FIXED_COUNTERS];
+uint64_t msr_gp_counters[MAX_GP_COUNTERS];
+uint64_t msr_gp_evtsel[MAX_GP_COUNTERS];
+
 /* exception/interrupt handling */
 int error_code;
 int exception_is_int;
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 84ac00a..513ae52 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -71,6 +71,9 @@ static bool has_msr_misc_enable;
 static bool has_msr_kvm_steal_time;
 static int lm_capable_kernel;
 
+static bool has_msr_architectural_pmu;
+static uint32_t num_architectural_pmu_counters;
+
 bool kvm_allows_irq0_override(void)
 {
 return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
@@ -581,6 +584,25 @@ int kvm_arch_init_vcpu(CPUState *cs)
 break;
 }
 }
+
+if (limit >= 0x0a) {
+uint32_t ver;
+
+cpu_x86_cpuid(env, 0x0a, 0, &ver, &unused, &unused, &unused);
+if ((ver & 0xff) > 0) {
+has_msr_architectural_pmu = true;
+num_architectural_pmu_counters = (ver & 0xff00) >> 8;
+
+/* Shouldn't be more than 32, since that's the number of bits
+ * available in EBX to tell us _which_ counters are available.
+ * Play it safe.
+ */
+if (num_architectural_pmu_counters > MAX_GP_COUNTERS) {
+num_architectural_pmu_counters = MAX_GP_COUNTERS;
+}
+}
+}
+
 cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused);
 
 for (i = 0x8000; i <= limit; i++) {
@@ -1052,7 +1074,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 struct kvm_msr_entry entries[100];
 } msr_data;
 struct kvm_msr_entry *msrs = msr_data.entries;
-int n = 0;
+int n = 0, i;
 
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_CS, env->sysenter_cs);
 kvm_msr_entry_set(&msrs[n++], MSR_IA32_SYSENTER_ESP, env->sysenter_esp);
@@ -1094,9 +1116,8 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 }
 }
 /*
- * The following paravirtual MSRs have side effects on the guest or are
- * too heavy for normal writeback. Limit them to reset or full state
- * updates.
+ * The following MSRs have side effects on the guest or are too heavy
+ * for normal writeback. Limit them to reset or full state updates.
  */
 if (level >= KVM_PUT_RESET_STATE) {
 kvm_msr_entry_set(&msrs[n++], MSR_KVM_SYSTEM_TIME,
@@ -1114,6 +1135,33 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 kvm_msr_entry_set(&msrs[n++], MSR_KVM_STEAL_TIME,
   env->steal_time_msr);
 }
+if (has_msr_architectural_pmu) {
+/* Stop the counter.  */
+kvm_msr_entry_set(&msrs[n++], MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
+kvm_msr_entry_set(&msrs

[Qemu-devel] [PATCH 09/10] kvm: shorten the parameter list for get_real_device()

2013-08-28 Thread Gleb Natapov

From: Wei Yang 

get_real_device() has 5 parameters with the last 4 is contained in the first
structure.

This patch removes the last 4 parameters and directly use them from the first
parameter.

Acked-by: Alex Williamson 
Signed-off-by: Wei Yang 
Signed-off-by: Paolo Bonzini 
---
 hw/i386/kvm/pci-assign.c |9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index ff33dc8..73941b2 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -568,8 +568,7 @@ static int get_real_device_id(const char *devpath, uint16_t 
*val)
 return get_real_id(devpath, "device", val);
 }
 
-static int get_real_device(AssignedDevice *pci_dev, uint16_t r_seg,
-   uint8_t r_bus, uint8_t r_dev, uint8_t r_func)
+static int get_real_device(AssignedDevice *pci_dev)
 {
 char dir[128], name[128];
 int fd, r = 0, v;
@@ -582,7 +581,8 @@ static int get_real_device(AssignedDevice *pci_dev, 
uint16_t r_seg,
 dev->region_number = 0;
 
 snprintf(dir, sizeof(dir), "/sys/bus/pci/devices/%04x:%02x:%02x.%x/",
- r_seg, r_bus, r_dev, r_func);
+ pci_dev->host.domain, pci_dev->host.bus,
+ pci_dev->host.slot, pci_dev->host.function);
 
 snprintf(name, sizeof(name), "%sconfig", dir);
 
@@ -1769,8 +1769,7 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 memcpy(dev->emulate_config_write, dev->emulate_config_read,
sizeof(dev->emulate_config_read));
 
-if (get_real_device(dev, dev->host.domain, dev->host.bus,
-dev->host.slot, dev->host.function)) {
+if (get_real_device(dev)) {
 error_report("pci-assign: Error: Couldn't get real device (%s)!",
  dev->dev.qdev.id);
 goto out;
-- 
1.7.10.4

[Qemu-devel] [PATCH v3 00/10] [PULL] qemu-kvm.git uq/master queue

2013-08-28 Thread Gleb Natapov

Anthony,

This obsoletes "[PULL v2 0/9] KVM changes for 2013-08-23"

The following changes since commit f03d07d4683b2e8325a7cb60b4e14b977b1a869c:

  Merge remote-tracking branch 'quintela/migration.next' into staging 
(2013-07-23 10:57:23 -0500)

are available in the git repository at:


  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

for you to fetch changes up to 821c808bd1863efc2c1e977800ae77db633a185c:

  kvm-stub: fix compilation (2013-08-28 17:07:02 +0300)


Arthur Chunqi Li (1):
  Initialize IA32_FEATURE_CONTROL MSR in reset and migration

Jan Kiszka (1):
  kvm: Simplify kvm_handle_io

Liu Jinsong (1):
  kvm: x86: fix setting IA32_FEATURE_CONTROL with nested VMX disabled

Marcelo Tosatti (2):
  kvm-all.c: max_cpus should not exceed KVM vcpu limit
  kvm: i386: fix LAPIC TSC deadline timer save/restore

Paolo Bonzini (3):
  target-i386: remove tabs from target-i386/cpu.h
  kvm: migrate vPMU state
  kvm-stub: fix compilation

Vincenzo Maffione (1):
  kvm: add KVM_IRQFD_FLAG_RESAMPLE support

Wei Yang (1):
  kvm: shorten the parameter list for get_real_device()

 hw/i386/kvm/pci-assign.c |9 +-
 hw/misc/vfio.c   |4 +-
 hw/virtio/virtio-pci.c   |2 +-
 include/sysemu/kvm.h |3 +-
 kvm-all.c|   52 +--
 kvm-stub.c   |3 +-
 target-i386/cpu.h|  217 ++
 target-i386/kvm.c|  139 +++--
 target-i386/machine.c|   66 ++
 9 files changed, 351 insertions(+), 144 deletions(-)

Re: [Qemu-devel] KVM guest cpu L3 cache and cpufreq

2013-08-27 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 03:35:33PM +0200, Benoît Canet wrote:
> > > > If I understand correctly changing the CPUID L3 cache infos in QEMU 
> > > > will change
> > > > the value displayed in the guest /proc/cpuinfo but will not change the 
> > > > size of
> > > > the l3 cache used by the hardware. So I am chasing a cosmetic bug.
> > > > If it right ?
> > > > 
> > > Right.
> > > 
> > Well, actually not entirely. An application can query cache sizes and
> > adjust its algorithms accordingly in which case performance can be
> > suboptimal if information in a guest is not accurate.
> 
> So maybe what is missing is a CPUID passthrough mode triggered when -cpu host
> is used ?
> 
That is missing too, but it should not be default with other CPU models.

--
Gleb.

Re: [Qemu-devel] KVM guest cpu L3 cache and cpufreq

2013-08-27 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 04:17:56PM +0300, Gleb Natapov wrote:
> On Tue, Aug 27, 2013 at 03:18:16PM +0200, Benoît Canet wrote:
> > 
> > Hello Eduardo,
> > 
> > I read a bit about caches on wikipedia.
> > 
> > If I understand correctly changing the CPUID L3 cache infos in QEMU will 
> > change
> > the value displayed in the guest /proc/cpuinfo but will not change the size 
> > of
> > the l3 cache used by the hardware. So I am chasing a cosmetic bug.
> > If it right ?
> > 
> Right.
> 
Well, actually not entirely. An application can query cache sizes and
adjust its algorithms accordingly in which case performance can be
suboptimal if information in a guest is not accurate.

--
Gleb.

Re: [Qemu-devel] KVM guest cpu L3 cache and cpufreq

2013-08-27 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 03:18:16PM +0200, Benoît Canet wrote:
> 
> Hello Eduardo,
> 
> I read a bit about caches on wikipedia.
> 
> If I understand correctly changing the CPUID L3 cache infos in QEMU will 
> change
> the value displayed in the guest /proc/cpuinfo but will not change the size of
> the l3 cache used by the hardware. So I am chasing a cosmetic bug.
> If it right ?
> 
Right.

--
Gleb.

Re: [Qemu-devel] KVM guest cpu L3 cache and cpufreq

2013-08-26 Thread Gleb Natapov

On Mon, Aug 26, 2013 at 02:49:41PM +0200, Benoît Canet wrote:
> 
> Hi,
> 
> Thanks for the answer.
> 
> > On Tue, Aug 13, 2013 at 08:17:13PM +0200, Benoît Canet wrote:
> > > 
> > > Hi,
> > > 
> > > I noticed that the l3 cache size of a guest /proc/cpuinfo is not the same 
> > > as
> > > the l3 cache size of the host.
> > > 
> > > I did not found any references to this in the qemu and KVM code.
> > > 
> > > Is the size of the guest L3 cache fixed in hardware ?
> > > 
> > No, it is hardcoded somewhere in qemu cpuid code.
> > 
> > > Can a patch be written to set it ?
> > > 
> > Yes.
> 
> Ok I'll try to do that.
> 
Talk to Eduardo since this is related to cpuid configuration and he is
an expert.

> > 
> > > Similarly I noticed that the frequency in the guest was not reflecting the
> > > frequency scaling of the host.
> > > 
> > Not sure what you mean here. Frequency as seen where?
> 
> The frequency seen in the host /proc/cpuinfo and the guest /proc/cpuinfo 
> differs
> when a governor kick the cpu frequency scaling: the guest still show the 
> maximum
> frequency of the cpu.
> 
KVM does not emulate frequency scaling. It does not make sense for a
guest.

> Could this value be reflected dynamically ?
> 
Not sure frequency scaling can be implemented only for read.

--
Gleb.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1470 matches

Mail list logo