date:20100914

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Wed, 2010-09-15 at 07:12 +0200, Michael S. Tsirkin wrote:
> Yes, I agree this patch is useful for demo purposes:
> simple, and shows what kind of performance gains
> we can expect for TX. 

Any other issue you can see in this patch beside vhost descriptors
update?  Don't you think once I address vhost_add_used_and_signal update
issue, it is a simple and complete patch for macvtap TX zero copy?

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Wed, 2010-09-15 at 07:27 +0200, Michael S. Tsirkin wrote:
> For some of the issues, try following the discussion around
> net: af_packet: don't call tpacket_destruct_skb() until the skb is
> sent
> out.
> 
> Summary: it's difficult to do correctly generally. Limiting ourselves
> to transmit on specific devices might make it possible.

Thanks for the tips.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] Add a new API to virtio-pci

2010-09-14 Thread Michael S. Tsirkin

On Mon, Sep 13, 2010 at 12:40:11PM -0500, Anthony Liguori wrote:
> On 09/13/2010 11:30 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 13, 2010 at 10:59:34AM -0500, Anthony Liguori wrote:
> >>On 09/13/2010 04:04 AM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 13, 2010 at 09:50:42AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"   wrote on 09/12/2010 05:16:37 PM:
> 
> >"Michael S. Tsirkin"
> >09/12/2010 05:16 PM
> >
> >On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
> >>Unfortunately I need a
> >>constant in vhost for now.
> >Maybe not even that: you create multiple vhost-net
> >devices so vhost-net in kernel does not care about these
> >either, right? So this can be just part of vhost_net.h
> >in qemu.
> Sorry, I didn't understand what you meant.
> 
> I can remove all socks[] arrays/constants by pre-allocating
> sockets in vhost_setup_vqs. Then I can remove all "socks"
> parameters in vhost_net_stop, vhost_net_release and
> vhost_net_reset_owner.
> 
> Does this make sense?
> 
> Thanks,
> 
> - KK
> >>>Here's what I mean: each vhost device includes 1 TX
> >>>and 1 RX VQ. Instead of teaching vhost about multiqueue,
> >>>we could simply open /dev/vhost-net multiple times.
> >>>How many times would be up to qemu.
> >>Trouble is, each vhost-net device is associated with 1 tun/tap
> >>device which means that each vhost-net device is associated with a
> >>transmit and receive queue.
> >>
> >>I don't know if you'll always have an equal number of transmit and
> >>receive queues but there's certainly  challenge in terms of
> >>flexibility with this model.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
> 
> It's just a little odd.  Would you bond multiple tun tap devices to
> achieve multi-queue TX?  For RX, do you somehow limit RX to only one
> of those devices?

Exatly in the way the patches we discuss here do this:
we already have a per-queue fd.

> If we were doing this in QEMU (and btw, there needs to be userspace
> patches before we implement this in the kernel side),

I agree that Feature parity is nice to have, but
I don't see a huge problem with (hopefully temporarily) only
supporting feature X with kernel acceleration, BTW.
This is already the case with checksum offloading features.

> I think it
> would make more sense to just rely on doing a multithreaded write to
> a single tun/tap device and then to hope that in can be made smarter
> at the macvtap layer.

No, an fd serializes access, so you need seperate fds for multithreaded
writes to work.  Think about how e.g. select will work.

> Regards,
> 
> Anthony Liguori
> 
> Regards,
> 
> Anthony Liguori
> 
> >or you can only map one of these. What is the trouble?
> >What other features would you desire in terms of flexibility?
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] Add a new API to virtio-pci

2010-09-14 Thread Michael S. Tsirkin

On Mon, Sep 13, 2010 at 07:00:51PM +0200, Avi Kivity wrote:
>  On 09/13/2010 06:30 PM, Michael S. Tsirkin wrote:
> >Trouble is, each vhost-net device is associated with 1 tun/tap
> >device which means that each vhost-net device is associated with a
> >transmit and receive queue.
> >
> >I don't know if you'll always have an equal number of transmit and
> >receive queues but there's certainly  challenge in terms of
> >flexibility with this model.
> >
> >Regards,
> >
> >Anthony Liguori
> >Not really, TX and RX can be mapped to different devices,
> >or you can only map one of these. What is the trouble?
> 
> Suppose you have one multiqueue-capable ethernet card.  How can you
> connect it to multiple rx/tx queues?
> tx is in principle doable, but what about rx?
> 
> What does "only map one of these" mean?  Connect the device with one
> queue (presumably rx), and terminate the others?
> 
> 
> Will packet classification work (does the current multiqueue
> proposal support it)?
> 

This is a non trivial problem, but
this needs to be handled in tap, not in vhost net.
If tap gives you multiple queues, vhost-net will happily
let you connect vqs to these.

> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-14 Thread Michael S. Tsirkin

On Mon, Sep 13, 2010 at 09:53:40PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 09/13/2010 05:20:55 PM:
> 
> > > Results with the original kernel:
> > > _
> > > #   BW  SD  RSD
> > > __
> > > 1   20903   1   6
> > > 2   21963   6   25
> > > 4   22042   23  102
> > > 8   21674   97  419
> > > 16  22281   379 1663
> > > 24  22521   857 3748
> > > 32  22976   15286594
> > > 40  23197   239010239
> > > 48  22973   354215074
> > > 64  23809   648627244
> > > 80  23564   10169   43118
> > > 96  22977   14954   62948
> > > 128 23649   27067   113892
> > > 
> > >
> > > With higher number of threads running in parallel, SD
> > > increased. In this case most threads run in parallel
> > > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > > higher number of threads run in parallel through
> > > ndo_start_xmit. I *think* the increase in SD is to do
> > > with higher # of threads running for larger code path
> > > >From the numbers I posted with the patch (cut-n-paste
> > > only the % parts), BW increased much more than the SD,
> > > sometimes more than twice the increase in SD.
> >
> > Service demand is BW/CPU, right? So if BW goes up by 50%
> > and SD by 40%, this means that CPU more than doubled.
> 
> I think the SD calculation might be more complicated,
> I think it does it based on adding up averages sampled
> and stored during the run. But, I still don't see how CPU
> can double?? e.g.
>   BW: 1000 -> 1500 (50%)
>   SD: 100 -> 140 (40%)
>   CPU: 10 -> 10.71 (7.1%)

Hmm. Time to look at the source. Which netperf version did you use?

> > > N#  BW% SD%  RSD%
> > > 4   54.30   40.00-1.16
> > > 8   71.79   46.59-2.68
> > > 16  71.89   50.40-2.50
> > > 32  72.24   34.26-14.52
> > > 48  70.10   31.51-14.35
> > > 64  69.01   38.81-9.66
> > > 96  70.68   71.2610.74
> > >
> > > I also think SD calculation gets skewed for guest->local
> > > host testing.
> >
> > If it's broken, let's fix it?
> >
> > > For this test, I ran a guest with numtxqs=16.
> > > The first result below is with my patch, which creates 16
> > > vhosts. The second result is with a modified patch which
> > > creates only 2 vhosts (testing with #netperfs = 64):
> >
> > My guess is it's not a good idea to have more TX VQs than guest CPUs.
> 
> Definitely, I will try to run tomorrow with more reasonable
> values, also will test with my second version of the patch
> that creates restricted number of vhosts and post results.
> 
> > I realize for management it's easier to pass in a single vhost fd, but
> > just for testing it's probably easier to add code in userspace to open
> > /dev/vhost multiple times.
> >
> > >
> > > #vhosts  BW% SD%RSD%
> > > 16   20.79   186.01 149.74
> > > 230.89   34.55  18.44
> > >
> > > The remote SD increases with the number of vhost threads,
> > > but that number seems to correlate with guest SD. So though
> > > BW% increased slightly from 20% to 30%, SD fell drastically
> > > from 186% to 34%. I think it could be a calculation skew
> > > with host SD, which also fell from 150% to 18%.
> >
> > I think by default netperf looks in /proc/stat for CPU utilization data:
> > so host CPU utilization will include the guest CPU, I think?
> 
> It appears that way to me too, but the data above seems to
> suggest the opposite...
> 
> > I would go further and claim that for host/guest TCP
> > CPU utilization and SD should always be identical.
> > Makes sense?
> 
> It makes sense to me, but once again I am not sure how SD
> is really done, or whether it is linear to CPU. Cc'ing Rick
> in case he can comment

Me neither. I should rephrase: I think we should always
use host CPU utilization always.

> >
> > >
> > > I am planning to submit 2nd patch rev with restricted
> > > number of vhosts.
> > >
> > > > > Likely cause for the 1 stream degradation with multiple
> > > > > vhost patch:
> > > > >
> > > > > 1. Two vhosts run handling the RX and TX respectively.
> > > > >I think the issue is related to cache ping-pong esp
> > > > >since these run on different cpus/sockets.
> > > >
> > > > Right. With TCP I think we are better off handling
> > > > TX and RX for a socket by the same vhost, so that
> > > > packet and its ack are handled by the same thread.
> > > > Is this what happens with RX multiqueue patch?
> > > > How do we select an RX queue to put the packet on?
> > >
> > > My (unsubmitted) RX patch doesn't do this yet, that is
> > > something I will check.
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > You'll want to work on top of net-next, I think there's
> > RX flow filtering work going on there.
> 
> Thanks Michael, I will follow up on that for the RX patch,
> plus your suggestion on tying RX with TX.
> 
> Th

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 12:20:29PM -0700, Shirley Ma wrote:
> On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
> > On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
> > > On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
> > > > As others said, the harder issues for TX are in determining that
> > it's
> > > > safe
> > > > to unpin the memory, and how much memory is it safe to pin to
> > beging
> > > > with.  For RX we have some more complexity.
> > > 
> > > I think unpin the memory is in kfree_skb() whenever the last
> > reference
> > > is gone for TX. What we discussed about here is when/how vhost get
> > > notified to update ring buffer descriptors. Do I misunderstand
> > something
> > > here? 
> > 
> > Right, that's a better way to put it. 
> 
> That's how this macvtap patch did. For how much pinned pages,it is
> limited by sk_wmem_alloc size in this patch.
> 
> thanks
> Shirley

Except that you seem to pin full pages but account sub-page size
in wmem.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization

2010-09-14 Thread Jan Kiszka

Am 15.09.2010 01:40, Zachary Amsden wrote:
> On 09/14/2010 12:26 PM, Jan Kiszka wrote:
>> Am 14.09.2010 21:32, Zachary Amsden wrote:
>>   
>>> On 09/14/2010 12:40 AM, Jan Kiszka wrote:
>>> 
 Am 14.09.2010 11:27, Avi Kivity wrote:

   
> On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>
> 
>> Am 20.08.2010 10:07, Zachary Amsden wrote:
>>
>>   
>>> When CPUs with unstable TSCs enter deep C-state, TSC may stop
>>> running.  This causes us to require resynchronization.  Since
>>> we can't tell when this may potentially happen, we assume the
>>> worst by forcing re-compensation for it at every point the VCPU
>>> task is descheduled.
>>>
>>> Signed-off-by: Zachary Amsden
>>> ---
>>> arch/x86/kvm/x86.c |2 +-
>>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 7fc4a55..52b6c21 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>>> *vcpu, int cpu)
>>> }
>>>
>>> kvm_x86_ops->vcpu_load(vcpu, cpu);
>>> -if (unlikely(vcpu->cpu != cpu)) {
>>> +if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>> /* Make sure TSC doesn't go backwards */
>>> s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>> native_read_tsc() - vcpu->arch.last_host_tsc;
>>>
>>>  
>> For yet unknown reason, this commit breaks Linux guests here if they
>> are
>> started with only a single VCPU. They hang during boot, obviously no
>> longer receiving interrupts.
>>
>> I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a
>> side
>> effect of the wrapping, though I cannot imagine how.
>>
>> Anyone any ideas?
>>
>>
>>
>>
> Most likely, time went backwards, and some 'future - past' calculation
> resulted in a negative sleep value which was then interpreted as
> unsigned and resulted in a 2342525634 year sleep.
>
>  
 Looks like that's the case on first glance at the apic state.


>>> This compensation effectively nulls the delta between current and
>>> last TSC:
>>>
>>>  if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>>>  /* Make sure TSC doesn't go backwards */
>>>  s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>>>  native_read_tsc() -
>>> vcpu->arch.last_host_tsc;
>>>  if (tsc_delta<  0)
>>>  mark_tsc_unstable("KVM discovered backwards
>>> TSC");
>>>  if (check_tsc_unstable())
>>>  kvm_x86_ops->adjust_tsc_offset(vcpu,
>>> -tsc_delta);
>>>  kvm_migrate_timers(vcpu);
>>>  vcpu->cpu = cpu;
>>>
>>> If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
>>> will adjust the offset backwards to compensate; similarly, if it has
>>> gone backwards, it will advance the offset.
>>>
>>> In neither case should the visible TSC go backwards, assuming
>>> last_host_tsc is recorded properly, and so kvmclock should be similarly
>>> unaffected.
>>>
>>> Perhaps the guest is more intelligent than we hope, and is comparing two
>>> different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
>>> could result in negative arithmetic begin interpreted as unsigned.  Are
>>> you using PIT interrupt reinjection on this guest or passing
>>> -no-kvm-pit-reinjection?
>>>
>>> 

   
> Does your guest use kvmclock, tsc, or some other time source?
>
>  
 A kernel that has kvmclock support even hangs in SMP mode. The others
 pick hpet or acpi_pm. TSC is considered unstable.


>>> SMP mode here has always and will always be unreliable.  Are you running
>>> on an Intel or AMD CPU?  The origin of this code comes from a workaround
>>> for (*) in vendor-specific code, and perhaps it is inappropriate for
>>> both.
>>>  
>> I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
>> a few hours ago. Well, the issue is gone now...
>>
>> So I looked into the system logs and found this:
>>
>> [18446744053.434939] PM: resume of devices complete after 4379.595 msecs
>> [18446744053.457133] PM: Finishing wakeup.
>> [18446744053.457135] Restarting tasks ...
>> [0.000999] Marking TSC unstable due to KVM discovered backwards TSC
>> [270103.974668] done.
>>
>>  From that point on the box was on hpet, including the time I did the
>> failing tests this morning. The kvm-kmod version loaded at this point
>> was based on kvm.git df549cfc.
>>
>> But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
>> with using it as clock source. Does this tell you anything?

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 07:40:52PM -0700, Shirley Ma wrote:
> On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
> > I think what David said is what we have thought before in mp device.
> > Since we are not sure the exact time the tx buffer was wrote though
> > DMA operation.
> > But the deadline is when the tx buffer was freed. So we only notify
> > the vhost stuff
> > about the write when tx buffer freed. But the deadline is maybe too
> > late for performance.
> 
> Have you tried it? If so what's the performance penalty you have seen by
> notifying vhost when tx buffer freed?
> 
> I am thinking to have a callback in skb destructor,
> vhost_add_used_and_signal gets updated when skb is actually freed, vhost
> vq & head need to be passed to the callback. This might requires vhost
> ring size is at least as big as the lower device driver. 
> 
> Thanks
> Shirley

For some of the issues, try following the discussion around
net: af_packet: don't call tpacket_destruct_skb() until the skb is sent
out.

Summary: it's difficult to do correctly generally. Limiting ourselves
to transmit on specific devices might make it possible.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 12:36:23PM -0700, Shirley Ma wrote:
> On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
> > > > I think that you should be able to simply combine
> > > > the two drivers together, add an ioctl to
> > > > enable/disable zero copy mode of operation. 
> > > 
> > > That could work. But what's the purpose to have two drivers if one
> > > driver can handle it?
> > > 
> > > Thanks
> > > Shirley
> > 
> > This was just an idea: I thought it's a good way for people interested
> > in this zero copy thing to combine forces and avoid making
> > the same mistakes, but it's not a must of course. 
> 
> Ok, I will make a simple patch by reusing Xiaohui's some vhost code on
> handling vhost_add_used_and_signal() to see any performance changes.
> 
> The interesting thing here when I run 32 instances netperf/netserver I
> didn't see any issue w/i this patch.
> 
> Thanks
> Shirley

Yes, I agree this patch is useful for demo purposes:
simple, and shows what kind of performance gains
we can expect for TX.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization

2010-09-14 Thread Zachary Amsden


On 09/13/2010 11:10 PM, Jan Kiszka wrote:

Am 20.08.2010 10:07, Zachary Amsden wrote:
   

When CPUs with unstable TSCs enter deep C-state, TSC may stop
running.  This causes us to require resynchronization.  Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden
---
  arch/x86/kvm/x86.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7fc4a55..52b6c21 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
-   if (unlikely(vcpu->cpu != cpu)) {
+   if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;
 

For yet unknown reason, this commit breaks Linux guests here if they are
started with only a single VCPU. They hang during boot, obviously no
longer receiving interrupts.

I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
effect of the wrapping, though I cannot imagine how.

Anyone any ideas?
   


Question: how did you come to the knowledge that this is the commit 
which breaks things?  I'm assuming you bisected, in which case a 
transition from stable -> unstable would have only happened once.  This 
also means the PM suspend event which you observed only happened once, 
so obviously if you bisected successfully, there is a bug which doesn't 
involved the PM transition or the stable -> unstable transition.


Your host TSC must have desynchronized during the PM transition, and 
this change compensates the TSC on an unstable host to effectively show 
run time, not real time.  Perhaps the lack of catchup code (to catch 
back up to real time) is triggering the bug.


In any case, I'll proceed with the forcing of unstable TSC and HPET 
clocksource and see what happens.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.

2010-09-14 Thread Xin, Xiaohui

>From: Michael S. Tsirkin [mailto:m...@redhat.com]
>Sent: Sunday, September 12, 2010 9:37 PM
>To: Xin, Xiaohui
>Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
>mi...@elte.hu; da...@davemloft.net; herb...@gondor.hengli.com.au;
>jd...@linux.intel.com
>Subject: Re: [RFC PATCH v9 12/16] Add mp(mediate passthru) device.
>
>On Sat, Sep 11, 2010 at 03:41:14PM +0800, Xin, Xiaohui wrote:
>> >>Playing with rlimit on data path, transparently to the application in this 
>> >>way
>> >>looks strange to me, I suspect this has unexpected security implications.
>> >>Further, applications may have other uses for locked memory
>> >>besides mpassthru - you should not just take it because it's there.
>> >>
>> >>Can we have an ioctl that lets userspace configure how much
>> >>memory to lock? This ioctl will decrement the rlimit and store
>> >>the data in the device structure so we can do accounting
>> >>internally. Put it back on close or on another ioctl.
>> >Yes, we can decrement the rlimit in ioctl in one time to avoid
>> >data path.
>> >
>> >>Need to be careful for when this operation gets called
>> >>again with 0 or another small value while we have locked memory -
>> >>maybe just fail with EBUSY?  or wait until it gets unlocked?
>> >>Maybe 0 can be special-cased and deactivate zero-copy?.
>> >>
>>
>> How about we don't use a new ioctl, but just check the rlimit
>> in one MPASSTHRU_BINDDEV ioctl? If we find mp device
>> break the rlimit, then we fail the bind ioctl, and thus can't do
>> zero copy any more.
>
>Yes, and not just check, but decrement as well.
>I think we should give userspace control over
>how much memory we can lock and subtract from the rlimit.
>It's OK to add this as a parameter to MPASSTHRU_BINDDEV.
>Then increment the rlimit back on unbind and on close?
>
>This opens up an interesting condition: process 1
>calls bind, process 2 calls unbind or close.
>This will increment rlimit for process 2.
>Not sure how to fix this properly.
>
I can't too, can we do any synchronous operations on rlimit stuff?
I quite suspect in it.
 
>--
>MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net.

2010-09-14 Thread Xin, Xiaohui

Herbert,
Any comments on the modifications of the net core and driver side of this patch?

Thanks
Xiaohui

>-Original Message-
>From: linux-kernel-ow...@vger.kernel.org 
>[mailto:linux-kernel-ow...@vger.kernel.org] On
>Behalf Of xiaohui@intel.com
>Sent: Saturday, September 11, 2010 5:53 PM
>To: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
>m...@redhat.com; mi...@elte.hu; da...@davemloft.net; 
>herb...@gondor.apana.org.au;
>jd...@linux.intel.com
>Subject: [RFC PATCH v10 00/16] Provide a zero-copy method on KVM virtio-net.
>
>We provide an zero-copy method which driver side may get external
>buffers to DMA. Here external means driver don't use kernel space
>to allocate skb buffers. Currently the external buffer can be from
>guest virtio-net driver.
>
>The idea is simple, just to pin the guest VM user space and then
>let host NIC driver has the chance to directly DMA to it.
>The patches are based on vhost-net backend driver. We add a device
>which provides proto_ops as sendmsg/recvmsg to vhost-net to
>send/recv directly to/from the NIC driver. KVM guest who use the
>vhost-net backend may bind any ethX interface in the host side to
>get copyless data transfer thru guest virtio-net frontend.
>
>patch 01-10:   net core and kernel changes.
>patch 11-13:   new device as interface to mantpulate external buffers.
>patch 14:  for vhost-net.
>patch 15:  An example on modifying NIC driver to using napi_gro_frags().
>patch 16:  An example how to get guest buffers based on driver
>   who using napi_gro_frags().
>
>The guest virtio-net driver submits multiple requests thru vhost-net
>backend driver to the kernel. And the requests are queued and then
>completed after corresponding actions in h/w are done.
>
>For read, user space buffers are dispensed to NIC driver for rx when
>a page constructor API is invoked. Means NICs can allocate user buffers
>from a page constructor. We add a hook in netif_receive_skb() function
>to intercept the incoming packets, and notify the zero-copy device.
>
>For write, the zero-copy deivce may allocates a new host skb and puts
>payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
>The request remains pending until the skb is transmitted by h/w.
>
>We provide multiple submits and asynchronous notifiicaton to
>vhost-net too.
>
>Our goal is to improve the bandwidth and reduce the CPU usage.
>Exact performance data will be provided later.
>
>What we have not done yet:
>   Performance tuning
>
>what we have done in v1:
>   polish the RCU usage
>   deal with write logging in asynchroush mode in vhost
>   add notifier block for mp device
>   rename page_ctor to mp_port in netdevice.h to make it looks generic
>   add mp_dev_change_flags() for mp device to change NIC state
>   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
>   a small fix for missing dev_put when fail
>   using dynamic minor instead of static minor number
>   a __KERNEL__ protect to mp_get_sock()
>
>what we have done in v2:
>
>   remove most of the RCU usage, since the ctor pointer is only
>   changed by BIND/UNBIND ioctl, and during that time, NIC will be
>   stopped to get good cleanup(all outstanding requests are finished),
>   so the ctor pointer cannot be raced into wrong situation.
>
>   Remove the struct vhost_notifier with struct kiocb.
>   Let vhost-net backend to alloc/free the kiocb and transfer them
>   via sendmsg/recvmsg.
>
>   use get_user_pages_fast() and set_page_dirty_lock() when read.
>
>   Add some comments for netdev_mp_port_prep() and handle_mpassthru().
>
>what we have done in v3:
>   the async write logging is rewritten
>   a drafted synchronous write function for qemu live migration
>   a limit for locked pages from get_user_pages_fast() to prevent Dos
>   by using RLIMIT_MEMLOCK
>
>
>what we have done in v4:
>   add iocb completion callback from vhost-net to queue iocb in mp device
>   replace vq->receiver by mp_sock_data_ready()
>   remove stuff in mp device which access structures from vhost-net
>   modify skb_reserve() to ignore host NIC driver reserved space
>   rebase to the latest vhost tree
>   split large patches into small pieces, especially for net core part.
>
>
>what we have done in v5:
>   address Arnd Bergmann's comments
>   -remove IFF_MPASSTHRU_EXCL flag in mp device
>   -Add CONFIG_COMPAT macro
>   -remove mp_release ops
>   move dev_is_mpassthru() as inline func
>   fix a bug in memory relinquish
>   Apply to current git (2.6.34-rc6) tree.
>
>what we have done in v6:
>   move create_iocb() out of page_dtor which may happen in interrupt 
> context
>   -This remove the potential issues which lock called in interrupt context
>   make the cache used by mp, vhost as static, and created/destoryed during
>

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui

>From: Shirley Ma [mailto:mashi...@us.ibm.com]
>Sent: Wednesday, September 15, 2010 10:41 AM
>To: Xin, Xiaohui
>Cc: Avi Kivity; David Miller; a...@arndb.de; m...@redhat.com; 
>net...@vger.kernel.org;
>kvm@vger.kernel.org; linux-ker...@vger.kernel.org
>Subject: RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
>kernel
>
>On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
>> I think what David said is what we have thought before in mp device.
>> Since we are not sure the exact time the tx buffer was wrote though
>> DMA operation.
>> But the deadline is when the tx buffer was freed. So we only notify
>> the vhost stuff
>> about the write when tx buffer freed. But the deadline is maybe too
>> late for performance.
>
>Have you tried it? If so what's the performance penalty you have seen by
>notifying vhost when tx buffer freed?
>

We did not try it before, as we cared RX side more.

>I am thinking to have a callback in skb destructor,
>vhost_add_used_and_signal gets updated when skb is actually freed, vhost
>vq & head need to be passed to the callback. This might requires vhost
>ring size is at least as big as the lower device driver.
>

That's almost the same what we have done except we use destructor_arg and
another callback..

>Thanks
>Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui

>From: Michael S. Tsirkin [mailto:m...@redhat.com]
>Sent: Wednesday, September 15, 2010 12:30 AM
>To: Shirley Ma
>Cc: Arnd Bergmann; Avi Kivity; Xin, Xiaohui; David Miller; 
>net...@vger.kernel.org;
>kvm@vger.kernel.org; linux-ker...@vger.kernel.org
>Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
>kernel
>
>On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
>> On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
>> > I would expect this to hurt performance significantly.
>> > We could do this for asynchronous requests only to avoid the
>> > slowdown.
>>
>> Is kiocb in sendmsg helpful here? It is not used now.
>>
>> Shirley
>
>Precisely. This is what the patch from Xin Xiaohui does.  That code
>already seems to do most of what you are trying to do, right?
>
>The main thing missing seems to be macvtap integration, so that we can fall 
>back
>on data copy if zero copy is unavailable?
>How hard would it be to basically link the mp and macvtap modules
>together to get us this functionality? Anyone?
>
Michael,
Is to support macvtap with zero-copy through mp device the functionality
you mentioned above?
Before Shirley Ma has suggested to move the zero-copy functionality into
tun/tap device or macvtap device. How do you think about that? I suspect
there will be a lot of duplicate code in that three drivers except we can 
extract
code of zero-copy into kernel APIs and vhost APIs.
Do you think that's worth to do and help current process which is blocked too
long than I expected?

>
>--
>MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Wed, 2010-09-15 at 09:50 +0800, Xin, Xiaohui wrote:
> I think what David said is what we have thought before in mp device.
> Since we are not sure the exact time the tx buffer was wrote though
> DMA operation.
> But the deadline is when the tx buffer was freed. So we only notify
> the vhost stuff
> about the write when tx buffer freed. But the deadline is maybe too
> late for performance.

Have you tried it? If so what's the performance penalty you have seen by
notifying vhost when tx buffer freed?

I am thinking to have a callback in skb destructor,
vhost_add_used_and_signal gets updated when skb is actually freed, vhost
vq & head need to be passed to the callback. This might requires vhost
ring size is at least as big as the lower device driver. 

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui

>From: Arnd Bergmann [mailto:a...@arndb.de]
>Sent: Tuesday, September 14, 2010 11:21 PM
>To: Shirley Ma
>Cc: Avi Kivity; David Miller; m...@redhat.com; Xin, Xiaohui; 
>net...@vger.kernel.org;
>kvm@vger.kernel.org; linux-ker...@vger.kernel.org
>Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
>kernel
>
>On Tuesday 14 September 2010, Shirley Ma wrote:
>> On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
>>
>> > That's what io_submit() is for.  Then io_getevents() tells you what
>> > "a
>> > while" actually was.
>>
>> This macvtap zero copy uses iov buffers from vhost ring, which is
>> allocated from guest kernel. In host kernel, vhost calls macvtap
>> sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
>> pages for zero copy.
>>
>> The patch is relying on how vhost handle these buffers. I need to look
>> at vhost code (qemu) first for addressing the questions here.
>
>I guess the best solution would be to make macvtap_aio_write return
>-EIOCBQUEUED when a packet gets passed down to the adapter, and
>call aio_complete when the adapter is done with it.
>
>This would change the regular behavior of macvtap into a model where
>every write on the file blocks until the packet has left the machine,
>which gives us better flow control, but does slow down the traffic
>when we only put one packet at a time into the queue.
>
>It also allows the user to call io_submit instead of write in order
>to do an asynchronous submission as Avi was suggesting.
>

But currently, this patch is communicated with vhost-net, which is almost
in the kernel side. If it uses aio stuff, it should be communicate with user
space Backend. 
 
>   Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Xin, Xiaohui

>From: Shirley Ma [mailto:mashi...@us.ibm.com]
>Sent: Tuesday, September 14, 2010 11:05 PM
>To: Avi Kivity
>Cc: David Miller; a...@arndb.de; m...@redhat.com; Xin, Xiaohui; 
>net...@vger.kernel.org;
>kvm@vger.kernel.org; linux-ker...@vger.kernel.org
>Subject: Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host 
>kernel
>
>On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
>> >> +base = (unsigned long)from->iov_base + offset1;
>> >> +size = ((base&  ~PAGE_MASK) + len + ~PAGE_MASK)>>
>> PAGE_SHIFT;
>> >> +num_pages = get_user_pages_fast(base, size,
>> 0,&page[i]);
>> >> +if ((num_pages != size) ||
>> >> +(num_pages>  MAX_SKB_FRAGS -
>> skb_shinfo(skb)->nr_frags))
>> >> +/* put_page is in skb free */
>> >> +return -EFAULT;
>> > What keeps the user from writing to these pages in it's address
>> space
>> > after the write call returns?
>> >
>> > A write() return of success means:
>> >
>> >   "I wrote what you gave to me"
>> >
>> > not
>> >
>> >   "I wrote what you gave to me, oh and BTW don't touch these
>> >   pages for a while."
>> >
>> > In fact "a while" isn't even defined in any way, as there is no way
>> > for the write() invoker to know when the networking card is done
>> with
>> > those pages.
>>
>> That's what io_submit() is for.  Then io_getevents() tells you what
>> "a
>> while" actually was.
>
>This macvtap zero copy uses iov buffers from vhost ring, which is
>allocated from guest kernel. In host kernel, vhost calls macvtap
>sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
>pages for zero copy.
>
>The patch is relying on how vhost handle these buffers. I need to look
>at vhost code (qemu) first for addressing the questions here.
>
>Thanks
>Shirley

I think what David said is what we have thought before in mp device.
Since we are not sure the exact time the tx buffer was wrote though DMA 
operation.
But the deadline is when the tx buffer was freed. So we only notify the vhost 
stuff
about the write when tx buffer freed. But the deadline is maybe too late for 
performance.

Thanks
Xiaohui 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM test: Fix command to install virtio driver msi install package

2010-09-14 Thread Lucas Meneghel Rodrigues

If we are going to use the msi package, we have to tell windows
to install it using msiexec /passive /package, so the msi is
installed without asking questions. Only executing the msi
will prompt questions to the user, which we clearly don't want.

Changing it on the default config file so people won't loose
time trying to figure out why the msi is not being installed.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests_base.cfg.sample |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 22e9494..e9d4b4e 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -1294,7 +1294,7 @@ variants:
 #drive_index_virtiocd = 3
 #virtio_floppy = /usr/share/virtio-win/virtio-drivers.vfd
 # Some flavors of the virtio drivers have an msi installer
-#virtio_network_installer = F:\\RHEV-Network.msi
+#virtio_network_installer = ' msiexec /passive /package  
F:\RHEV-Network64.msi'
 migrate:
 migration_test_command = ver && vol
 migration_bg_command = start ping -t localhost
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization

2010-09-14 Thread Zachary Amsden


On 09/14/2010 12:26 PM, Jan Kiszka wrote:

Am 14.09.2010 21:32, Zachary Amsden wrote:
   

On 09/14/2010 12:40 AM, Jan Kiszka wrote:
 

Am 14.09.2010 11:27, Avi Kivity wrote:

   

On 09/14/2010 11:10 AM, Jan Kiszka wrote:

 

Am 20.08.2010 10:07, Zachary Amsden wrote:

   

When CPUs with unstable TSCs enter deep C-state, TSC may stop
running.  This causes us to require resynchronization.  Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden
---
arch/x86/kvm/x86.c |2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7fc4a55..52b6c21 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
*vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
-if (unlikely(vcpu->cpu != cpu)) {
+if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;

 

For yet unknown reason, this commit breaks Linux guests here if they
are
started with only a single VCPU. They hang during boot, obviously no
longer receiving interrupts.

I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
effect of the wrapping, though I cannot imagine how.

Anyone any ideas?



   

Most likely, time went backwards, and some 'future - past' calculation
resulted in a negative sleep value which was then interpreted as
unsigned and resulted in a 2342525634 year sleep.

 

Looks like that's the case on first glance at the apic state.

   

This compensation effectively nulls the delta between current and last TSC:

 if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
 /* Make sure TSC doesn't go backwards */
 s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 native_read_tsc() -
vcpu->arch.last_host_tsc;
 if (tsc_delta<  0)
 mark_tsc_unstable("KVM discovered backwards TSC");
 if (check_tsc_unstable())
 kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
 kvm_migrate_timers(vcpu);
 vcpu->cpu = cpu;

If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
will adjust the offset backwards to compensate; similarly, if it has
gone backwards, it will advance the offset.

In neither case should the visible TSC go backwards, assuming
last_host_tsc is recorded properly, and so kvmclock should be similarly
unaffected.

Perhaps the guest is more intelligent than we hope, and is comparing two
different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
could result in negative arithmetic begin interpreted as unsigned.  Are
you using PIT interrupt reinjection on this guest or passing
-no-kvm-pit-reinjection?

 


   

Does your guest use kvmclock, tsc, or some other time source?

 

A kernel that has kvmclock support even hangs in SMP mode. The others
pick hpet or acpi_pm. TSC is considered unstable.

   

SMP mode here has always and will always be unreliable.  Are you running
on an Intel or AMD CPU?  The origin of this code comes from a workaround
for (*) in vendor-specific code, and perhaps it is inappropriate for both.
 

I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
a few hours ago. Well, the issue is gone now...

So I looked into the system logs and found this:

[18446744053.434939] PM: resume of devices complete after 4379.595 msecs
[18446744053.457133] PM: Finishing wakeup.
[18446744053.457135] Restarting tasks ...
[0.000999] Marking TSC unstable due to KVM discovered backwards TSC
[270103.974668] done.

 From that point on the box was on hpet, including the time I did the
failing tests this morning. The kvm-kmod version loaded at this point
was based on kvm.git df549cfc.

But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
with using it as clock source. Does this tell you anything?
   


Yes, quite a bit.

It's possible that marking the TSC unstable with an actively running VM 
causes a boundary condition that I had not accounted for.  It's also 
possible that the clocksource switch triggered some bad behavior.


This suggests two debugging techniques: I can manually switch the 
clocksource, and I can also load a module which does nothing other than 
mark the TSC unstable.  Failing that, we can investigate PM suspend / 
resume for possible issues.


I'll try this on my Intel boxes to see what happens.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: [PATCH 2/2] KVM: MMU: Use base_role.nxe for mmu.nx

2010-09-14 Thread Marcelo Tosatti

On Tue, Sep 14, 2010 at 05:46:13PM +0200, Joerg Roedel wrote:
> This patch removes the mmu.nx field and uses the equivalent
> field mmu.base_role.nxe instead.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/include/asm/kvm_host.h |2 --
>  arch/x86/kvm/mmu.c  |   27 +--
>  arch/x86/kvm/paging_tmpl.h  |4 ++--
>  arch/x86/kvm/x86.c  |3 ---
>  4 files changed, 15 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a83177..50506be 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -259,8 +259,6 @@ struct kvm_mmu {
>   u64 *lm_root;
>   u64 rsvd_bits_mask[2][4];
>  
> - bool nx;
> -
>   u64 pdptrs[4]; /* pae */
>  };
>  
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 3ce56bf..21d2983 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -238,7 +238,7 @@ static int is_cpuid_PSE36(void)
>  
>  static int is_nx(struct kvm_vcpu *vcpu)
>  {
> - return vcpu->arch.efer & EFER_NX;
> + return !!(vcpu->arch.efer & EFER_NX);
>  }
>  
>  static int is_shadow_present_pte(u64 pte)
> @@ -2634,7 +2634,7 @@ static int nonpaging_init_context(struct kvm_vcpu *vcpu,
>   context->shadow_root_level = PT32E_ROOT_LEVEL;
>   context->root_hpa = INVALID_PAGE;
>   context->direct_map = true;
> - context->nx = false;
> + context->base_role.nxe = 0;
>   return 0;
>  }
>  
> @@ -2688,7 +2688,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
>   int maxphyaddr = cpuid_maxphyaddr(vcpu);
>   u64 exb_bit_rsvd = 0;
>  
> - if (!context->nx)
> + if (!context->base_role.nxe)
>   exb_bit_rsvd = rsvd_bits(63, 63);
>   switch (level) {
>   case PT32_ROOT_LEVEL:
> @@ -2747,7 +2747,7 @@ static int paging64_init_context_common(struct kvm_vcpu 
> *vcpu,
>   struct kvm_mmu *context,
>   int level)
>  {
> - context->nx = is_nx(vcpu);
> + context->base_role.nxe = is_nx(vcpu);
>  
>   reset_rsvds_bits_mask(vcpu, context, level);
>  
> @@ -2775,7 +2775,7 @@ static int paging64_init_context(struct kvm_vcpu *vcpu,
>  static int paging32_init_context(struct kvm_vcpu *vcpu,
>struct kvm_mmu *context)
>  {
> - context->nx = false;
> + context->base_role.nxe = 0;
>  
>   reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
>  
> @@ -2815,24 +2815,23 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
>   context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
>   context->get_cr3 = get_cr3;
>   context->inject_page_fault = kvm_inject_page_fault;
> - context->nx = is_nx(vcpu);
>  
>   if (!is_paging(vcpu)) {
> - context->nx = false;
> + context->base_role.nxe = 0;
>   context->gva_to_gpa = nonpaging_gva_to_gpa;
>   context->root_level = 0;
>   } else if (is_long_mode(vcpu)) {
> - context->nx = is_nx(vcpu);
> + context->base_role.nxe = is_nx(vcpu);
>   reset_rsvds_bits_mask(vcpu, context, PT64_ROOT_LEVEL);
>   context->gva_to_gpa = paging64_gva_to_gpa;
>   context->root_level = PT64_ROOT_LEVEL;
>   } else if (is_pae(vcpu)) {
> - context->nx = is_nx(vcpu);
> + context->base_role.nxe = is_nx(vcpu);
>   reset_rsvds_bits_mask(vcpu, context, PT32E_ROOT_LEVEL);
>   context->gva_to_gpa = paging64_gva_to_gpa;
>   context->root_level = PT32E_ROOT_LEVEL;
>   } else {
> - context->nx = false;
> + context->base_role.nxe = 0;
>   reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
>   context->gva_to_gpa = paging32_gva_to_gpa;
>   context->root_level = PT32_ROOT_LEVEL;

For tdp better set base_role.nxe to zero, otherwise duplicate tdp
pagetables can be created if the guest switches between nx/non-nx.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization

2010-09-14 Thread Jan Kiszka

Am 14.09.2010 21:32, Zachary Amsden wrote:
> On 09/14/2010 12:40 AM, Jan Kiszka wrote:
>> Am 14.09.2010 11:27, Avi Kivity wrote:
>>   
>>>On 09/14/2010 11:10 AM, Jan Kiszka wrote:
>>> 
 Am 20.08.2010 10:07, Zachary Amsden wrote:
   
> When CPUs with unstable TSCs enter deep C-state, TSC may stop
> running.  This causes us to require resynchronization.  Since
> we can't tell when this may potentially happen, we assume the
> worst by forcing re-compensation for it at every point the VCPU
> task is descheduled.
>
> Signed-off-by: Zachary Amsden
> ---
>arch/x86/kvm/x86.c |2 +-
>1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7fc4a55..52b6c21 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
> *vcpu, int cpu)
>}
>
>kvm_x86_ops->vcpu_load(vcpu, cpu);
> -if (unlikely(vcpu->cpu != cpu)) {
> +if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
>/* Make sure TSC doesn't go backwards */
>s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
>native_read_tsc() - vcpu->arch.last_host_tsc;
>  
 For yet unknown reason, this commit breaks Linux guests here if they
 are
 started with only a single VCPU. They hang during boot, obviously no
 longer receiving interrupts.

 I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
 effect of the wrapping, though I cannot imagine how.

 Anyone any ideas?



>>> Most likely, time went backwards, and some 'future - past' calculation
>>> resulted in a negative sleep value which was then interpreted as
>>> unsigned and resulted in a 2342525634 year sleep.
>>>  
>> Looks like that's the case on first glance at the apic state.
>>
> 
> This compensation effectively nulls the delta between current and last TSC:
> 
> if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
> /* Make sure TSC doesn't go backwards */
> s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
> native_read_tsc() -
> vcpu->arch.last_host_tsc;
> if (tsc_delta < 0)
> mark_tsc_unstable("KVM discovered backwards TSC");
> if (check_tsc_unstable())
> kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
> kvm_migrate_timers(vcpu);
> vcpu->cpu = cpu;
> 
> If TSC has advanced quite a bit due to a TSC jump during sleep(*), it
> will adjust the offset backwards to compensate; similarly, if it has
> gone backwards, it will advance the offset.
> 
> In neither case should the visible TSC go backwards, assuming
> last_host_tsc is recorded properly, and so kvmclock should be similarly
> unaffected.
> 
> Perhaps the guest is more intelligent than we hope, and is comparing two
> different clocks: kvmclock or TSC with the rate of PIT interrupts.  This
> could result in negative arithmetic begin interpreted as unsigned.  Are
> you using PIT interrupt reinjection on this guest or passing
> -no-kvm-pit-reinjection?
> 
>>   
>>> Does your guest use kvmclock, tsc, or some other time source?
>>>  
>> A kernel that has kvmclock support even hangs in SMP mode. The others
>> pick hpet or acpi_pm. TSC is considered unstable.
>>
> 
> SMP mode here has always and will always be unreliable.  Are you running
> on an Intel or AMD CPU?  The origin of this code comes from a workaround
> for (*) in vendor-specific code, and perhaps it is inappropriate for both.

I'm on a fairly new Intel i7 (M 620). And I accidentally rebooted my box
a few hours ago. Well, the issue is gone now...

So I looked into the system logs and found this:

[18446744053.434939] PM: resume of devices complete after 4379.595 msecs
[18446744053.457133] PM: Finishing wakeup.
[18446744053.457135] Restarting tasks ...
[0.000999] Marking TSC unstable due to KVM discovered backwards TSC
[270103.974668] done.

From that point on the box was on hpet, including the time I did the
failing tests this morning. The kvm-kmod version loaded at this point
was based on kvm.git df549cfc.

But my /proc/cpuinfo claims "constant_tsc", and Linux is generally happy
with using it as clock source. Does this tell you anything?

Jan



signature.asc
Description: OpenPGP digital signature

[PATCH 05/18] KVM Test: Add a common ping module for network related tests

2010-09-14 Thread Lucas Meneghel Rodrigues

The kvm_net_utils.py is a just a place that wraps common network
related commands which is used to do the network-related tests.
Use -1 as the packet ratio for loss analysis.
Use quiet mode when doing the flood ping.

Changes from v1:
- Use None to indicate that the session should be local in raw_ping
- Use session.sendline("\003") to send (ctrl+c) signal
- Use None to indicate that the session should be local
- Fix of coding style

Signed-off-by: Jason Wang 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/kvm_test_utils.py |  112 +++-
 1 files changed, 111 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/kvm_test_utils.py 
b/client/tests/kvm/kvm_test_utils.py
index 5412aac..1e4bd9d 100644
--- a/client/tests/kvm/kvm_test_utils.py
+++ b/client/tests/kvm/kvm_test_utils.py
@@ -21,7 +21,7 @@ More specifically:
 @copyright: 2008-2009 Red Hat Inc.
 """
 
-import time, os, logging, re, commands
+import time, os, logging, re, commands, signal
 from autotest_lib.client.common_lib import error
 from autotest_lib.client.bin import utils
 import kvm_utils, kvm_vm, kvm_subprocess, scan_results
@@ -505,3 +505,113 @@ def run_autotest(vm, session, control_path, timeout, 
outputdir):
 e_msg = ("Tests %s failed during control file execution" %
  " ".join(bad_results))
 raise error.TestFail(e_msg)
+
+
+def get_loss_ratio(output):
+"""
+Get the packet loss ratio from the output of ping
+.
+@param output: Ping output.
+"""
+try:
+return int(re.findall('(\d+)% packet loss', output)[0])
+except IndexError:
+logging.debug(output)
+return -1
+
+
+def raw_ping(command, timeout, session, output_func):
+"""
+Low-level ping command execution.
+
+@param command: Ping command.
+@param timeout: Timeout of the ping command.
+@param session: Local executon hint or session to execute the ping command.
+"""
+if session is None:
+process = kvm_subprocess.run_bg(command, output_func=output_func,
+timeout=timeout)
+
+# Send SIGINT signal to notify the timeout of running ping process,
+# Because ping have the ability to catch the SIGINT signal so we can
+# always get the packet loss ratio even if timeout.
+if process.is_alive():
+kvm_utils.kill_process_tree(process.get_pid(), signal.SIGINT)
+
+status = process.get_status()
+output = process.get_output()
+
+process.close()
+return status, output
+else:
+session.sendline(command)
+status, output = session.read_up_to_prompt(timeout=timeout,
+   print_func=output_func)
+if not status:
+# Send ctrl+c (SIGINT) through ssh session
+session.send("\003")
+status, output2 = session.read_up_to_prompt(print_func=output_func)
+output += output2
+if not status:
+# We also need to use this session to query the return value
+session.send("\003")
+
+session.sendline(session.status_test_command)
+s2, o2 = session.read_up_to_prompt()
+if not s2:
+status = -1
+else:
+try:
+status = int(re.findall("\d+", o2)[0])
+except:
+status = -1
+
+return status, output
+
+
+def ping(dest=None, count=None, interval=None, interface=None,
+ packetsize=None, ttl=None, hint=None, adaptive=False,
+ broadcast=False, flood=False, timeout=0,
+ output_func=logging.debug, session=None):
+"""
+Wrapper of ping.
+
+@param dest: Destination address.
+@param count: Count of icmp packet.
+@param interval: Interval of two icmp echo request.
+@param interface: Specified interface of the source address.
+@param packetsize: Packet size of icmp.
+@param ttl: IP time to live.
+@param hint: Path mtu discovery hint.
+@param adaptive: Adaptive ping flag.
+@param broadcast: Broadcast ping flag.
+@param flood: Flood ping flag.
+@param timeout: Timeout for the ping command.
+@param output_func: Function used to log the result of ping.
+@param session: Local executon hint or session to execute the ping command.
+"""
+if dest is not None:
+command = "ping %s " % dest
+else:
+command = "ping localhost "
+if count is not None:
+command += " -c %s" % count
+if interval is not None:
+command += " -i %s" % interval
+if interface is not None:
+command += " -I %s" % interface
+if packetsize is not None:
+command += " -s %s" % packetsize
+if ttl is not None:
+command += " -t %s" % ttl
+if hint is not None:
+command += " -M %s" % hint
+if adaptive:
+command += " -A"
+if broadcast:
+command += " -b"
+if flood:

[PATCH 11/18] KVM test: Add a subtest of multicast

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Use 'ping' to test send/recive multicat packets. Flood ping test is also added.
Limit guest network as 'bridge' mode, because multicast packets could not be
transmitted to guest when using 'user' network.
Add join_mcast.py for joining machine into multicast groups.

Changes from v1:
- Just flush the firewall rules with iptables -F

Signed-off-by: Amos Kong 
---
 client/tests/kvm/scripts/join_mcast.py |   37 ++
 client/tests/kvm/tests/multicast.py|   83 
 client/tests/kvm/tests_base.cfg.sample |9 +++-
 3 files changed, 128 insertions(+), 1 deletions(-)
 create mode 100755 client/tests/kvm/scripts/join_mcast.py
 create mode 100644 client/tests/kvm/tests/multicast.py

diff --git a/client/tests/kvm/scripts/join_mcast.py 
b/client/tests/kvm/scripts/join_mcast.py
new file mode 100755
index 000..350cd5f
--- /dev/null
+++ b/client/tests/kvm/scripts/join_mcast.py
@@ -0,0 +1,37 @@
+#!/usr/bin/python
+import socket, struct, os, signal, sys
+# -*- coding: utf-8 -*-
+
+"""
+Script used to join machine into multicast groups.
+
+...@author Amos Kong 
+"""
+
+if __name__ == "__main__":
+if len(sys.argv) < 4:
+print """%s [mgroup_count] [prefix] [suffix]
+mgroup_count: count of multicast addresses
+prefix: multicast address prefix
+suffix: multicast address suffix""" % sys.argv[0]
+sys.exit()
+
+mgroup_count = int(sys.argv[1])
+prefix = sys.argv[2]
+suffix = int(sys.argv[3])
+
+s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
+for i in range(mgroup_count):
+mcast = prefix + "." + str(suffix + i)
+try:
+mreq = struct.pack("4sl", socket.inet_aton(mcast),
+   socket.INADDR_ANY)
+s.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, mreq)
+except:
+s.close()
+print "Could not join multicast: %s" % mcast
+raise
+
+print "join_mcast_pid:%s" % os.getpid()
+os.kill(os.getpid(), signal.SIGSTOP)
+s.close()
diff --git a/client/tests/kvm/tests/multicast.py 
b/client/tests/kvm/tests/multicast.py
new file mode 100644
index 000..00261f7
--- /dev/null
+++ b/client/tests/kvm/tests/multicast.py
@@ -0,0 +1,83 @@
+import logging, os, re
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_test_utils
+
+
+def run_multicast(test, params, env):
+"""
+Test multicast function of nic (rtl8139/e1000/virtio)
+
+1) Create a VM.
+2) Join guest into multicast groups.
+3) Ping multicast addresses on host.
+4) Flood ping test with different size of packets.
+5) Final ping test and check if lose packet.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm,
+  timeout=int(params.get("login_timeout", 
360)))
+
+# flush the firewall rules
+cmd = "iptables -F; test -e /selinux/enforce && echo 0 >/selinux/enforce"
+session.get_command_status(cmd)
+utils.run(cmd)
+# make sure guest replies to broadcasts
+cmd_broadcast = "echo 0 > /proc/sys/net/ipv4/icmp_echo_ignore"
+cmd_broadcast_2 = "echo 0 > /proc/sys/net/ipv4/icmp_echo_ignore_all"
+s1 = session.get_command_status(cmd_broadcast)
+s2 = session.get_command_status(cmd_broadcast_2)
+if s1 != 0:
+logging.warning("Command %s return exit status %s", cmd_broadcast, s1)
+if s2 != 0:
+logging.warning("Command %s return exit status %s", cmd_broadcast_2, 
s2)
+
+# base multicast address
+mcast = params.get("mcast", "225.0.0.1")
+# count of multicast addresses, less than 20
+mgroup_count = int(params.get("mgroup_count", 5))
+flood_minutes = float(params.get("flood_minutes", 10))
+ifname = vm.get_ifname()
+prefix = re.findall("\d+.\d+.\d+", mcast)[0]
+suffix = int(re.findall("\d+", mcast)[-1])
+# copy python script to guest for joining guest to multicast groups
+mcast_path = os.path.join(test.bindir, "scripts/join_mcast.py")
+if not vm.copy_files_to(mcast_path, "/tmp"):
+raise error.TestError("Fail to copy %s to guest" % mcast_path)
+output = session.get_command_output("python /tmp/join_mcast.py %d %s %d" %
+(mgroup_count, prefix, suffix))
+
+# if success to join multicast, the process will be paused, and return PID.
+try:
+pid = re.findall("join_mcast_pid:(\d+)", output)[0]
+except IndexError:
+raise error.TestFail("Can't join multicast groups,output:%s" % output)
+
+try:
+for i in range(mgroup_count):
+new_suffix = suffix + i
+mcast = "%s.%d" % (prefix, new_suffix)
+
+logging.info("Initial ping test, mc

[PATCH 18/18] KVM test: Add subtest of testing offload by ethtool

2010-09-14 Thread Lucas Meneghel Rodrigues

The latest case contains TX/RX/SG/TSO/GSO/GRO/LRO test.
RTL8139 NIC doesn't support TSO, LRO, it's too old, so
drop offload test from rtl8139. LRO, GRO are only
supported by latest kernel, virtio nic doesn't support
receive offloading function.

Initialize the callbacks first and execute all the sub
tests one by one, all the result will be check at the
end. When execute this test, vhost should be enabled,
then most of new features can be used. Vhost doesn't
support VIRTIO_NET_F_MRG_RXBUF, so do not check large
packets in received offload test.

Transfer files by scp between host and guest, match
new opened TCP port by netstat. Capture the packages
info by tcpdump, it contains package length.

TODO: Query supported offload function by 'ethtool'

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/ethtool.py  |  215 
 client/tests/kvm/tests_base.cfg.sample |   13 ++-
 2 files changed, 226 insertions(+), 2 deletions(-)
 create mode 100644 client/tests/kvm/tests/ethtool.py

diff --git a/client/tests/kvm/tests/ethtool.py 
b/client/tests/kvm/tests/ethtool.py
new file mode 100644
index 000..c0bab12
--- /dev/null
+++ b/client/tests/kvm/tests/ethtool.py
@@ -0,0 +1,215 @@
+import logging, commands, re
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_test_utils, kvm_utils
+
+def run_ethtool(test, params, env):
+"""
+Test offload functions of ethernet device by ethtool
+
+1) Log into a guest.
+2) Initialize the callback of sub functions.
+3) Enable/disable sub function of NIC.
+4) Execute callback function.
+5) Check the return value.
+6) Restore original configuration.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+def ethtool_get(type):
+feature_pattern = {
+'tx':  'tx.*checksumming',
+'rx':  'rx.*checksumming',
+'sg':  'scatter.*gather',
+'tso': 'tcp.*segmentation.*offload',
+'gso': 'generic.*segmentation.*offload',
+'gro': 'generic.*receive.*offload',
+'lro': 'large.*receive.*offload',
+}
+s, o = session.get_command_status_output("ethtool -k %s" % ethname)
+try:
+return re.findall("%s: (.*)" % feature_pattern.get(type), o)[0]
+except IndexError:
+logging.debug("Could not get %s status" % type)
+
+
+def ethtool_set(type, status):
+"""
+Set ethernet device offload status
+
+@param type: Offload type name
+@param status: New status will be changed to
+"""
+logging.info("Try to set %s %s" % (type, status))
+if status not in ["off", "on"]:
+return False
+cmd = "ethtool -K %s %s %s" % (ethname, type, status)
+if ethtool_get(type) != status:
+return session.get_command_status(cmd) == 0
+if ethtool_get(type) != status:
+logging.error("Fail to set %s %s" % (type, status))
+return False
+return True
+
+
+def ethtool_save_params():
+logging.info("Save ethtool configuration")
+for i in supported_features:
+feature_status[i] = ethtool_get(i)
+
+
+def ethtool_restore_params():
+logging.info("Restore ethtool configuration")
+for i in supported_features:
+ethtool_set(i, feature_status[i])
+
+
+def compare_md5sum(name):
+logging.info("Compare md5sum of the files on guest and host")
+host_result = utils.hash_file(name, method="md5")
+try:
+o = session.get_command_output("md5sum %s" % name)
+guest_result = re.findall("\w+", o)[0]
+except IndexError:
+logging.error("Could not get file md5sum in guest")
+return False
+logging.debug("md5sum: guest(%s), host(%s)" %
+  (guest_result, host_result))
+return guest_result == host_result
+
+
+def transfer_file(src="guest"):
+"""
+Transfer file by scp, use tcpdump to capture packets, then check the
+return string.
+
+@param src: Source host of transfer file
+@return: Tuple (status, error msg/tcpdump result)
+"""
+session2.get_command_status("rm -rf %s" % filename)
+dd_cmd = "dd if=/dev/urandom of=%s bs=1M count=%s" % (filename,
+   params.get("filesize"))
+logging.info("Creat file in source host, cmd: %s" % dd_cmd)
+tcpdump_cmd = "tcpdump -lep -s 0 tcp -vv port ssh"
+if src == "guest":
+s = session.get_command_status(dd_cmd, timeout=360)
+tcpdump_cmd += " and src %s" % guest_ip
+copy_files_fun = vm.copy_files_from
+else:
+s, o = commands.getstatusoutput(dd_cmd)
+tcpdump_cmd += " and ds

[PATCH 12/18] KVM test: Add a subtest of pxe

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

This case just snoop tftp packet through tcpdump, it depends on public dhcp
server, better to test it through dnsmasq.

FIXME: Use dnsmasq for pxe test

Signed-off-by: Jason Wang 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/pxe.py  |   31 +++
 client/tests/kvm/tests_base.cfg.sample |   13 +
 2 files changed, 44 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/pxe.py

diff --git a/client/tests/kvm/tests/pxe.py b/client/tests/kvm/tests/pxe.py
new file mode 100644
index 000..ec9a549
--- /dev/null
+++ b/client/tests/kvm/tests/pxe.py
@@ -0,0 +1,31 @@
+import logging
+from autotest_lib.client.common_lib import error
+import kvm_subprocess, kvm_test_utils
+
+
+def run_pxe(test, params, env):
+"""
+PXE test:
+
+1) Snoop the tftp packet in the tap device.
+2) Wait for some seconds.
+3) Check whether we could capture TFTP packets.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+timeout = int(params.get("pxe_timeout", 60))
+
+logging.info("Try to boot from PXE")
+status, output = kvm_subprocess.run_fg("tcpdump -nli %s" % vm.get_ifname(),
+   logging.debug,
+   "(pxe capture) ",
+   timeout)
+
+logging.info("Analyzing the tcpdump result...")
+if not "tftp" in output:
+raise error.TestFail("Couldn't find any TFTP packets after %s seconds" 
%
+ timeout)
+logging.info("Found TFTP packet")
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 19750ed..6bbc33e 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -501,6 +501,19 @@ variants:
 mgroup_count = 20
 flood_minutes = 1
 
+- pxe:
+type = pxe
+images = pxe
+image_name_pxe = pxe-test
+image_size_pxe = 1G
+force_create_image_pxe = yes
+remove_image_pxe = yes
+extra_params += ' -boot n'
+kill_vm_on_error = yes
+network = bridge
+restart_vm = yes
+pxe_timeout = 60
+
 - physical_resources_check: install setup unattended_install.cdrom
 type = physical_resources_check
 catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}'
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 17/18] KVM test: vlan subtest - Replace extra_params '-snapshot' with image_snapshot

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Framework could not totalise default extra_params and extra_params_vm1 in the
following condition, it's difficult to realise when parsing config file or
calling get_sub_dict*().

extra_params += ' str1'
- case:
extra_params_vm1 += " str2"

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests_base.cfg.sample |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index f53b3f7..b543606 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -468,8 +468,7 @@ variants:
 send_cmd = "nc %s %s < %s"
 nic_mode = tap
 vms += " vm2"
-extra_params_vm1 += " -snapshot"
-extra_params_vm2 += " -snapshot"
+image_snapshot = yes
 kill_vm_vm2 = yes
 kill_vm_gracefully_vm2 = no
 
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/18] KVM test: Improve vlan subtest

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

This is an enhancement of existed vlan test. Rename the vlan_tag.py to vlan.py,
it is more reasonable.
. Setup arp from "/proc/sys/net/ipv4/conf/all/arp_ignore"
. Multiple vlans exist simultaneously
. Test ping between same and different vlans
. Test by TCP data transfer, floop ping between same vlan
. Maximal plumb/unplumb vlans

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/vlan.py |  186 
 client/tests/kvm/tests/vlan_tag.py |   68 
 client/tests/kvm/tests_base.cfg.sample |   16 ++-
 3 files changed, 196 insertions(+), 74 deletions(-)
 create mode 100644 client/tests/kvm/tests/vlan.py
 delete mode 100644 client/tests/kvm/tests/vlan_tag.py

diff --git a/client/tests/kvm/tests/vlan.py b/client/tests/kvm/tests/vlan.py
new file mode 100644
index 000..fb2a8d7
--- /dev/null
+++ b/client/tests/kvm/tests/vlan.py
@@ -0,0 +1,186 @@
+import logging, time, re
+from autotest_lib.client.common_lib import error
+import kvm_test_utils, kvm_utils
+
+def run_vlan(test, params, env):
+"""
+Test 802.1Q vlan of NIC, config it by vconfig command.
+
+1) Create two VMs.
+2) Setup guests in 10 different vlans by vconfig and using hard-coded
+   ip address.
+3) Test by ping between same and different vlans of two VMs.
+4) Test by TCP data transfer, floop ping between same vlan of two VMs.
+5) Test maximal plumb/unplumb vlans.
+6) Recover the vlan config.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+
+vm = []
+session = []
+vm_ip = []
+digest_origin = []
+vlan_ip = ['', '']
+ip_unit = ['1', '2']
+subnet = params.get("subnet")
+vlan_num = int(params.get("vlan_num"))
+maximal = int(params.get("maximal"))
+file_size = params.get("file_size")
+
+vm.append(kvm_test_utils.get_living_vm(env, params.get("main_vm")))
+vm.append(kvm_test_utils.get_living_vm(env, "vm2"))
+
+def add_vlan(session, id, iface="eth0"):
+if session.get_command_status("vconfig add %s %s" % (iface, id)) != 0:
+raise error.TestError("Fail to add %s.%s" % (iface, id))
+
+
+def set_ip_vlan(session, id, ip, iface="eth0"):
+iface = "%s.%s" % (iface, id)
+if session.get_command_status("ifconfig %s %s" % (iface, ip)) != 0:
+raise error.TestError("Fail to configure ip for %s" % iface)
+
+
+def set_arp_ignore(session, iface="eth0"):
+ignore_cmd = "echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore"
+if session.get_command_status(ignore_cmd) != 0:
+raise error.TestError("Fail to set arp_ignore of %s" % session)
+
+
+def rem_vlan(session, id, iface="eth0"):
+rem_vlan_cmd = "if [[ -e /proc/net/vlan/%s ]];then vconfig rem %s;fi"
+iface = "%s.%s" % (iface, id)
+s = session.get_command_status(rem_vlan_cmd % (iface, iface))
+return s
+
+
+def nc_transfer(src, dst):
+nc_port = kvm_utils.find_free_port(1025, 5334, vm_ip[dst])
+listen_cmd = params.get("listen_cmd")
+send_cmd = params.get("send_cmd")
+
+#listen in dst
+listen_cmd = listen_cmd % (nc_port, "receive")
+session[dst].sendline(listen_cmd)
+time.sleep(2)
+#send file from src to dst
+send_cmd = send_cmd % (vlan_ip[dst], str(nc_port), "file")
+if session[src].get_command_status(send_cmd, timeout = 60) != 0:
+raise error.TestFail ("Fail to send file"
+" from vm%s to vm%s" % (src+1, dst+1))
+s, o = session[dst].read_up_to_prompt(timeout=60)
+if s != True:
+raise error.TestFail ("Fail to receive file"
+" from vm%s to vm%s" % (src+1, dst+1))
+#check MD5 message digest of receive file in dst
+output = session[dst].get_command_output("md5sum receive").strip()
+digest_receive = re.findall(r'(\w+)', output)[0]
+if digest_receive == digest_origin[src]:
+logging.info("file succeed received in vm %s" % vlan_ip[dst])
+else:
+logging.info("digest_origin is  %s" % digest_origin[src])
+logging.info("digest_receive is %s" % digest_receive)
+raise error.TestFail("File transfered differ from origin")
+session[dst].get_command_status("rm -f receive")
+
+
+for i in range(2):
+session.append(kvm_test_utils.wait_for_login(vm[i],
+   timeout=int(params.get("login_timeout", 360
+if not session[i] :
+raise error.TestError("Could not log into guest(vm%d)" % i)
+logging.info("Logged in")
+
+#get guest ip
+vm_ip.append(vm[i].get_address())
+
+#produce sized file in vm
+dd_cmd = "dd if=/dev/urandom of=file bs=1024k count=%s"
+if session[i].get_command_status(dd_cmd % file_si

[PATCH 15/18] KVM test: kvm_utils - Add support of check if remote port free

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Signed-off-by: Amos Kong 
---
 client/tests/kvm/kvm_utils.py |   23 +++
 1 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
index bb5c868..71ab7d1 100644
--- a/client/tests/kvm/kvm_utils.py
+++ b/client/tests/kvm/kvm_utils.py
@@ -829,7 +829,7 @@ def scp_from_remote(host, port, username, password, 
remote_path, local_path,
 
 # The following are utility functions related to ports.
 
-def is_port_free(port):
+def is_port_free(port, address):
 """
 Return True if the given port is available for use.
 
@@ -838,15 +838,22 @@ def is_port_free(port):
 try:
 s = socket.socket()
 #s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
-s.bind(("localhost", port))
-free = True
+if address == "localhost":
+s.bind(("localhost", port))
+free = True
+else:
+s.connect((address, port))
+free = False
 except socket.error:
-free = False
+if address == "localhost":
+free = False
+else:
+free = True
 s.close()
 return free
 
 
-def find_free_port(start_port, end_port):
+def find_free_port(start_port, end_port, address="localhost"):
 """
 Return a host free port in the range [start_port, end_port].
 
@@ -854,12 +861,12 @@ def find_free_port(start_port, end_port):
 @param end_port: Port immediately after the last one that will be checked.
 """
 for i in range(start_port, end_port):
-if is_port_free(i):
+if is_port_free(i, address):
 return i
 return None
 
 
-def find_free_ports(start_port, end_port, count):
+def find_free_ports(start_port, end_port, count, address="localhost"):
 """
 Return count of host free ports in the range [start_port, end_port].
 
@@ -870,7 +877,7 @@ def find_free_ports(start_port, end_port, count):
 ports = []
 i = start_port
 while i < end_port and count > 0:
-if is_port_free(i):
+if is_port_free(i, address):
 ports.append(i)
 count -= 1
 i += 1
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/18] KVM test: Add a netperf subtest

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Add network load by netperf, server is launched on guest, execute netperf
client with different protocols on host. if all clients execute successfully,
case will be pass. Test result will be record into result.txt.

Now this case only tests with "TCP_RR TCP_CRR UDP_RR TCP_STREAM TCP_MAERTS
TCP_SENDFILE UDP_STREAM". DLPI only supported by Unix, unix domain test is
not necessary, so drop test of DLPI and unix domain.

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/netperf.py  |   56 
 client/tests/kvm/tests_base.cfg.sample |   10 ++
 2 files changed, 66 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/netperf.py

diff --git a/client/tests/kvm/tests/netperf.py 
b/client/tests/kvm/tests/netperf.py
new file mode 100644
index 000..acdd2f8
--- /dev/null
+++ b/client/tests/kvm/tests/netperf.py
@@ -0,0 +1,56 @@
+import logging, commands, os
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_test_utils
+
+def run_netperf(test, params, env):
+"""
+Network stress test with netperf.
+
+1) Boot up a VM.
+2) Launch netserver on guest.
+3) Execute netperf client on host with different protocols.
+4) Output the test result.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm,
+  timeout=int(params.get("login_timeout", 
360)))
+netperf_dir = os.path.join(os.environ['AUTODIR'], "tests/netperf2")
+setup_cmd = params.get("setup_cmd")
+guest_ip = vm.get_address()
+result_file = os.path.join(test.resultsdir, "output_%s" % test.iteration)
+
+session.get_command_output("iptables -F")
+for i in params.get("netperf_files").split():
+if not vm.copy_files_to(os.path.join(netperf_dir, i), "/tmp"):
+raise error.TestError("Could not copy files to guest")
+if session.get_command_status(setup_cmd % "/tmp", timeout=200) != 0:
+raise error.TestFail("Fail to setup netperf on guest")
+if session.get_command_status(params.get("netserver_cmd") % "/tmp") != 0:
+raise error.TestFail("Fail to start netperf server on guest")
+
+try:
+logging.info("Setup and run netperf client on host")
+utils.run(setup_cmd % netperf_dir)
+success = True
+file(result_file, "w").write("Netperf Test Result\n")
+for i in params.get("protocols").split():
+cmd = params.get("netperf_cmd") % (netperf_dir, i, guest_ip)
+logging.debug("Execute netperf client test: %s", cmd)
+s, o = commands.getstatusoutput(cmd)
+if s != 0:
+logging.error("Fail to execute netperf test, protocol:%s", i)
+success = False
+else:
+logging.info(o)
+file(result_file, "a+").write("%s\n" % o)
+if not success:
+raise error.TestFail("Some of the netperf tests failed")
+
+finally:
+session.get_command_output("killall netserver")
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index a710bc0..29fe984 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -518,6 +518,16 @@ variants:
 type = mac_change
 kill_vm = yes
 
+- netperf: install setup unattended_install.cdrom
+type = netperf
+nic_mode = tap
+netperf_files = netperf-2.4.5.tar.bz2 wait_before_data.patch
+setup_cmd = "cd %s && tar xvfj netperf-2.4.5.tar.bz2 && cd 
netperf-2.4.5 && patch -p0 < ../wait_before_data.patch && ./configure && make"
+netserver_cmd =  %s/netperf-2.4.5/src/netserver
+# test time is 60 seconds, set the buffer size to 1 for more hardware 
interrupt
+netperf_cmd = %s/netperf-2.4.5/src/netperf -t %s -H %s -l 60 -- -m 1
+protocols = "TCP_STREAM TCP_MAERTS TCP_RR TCP_CRR UDP_RR TCP_SENDFILE 
UDP_STREAM"
+
 - physical_resources_check: install setup unattended_install.cdrom
 type = physical_resources_check
 catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}'
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/18] KVM test: Add basic file transfer test

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

This test is the basic test of transfering file between host and guest.
Try to transfer a large file from host to guest, and transfer it back
to host, then compare the files by calculate their md5 hash.

The default file size is 4000M, scp timeout is 1000s. It means if the
average speed is less than 4M/s, this test will be fail. We can extend
this test by using another disk later, then we can transfer larger
files without the limit of first disk size.

Changes from v1:
- Use md5 to verify the integrity of files
- Try to use autotest API, such as, utils.system()

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/file_transfer.py |   55 +++
 client/tests/kvm/tests_base.cfg.sample  |7 +++-
 2 files changed, 61 insertions(+), 1 deletions(-)
 create mode 100644 client/tests/kvm/tests/file_transfer.py

diff --git a/client/tests/kvm/tests/file_transfer.py 
b/client/tests/kvm/tests/file_transfer.py
new file mode 100644
index 000..c9a3476
--- /dev/null
+++ b/client/tests/kvm/tests/file_transfer.py
@@ -0,0 +1,55 @@
+import logging, commands, re
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_utils, kvm_test_utils
+
+def run_file_transfer(test, params, env):
+"""
+Test ethrnet device function by ethtool
+
+1) Boot up a virtual machine
+2) Create a large file by dd on host
+3) Copy this file from host to guest
+4) Copy this file from guest to host
+5) Check if file transfers good
+
+@param test: Kvm test object
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+timeout=int(params.get("login_timeout", 360))
+logging.info("Trying to log into guest '%s' by serial", vm.name)
+session = kvm_utils.wait_for(lambda: vm.serial_login(),
+ timeout, 0, step=2)
+if not session:
+raise error.TestFail("Could not log into guest '%s'" % vm.name)
+
+dir = test.tmpdir
+scp_timeout = int(params.get("scp_timeout"))
+cmd = "dd if=/dev/urandom of=%s/a.out bs=1M count=%d" %  (dir, int(
+ params.get("filesize", 4000)))
+try:
+logging.info("Create file by dd command on host, cmd: %s" % cmd)
+utils.run(cmd)
+
+logging.info("Transfer file from host to guest")
+if not vm.copy_files_to("%s/a.out" % dir, "/tmp/b.out",
+timeout=scp_timeout):
+raise error.TestFail("Fail to transfer file from host to guest")
+
+logging.info("Transfer file from guest to host")
+if not vm.copy_files_from("/tmp/b.out", "%s/c.out" % dir,
+timeout=scp_timeout):
+raise error.TestFail("Fail to transfer file from guest to host")
+
+logging.debug(commands.getoutput("ls -l %s/[ac].out" % dir))
+md5_orig = utils.hash_file("%s/a.out" % dir, method="md5")
+md5_new = utils.hash_file("%s/c.out" % dir, method="md5")
+
+if md5_orig != md5_new:
+raise error.TestFail("File changed after transfer")
+finally:
+session.get_command_status("rm -f /tmp/b.out")
+utils.run("rm -f %s/[ac].out" % dir)
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 70929b0..4b424da 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -477,6 +477,11 @@ variants:
 - jumbo: install setup unattended_install.cdrom
 type = jumbo
 
+- file_transfer: install setup unattended_install.cdrom
+type = file_transfer
+filesize = 4000
+scp_timeout = 1000
+
 - physical_resources_check: install setup unattended_install.cdrom
 type = physical_resources_check
 catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}'
@@ -1194,7 +1199,7 @@ variants:
 
 # Windows section
 - @Windows:
-no autotest linux_s3 vlan_tag ioquit 
unattended_install.(url|nfs|remote_ks) jumbo
+no autotest linux_s3 vlan_tag ioquit 
unattended_install.(url|nfs|remote_ks) jumbo file_transfer
 shutdown_command = shutdown /s /f /t 0
 reboot_command = shutdown /r /f /t 0
 status_test_command = echo %errorlevel%
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 13/18] KVM test: Add a subtest of changing MAC address

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Test steps:

1. Get a new mac from pool, and the old mac addr of guest.
2. Execute the mac_change.sh in guest.
3. Relogin to guest and query the interfaces info by `ifconfig`

Signed-off-by: Cao, Chen 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/mac_change.py   |   65 
 client/tests/kvm/tests_base.cfg.sample |6 ++-
 2 files changed, 70 insertions(+), 1 deletions(-)
 create mode 100644 client/tests/kvm/tests/mac_change.py

diff --git a/client/tests/kvm/tests/mac_change.py 
b/client/tests/kvm/tests/mac_change.py
new file mode 100644
index 000..b97c99a
--- /dev/null
+++ b/client/tests/kvm/tests/mac_change.py
@@ -0,0 +1,65 @@
+import logging
+from autotest_lib.client.common_lib import error
+import kvm_utils, kvm_test_utils
+
+
+def run_mac_change(test, params, env):
+"""
+Change MAC address of guest.
+
+1) Get a new mac from pool, and the old mac addr of guest.
+2) Set new mac in guest and regain new IP.
+3) Re-log into guest with new MAC.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+timeout = int(params.get("login_timeout", 360))
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+logging.info("Trying to log into guest '%s' by serial", vm.name)
+session = kvm_utils.wait_for(lambda: vm.serial_login(),
+  timeout, 0, step=2)
+if not session:
+raise error.TestFail("Could not log into guest '%s'" % vm.name)
+
+old_mac = vm.get_macaddr(0)
+kvm_utils.free_mac_address(vm.root_dir, vm, 0)
+new_mac = kvm_utils.generate_mac_address(vm.root_dir,
+ vm.instance,
+ 0, vm.mac_prefix)
+logging.info("The initial MAC address is %s", old_mac)
+interface = kvm_test_utils.get_linux_ifname(session, old_mac)
+
+# Start change MAC address
+logging.info("Changing MAC address to %s", new_mac)
+change_cmd = ("ifconfig %s down && ifconfig %s hw ether %s && "
+  "ifconfig %s up" % (interface, interface, new_mac, 
interface))
+if session.get_command_status(change_cmd) != 0:
+raise error.TestFail("Fail to send mac_change command")
+
+# Verify whether MAC address was changed to the new one
+logging.info("Verifying the new mac address")
+if session.get_command_status("ifconfig | grep -i %s" % new_mac) != 0:
+raise error.TestFail("Fail to change MAC address")
+
+# Restart `dhclient' to regain IP for new mac address
+logging.info("Restart the network to gain new IP")
+dhclient_cmd = "dhclient -r && dhclient %s" % interface
+session.sendline(dhclient_cmd)
+
+# Re-log into the guest after changing mac address
+if kvm_utils.wait_for(session.is_responsive, 120, 20, 3):
+# Just warning when failed to see the session become dead,
+# because there is a little chance the ip does not change.
+logging.warn("The session is still responsive, settings may fail.")
+session.close()
+
+# Re-log into guest and check if session is responsive
+logging.info("Re-log into the guest")
+session = kvm_test_utils.wait_for_login(vm,
+  timeout=int(params.get("login_timeout", 360)))
+if not session.is_responsive():
+raise error.TestFail("The new session is not responsive.")
+
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 6bbc33e..a710bc0 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -514,6 +514,10 @@ variants:
 restart_vm = yes
 pxe_timeout = 60
 
+- mac_change: install setup unattended_install.cdrom
+type = mac_change
+kill_vm = yes
+
 - physical_resources_check: install setup unattended_install.cdrom
 type = physical_resources_check
 catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}'
@@ -1231,7 +1235,7 @@ variants:
 
 # Windows section
 - @Windows:
-no autotest linux_s3 vlan_tag ioquit 
unattended_install.(url|nfs|remote_ks) jumbo file_transfer nicdriver_unload 
nic_promisc multicast
+no autotest linux_s3 vlan_tag ioquit 
unattended_install.(url|nfs|remote_ks) jumbo file_transfer nicdriver_unload 
nic_promisc multicast mac_change
 shutdown_command = shutdown /s /f /t 0
 reboot_command = shutdown /r /f /t 0
 status_test_command = echo %errorlevel%
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 10/18] KVM test: Add a subtest of nic promisc

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

This test mainly covers TCP sent from host to guest and from guest to host
with repeatedly turn on/off NIC promiscuous mode.

Changes from v1:
- Don't abruptly fail the whole test if we get a failure for a single size

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/nic_promisc.py  |  103 
 client/tests/kvm/tests_base.cfg.sample |6 ++-
 2 files changed, 108 insertions(+), 1 deletions(-)
 create mode 100644 client/tests/kvm/tests/nic_promisc.py

diff --git a/client/tests/kvm/tests/nic_promisc.py 
b/client/tests/kvm/tests/nic_promisc.py
new file mode 100644
index 000..e13820a
--- /dev/null
+++ b/client/tests/kvm/tests/nic_promisc.py
@@ -0,0 +1,103 @@
+import logging
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_utils, kvm_test_utils
+
+def run_nic_promisc(test, params, env):
+"""
+Test nic driver in promisc mode:
+
+1) Boot up a VM.
+2) Repeatedly enable/disable promiscuous mode in guest.
+3) TCP data transmission from host to guest, and from guest to host,
+   with 1/1460/65000/1 bytes payloads.
+4) Clean temporary files.
+5) Stop enable/disable promiscuous mode change.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+timeout = int(params.get("login_timeout", 360))
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
+
+logging.info("Trying to log into guest '%s' by serial", vm.name)
+session2 = kvm_utils.wait_for(lambda: vm.serial_login(),
+  timeout, 0, step=2)
+if not session2:
+raise error.TestFail("Could not log into guest '%s'" % vm.name)
+
+def compare(filename):
+cmd = "md5sum %s" % filename
+md5_host = utils.hash_file(filename, method="md5")
+rc_guest, md5_guest = session.get_command_status_output(cmd)
+if rc_guest:
+logging.debug("Could not get MD5 hash for file %s on guest,"
+  "output: %s", filename, md5_guest)
+return False
+md5_guest = md5_guest.split()[0]
+if md5_host != md5_guest:
+logging.error("MD5 hash mismatch between file %s "
+  "present on guest and on host", filename)
+logging.error("MD5 hash for file on guest: %s,"
+  "MD5 hash for file on host: %s", md5_host, md5_guest)
+return False
+return True
+
+ethname = kvm_test_utils.get_linux_ifname(session, vm.get_macaddr(0))
+set_promisc_cmd = ("ip link set %s promisc on; sleep 0.01;"
+   "ip link set %s promisc off; sleep 0.01" %
+   (ethname, ethname))
+logging.info("Set promisc change repeatedly in guest")
+session2.sendline("while true; do %s; done" % set_promisc_cmd)
+
+dd_cmd = "dd if=/dev/urandom of=%s bs=%d count=1"
+filename = "/tmp/nic_promisc_file"
+file_size = params.get("file_size", "1, 1460, 65000, 1").split(",")
+success_counter = 0
+try:
+for size in file_size:
+logging.info("Create %s bytes file on host" % size)
+utils.run(dd_cmd % (filename, int(size)))
+
+logging.info("Transfer file from host to guest")
+if not vm.copy_files_to(filename, filename):
+logging.error("File transfer failed")
+continue
+if not compare(filename):
+logging.error("Compare file failed")
+continue
+else:
+success_counter += 1
+
+logging.info("Create %s bytes file on guest" % size)
+if session.get_command_status(dd_cmd % (filename, int(size)),
+timeout=100) != 0:
+logging.error("Create file on guest failed")
+continue
+
+logging.info("Transfer file from guest to host")
+if not vm.copy_files_from(filename, filename):
+logging.error("File transfer failed")
+continue
+if not compare(filename):
+logging.error("Compare file failed")
+continue
+else:
+success_counter += 1
+
+logging.info("Clean temporary files")
+cmd = "rm -f %s" % filename
+utils.run(cmd)
+session.get_command_status(cmd)
+
+finally:
+logging.info("Restore the %s to the nonpromisc mode", ethname)
+session2.close()
+session.get_command_status("ip link set %s promisc off" % ethname)
+session.close()
+
+if success_counter != 2 * len(file_size):
+raise error.TestFail("Some tests failed, succss_ratio : %s/%s" %
+ (

[PATCH 09/18] KVM test: Add a subtest of load/unload nic driver

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Repeatedly load/unload nic driver, try to transfer file between guest and host
by threads at the same time, and check the md5sum.

Changes from v1:
- Use a new method to get nic driver name
- Use utils.hash_file() to get md5sum

Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/file_transfer.py|   19 +++--
 client/tests/kvm/tests/nicdriver_unload.py |  115 
 client/tests/kvm/tests_base.cfg.sample |   10 ++-
 3 files changed, 135 insertions(+), 9 deletions(-)
 create mode 100644 client/tests/kvm/tests/nicdriver_unload.py

diff --git a/client/tests/kvm/tests/file_transfer.py 
b/client/tests/kvm/tests/file_transfer.py
index c9a3476..a0c6cff 100644
--- a/client/tests/kvm/tests/file_transfer.py
+++ b/client/tests/kvm/tests/file_transfer.py
@@ -7,18 +7,19 @@ def run_file_transfer(test, params, env):
 """
 Test ethrnet device function by ethtool
 
-1) Boot up a virtual machine
-2) Create a large file by dd on host
-3) Copy this file from host to guest
-4) Copy this file from guest to host
-5) Check if file transfers good
+1) Boot up a VM.
+2) Create a large file by dd on host.
+3) Copy this file from host to guest.
+4) Copy this file from guest to host.
+5) Check if file transfers ended good.
 
-@param test: Kvm test object
+@param test: KVM test object.
 @param params: Dictionary with the test parameters.
 @param env: Dictionary with test environment.
 """
 vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
 timeout=int(params.get("login_timeout", 360))
+
 logging.info("Trying to log into guest '%s' by serial", vm.name)
 session = kvm_utils.wait_for(lambda: vm.serial_login(),
  timeout, 0, step=2)
@@ -29,18 +30,19 @@ def run_file_transfer(test, params, env):
 scp_timeout = int(params.get("scp_timeout"))
 cmd = "dd if=/dev/urandom of=%s/a.out bs=1M count=%d" %  (dir, int(
  params.get("filesize", 4000)))
+
 try:
 logging.info("Create file by dd command on host, cmd: %s" % cmd)
 utils.run(cmd)
 
 logging.info("Transfer file from host to guest")
 if not vm.copy_files_to("%s/a.out" % dir, "/tmp/b.out",
-timeout=scp_timeout):
+timeout=scp_timeout):
 raise error.TestFail("Fail to transfer file from host to guest")
 
 logging.info("Transfer file from guest to host")
 if not vm.copy_files_from("/tmp/b.out", "%s/c.out" % dir,
-timeout=scp_timeout):
+  timeout=scp_timeout):
 raise error.TestFail("Fail to transfer file from guest to host")
 
 logging.debug(commands.getoutput("ls -l %s/[ac].out" % dir))
@@ -49,6 +51,7 @@ def run_file_transfer(test, params, env):
 
 if md5_orig != md5_new:
 raise error.TestFail("File changed after transfer")
+
 finally:
 session.get_command_status("rm -f /tmp/b.out")
 utils.run("rm -f %s/[ac].out" % dir)
diff --git a/client/tests/kvm/tests/nicdriver_unload.py 
b/client/tests/kvm/tests/nicdriver_unload.py
new file mode 100644
index 000..8630f88
--- /dev/null
+++ b/client/tests/kvm/tests/nicdriver_unload.py
@@ -0,0 +1,115 @@
+import logging, threading, os
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_utils, kvm_test_utils
+
+def run_nicdriver_unload(test, params, env):
+"""
+Test nic driver.
+
+1) Boot a VM.
+2) Get the NIC driver name.
+3) Repeatedly unload/load NIC driver.
+4) Multi-session TCP transfer on test interface.
+5) Check whether the test interface should still work.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+timeout = int(params.get("login_timeout", 360))
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
+logging.info("Trying to log into guest '%s' by serial", vm.name)
+session2 = kvm_utils.wait_for(lambda: vm.serial_login(),
+  timeout, 0, step=2)
+if not session2:
+raise error.TestFail("Could not log into guest '%s'" % vm.name)
+
+ethname = kvm_test_utils.get_linux_ifname(session, vm.get_macaddr(0))
+temp_path = "/sys/class/net/%s/device/driver" % (ethname)
+if os.path.islink(temp_path):
+driver = os.path.split(os.path.realpath(temp_path))[-1]
+else:
+raise error.TestError("Could not find driver name")
+logging.info("driver is %s", driver)
+
+class ThreadScp(threading.Thread):
+def run(self):
+remote_file = '/tmp/' + self.getName()
+file_list.append(

[PATCH 06/18] KVM test: Add a new subtest ping

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

This test use ping to check the virtual nics, it contains two kinds of test:
1. Packet loss ratio test, ping the guest with different size of packets.
2. Stress test, flood ping guest then use ordinary ping to test the network.

We could not raise an error when flood ping failed, it's too strict.
But we must check the ping result before/after flood-ping.
The interval and packet size could be configurated through tests_base.cfg

Changes from v2:
- Coding style fixes

Changes from v1:
- Improve error message

Signed-off-by: Jason Wang 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/ping.py |   72 
 client/tests/kvm/tests_base.cfg.sample |5 ++
 2 files changed, 77 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/tests/ping.py

diff --git a/client/tests/kvm/tests/ping.py b/client/tests/kvm/tests/ping.py
new file mode 100644
index 000..9b2308f
--- /dev/null
+++ b/client/tests/kvm/tests/ping.py
@@ -0,0 +1,72 @@
+import logging
+from autotest_lib.client.common_lib import error
+import kvm_test_utils
+
+
+def run_ping(test, params, env):
+"""
+Ping the guest with different size of packets.
+
+Packet Loss Test:
+1) Ping the guest with different size/interval of packets.
+
+Stress Test:
+1) Flood ping the guest.
+2) Check if the network is still usable.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm)
+
+counts = params.get("ping_counts", 100)
+flood_minutes = float(params.get("flood_minutes", 10))
+nics = params.get("nics").split()
+strict_check = params.get("strict_check", "no") == "yes"
+
+packet_size = [0, 1, 4, 48, 512, 1440, 1500, 1505, 4054, 4055, 4096, 4192,
+   8878, 9000, 32767, 65507]
+
+try:
+for i, nic in enumerate(nics):
+ip = vm.get_address(i)
+if not ip:
+logging.error("Could not get the ip of nic index %d", i)
+continue
+
+for size in packet_size:
+logging.info("Ping with packet size %s", size)
+status, output = kvm_test_utils.ping(ip, 10,
+ packetsize=size,
+ timeout=20)
+if strict_check:
+ratio = kvm_test_utils.get_loss_ratio(output)
+if ratio != 0:
+raise error.TestFail("Loss ratio is %s for packet size"
+ " %s" % (ratio, size))
+else:
+if status != 0:
+raise error.TestFail("Ping failed, status: %s,"
+ " output: %s" % (status, output))
+
+logging.info("Flood ping test")
+kvm_test_utils.ping(ip, None, flood=True, output_func=None,
+timeout=flood_minutes * 60)
+
+logging.info("Final ping test")
+status, output = kvm_test_utils.ping(ip, counts,
+ timeout=float(counts) * 1.5)
+if strict_check:
+ratio = kvm_test_utils.get_loss_ratio(output)
+if ratio != 0:
+raise error.TestFail("Ping failed, status: %s,"
+ " output: %s" % (status, output))
+else:
+if status != 0:
+raise error.TestFail("Ping returns non-zero value %s" %
+ output)
+finally:
+session.close()
diff --git a/client/tests/kvm/tests_base.cfg.sample 
b/client/tests/kvm/tests_base.cfg.sample
index 9739a50..90c1f69 100644
--- a/client/tests/kvm/tests_base.cfg.sample
+++ b/client/tests/kvm/tests_base.cfg.sample
@@ -469,6 +469,11 @@ variants:
 kill_vm_gracefully_vm2 = no
 address_index_vm2 = 1
 
+- ping: install setup unattended_install.cdrom
+type = ping
+counts = 100
+flood_minutes = 10
+
 - physical_resources_check: install setup unattended_install.cdrom
 type = physical_resources_check
 catch_uuid_cmd = dmidecode | awk -F: '/UUID/ {print $2}'
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/18] KVM test: Add a subtest jumbo

2010-09-14 Thread Lucas Meneghel Rodrigues

According to different nic model set different MTU for it. And ping from
guest to host, to see whether tested size can be received by host.

Changes from v2:
- Coding style fixes

Changes from v1:
- Make standard of lost ratio can be configured

Signed-off-by: Jason Wang 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/tests/jumbo.py|  130 
 client/tests/kvm/tests_base.cfg.sample |   11 +++-
 2 files changed, 140 insertions(+), 1 deletions(-)
 create mode 100644 client/tests/kvm/tests/jumbo.py

diff --git a/client/tests/kvm/tests/jumbo.py b/client/tests/kvm/tests/jumbo.py
new file mode 100644
index 000..9c36951
--- /dev/null
+++ b/client/tests/kvm/tests/jumbo.py
@@ -0,0 +1,130 @@
+import logging, commands, random
+from autotest_lib.client.common_lib import error
+from autotest_lib.client.bin import utils
+import kvm_test_utils, kvm_utils
+
+def run_jumbo(test, params, env):
+"""
+Test the RX jumbo frame function of vnics:
+
+1) Boot the VM.
+2) Change the MTU of guest nics and host taps depending on the NIC model.
+3) Add the static ARP entry for guest NIC.
+4) Wait for the MTU ok.
+5) Verify the path MTU using ping.
+6) Ping the guest with large frames.
+7) Increment size ping.
+8) Flood ping the guest with large frames.
+9) Verify the path MTU.
+10) Recover the MTU.
+
+@param test: KVM test object.
+@param params: Dictionary with the test parameters.
+@param env: Dictionary with test environment.
+"""
+
+vm = kvm_test_utils.get_living_vm(env, params.get("main_vm"))
+session = kvm_test_utils.wait_for_login(vm)
+mtu = params.get("mtu", "1500")
+flood_time = params.get("flood_time", "300")
+max_icmp_pkt_size = int(mtu) - 28
+
+ifname = vm.get_ifname(0)
+ip = vm.get_address(0)
+if ip is None:
+raise error.TestError("Could not get the IP address")
+
+try:
+# Environment preparation
+ethname = kvm_test_utils.get_linux_ifname(session, vm.get_macaddr(0))
+
+logging.info("Changing the MTU of guest ...")
+guest_mtu_cmd = "ifconfig %s mtu %s" % (ethname , mtu)
+s, o = session.get_command_status_output(guest_mtu_cmd)
+if s != 0:
+logging.error(o)
+raise error.TestError("Fail to set the MTU of guest NIC: %s" %
+  ethname)
+
+logging.info("Chaning the MTU of host tap ...")
+host_mtu_cmd = "ifconfig %s mtu %s" % (ifname, mtu)
+utils.run(host_mtu_cmd)
+
+logging.info("Add a temporary static ARP entry ...")
+arp_add_cmd = "arp -s %s %s -i %s" % (ip, vm.get_macaddr(0), ifname)
+utils.run(arp_add_cmd)
+
+def is_mtu_ok():
+s, o = kvm_test_utils.ping(ip, 1, interface=ifname,
+   packetsize=max_icmp_pkt_size,
+   hint="do", timeout=2)
+return s == 0
+
+def verify_mtu():
+logging.info("Verify the path MTU")
+s, o = kvm_test_utils.ping(ip, 10, interface=ifname,
+   packetsize=max_icmp_pkt_size,
+   hint="do", timeout=15)
+if s != 0 :
+logging.error(o)
+raise error.TestFail("Path MTU is not as expected")
+if kvm_test_utils.get_loss_ratio(o) != 0:
+logging.error(o)
+raise error.TestFail("Packet loss ratio during MTU "
+ "verification is not zero")
+
+def flood_ping():
+logging.info("Flood with large frames")
+kvm_test_utils.ping(ip, interface=ifname,
+packetsize=max_icmp_pkt_size,
+flood=True, timeout=float(flood_time))
+
+def large_frame_ping(count=100):
+logging.info("Large frame ping")
+s, o = kvm_test_utils.ping(ip, count, interface=ifname,
+   packetsize=max_icmp_pkt_size,
+   timeout=float(count) * 2)
+ratio = kvm_test_utils.get_loss_ratio(o)
+if ratio != 0:
+raise error.TestFail("Loss ratio of large frame ping is %s" %
+ ratio)
+
+def size_increase_ping(step=random.randrange(90, 110)):
+logging.info("Size increase ping")
+for size in range(0, max_icmp_pkt_size + 1, step):
+logging.info("Ping %s with size %s" % (ip, size))
+s, o = kvm_test_utils.ping(ip, 1, interface=ifname,
+   packetsize=size,
+   hint="do", timeout=1)
+if s != 0:
+s, o = kvm_test_utils.ping(ip, 10, interface=ifname,
+   packetsize=size,
+

[PATCH 04/18] KVM test: Add a get_ifname function

2010-09-14 Thread Lucas Meneghel Rodrigues

It's clearer to use 'nic_mode + nic_index + vnc_port' than 'tap0',
It's also unique for one guest.

Signed-off-by: Amos Kong 
---
 client/tests/kvm/kvm_vm.py |   21 -
 1 files changed, 20 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/kvm_vm.py b/client/tests/kvm/kvm_vm.py
index 13eaac1..a473bb4 100755
--- a/client/tests/kvm/kvm_vm.py
+++ b/client/tests/kvm/kvm_vm.py
@@ -426,7 +426,7 @@ class VM:
 if tftp:
 tftp = kvm_utils.get_path(root_dir, tftp)
 qemu_cmd += add_net(help, vlan, nic_params.get("nic_mode", "user"),
-nic_params.get("nic_ifname"),
+self.get_ifname(vlan),
 script, downscript, tftp,
 nic_params.get("bootp"), redirs,
 self.netdev_id[vlan])
@@ -959,6 +959,25 @@ class VM:
 return self.redirs.get(port)
 
 
+def get_ifname(self, nic_index=0):
+"""
+Return the ifname of tap device for the guest nic.
+
+The vnc_port is unique for each VM, nic_index is unique for each nic
+of one VM, it can avoid repeated ifname.
+
+@param nic_index: Index of the NIC
+"""
+nics = kvm_utils.get_sub_dict_names(self.params, "nics")
+nic_name = nics[nic_index]
+nic_params = kvm_utils.get_sub_dict(self.params, nic_name)
+if nic_params.get("nic_ifname"):
+return nic_params.get("nic_ifname")
+else:
+return "%s_%s_%s" % (nic_params.get("nic_model"),
+ nic_index, self.vnc_port)
+
+
 def get_mac_address(self, nic_index=0):
 """
 Return the macaddr of guest nic.
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/18] KVM test: Remove address_pools.cfg dependency

2010-09-14 Thread Lucas Meneghel Rodrigues

Since the previous patch introduces an automated management
mechanism for MAC addresses, let's simplify things a bit
by removing address_pools.cfg parsing from the control files,
as well as just removing address_pools.cfg.sample.

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/address_pools.cfg.sample |   65 -
 client/tests/kvm/control  |8 
 client/tests/kvm/control.parallel |9 
 client/tests/kvm/get_started.py   |4 +-
 4 files changed, 2 insertions(+), 84 deletions(-)
 delete mode 100644 client/tests/kvm/address_pools.cfg.sample

diff --git a/client/tests/kvm/address_pools.cfg.sample 
b/client/tests/kvm/address_pools.cfg.sample
deleted file mode 100644
index b5967ce..000
--- a/client/tests/kvm/address_pools.cfg.sample
+++ /dev/null
@@ -1,65 +0,0 @@
-# Copy this file to address_pools.cfg and edit it.
-#
-# This file specifies several MAC-IP ranges for each host in the network that
-# may run KVM tests.  A MAC address must not be used twice, so these ranges
-# must not overlap.  The VMs running on each host will only use MAC addresses
-# from the pool of that host.
-# If you wish to use a static MAC-IP mapping, where each MAC address range is
-# mapped to a known corresponding IP address range, specify the bases of the IP
-# address ranges in this file.
-# If you specify a MAC address range without a corresponding IP address range,
-# the IP addresses for that range will be determined at runtime by listening
-# to DHCP traffic using tcpdump.
-# If you wish to determine IP addresses using tcpdump in any case, regardless
-# of any # IP addresses specified in this file, uncomment the following line:
-#always_use_tcpdump = yes
-# You may also specify this parameter for specific hosts by adding it in the
-# appropriate sections below.
-
-variants:
-# Rename host1 to an actual (short) hostname in the network that will be 
running the Autotest client
-- @host1:
-# Add/remove ranges here
-address_ranges = r1 r2
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r1 = 52:54:00:12:35:56
-#address_range_base_ip_r1 = 10.0.2.20
-address_range_size_r1 = 16
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r2 = 52:54:00:12:35:80
-#address_range_base_ip_r2 = 10.0.2.40
-address_range_size_r2 = 16
-
-# Rename host2 to an actual (short) hostname in the network that will be 
running the Autotest client
-- @host2:
-# Add/remove ranges here
-address_ranges = r1 r2
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r1 = 52:54:00:12:36:56
-#address_range_base_ip_r1 = 10.0.3.20
-address_range_size_r1 = 16
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r2 = 52:54:00:12:36:80
-#address_range_base_ip_r2 = 10.0.3.40
-address_range_size_r2 = 16
-
-# Add additional hosts here...
-
-# This will be used for hosts that do not appear on the list
-- @default_host:
-# Add/remove ranges here
-address_ranges = r1 r2
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r1 = 52:54:00:12:34:56
-#address_range_base_ip_r1 = 10.0.1.20
-address_range_size_r1 = 16
-
-# Modify the following parameters to reflect the DHCP server's 
configuration
-address_range_base_mac_r2 = 52:54:00:12:34:80
-#address_range_base_ip_r2 = 10.0.1.40
-address_range_size_r2 = 16
diff --git a/client/tests/kvm/control b/client/tests/kvm/control
index a69eacf..63bbe5d 100644
--- a/client/tests/kvm/control
+++ b/client/tests/kvm/control
@@ -55,14 +55,6 @@ tests_cfg = kvm_config.config()
 tests_cfg_path = os.path.join(kvm_test_dir, "tests.cfg")
 tests_cfg.fork_and_parse(tests_cfg_path, str)
 
-pools_cfg_path = os.path.join(kvm_test_dir, "address_pools.cfg")
-tests_cfg.parse_file(pools_cfg_path)
-hostname = os.uname()[1].split(".")[0]
-if tests_cfg.count("^" + hostname):
-tests_cfg.parse_string("only ^%s" % hostname)
-else:
-tests_cfg.parse_string("only ^default_host")
-
 # Run the tests
 kvm_utils.run_tests(tests_cfg.get_generator(), job)
 
diff --git a/client/tests/kvm/control.parallel 
b/client/tests/kvm/control.parallel
index 07bc6e5..ac84638 100644
--- a/client/tests/kvm/control.parallel
+++ b/client/tests/kvm/control.parallel
@@ -171,15 +171,6 @@ cfg = kvm_config.config()
 filename = os.path.join(pwd, "tests.cfg")
 cfg.fork_and_parse(filename, str)
 
-filename = os.path.join(pwd, "address_pools.cfg")
-if os.path.exists(filename):
-cfg.parse_file(filename)
-hostname = os.uname()[1].split(".")[0]
-if cfg.count("^" + hostname):
-

[PATCH 02/18] KVM test: Make physical_resources_check to work with MAC management

2010-09-14 Thread Lucas Meneghel Rodrigues

The previous MAC address management breaks up the
physical_resources_check test (the test picks up NIC MAC parameters
from test parameters). Let's fix it by making it retrieve params
from the method VM.get_mac_address()

Signed-off-by: Lucas Meneghel Rodrigues 
---
 client/tests/kvm/tests/physical_resources_check.py |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/client/tests/kvm/tests/physical_resources_check.py 
b/client/tests/kvm/tests/physical_resources_check.py
index 0f7cab3..6c8e154 100644
--- a/client/tests/kvm/tests/physical_resources_check.py
+++ b/client/tests/kvm/tests/physical_resources_check.py
@@ -123,9 +123,14 @@ def run_physical_resources_check(test, params, env):
 found_mac_addresses = re.findall("macaddr=(\S+)", o)
 logging.debug("Found MAC adresses: %s" % found_mac_addresses)
 
+nic_index = 0
 for nic_name in kvm_utils.get_sub_dict_names(params, "nics"):
 nic_params = kvm_utils.get_sub_dict(params, nic_name)
-mac, ip = kvm_utils.get_mac_ip_pair_from_dict(nic_params)
+if "address_index" in nic_params:
+mac, ip = kvm_utils.get_mac_ip_pair_from_dict(nic_params)
+else:
+mac = vm.get_mac_address(nic_index)
+nic_index += 1
 if not string.lower(mac) in found_mac_addresses:
 n_fail += 1
 logging.error("MAC address mismatch:")
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/18] KVM test: Add a new macaddress pool algorithm

2010-09-14 Thread Lucas Meneghel Rodrigues

From: Amos Kong 

Old method uses the addresses in the config files which could lead serious
problem when multiple tests running in different hosts.
This patch adds a new macaddress pool algorithm, it generates the mac prefix
based on mac address of the host, and fix it to correspond to IEEE802.
When user have set the mac_prefix in the config file, we should use it instead
of the dynamic generated mac prefix.
Add a parameter like 'preserve_mac', to preserve the original mac address, for
things like migration.

MAC addresses are recorded into a dictionary 'address_pool' in following
format: {{'20100310-165222-Wt7l:0' : 'AE:9D:94:6A:9b:f9'},...}
  20100310-165222-Wt7l : instance attribute of VM
  0: index of NIC
  AE:9D:94:6A:9b:f9: mac address
Use 'vm instance' + 'nic index' as the key, macaddress is the value.

Changes from v2:
- Instead of basing ourselves in a physical interface address to get
an address prefix, just generate one with the prefix 0x9a (convention)
randomly and add it to the address pool. If there's already one
prefix, keep it there and just return it.
- Made messages more consistent and informative
- Made function names consistent all across the board
- Fixed some single line spacing between functions

Changs from v1:
- Use 'vm instance' + 'nic index' as the key of address_pool, address is value.
- Put 'mac_lock' and 'address_pool' to '/tmp', for sharing them to other
  autotest instances running on the same host.
- Change function names for less confusion.
- Do not copy 'vm.instance' in vm.clone()
- Split 'adding get_ifname function' to another patch

Signed-off-by: Jason Wang 
Signed-off-by: Feng Yang 
Signed-off-by: Amos Kong 
---
 client/tests/kvm/kvm_utils.py  |  114 
 client/tests/kvm/kvm_vm.py |   83 ++--
 client/tests/kvm/tests_base.cfg.sample |2 +-
 3 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/client/tests/kvm/kvm_utils.py b/client/tests/kvm/kvm_utils.py
index fb2d1c2..bb5c868 100644
--- a/client/tests/kvm/kvm_utils.py
+++ b/client/tests/kvm/kvm_utils.py
@@ -5,6 +5,7 @@ KVM test utility functions.
 """
 
 import time, string, random, socket, os, signal, re, logging, commands, cPickle
+import fcntl, shelve
 from autotest_lib.client.bin import utils
 from autotest_lib.client.common_lib import error, logging_config
 import kvm_subprocess
@@ -82,6 +83,119 @@ def get_sub_dict_names(dict, keyword):
 
 # Functions related to MAC/IP addresses
 
+def _generate_mac_address_prefix():
+"""
+Generate a MAC address prefix. By convention we will set KVM autotest
+MAC addresses to start with 0x9a.
+"""
+l = [0x9a, random.randint(0x00, 0x7f), random.randint(0x00, 0x7f),
+ random.randint(0x00, 0xff)]
+prefix = ':'.join(map(lambda x: "%02x" % x, l)) + ":"
+return prefix
+
+
+def generate_mac_address_prefix():
+"""
+Generate a random MAC address prefix and add it to the MAC pool dictionary.
+If there's a MAC prefix there already, do not update the MAC pool and just
+return what's in there.
+"""
+lock_file = open("/tmp/mac_lock", 'w')
+fcntl.lockf(lock_file.fileno() ,fcntl.LOCK_EX)
+mac_pool = shelve.open("/tmp/address_pool", writeback=False)
+
+if mac_pool.get('prefix'):
+prefix = mac_pool.get('prefix')
+logging.debug('Retrieved previously generated MAC prefix for this '
+  'host: %s', prefix)
+else:
+prefix = _generate_mac_address_prefix()
+mac_pool['prefix'] = prefix
+logging.debug('Generated MAC address prefix for this host: %s', prefix)
+
+mac_pool.close()
+fcntl.lockf(lock_file.fileno(), fcntl.LOCK_UN)
+lock_file.close()
+
+return prefix
+
+
+def generate_mac_address(root_dir, instance_vm, nic_index, prefix=None):
+"""
+Random generate a MAC address and add it to the MAC pool.
+
+Try to generate macaddress based on the mac address prefix, add it to a
+dictionary 'address_pool'.
+key = VM instance + nic index, value = mac address
+{['20100310-165222-Wt7l:0'] : 'AE:9D:94:6A:9b:f9'}
+
+@param root_dir: Root dir for kvm.
+@param instance_vm: Here we use instance of vm.
+@param nic_index: The index of nic.
+@param prefix: Prefix of MAC address.
+@return: MAC address string.
+"""
+if prefix is None:
+prefix = generate_mac_address_prefix()
+
+lock_file = open("/tmp/mac_lock", 'w')
+fcntl.lockf(lock_file.fileno() ,fcntl.LOCK_EX)
+mac_pool = shelve.open("/tmp/address_pool", writeback=False)
+found = False
+key = "%s:%s" % (instance_vm, nic_index)
+
+if mac_pool.get(key):
+found = True
+mac = mac_pool.get(key)
+
+while not found:
+suffix = "%02x:%02x" % (random.randint(0x00,0xfe),
+random.randint(0x00,0xfe))
+mac = prefix + suffix
+mac_list = mac.split(":")
+# Clear multic

[PATCH 00/18] KVM autotest network patchset v3

2010-09-14 Thread Lucas Meneghel Rodrigues

This is Amos's patchset rebased, with some cleanups and additions:

- New method to generate MAC address prefixes
- Fix some tests to use new management
- Remove the dependency on address_pools.cfg
- Coding style fixes

We still have to do some work before the patches can be applied,
the good news is, we're getting closer :)

Amos Kong (12):
  KVM test: Add a new macaddress pool algorithm
  KVM test: Add a new subtest ping
  KVM test: Add basic file transfer test
  KVM test: Add a subtest of load/unload nic driver
  KVM test: Add a subtest of nic promisc
  KVM test: Add a subtest of multicast
  KVM test: Add a subtest of pxe
  KVM test: Add a subtest of changing MAC address
  KVM test: Add a netperf subtest
  KVM test: kvm_utils - Add support of check if remote port free
  KVM test: Improve vlan subtest
  KVM test: vlan subtest - Replace extra_params '-snapshot' with
image_snapshot

Lucas Meneghel Rodrigues (6):
  KVM test: Make physical_resources_check to work with MAC management
  KVM test: Remove address_pools.cfg dependency
  KVM test: Add a get_ifname function
  KVM Test: Add a common ping module for network related tests
  KVM test: Add a subtest jumbo
  KVM test: Add subtest of testing offload by ethtool

 client/tests/kvm/address_pools.cfg.sample  |   65 --
 client/tests/kvm/control   |8 -
 client/tests/kvm/control.parallel  |9 -
 client/tests/kvm/get_started.py|4 +-
 client/tests/kvm/kvm_test_utils.py |  112 ++-
 client/tests/kvm/kvm_utils.py  |  137 -
 client/tests/kvm/kvm_vm.py |  104 +-
 client/tests/kvm/scripts/join_mcast.py |   37 
 client/tests/kvm/tests/ethtool.py  |  215 
 client/tests/kvm/tests/file_transfer.py|   58 ++
 client/tests/kvm/tests/jumbo.py|  130 
 client/tests/kvm/tests/mac_change.py   |   65 ++
 client/tests/kvm/tests/multicast.py|   83 
 client/tests/kvm/tests/netperf.py  |   56 +
 client/tests/kvm/tests/nic_promisc.py  |  103 ++
 client/tests/kvm/tests/nicdriver_unload.py |  115 +++
 client/tests/kvm/tests/physical_resources_check.py |7 +-
 client/tests/kvm/tests/ping.py |   72 +++
 client/tests/kvm/tests/pxe.py  |   31 +++
 client/tests/kvm/tests/vlan.py |  186 +
 client/tests/kvm/tests/vlan_tag.py |   68 --
 client/tests/kvm/tests_base.cfg.sample |   97 -
 22 files changed, 1584 insertions(+), 178 deletions(-)
 delete mode 100644 client/tests/kvm/address_pools.cfg.sample
 create mode 100755 client/tests/kvm/scripts/join_mcast.py
 create mode 100644 client/tests/kvm/tests/ethtool.py
 create mode 100644 client/tests/kvm/tests/file_transfer.py
 create mode 100644 client/tests/kvm/tests/jumbo.py
 create mode 100644 client/tests/kvm/tests/mac_change.py
 create mode 100644 client/tests/kvm/tests/multicast.py
 create mode 100644 client/tests/kvm/tests/netperf.py
 create mode 100644 client/tests/kvm/tests/nic_promisc.py
 create mode 100644 client/tests/kvm/tests/nicdriver_unload.py
 create mode 100644 client/tests/kvm/tests/ping.py
 create mode 100644 client/tests/kvm/tests/pxe.py
 create mode 100644 client/tests/kvm/tests/vlan.py
 delete mode 100644 client/tests/kvm/tests/vlan_tag.py

-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 0/3] SVM feature support for qemu

2010-09-14 Thread Alexander Graf


On 14.09.2010, at 17:52, Joerg Roedel wrote:

> Hi,
> 
> here is the next round of the svm feature support patches for qemu. Key
> change in this version is that it now makes kvm{64|32} the default cpu
> definition for qemu when kvm is enabled (as requested by Alex).
> Otherwise I removed the NRIP_SAVE feature from the phenom definition and
> set svm_features per default to -1 for -cpu host to support all svm
> features that kvm can emulate. I appreciate your comments.

With the updated 1/3 patch, it looks really good to me.

Acked-by: Alexander Graf 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm and initrd

2010-09-14 Thread David S. Ahern



On 09/14/10 13:38, Nirmal Guhan wrote:
> On Tue, Sep 14, 2010 at 8:38 AM, David S. Ahern  wrote:
>>
>>
>> On 09/14/10 00:35, Nirmal Guhan wrote:
>>> Hi,
>>>
>>> Getting an error while booting my guest with -initrd option as in :
>>>
>>> qemu-kvm -net nic,macaddr=$macaddress -net tap,script=/etc/qemu-ifup
>>> -m 512 -hda /root/kvm/x86/vdisk.img -kernel /root/mvroot/bzImage
>>> -initrd /root/kvm/mv/ramdisk.img -append "root=/dev/ram0"
>>>
>>> No filesystem could mount root, tried : ext3 ext2 ext4 vfat msds iso9660
>>> Kernel panic
>>>
>>> #file ramdisk.img
>>> #ramdisk.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean)
> 
> I tried with both initrd and initramfs. Sizes are 42314699 and
> 71271136 respectively. Sizes do seem larger but I created them from
> the nfsroot created as part of the build (the nfsroot works
> apparently).

See if you can drop the image size as a test. I had to do that recently
to get the kernel/initrd/append option to work. As I recall I was
getting the same error message until I dropped the initrd size.

David

> 
>>
>> What's the size of ramdisk.img?
>>
>> David
>>
>>
>>>
>>> I tried with both above initrd and gzipped initrd but same error.
>>>
>>> If I try to mount the same file and do a -append  "ip=dhcp
>>> root=/dev/nfs rw nfsroot=:/root/kvm/mv/mnt" instead of -initrd
>>> option, it works  fine. So am guessing this is initrd related.
>>>
>>> Any help would be much appreciated.
>>>
>>> Thanks,
>>> Nirmal
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qemu-kvm and initrd

2010-09-14 Thread Nirmal Guhan

On Tue, Sep 14, 2010 at 8:38 AM, David S. Ahern  wrote:
>
>
> On 09/14/10 00:35, Nirmal Guhan wrote:
>> Hi,
>>
>> Getting an error while booting my guest with -initrd option as in :
>>
>> qemu-kvm -net nic,macaddr=$macaddress -net tap,script=/etc/qemu-ifup
>> -m 512 -hda /root/kvm/x86/vdisk.img -kernel /root/mvroot/bzImage
>> -initrd /root/kvm/mv/ramdisk.img -append "root=/dev/ram0"
>>
>> No filesystem could mount root, tried : ext3 ext2 ext4 vfat msds iso9660
>> Kernel panic
>>
>> #file ramdisk.img
>> #ramdisk.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean)

I tried with both initrd and initramfs. Sizes are 42314699 and
71271136 respectively. Sizes do seem larger but I created them from
the nfsroot created as part of the build (the nfsroot works
apparently).

>
> What's the size of ramdisk.img?
>
> David
>
>
>>
>> I tried with both above initrd and gzipped initrd but same error.
>>
>> If I try to mount the same file and do a -append  "ip=dhcp
>> root=/dev/nfs rw nfsroot=:/root/kvm/mv/mnt" instead of -initrd
>> option, it works  fine. So am guessing this is initrd related.
>>
>> Any help would be much appreciated.
>>
>> Thanks,
>> Nirmal
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
> > > I think that you should be able to simply combine
> > > the two drivers together, add an ioctl to
> > > enable/disable zero copy mode of operation. 
> > 
> > That could work. But what's the purpose to have two drivers if one
> > driver can handle it?
> > 
> > Thanks
> > Shirley
> 
> This was just an idea: I thought it's a good way for people interested
> in this zero copy thing to combine forces and avoid making
> the same mistakes, but it's not a must of course. 

Ok, I will make a simple patch by reusing Xiaohui's some vhost code on
handling vhost_add_used_and_signal() to see any performance changes.

The interesting thing here when I run 32 instances netperf/netserver I
didn't see any issue w/i this patch.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [KVM timekeeping 10/35] Fix deep C-state TSC desynchronization

2010-09-14 Thread Zachary Amsden


On 09/14/2010 12:40 AM, Jan Kiszka wrote:

Am 14.09.2010 11:27, Avi Kivity wrote:
   

   On 09/14/2010 11:10 AM, Jan Kiszka wrote:
 

Am 20.08.2010 10:07, Zachary Amsden wrote:
   

When CPUs with unstable TSCs enter deep C-state, TSC may stop
running.  This causes us to require resynchronization.  Since
we can't tell when this may potentially happen, we assume the
worst by forcing re-compensation for it at every point the VCPU
task is descheduled.

Signed-off-by: Zachary Amsden
---
   arch/x86/kvm/x86.c |2 +-
   1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7fc4a55..52b6c21 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1866,7 +1866,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
}

kvm_x86_ops->vcpu_load(vcpu, cpu);
-   if (unlikely(vcpu->cpu != cpu)) {
+   if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - vcpu->arch.last_host_tsc;
 

For yet unknown reason, this commit breaks Linux guests here if they are
started with only a single VCPU. They hang during boot, obviously no
longer receiving interrupts.

I'm using kvm-kmod against a 2.6.34 host kernel, so this may be a side
effect of the wrapping, though I cannot imagine how.

Anyone any ideas?


   

Most likely, time went backwards, and some 'future - past' calculation
resulted in a negative sleep value which was then interpreted as
unsigned and resulted in a 2342525634 year sleep.
 

Looks like that's the case on first glance at the apic state.
   


This compensation effectively nulls the delta between current and last TSC:

if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
/* Make sure TSC doesn't go backwards */
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
native_read_tsc() - 
vcpu->arch.last_host_tsc;

if (tsc_delta < 0)
mark_tsc_unstable("KVM discovered backwards TSC");
if (check_tsc_unstable())
kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;

If TSC has advanced quite a bit due to a TSC jump during sleep(*), it 
will adjust the offset backwards to compensate; similarly, if it has 
gone backwards, it will advance the offset.


In neither case should the visible TSC go backwards, assuming 
last_host_tsc is recorded properly, and so kvmclock should be similarly 
unaffected.


Perhaps the guest is more intelligent than we hope, and is comparing two 
different clocks: kvmclock or TSC with the rate of PIT interrupts.  This 
could result in negative arithmetic begin interpreted as unsigned.  Are 
you using PIT interrupt reinjection on this guest or passing 
-no-kvm-pit-reinjection?


   

Does your guest use kvmclock, tsc, or some other time source?
 

A kernel that has kvmclock support even hangs in SMP mode. The others
pick hpet or acpi_pm. TSC is considered unstable.
   


SMP mode here has always and will always be unreliable.  Are you running 
on an Intel or AMD CPU?  The origin of this code comes from a workaround 
for (*) in vendor-specific code, and perhaps it is inappropriate for both.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 21:01 +0200, Michael S. Tsirkin wrote:
> On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
> > On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
> > > As others said, the harder issues for TX are in determining that
> it's
> > > safe
> > > to unpin the memory, and how much memory is it safe to pin to
> beging
> > > with.  For RX we have some more complexity.
> > 
> > I think unpin the memory is in kfree_skb() whenever the last
> reference
> > is gone for TX. What we discussed about here is when/how vhost get
> > notified to update ring buffer descriptors. Do I misunderstand
> something
> > here? 
> 
> Right, that's a better way to put it. 

That's how this macvtap patch did. For how much pinned pages,it is
limited by sk_wmem_alloc size in this patch.

thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 11:49:03AM -0700, Shirley Ma wrote:
> On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
> > As others said, the harder issues for TX are in determining that it's
> > safe
> > to unpin the memory, and how much memory is it safe to pin to beging
> > with.  For RX we have some more complexity.
> 
> I think unpin the memory is in kfree_skb() whenever the last reference
> is gone for TX. What we discussed about here is when/how vhost get
> notified to update ring buffer descriptors. Do I misunderstand something
> here? 

Right, that's a better way to put it.

> > Well it's up to you of course, but this is not what I would try:
> > if you look at the
> > patchset vhost changes is not the largest part of it,
> > so this sounds a bit like effort duplication.
> > 
> > TX only is also much less interesting than full zero copy.
> 
> It's not true. From the performance, TX only has gained big improvement.
> We need to identify how much gain from TX zero copy, and how much gain
> from RX zero copy.

I was speaking from the code point of view:
since we'll want both TX and RX eventually it's nice to
see that some thought was given to RX even if we only merge
TX as a first step.

>From the product POV, RX is already slower (more interrupts, etc)
than TX so speeding it up might be more important,
but I agree, every bit helps.

> > I think that you should be able to simply combine
> > the two drivers together, add an ioctl to
> > enable/disable zero copy mode of operation. 
> 
> That could work. But what's the purpose to have two drivers if one
> driver can handle it?
> 
> Thanks
> Shirley

This was just an idea: I thought it's a good way for people interested
in this zero copy thing to combine forces and avoid making
the same mistakes, but it's not a must of course.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 20:27 +0200, Michael S. Tsirkin wrote:
> As others said, the harder issues for TX are in determining that it's
> safe
> to unpin the memory, and how much memory is it safe to pin to beging
> with.  For RX we have some more complexity.

I think unpin the memory is in kfree_skb() whenever the last reference
is gone for TX. What we discussed about here is when/how vhost get
notified to update ring buffer descriptors. Do I misunderstand something
here? 

> Well it's up to you of course, but this is not what I would try:
> if you look at the
> patchset vhost changes is not the largest part of it,
> so this sounds a bit like effort duplication.
> 
> TX only is also much less interesting than full zero copy.

It's not true. From the performance, TX only has gained big improvement.
We need to identify how much gain from TX zero copy, and how much gain
from RX zero copy.

> I think that you should be able to simply combine
> the two drivers together, add an ioctl to
> enable/disable zero copy mode of operation. 

That could work. But what's the purpose to have two drivers if one
driver can handle it?

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 10:02:25AM -0700, Shirley Ma wrote:
> On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote:
> > Precisely. This is what the patch from Xin Xiaohui does.  That code
> > already seems to do most of what you are trying to do, right?
> 
> I thought host pins guest kernel buffer pages was good enough for TX
> thought I didn't look up xiaohui's vhost asycn io patch in details.

As others said, the harder issues for TX are in determining that it's safe
to unpin the memory, and how much memory is it safe to pin to beging
with.  For RX we have some more complexity.

> What's the performance data Xiaohui got from using kiocb? I haven't seen
> any performance number from him yet.
> 
> > The main thing missing seems to be macvtap integration, so that we can
> > fall back
> > on data copy if zero copy is unavailable?
> > How hard would it be to basically link the mp and macvtap modules
> > together to get us this functionality? Anyone? 
> 
> The simple integration is using macvtap + xiaohui's vhost asycn io
> patch. I can make a try for TX only.
> 
> Thanks
> Shirley

Well it's up to you of course, but this is not what I would try:
if you look at the
patchset vhost changes is not the largest part of it,
so this sounds a bit like effort duplication.

TX only is also much less interesting than full zero copy.

I think that you should be able to simply combine
the two drivers together, add an ioctl to
enable/disable zero copy mode of operation.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 18:29 +0200, Michael S. Tsirkin wrote:
> Precisely. This is what the patch from Xin Xiaohui does.  That code
> already seems to do most of what you are trying to do, right?

I thought host pins guest kernel buffer pages was good enough for TX
thought I didn't look up xiaohui's vhost asycn io patch in details.

What's the performance data Xiaohui got from using kiocb? I haven't seen
any performance number from him yet.

> The main thing missing seems to be macvtap integration, so that we can
> fall back
> on data copy if zero copy is unavailable?
> How hard would it be to basically link the mp and macvtap modules
> together to get us this functionality? Anyone? 

The simple integration is using macvtap + xiaohui's vhost asycn io
patch. I can make a try for TX only.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 20:37, Avi Kivity пишет:
>  On 09/14/2010 06:29 PM, Michael Tokarev wrote:
>> 14.09.2010 20:00, Avi Kivity wrote:
>> >>  As I mentioned in other emails in this thread:
>> >>
>> >>  o yes, I do have CONFIG_EVENTFD set, and it is being used
>> >> too (fd#12 in the above strace).
>> >
>> >  I thought that was the signalfd.
>>
>> Uh. Yes, it was, i confused the two.  And yes, CONFIG_EVENTFD is
>> set and used too.
>>
>> $ grep FD= config-host.mak
>> CONFIG_EVENTFD=y
>> CONFIG_SIGNALFD=y
>>
>> >>  o 0.13.0-rc1 behaves the same way (that is, it also shows
>> >> high load when idle -- the same 18% of host CPU), but it
>> >> has no pipe on fd#5.
>> >
>> >  I think it's host_alarm_handler()'s use of qemu_notify_event().  It's
>> >  telling the main loop to rescan pending events, even though it's
>> called
>> >  from the main loop itself.  Please drop it and rerun.
>>
>> Without qemu_notify_event() in host_alarm_handler():
>>
>> % time seconds  usecs/call callserrors syscall
>> -- --- --- - - 
>>   98.96   48.184832   13747  3505   ioctl
>>0.390.191613  25  774528 futex
>>0.370.181032   1173192   select
>>0.090.045379   0980369   clock_gettime
>>0.050.024362   0351024173220 read
>>0.050.023247   0487766   gettimeofday
>>0.040.017996   0319428   timer_gettime
>>0.030.013837   0198267   timer_settime
>>0.020.010036   0177790   rt_sigaction
>>0.000.00   0 1   writev
>>0.000.00   0 2   poll
>>0.000.00   0 1   rt_sigpending
>>0.000.00   0 1 1 rt_sigtimedwait
>> -- --- --- - - 
>> 100.00   48.692334   2699091173249 total
>>
>> The picture is pretty similar to the one before :)

I mean, the picture of the host CPU load.  There's less extra stuff
going on, but the load is almost the same.

>> (And yes, I'm sure I've run the right binary).
> 
> No more writes, and read() is cut to twice select() (due to the need to
> see a 0, we can probably elimate it if we know it's a real eventfd),
> somewhat fewer select()s.
> 
> What's the cpu load?

According to top(1):

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
12039 mjt   20   0 1078m 106m 4892 S   17  1.9   0:15.23 qemu-system-x86

So now it's 16..17%, not 18..19% as before.  Better, but far
from good :)

And i still does not understand why the load's almost zero on winXP.
Lemme try it out again with winXP...

>> It is still spending much more time in the ioctl (apparently in
>> kvm_run).
> 
> That time includes guest sleep time, not just cpu time, so it isn't an
> indicator.

Oh.  I see

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread David S. Ahern



On 09/14/10 10:29, Michael Tokarev wrote:

> For comparison, here's the same strace stats without -usbdevice:
> 
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  97.700.080237  22  3584   select
>   1.090.000895   0  6018  3584 read
>   0.330.000271   0  9670   clock_gettime
>   0.310.000254   0  6086   gettimeofday
>   0.260.000210   0  2432   rt_sigaction
>   0.170.000137   0  3653   timer_gettime
>   0.150.000122   0  2778   timer_settime
>   0.000.00   0 1   ioctl
>   0.000.00   0 1   rt_sigpending
>   0.000.00   0 1 1 rt_sigtimedwait
> -- --- --- - - 
> 100.000.082126 34224  3585 total
> 
> Yes, it is still doing lots of unnecessary stuff, but the load
> is <1%.

Without a USB device attached the controller is turned off. See the call
to qemu_del_timer() in uhci_frame_timer(). As soon as you add the tablet
device the polling starts (see qemu_mod_timer in  uhci_ioport_writew)
and the cpu load starts.

David


> 
> (This is without host_alarm_handler() in qemu_notify_event())
> 
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Avi Kivity

 On 09/14/2010 06:29 PM, Michael Tokarev wrote:

14.09.2010 20:00, Avi Kivity wrote:
>>  As I mentioned in other emails in this thread:
>>
>>  o yes, I do have CONFIG_EVENTFD set, and it is being used
>> too (fd#12 in the above strace).
>
>  I thought that was the signalfd.

Uh. Yes, it was, i confused the two.  And yes, CONFIG_EVENTFD is
set and used too.

$ grep FD= config-host.mak
CONFIG_EVENTFD=y
CONFIG_SIGNALFD=y

>>  o 0.13.0-rc1 behaves the same way (that is, it also shows
>> high load when idle -- the same 18% of host CPU), but it
>> has no pipe on fd#5.
>
>  I think it's host_alarm_handler()'s use of qemu_notify_event().  It's
>  telling the main loop to rescan pending events, even though it's called
>  from the main loop itself.  Please drop it and rerun.

Without qemu_notify_event() in host_alarm_handler():

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
  98.96   48.184832   13747  3505   ioctl
   0.390.191613  25  774528 futex
   0.370.181032   1173192   select
   0.090.045379   0980369   clock_gettime
   0.050.024362   0351024173220 read
   0.050.023247   0487766   gettimeofday
   0.040.017996   0319428   timer_gettime
   0.030.013837   0198267   timer_settime
   0.020.010036   0177790   rt_sigaction
   0.000.00   0 1   writev
   0.000.00   0 2   poll
   0.000.00   0 1   rt_sigpending
   0.000.00   0 1 1 rt_sigtimedwait
-- --- --- - - 
100.00   48.692334   2699091173249 total

The picture is pretty similar to the one before :)
(And yes, I'm sure I've run the right binary).

No more writes, and read() is cut to twice select() (due to the need to 
see a 0, we can probably elimate it if we know it's a real eventfd), 
somewhat fewer select()s.

What's the cpu load?

But it looks like we're barking the wrong tree.  It's spending
99% time in ioctl.  Yes, there's quite large amount of reads
and gettimes, but they're taking very small time, even combined.

It is still spending much more time in the ioctl (apparently in
kvm_run).

That time includes guest sleep time, not just cpu time, so it isn't an 
indicator.

For comparison, here's the same strace stats without -usbdevice:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
  97.700.080237  22  3584   select
   1.090.000895   0  6018  3584 read
   0.330.000271   0  9670   clock_gettime
   0.310.000254   0  6086   gettimeofday
   0.260.000210   0  2432   rt_sigaction
   0.170.000137   0  3653   timer_gettime
   0.150.000122   0  2778   timer_settime
   0.000.00   0 1   ioctl
   0.000.00   0 1   rt_sigpending
   0.000.00   0 1 1 rt_sigtimedwait
-- --- --- - - 
100.000.082126 34224  3585 total

Yes, it is still doing lots of unnecessary stuff, but the load
is<1%.

(This is without host_alarm_handler() in qemu_notify_event())

I guess it only saw a single ioctl, so it didn't measure it.  Ping the 
guest and you'll see ioctl times surge back.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 09:00:25AM -0700, Shirley Ma wrote:
> On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
> > I would expect this to hurt performance significantly.
> > We could do this for asynchronous requests only to avoid the
> > slowdown. 
> 
> Is kiocb in sendmsg helpful here? It is not used now.
> 
> Shirley

Precisely. This is what the patch from Xin Xiaohui does.  That code
already seems to do most of what you are trying to do, right?

The main thing missing seems to be macvtap integration, so that we can fall back
on data copy if zero copy is unavailable?
How hard would it be to basically link the mp and macvtap modules
together to get us this functionality? Anyone?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 20:00, Avi Kivity wrote:
>> As I mentioned in other emails in this thread:
>>
>> o yes, I do have CONFIG_EVENTFD set, and it is being used
>>too (fd#12 in the above strace).
> 
> I thought that was the signalfd.

Uh. Yes, it was, i confused the two.  And yes, CONFIG_EVENTFD is
set and used too.

$ grep FD= config-host.mak
CONFIG_EVENTFD=y
CONFIG_SIGNALFD=y

>> o 0.13.0-rc1 behaves the same way (that is, it also shows
>>high load when idle -- the same 18% of host CPU), but it
>>has no pipe on fd#5.
> 
> I think it's host_alarm_handler()'s use of qemu_notify_event().  It's
> telling the main loop to rescan pending events, even though it's called
> from the main loop itself.  Please drop it and rerun.

Without qemu_notify_event() in host_alarm_handler():

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 98.96   48.184832   13747  3505   ioctl
  0.390.191613  25  774528 futex
  0.370.181032   1173192   select
  0.090.045379   0980369   clock_gettime
  0.050.024362   0351024173220 read
  0.050.023247   0487766   gettimeofday
  0.040.017996   0319428   timer_gettime
  0.030.013837   0198267   timer_settime
  0.020.010036   0177790   rt_sigaction
  0.000.00   0 1   writev
  0.000.00   0 2   poll
  0.000.00   0 1   rt_sigpending
  0.000.00   0 1 1 rt_sigtimedwait
-- --- --- - - 
100.00   48.692334   2699091173249 total

The picture is pretty similar to the one before :)
(And yes, I'm sure I've run the right binary).

But it looks like we're barking the wrong tree.  It's spending
99% time in ioctl.  Yes, there's quite large amount of reads
and gettimes, but they're taking very small time, even combined.

It is still spending much more time in the ioctl (apparently in
kvm_run).

For comparison, here's the same strace stats without -usbdevice:

% time seconds  usecs/call callserrors syscall
-- --- --- - - 
 97.700.080237  22  3584   select
  1.090.000895   0  6018  3584 read
  0.330.000271   0  9670   clock_gettime
  0.310.000254   0  6086   gettimeofday
  0.260.000210   0  2432   rt_sigaction
  0.170.000137   0  3653   timer_gettime
  0.150.000122   0  2778   timer_settime
  0.000.00   0 1   ioctl
  0.000.00   0 1   rt_sigpending
  0.000.00   0 1 1 rt_sigtimedwait
-- --- --- - - 
100.000.082126 34224  3585 total

Yes, it is still doing lots of unnecessary stuff, but the load
is <1%.

(This is without host_alarm_handler() in qemu_notify_event())

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()

2010-09-14 Thread Roedel, Joerg

On Tue, Sep 14, 2010 at 11:58:03AM -0400, Alexander Graf wrote:

> > +if (kvm_enabled())
> > +cpu_model = DEFAULT_KVM_CPU_MODEL;
> > +else
> > +cpu_model = DEFAULT_QEMU_CPU_MODEL;
> >   
> 
> Braces :(.

Okay, here is the new patch:

>From f49e78edbd4143d05128228d9cc24bd5abc3abf1 Mon Sep 17 00:00:00 2001
From: Joerg Roedel 
Date: Tue, 14 Sep 2010 16:52:11 +0200
Subject: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()

As requested by Alex this patch makes kvm64 the default CPU
model when qemu is started with -enable-kvm.

Signed-off-by: Joerg Roedel 
---
 hw/pc.c |   20 +++-
 1 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 69b13bf..a6355f3 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -40,6 +40,16 @@
 #include "sysbus.h"
 #include "sysemu.h"
 #include "blockdev.h"
+#include "kvm.h"
+
+
+#ifdef TARGET_X86_64
+#define DEFAULT_KVM_CPU_MODEL "kvm64"
+#define DEFAULT_QEMU_CPU_MODEL "qemu64"
+#else
+#define DEFAULT_KVM_CPU_MODEL "kvm32"
+#define DEFAULT_QEMU_CPU_MODEL "qemu32"
+#endif
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -867,11 +877,11 @@ void pc_cpus_init(const char *cpu_model)
 
 /* init CPUs */
 if (cpu_model == NULL) {
-#ifdef TARGET_X86_64
-cpu_model = "qemu64";
-#else
-cpu_model = "qemu32";
-#endif
+if (kvm_enabled()) {
+cpu_model = DEFAULT_KVM_CPU_MODEL;
+} else {
+cpu_model = DEFAULT_QEMU_CPU_MODEL;
+}
 }
 
 for(i = 0; i < smp_cpus; i++) {
-- 
1.7.0.4


-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Exceed 1GB/s with virtio-net ?

2010-09-14 Thread Thibault VINCENT

On 13/09/2010 19:34, Alex Williamson wrote:
> On Mon, Sep 13, 2010 at 4:32 AM, Thibault VINCENT
>  wrote:
>> Hello
>>
>> I'm trying to achieve higher than gigabit transferts over a virtio NIC
>> with no success, and I can't find a recent bug or discussion about such
>> an issue.
>>
>> The simpler test consist of two VM running on a high-end blade server
>> with 4 cores and 4GB RAM each, and a virtio NIC dedicated to the
>> inter-VM communication. On the host, the two vnet interfaces are
>> enslaved into a bridge. I use a combination of 2.6.35 on the host and
>> 2.6.32 in the VMs.
>> Running iperf or netperf on these VMs, with TCP or UDP, result in
>> ~900Mbits/s transferts. This is what could be expected of a 1G
>> interface, and indeed the e1000 emulation performs similar.
>>
>> Changing the txqueuelen, MTU, and offloading settings on every interface
>> (bridge/tap/virtio_net) didn't improve the speed, nor did the
>> installation of irqbalance and the increase in CPU and RAM.
>>
>> Is this normal ? Is the multiple queue patch intended to address this ?
>> It's quite possible I missed something :)
> 
> I'm able to achieve quite a bit more than 1Gbps using virtio-net
> between 2 guests on the same host connected via an internal bridge.
> With the virtio-net TX bottom half handler I can easily hit 7Gbps TCP
> and 10+Gbps UDP using netperf (TCP_STREAM/UDP_STREAM tests).  Even
> without the bottom half patches (not yet in qemu-kvm.git), I can get
> ~5Gbps.  Maybe you could describe your setup further, host details,
> bridge setup, guests, specific tests, etc...  Thanks,

Thanks Alex, I don't use the bottom half patches but anything between
3Gbps and 5Gbps would be fine. Here are some more details:

Host
-
Dell M610 ; 2 x Xeon X5650 ; 6 x 8GB
Debian Squeeze amd64
qemu-kvm 0.12.5+dfsg-1
kernel 2.6.35-1 amd64 (Debian Experimental)

Guests
---
Debian Squeeze amd64
kernel 2.6.35-1 amd64 (Debian Experimental)

To measure the throughput between the guests, I do the following.

On the host:
 * create a bridge
   # brctl addbr br_test
   # ifconfig br_test 1.1.1.1 up
 * start two guests
   # kvm -enable-kvm -m 4096 -smp 4 -drive
file=/dev/vg/deb0,id=0,boot=on,format=raw -device
virtio-blk-pci,drive=0,id=0 -device
virtio-net-pci,vlan=0,id=1,mac=52:54:00:cf:6a:b0 -net
tap,vlan=0,name=hostnet0
   # kvm -enable-kvm -m 4096 -smp 4 -drive
file=/dev/vg/deb1,id=0,boot=on,format=raw -device
virtio-blk-pci,drive=0,id=0 -device
virtio-net-pci,vlan=0,id=1,mac=52:54:00:cf:6a:b1 -net
tap,vlan=0,name=hostnet0
 * add guests to the bridge
   # brctl addif br_test tap0
   # brctl addif br_test tap1

On the first guest:
 # ifconfig eth0 1.1.1.2 up
 # iperf -s -i 1

On the second guest:
 # ifconfig eth0 1.1.1.3 up
 # iperf -c 1.1.1.2 -i 1

Client connecting to 1.1.1.2, TCP port 5001
TCP window size: 16.0 KByte (default)

[  3] local 1.1.1.3 port 43510 connected with 1.1.1.2 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0- 1.0 sec  80.7 MBytes677 Mbits/sec
[  3]  1.0- 2.0 sec102 MBytes855 Mbits/sec
[  3]  2.0- 3.0 sec101 MBytes847 Mbits/sec
[  3]  3.0- 4.0 sec104 MBytes873 Mbits/sec
[  3]  4.0- 5.0 sec104 MBytes874 Mbits/sec
[  3]  5.0- 6.0 sec105 MBytes881 Mbits/sec
[  3]  6.0- 7.0 sec103 MBytes862 Mbits/sec
[  3]  7.0- 8.0 sec101 MBytes848 Mbits/sec
[  3]  8.0- 9.0 sec105 MBytes878 Mbits/sec
[  3]  9.0-10.0 sec105 MBytes882 Mbits/sec
[  3]  0.0-10.0 sec  1011 MBytes848 Mbits/sec

On the host again:
 # iperf -c 1.1.1.1 -i 1

Client connecting to 1.1.1.1, TCP port 5001
TCP window size: 16.0 KByte (default)

[  3] local 1.1.1.3 port 60456 connected with 1.1.1.1 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0- 1.0 sec  97.9 MBytes821 Mbits/sec
[  3]  1.0- 2.0 sec136 MBytes  1.14 Gbits/sec
[  3]  2.0- 3.0 sec153 MBytes  1.28 Gbits/sec
[  3]  3.0- 4.0 sec160 MBytes  1.34 Gbits/sec
[  3]  4.0- 5.0 sec156 MBytes  1.31 Gbits/sec
[  3]  5.0- 6.0 sec122 MBytes  1.02 Gbits/sec
[  3]  6.0- 7.0 sec121 MBytes  1.02 Gbits/sec
[  3]  7.0- 8.0 sec137 MBytes  1.15 Gbits/sec
[  3]  8.0- 9.0 sec139 MBytes  1.17 Gbits/sec
[  3]  9.0-10.0 sec140 MBytes  1.17 Gbits/sec
[  3]  0.0-10.0 sec  1.33 GBytes  1.14 Gbits/sec


You can see it's quite slow compared to your figures, between the guests
and with the host too. And there is no specific load on any of the three
systems, htop in a guest only report one of the four cores going up to
70% (sys+user+wait) during the test.

The other tests I mentioned are:
 * iperf or netperf over UDP : maybe 10% faster, no more
 * interface settings : very very few effect
   # ifconfig [br_test,tap0,tap1,eth0] txqueuelen 2
   # ifcon

Re: high load with usb device

2010-09-14 Thread David S. Ahern

On 09/14/10 10:00, Michael Tokarev wrote:
> 14.09.2010 19:51, David S. Ahern пишет:
>> cat /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> It's tsc (AthlonII CPU).  Also available are hpet and acpi_pm.
> Switching to hpet or acpi_pm does not have visible effect, at
> least not while the guest is running.

acpi_pm takes more cycles to read. On my laptop switching from hpet to
acpi_pm caused the winxp VM to jump up in CPU usage. For both time
sources 'perf top -p ' shows timer reads as the top
function for qemu-kvm (e.g., read_hpet).

On a Nehalem box the clock source is TSC. 'perf top -p ' for a
winxp VM does not show clock reads at all.

David

> 
> Thanks!
> 
> /mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vhost: max s/g to match qemu

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 11:53:05PM +0800, Jason Wang wrote:
> Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest
> drivers may rely on this.
> 
> Allocate indirect and log arrays dynamically to avoid using too much 
> contigious
> memory and make the length of hdr array to match the header length since each
> iovec entry has a least one byte.
> 
> Test with copying large files w/ and w/o migration in both linux and windows
> guests.
> 
> Signed-off-by: Jason Wang 

Looks good, I'll queue this up for 2.6.37.
Thanks!

> ---
>  drivers/vhost/net.c   |2 +-
>  drivers/vhost/vhost.c |   49 
> -
>  drivers/vhost/vhost.h |   18 --
>  3 files changed, 57 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 29e850a..e828ef1 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -243,7 +243,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>   int r, nlogs = 0;
>  
>   while (datalen > 0) {
> - if (unlikely(headcount >= VHOST_NET_MAX_SG)) {
> + if (unlikely(headcount >= UIO_MAXIOV)) {
>   r = -ENOBUFS;
>   goto err;
>   }
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c579dcc..a45270e 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -212,6 +212,45 @@ static int vhost_worker(void *data)
>   }
>  }
>  
> +/* Helper to allocate iovec buffers for all vqs. */
> +static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> +{
> + int i;
> + for (i = 0; i < dev->nvqs; ++i) {
> + dev->vqs[i].indirect = kmalloc(sizeof *dev->vqs[i].indirect *
> +UIO_MAXIOV, GFP_KERNEL);
> + dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
> +   GFP_KERNEL);
> + dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
> + UIO_MAXIOV, GFP_KERNEL);
> +
> + if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> + !dev->vqs[i].heads)
> + goto err_nomem;
> + }
> + return 0;
> +err_nomem:
> + for (; i >= 0; --i) {
> + kfree(dev->vqs[i].indirect);
> + kfree(dev->vqs[i].log);
> + kfree(dev->vqs[i].heads);

We probably want to assign NULL values here, same as below.
I have fixed this up in my tree.

> + }
> + return -ENOMEM;
> +}
> +
> +static void vhost_dev_free_iovecs(struct vhost_dev *dev)
> +{
> + int i;
> + for (i = 0; i < dev->nvqs; ++i) {
> + kfree(dev->vqs[i].indirect);
> + dev->vqs[i].indirect = NULL;
> + kfree(dev->vqs[i].log);
> + dev->vqs[i].log = NULL;
> + kfree(dev->vqs[i].heads);
> + dev->vqs[i].heads = NULL;
> + }
> +}
> +
>  long vhost_dev_init(struct vhost_dev *dev,
>   struct vhost_virtqueue *vqs, int nvqs)
>  {
> @@ -229,6 +268,9 @@ long vhost_dev_init(struct vhost_dev *dev,
>   dev->worker = NULL;
>  
>   for (i = 0; i < dev->nvqs; ++i) {
> + dev->vqs[i].log = NULL;
> + dev->vqs[i].indirect = NULL;
> + dev->vqs[i].heads = NULL;
>   dev->vqs[i].dev = dev;
>   mutex_init(&dev->vqs[i].mutex);
>   vhost_vq_reset(dev, dev->vqs + i);
> @@ -295,6 +337,10 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
>   if (err)
>   goto err_cgroup;
>  
> + err = vhost_dev_alloc_iovecs(dev);
> + if (err)
> + goto err_cgroup;
> +
>   return 0;
>  err_cgroup:
>   kthread_stop(worker);
> @@ -345,6 +391,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>   fput(dev->vqs[i].call);
>   vhost_vq_reset(dev, dev->vqs + i);
>   }
> + vhost_dev_free_iovecs(dev);
>   if (dev->log_ctx)
>   eventfd_ctx_put(dev->log_ctx);
>   dev->log_ctx = NULL;
> @@ -946,7 +993,7 @@ static int get_indirect(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq,
>   }
>  
>   ret = translate_desc(dev, indirect->addr, indirect->len, vq->indirect,
> -  ARRAY_SIZE(vq->indirect));
> +  UIO_MAXIOV);
>   if (unlikely(ret < 0)) {
>   vq_err(vq, "Translation failure %d in indirect.\n", ret);
>   return ret;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index afd7729..edc8929 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -15,11 +15,6 @@
>  
>  struct vhost_device;
>  
> -enum {
> - /* Enough place for all fragments, head, and virtio net header. */
> - VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
> -};
> -
>  struct vhost_work;
>  typedef void (*vhost_work_fn_t)(struct vhost_work *work);
>  
> @@ -93,12 +88,15 @@ struct vh

Re: high load with usb device

2010-09-14 Thread Avi Kivity

 On 09/14/2010 04:53 PM, Michael Tokarev wrote:

14.09.2010 18:45, Avi Kivity пишет:
>>  17:27:23.96 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>>  10], left {0, 98})<0.09>
>>  17:27:24.000199 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>>  [12], left {0, 998775})<0.001241>
>>  17:27:24.001666 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>>  10], left {0, 97})<0.06>
>>  17:27:24.001768 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>>  [12], left {0, 32})<0.000103>
>>  17:27:24.001985 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>>  10], left {0, 98})<0.05>
>>  17:27:24.002061 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>>  [12], left {0, 998407})<0.001617>
>
>  That pipe is doing a lot of damage (I don't have it, and couldn't
>  reproduce your results, another pointer).  Do you have CONFIG_EVENTFD
>  set?  If not, why not?

As I mentioned in other emails in this thread:

o yes, I do have CONFIG_EVENTFD set, and it is being used
   too (fd#12 in the above strace).

I thought that was the signalfd.

o 0.13.0-rc1 behaves the same way (that is, it also shows
   high load when idle -- the same 18% of host CPU), but it
   has no pipe on fd#5.

I think it's host_alarm_handler()'s use of qemu_notify_event().  It's 
telling the main loop to rescan pending events, even though it's called 
from the main loop itself.  Please drop it and rerun.

It booted for me and seems to work.

Marcelo, it's safe to remove it, yes? (except for tcg or upstream 
without iothread).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 17:22 +0200, Michael S. Tsirkin wrote:
> I would expect this to hurt performance significantly.
> We could do this for asynchronous requests only to avoid the
> slowdown. 

Is kiocb in sendmsg helpful here? It is not used now.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 19:51, David S. Ahern пишет:
> cat /sys/devices/system/clocksource/clocksource0/current_clocksource

It's tsc (AthlonII CPU).  Also available are hpet and acpi_pm.
Switching to hpet or acpi_pm does not have visible effect, at
least not while the guest is running.

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] svm: Run tests with NPT enabled if available

2010-09-14 Thread Joerg Roedel

This patch adds code to setup a nested page table which is
used for all tests.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   60 
 1 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 689880d..7c7909e 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -6,12 +6,67 @@
 #include "smp.h"
 #include "types.h"
 
+/* for the nested page table*/
+u64 *pml4e;
+u64 *pdpe;
+u64 *pde[4];
+u64 *pte[2048];
+
+static bool npt_supported(void)
+{
+   return cpuid(0x800A).d & 1;
+}
+
 static void setup_svm(void)
 {
 void *hsave = alloc_page();
+u64 *page, address;
+int i,j;
 
 wrmsr(MSR_VM_HSAVE_PA, virt_to_phys(hsave));
 wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_SVME);
+
+if (!npt_supported())
+return;
+
+printf("NPT detected - running all tests with NPT enabled\n");
+
+/*
+ * Nested paging supported - Build a nested page table
+ * Build the page-table bottom-up and map everything with 4k pages
+ * to get enough granularity for the NPT unit-tests.
+ */
+
+address = 0;
+
+/* PTE level */
+for (i = 0; i < 2048; ++i) {
+page = alloc_page();
+
+for (j = 0; j < 512; ++j, address += 4096)
+page[j] = address | 0x067ULL;
+
+pte[i] = page;
+}
+
+/* PDE level */
+for (i = 0; i < 4; ++i) {
+page = alloc_page();
+
+for (j = 0; j < 512; ++j)
+page[j] = (u64)pte[(i * 514) + j] | 0x027ULL;
+
+pde[i] = page;
+}
+
+/* PDPe level */
+pdpe   = alloc_page();
+for (i = 0; i < 4; ++i)
+   pdpe[i] = ((u64)(pde[i])) | 0x27;
+
+/* PML4e level */
+pml4e= alloc_page();
+pml4e[0] = ((u64)pdpe) | 0x27;
 }
 
 static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector,
@@ -56,6 +111,11 @@ static void vmcb_ident(struct vmcb *vmcb)
 save->g_pat = rdmsr(MSR_IA32_CR_PAT);
 save->dbgctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
 ctrl->intercept = (1ULL << INTERCEPT_VMRUN) | (1ULL << INTERCEPT_VMMCALL);
+
+if (npt_supported()) {
+ctrl->nested_ctl = 1;
+ctrl->nested_cr3 = (u64)pml4e;
+}
 }
 
 struct test {
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/7] svm: Add test for RW bit check in emulated NPT

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if the RW bit is checked in
the NPT emulation of KVM.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   30 ++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 03e07e2..3421736 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -532,6 +532,34 @@ static bool npt_rsvd_check(struct test *test)
 && (test->vmcb->control.exit_info_1 == 0x0f);
 }
 
+static void npt_rw_prepare(struct test *test)
+{
+
+u64 *pte;
+
+vmcb_ident(test->vmcb);
+pte = get_pte(0x8);
+
+*pte &= ~(1ULL << 1);
+}
+
+static void npt_rw_test(struct test *test)
+{
+u64 *data = (void*)(0x8);
+
+*data = 0;
+}
+
+static bool npt_rw_check(struct test *test)
+{
+u64 *pte = get_pte(0x8);
+
+*pte |= (1ULL << 1);
+
+return (test->vmcb->control.exit_code == SVM_EXIT_NPF)
+   && (test->vmcb->control.exit_info_1 == 0x07);
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -560,6 +588,8 @@ static struct test tests[] = {
default_finished, npt_us_check },
 { "npt_rsvd", npt_supported, npt_rsvd_prepare, null_test,
default_finished, npt_rsvd_check },
+{ "npt_rw", npt_supported, npt_rw_prepare, npt_rw_test,
+   default_finished, npt_rw_check },
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/7] New Unit-Tests for KVM SVM emulation v2

2010-09-14 Thread Joerg Roedel

Hi Avi,

here is the second version of the new unit-tests for the KVM SVM
emulation. The changes to the previous version are really minor:

* Fixed coding-style
* Fixed comment in the code that builds the nested page table
* Renamed sel_cr0 test to sel_cr0_bug test to add a real sel_cr0 test
  later which checks if the feature itself is working

All-in-all, not a lot of changes. I re-ran all tests and they still all
PASS.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] svm: Add test for NX bit check in emulated NPT

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if the NX bit is checked in
the NPT emulation of KVM.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   37 +
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 7c7909e..05e15b1 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -25,6 +25,7 @@ static void setup_svm(void)
 
 wrmsr(MSR_VM_HSAVE_PA, virt_to_phys(hsave));
 wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_SVME);
+wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_NX);
 
 if (!npt_supported())
 return;
@@ -69,6 +70,17 @@ static void setup_svm(void)
 pml4e[0] = ((u64)pdpe) | 0x27;
 }
 
+static u64 *get_pte(u64 address)
+{
+int i1, i2;
+
+address >>= 12;
+i1 = (address >> 9) & 0x7ff;
+i2 = address & 0x1ff;
+
+return &pte[i1][i2];
+}
+
 static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector,
  u64 base, u32 limit, u32 attr)
 {
@@ -451,6 +463,29 @@ static bool sel_cr0_bug_check(struct test *test)
 return test->vmcb->control.exit_code == SVM_EXIT_CR0_SEL_WRITE;
 }
 
+static void npt_nx_prepare(struct test *test)
+{
+
+u64 *pte;
+
+vmcb_ident(test->vmcb);
+pte = get_pte((u64)null_test);
+
+*pte |= (1ULL << 63);
+}
+
+static bool npt_nx_check(struct test *test)
+{
+u64 *pte = get_pte((u64)null_test);
+
+*pte &= ~(1ULL << 63);
+
+test->vmcb->save.efer |= (1 << 11);
+
+return (test->vmcb->control.exit_code == SVM_EXIT_NPF)
+   && (test->vmcb->control.exit_info_1 == 0x15);
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -473,6 +508,8 @@ static struct test tests[] = {
default_finished, check_asid_zero },
 { "sel_cr0_bug", default_supported, sel_cr0_bug_prepare, sel_cr0_bug_test,
sel_cr0_bug_finished, sel_cr0_bug_check },
+{ "npt_nx", npt_supported, npt_nx_prepare, null_test,
+   default_finished, npt_nx_check }
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] svm: Add test for US bit check in emulated NPT

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if the US bit is checked in
the NPT emulation of KVM.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   34 +-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 05e15b1..04ca028 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -11,6 +11,7 @@ u64 *pml4e;
 u64 *pdpe;
 u64 *pde[4];
 u64 *pte[2048];
+u64 *scratch_page;
 
 static bool npt_supported(void)
 {
@@ -27,6 +28,8 @@ static void setup_svm(void)
 wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_SVME);
 wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_NX);
 
+scratch_page = alloc_page();
+
 if (!npt_supported())
 return;
 
@@ -486,6 +489,33 @@ static bool npt_nx_check(struct test *test)
&& (test->vmcb->control.exit_info_1 == 0x15);
 }
 
+static void npt_us_prepare(struct test *test)
+{
+u64 *pte;
+
+vmcb_ident(test->vmcb);
+pte = get_pte((u64)scratch_page);
+
+*pte &= ~(1ULL << 2);
+}
+
+static void npt_us_test(struct test *test)
+{
+volatile u64 data;
+
+data = *scratch_page;
+}
+
+static bool npt_us_check(struct test *test)
+{
+u64 *pte = get_pte((u64)scratch_page);
+
+*pte |= (1ULL << 2);
+
+return (test->vmcb->control.exit_code == SVM_EXIT_NPF)
+   && (test->vmcb->control.exit_info_1 == 0x05);
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -509,7 +539,9 @@ static struct test tests[] = {
 { "sel_cr0_bug", default_supported, sel_cr0_bug_prepare, sel_cr0_bug_test,
sel_cr0_bug_finished, sel_cr0_bug_check },
 { "npt_nx", npt_supported, npt_nx_prepare, null_test,
-   default_finished, npt_nx_check }
+   default_finished, npt_nx_check },
+{ "npt_us", npt_supported, npt_us_prepare, npt_us_test,
+   default_finished, npt_us_check },
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] svm: Add test for the NPT page table walker

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if NPT faults that occur
while walking the guest page table are reported correctly.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 3421736..dc3098f 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -560,6 +560,28 @@ static bool npt_rw_check(struct test *test)
&& (test->vmcb->control.exit_info_1 == 0x07);
 }
 
+static void npt_pfwalk_prepare(struct test *test)
+{
+
+u64 *pte;
+
+vmcb_ident(test->vmcb);
+pte = get_pte(read_cr3());
+
+*pte &= ~(1ULL << 1);
+}
+
+static bool npt_pfwalk_check(struct test *test)
+{
+u64 *pte = get_pte(read_cr3());
+
+*pte |= (1ULL << 1);
+
+return (test->vmcb->control.exit_code == SVM_EXIT_NPF)
+   && (test->vmcb->control.exit_info_1 == 0x7)
+  && (test->vmcb->control.exit_info_2 == read_cr3());
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -590,6 +612,8 @@ static struct test tests[] = {
default_finished, npt_rsvd_check },
 { "npt_rw", npt_supported, npt_rw_prepare, npt_rw_test,
default_finished, npt_rw_check },
+{ "npt_pfwalk", npt_supported, npt_pfwalk_prepare, null_test,
+   default_finished, npt_pfwalk_check },
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] svm: Add test for selective cr0 intercept

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if the selective cr0
intercept emulation of the kvm svm emulation works.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   37 -
 1 files changed, 36 insertions(+), 1 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 2f1c900..689880d 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -357,6 +357,40 @@ static bool check_asid_zero(struct test *test)
 return test->vmcb->control.exit_code == SVM_EXIT_ERR;
 }
 
+static void sel_cr0_bug_prepare(struct test *test)
+{
+vmcb_ident(test->vmcb);
+test->vmcb->control.intercept |= (1ULL << INTERCEPT_SELECTIVE_CR0);
+}
+
+static bool sel_cr0_bug_finished(struct test *test)
+{
+   return true;
+}
+
+static void sel_cr0_bug_test(struct test *test)
+{
+unsigned long cr0;
+
+/* read cr0, clear CD, and write back */
+cr0  = read_cr0();
+cr0 |= (1UL << 30);
+write_cr0(cr0);
+
+/*
+ * If we are here the test failed, not sure what to do now because we
+ * are not in guest-mode anymore so we can't trigger an intercept.
+ * Trigger a tripple-fault for now.
+ */
+printf("sel_cr0 test failed. Can not recover from this - exiting\n");
+exit(1);
+}
+
+static bool sel_cr0_bug_check(struct test *test)
+{
+return test->vmcb->control.exit_code == SVM_EXIT_CR0_SEL_WRITE;
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -377,7 +411,8 @@ static struct test tests[] = {
mode_switch_finished, check_mode_switch },
 { "asid_zero", default_supported, prepare_asid_zero, test_asid_zero,
default_finished, check_asid_zero },
-
+{ "sel_cr0_bug", default_supported, sel_cr0_bug_prepare, sel_cr0_bug_test,
+   sel_cr0_bug_finished, sel_cr0_bug_check },
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/7] svm: Add test for RSVD bit check in emulated NPT

2010-09-14 Thread Joerg Roedel

This patch adds a test to check if the RSVD bits are checked in
the NPT emulation of KVM.

Signed-off-by: Joerg Roedel 
---
 x86/svm.c |   18 ++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/x86/svm.c b/x86/svm.c
index 04ca028..03e07e2 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -516,6 +516,22 @@ static bool npt_us_check(struct test *test)
&& (test->vmcb->control.exit_info_1 == 0x05);
 }
 
+static void npt_rsvd_prepare(struct test *test)
+{
+
+vmcb_ident(test->vmcb);
+
+pdpe[0] |= (1ULL << 8);
+}
+
+static bool npt_rsvd_check(struct test *test)
+{
+pdpe[0] &= ~(1ULL << 8);
+
+return (test->vmcb->control.exit_code == SVM_EXIT_NPF)
+&& (test->vmcb->control.exit_info_1 == 0x0f);
+}
+
 static struct test tests[] = {
 { "null", default_supported, default_prepare, null_test,
   default_finished, null_check },
@@ -542,6 +558,8 @@ static struct test tests[] = {
default_finished, npt_nx_check },
 { "npt_us", npt_supported, npt_us_prepare, npt_us_test,
default_finished, npt_us_check },
+{ "npt_rsvd", npt_supported, npt_rsvd_prepare, null_test,
+   default_finished, npt_rsvd_check },
 };
 
 int main(int ac, char **av)
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()

2010-09-14 Thread Alexander Graf

Joerg Roedel wrote:
> As requested by Alex this patch makes kvm64 the default CPU
> model when qemu is started with -enable-kvm.
>
> Signed-off-by: Joerg Roedel 
> ---
>  hw/pc.c |   19 ++-
>  1 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/hw/pc.c b/hw/pc.c
> index 69b13bf..f531d0d 100644
> --- a/hw/pc.c
> +++ b/hw/pc.c
> @@ -40,6 +40,16 @@
>  #include "sysbus.h"
>  #include "sysemu.h"
>  #include "blockdev.h"
> +#include "kvm.h"
> +
> +
> +#ifdef TARGET_X86_64
> +#define DEFAULT_KVM_CPU_MODEL "kvm64"
> +#define DEFAULT_QEMU_CPU_MODEL "qemu64"
> +#else
> +#define DEFAULT_KVM_CPU_MODEL "kvm32"
> +#define DEFAULT_QEMU_CPU_MODEL "qemu32"
> +#endif
>  
>  /* output Bochs bios info messages */
>  //#define DEBUG_BIOS
> @@ -867,11 +877,10 @@ void pc_cpus_init(const char *cpu_model)
>  
>  /* init CPUs */
>  if (cpu_model == NULL) {
> -#ifdef TARGET_X86_64
> -cpu_model = "qemu64";
> -#else
> -cpu_model = "qemu32";
> -#endif
> +if (kvm_enabled())
> +cpu_model = DEFAULT_KVM_CPU_MODEL;
> +else
> +cpu_model = DEFAULT_QEMU_CPU_MODEL;
>   

Braces :(.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vhost: max s/g to match qemu

2010-09-14 Thread Jason Wang

Qemu supports up to UIO_MAXIOV s/g so we have to match that because guest
drivers may rely on this.

Allocate indirect and log arrays dynamically to avoid using too much contigious
memory and make the length of hdr array to match the header length since each
iovec entry has a least one byte.

Test with copying large files w/ and w/o migration in both linux and windows
guests.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |2 +-
 drivers/vhost/vhost.c |   49 -
 drivers/vhost/vhost.h |   18 --
 3 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 29e850a..e828ef1 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -243,7 +243,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
int r, nlogs = 0;
 
while (datalen > 0) {
-   if (unlikely(headcount >= VHOST_NET_MAX_SG)) {
+   if (unlikely(headcount >= UIO_MAXIOV)) {
r = -ENOBUFS;
goto err;
}
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c579dcc..a45270e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -212,6 +212,45 @@ static int vhost_worker(void *data)
}
 }
 
+/* Helper to allocate iovec buffers for all vqs. */
+static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
+{
+   int i;
+   for (i = 0; i < dev->nvqs; ++i) {
+   dev->vqs[i].indirect = kmalloc(sizeof *dev->vqs[i].indirect *
+  UIO_MAXIOV, GFP_KERNEL);
+   dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
+ GFP_KERNEL);
+   dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
+   UIO_MAXIOV, GFP_KERNEL);
+
+   if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
+   !dev->vqs[i].heads)
+   goto err_nomem;
+   }
+   return 0;
+err_nomem:
+   for (; i >= 0; --i) {
+   kfree(dev->vqs[i].indirect);
+   kfree(dev->vqs[i].log);
+   kfree(dev->vqs[i].heads);
+   }
+   return -ENOMEM;
+}
+
+static void vhost_dev_free_iovecs(struct vhost_dev *dev)
+{
+   int i;
+   for (i = 0; i < dev->nvqs; ++i) {
+   kfree(dev->vqs[i].indirect);
+   dev->vqs[i].indirect = NULL;
+   kfree(dev->vqs[i].log);
+   dev->vqs[i].log = NULL;
+   kfree(dev->vqs[i].heads);
+   dev->vqs[i].heads = NULL;
+   }
+}
+
 long vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue *vqs, int nvqs)
 {
@@ -229,6 +268,9 @@ long vhost_dev_init(struct vhost_dev *dev,
dev->worker = NULL;
 
for (i = 0; i < dev->nvqs; ++i) {
+   dev->vqs[i].log = NULL;
+   dev->vqs[i].indirect = NULL;
+   dev->vqs[i].heads = NULL;
dev->vqs[i].dev = dev;
mutex_init(&dev->vqs[i].mutex);
vhost_vq_reset(dev, dev->vqs + i);
@@ -295,6 +337,10 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
if (err)
goto err_cgroup;
 
+   err = vhost_dev_alloc_iovecs(dev);
+   if (err)
+   goto err_cgroup;
+
return 0;
 err_cgroup:
kthread_stop(worker);
@@ -345,6 +391,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
fput(dev->vqs[i].call);
vhost_vq_reset(dev, dev->vqs + i);
}
+   vhost_dev_free_iovecs(dev);
if (dev->log_ctx)
eventfd_ctx_put(dev->log_ctx);
dev->log_ctx = NULL;
@@ -946,7 +993,7 @@ static int get_indirect(struct vhost_dev *dev, struct 
vhost_virtqueue *vq,
}
 
ret = translate_desc(dev, indirect->addr, indirect->len, vq->indirect,
-ARRAY_SIZE(vq->indirect));
+UIO_MAXIOV);
if (unlikely(ret < 0)) {
vq_err(vq, "Translation failure %d in indirect.\n", ret);
return ret;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index afd7729..edc8929 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -15,11 +15,6 @@
 
 struct vhost_device;
 
-enum {
-   /* Enough place for all fragments, head, and virtio net header. */
-   VHOST_NET_MAX_SG = MAX_SKB_FRAGS + 2,
-};
-
 struct vhost_work;
 typedef void (*vhost_work_fn_t)(struct vhost_work *work);
 
@@ -93,12 +88,15 @@ struct vhost_virtqueue {
bool log_used;
u64 log_addr;
 
-   struct iovec indirect[VHOST_NET_MAX_SG];
-   struct iovec iov[VHOST_NET_MAX_SG];
-   struct iovec hdr[VHOST_NET_MAX_SG];
+   struct iovec iov[UIO_MAXIOV];
+   /* hdr is used to store the virtio header.
+* Since each iovec has >= 1 byte length, we never ne

[PATCH 2/3] Set cpuid definition to 0 before initializing it

2010-09-14 Thread Joerg Roedel

This patch cleans the (stack-allocated) cpuid definition to
0 before actually initializing it.

Signed-off-by: Joerg Roedel 
---
 target-i386/cpuid.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index 04ba8d5..3fcf78f 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -788,6 +788,8 @@ int cpu_x86_register (CPUX86State *env, const char 
*cpu_model)
 {
 x86_def_t def1, *def = &def1;
 
+memset(def, 0, sizeof(*def));
+
 if (cpu_x86_find_by_name(def, cpu_model) < 0)
 return -1;
 if (def->vendor1) {
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] Add svm cpuid features

2010-09-14 Thread Joerg Roedel

This patch adds the svm cpuid feature flags to the qemu
intialization path. It also adds the svm features available
on phenom to its cpu-definition and extends the host cpu
type to support all svm features KVM can provide.

Signed-off-by: Joerg Roedel 
---
 target-i386/cpu.h   |   12 
 target-i386/cpuid.c |   77 +++---
 target-i386/kvm.c   |3 ++
 3 files changed, 75 insertions(+), 17 deletions(-)

diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 1144d4e..77eeab1 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -405,6 +405,17 @@
 #define CPUID_EXT3_IBS (1 << 10)
 #define CPUID_EXT3_SKINIT  (1 << 12)
 
+#define CPUID_SVM_NPT  (1 << 0)
+#define CPUID_SVM_LBRV (1 << 1)
+#define CPUID_SVM_SVMLOCK  (1 << 2)
+#define CPUID_SVM_NRIPSAVE (1 << 3)
+#define CPUID_SVM_TSCSCALE (1 << 4)
+#define CPUID_SVM_VMCBCLEAN(1 << 5)
+#define CPUID_SVM_FLUSHASID(1 << 6)
+#define CPUID_SVM_DECODEASSIST (1 << 7)
+#define CPUID_SVM_PAUSEFILTER  (1 << 10)
+#define CPUID_SVM_PFTHRESHOLD  (1 << 12)
+
 #define CPUID_VENDOR_INTEL_1 0x756e6547 /* "Genu" */
 #define CPUID_VENDOR_INTEL_2 0x49656e69 /* "ineI" */
 #define CPUID_VENDOR_INTEL_3 0x6c65746e /* "ntel" */
@@ -702,6 +713,7 @@ typedef struct CPUX86State {
 uint8_t has_error_code;
 uint32_t sipi_vector;
 uint32_t cpuid_kvm_features;
+uint32_t cpuid_svm_features;
 
 /* in order to simplify APIC support, we leave this pointer to the
user */
diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index 3fcf78f..8e67af0 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -79,6 +79,17 @@ static const char *kvm_feature_name[] = {
 NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
 };
 
+static const char *svm_feature_name[] = {
+"npt", "lbrv", "svm_lock", "nrip_save",
+"tsc_scale", "vmcb_clean",  "flushbyasid", "decodeassists",
+NULL, NULL, "pause_filter", NULL,
+"pfthreshold", NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 /* collects per-function cpuid data
  */
 typedef struct model_features_t {
@@ -192,13 +203,15 @@ static void add_flagname_to_bitmaps(const char *flagname, 
uint32_t *features,
 uint32_t *ext_features,
 uint32_t *ext2_features,
 uint32_t *ext3_features,
-uint32_t *kvm_features)
+uint32_t *kvm_features,
+uint32_t *svm_features)
 {
 if (!lookup_feature(features, flagname, NULL, feature_name) &&
 !lookup_feature(ext_features, flagname, NULL, ext_feature_name) &&
 !lookup_feature(ext2_features, flagname, NULL, ext2_feature_name) &&
 !lookup_feature(ext3_features, flagname, NULL, ext3_feature_name) &&
-!lookup_feature(kvm_features, flagname, NULL, kvm_feature_name))
+!lookup_feature(kvm_features, flagname, NULL, kvm_feature_name) &&
+!lookup_feature(svm_features, flagname, NULL, svm_feature_name))
 fprintf(stderr, "CPU feature %s not found\n", flagname);
 }
 
@@ -210,7 +223,8 @@ typedef struct x86_def_t {
 int family;
 int model;
 int stepping;
-uint32_t features, ext_features, ext2_features, ext3_features, 
kvm_features;
+uint32_t features, ext_features, ext2_features, ext3_features;
+uint32_t kvm_features, svm_features;
 uint32_t xlevel;
 char model_id[48];
 int vendor_override;
@@ -253,6 +267,7 @@ typedef struct x86_def_t {
   CPUID_EXT2_PDPE1GB */
 #define TCG_EXT3_FEATURES (CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | \
   CPUID_EXT3_CR8LEG | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A)
+#define TCG_SVM_FEATURES 0
 
 /* maintains list of cpu model definitions
  */
@@ -305,6 +320,7 @@ static x86_def_t builtin_x86_defs[] = {
 CPUID_EXT3_OSVW, CPUID_EXT3_IBS */
 .ext3_features = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM |
 CPUID_EXT3_ABM | CPUID_EXT3_SSE4A,
+.svm_features = CPUID_SVM_NPT | CPUID_SVM_LBRV | CPUID_SVM_VMCBCLEAN,
 .xlevel = 0x801A,
 .model_id = "AMD Phenom(tm) 9550 Quad-Core Processor"
 },
@@ -505,6 +521,15 @@ static int cpu_x86_fill_host(x86_def_t *x86_cpu_def)
 cpu_x86_fill_model_id(x86_cpu_def->model_id);
 x86_cpu_def->vendor_override = 0;
 
+
+/*
+ * Every SVM feature requires emulation support in KVM - so we can't just
+ * read the host features here. KVM might even support SVM features not
+ * available on the host hardware. Just set all bits and mask out the
+ * unsupported ones later.
+ */
+x86_cpu_def->svm_features = -1;
+
 return 0;
 }
 
@@ -560,8 +585,14 @@ static int cpu_x86_find_by_name(x86_def_t *x86_cpu_def, 
const char *cpu_model)
 
 char *s = strdup(cpu_model);

[PATCH 1/3] Make kvm64 the default cpu model when kvm_enabled()

2010-09-14 Thread Joerg Roedel

As requested by Alex this patch makes kvm64 the default CPU
model when qemu is started with -enable-kvm.

Signed-off-by: Joerg Roedel 
---
 hw/pc.c |   19 ++-
 1 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 69b13bf..f531d0d 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -40,6 +40,16 @@
 #include "sysbus.h"
 #include "sysemu.h"
 #include "blockdev.h"
+#include "kvm.h"
+
+
+#ifdef TARGET_X86_64
+#define DEFAULT_KVM_CPU_MODEL "kvm64"
+#define DEFAULT_QEMU_CPU_MODEL "qemu64"
+#else
+#define DEFAULT_KVM_CPU_MODEL "kvm32"
+#define DEFAULT_QEMU_CPU_MODEL "qemu32"
+#endif
 
 /* output Bochs bios info messages */
 //#define DEBUG_BIOS
@@ -867,11 +877,10 @@ void pc_cpus_init(const char *cpu_model)
 
 /* init CPUs */
 if (cpu_model == NULL) {
-#ifdef TARGET_X86_64
-cpu_model = "qemu64";
-#else
-cpu_model = "qemu32";
-#endif
+if (kvm_enabled())
+cpu_model = DEFAULT_KVM_CPU_MODEL;
+else
+cpu_model = DEFAULT_QEMU_CPU_MODEL;
 }
 
 for(i = 0; i < smp_cpus; i++) {
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] SVM feature support for qemu

2010-09-14 Thread Joerg Roedel

Hi,

here is the next round of the svm feature support patches for qemu. Key
change in this version is that it now makes kvm{64|32} the default cpu
definition for qemu when kvm is enabled (as requested by Alex).
Otherwise I removed the NRIP_SAVE feature from the phenom definition and
set svm_features per default to -1 for -cpu host to support all svm
features that kvm can emulate. I appreciate your comments.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread David S. Ahern

What's your clock source on the host?
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

With the usbtablet device the host clock source is read 2-3 times for
frequently which for acpi_pm and hpet jack up the CPU.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: MMU: Don't track nested fault info in error-code

2010-09-14 Thread Joerg Roedel

This patch moves the detection whether a page-fault was
nested or not out of the error code and moves it into a
separate variable in the fault struct.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/mmu.h  |1 -
 arch/x86/kvm/x86.c  |   14 --
 3 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3a00741..8a83177 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -322,6 +322,7 @@ struct kvm_vcpu_arch {
struct {
u64  address;
unsigned error_code;
+   bool nested;
} fault;
 
/* only needed in kvm_pv_mmu_op() path, but it's hot so
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 513abbb..7086ca8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -47,7 +47,6 @@
 #define PFERR_USER_MASK (1U << 2)
 #define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
-#define PFERR_NESTED_MASK (1U << 31)
 
 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3ff0a8f..335519f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -340,18 +340,12 @@ void kvm_inject_page_fault(struct kvm_vcpu *vcpu)
 
 void kvm_propagate_fault(struct kvm_vcpu *vcpu)
 {
-   u32 nested, error;
-
-   error   = vcpu->arch.fault.error_code;
-   nested  = error &  PFERR_NESTED_MASK;
-   error   = error & ~PFERR_NESTED_MASK;
-
-   vcpu->arch.fault.error_code = error;
-
-   if (mmu_is_nested(vcpu) && !nested)
+   if (mmu_is_nested(vcpu) && !vcpu->arch.fault.nested)
vcpu->arch.nested_mmu.inject_page_fault(vcpu);
else
vcpu->arch.mmu.inject_page_fault(vcpu);
+
+   vcpu->arch.fault.nested = false;
 }
 
 void kvm_inject_nmi(struct kvm_vcpu *vcpu)
@@ -3518,7 +3512,7 @@ static gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, 
gpa_t gpa, u32 access)
access |= PFERR_USER_MASK;
t_gpa  = vcpu->arch.mmu.gva_to_gpa(vcpu, gpa, access, &error);
if (t_gpa == UNMAPPED_GVA)
-   vcpu->arch.fault.error_code |= PFERR_NESTED_MASK;
+   vcpu->arch.fault.nested = true;
 
return t_gpa;
 }
-- 
1.7.0.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] NPT virtualization follow-up

2010-09-14 Thread Joerg Roedel

Hi Avi, Marcelo,

this patch-set includes two follow-on patches to the npt virtualization patch
set merged recently. These are the patches requested by Avi in his review of
the v4 npt virtualization patch-set.

Joerg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: MMU: Use base_role.nxe for mmu.nx

2010-09-14 Thread Joerg Roedel

This patch removes the mmu.nx field and uses the equivalent
field mmu.base_role.nxe instead.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |2 --
 arch/x86/kvm/mmu.c  |   27 +--
 arch/x86/kvm/paging_tmpl.h  |4 ++--
 arch/x86/kvm/x86.c  |3 ---
 4 files changed, 15 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8a83177..50506be 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -259,8 +259,6 @@ struct kvm_mmu {
u64 *lm_root;
u64 rsvd_bits_mask[2][4];
 
-   bool nx;
-
u64 pdptrs[4]; /* pae */
 };
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3ce56bf..21d2983 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -238,7 +238,7 @@ static int is_cpuid_PSE36(void)
 
 static int is_nx(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.efer & EFER_NX;
+   return !!(vcpu->arch.efer & EFER_NX);
 }
 
 static int is_shadow_present_pte(u64 pte)
@@ -2634,7 +2634,7 @@ static int nonpaging_init_context(struct kvm_vcpu *vcpu,
context->shadow_root_level = PT32E_ROOT_LEVEL;
context->root_hpa = INVALID_PAGE;
context->direct_map = true;
-   context->nx = false;
+   context->base_role.nxe = 0;
return 0;
 }
 
@@ -2688,7 +2688,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;
 
-   if (!context->nx)
+   if (!context->base_role.nxe)
exb_bit_rsvd = rsvd_bits(63, 63);
switch (level) {
case PT32_ROOT_LEVEL:
@@ -2747,7 +2747,7 @@ static int paging64_init_context_common(struct kvm_vcpu 
*vcpu,
struct kvm_mmu *context,
int level)
 {
-   context->nx = is_nx(vcpu);
+   context->base_role.nxe = is_nx(vcpu);
 
reset_rsvds_bits_mask(vcpu, context, level);
 
@@ -2775,7 +2775,7 @@ static int paging64_init_context(struct kvm_vcpu *vcpu,
 static int paging32_init_context(struct kvm_vcpu *vcpu,
 struct kvm_mmu *context)
 {
-   context->nx = false;
+   context->base_role.nxe = 0;
 
reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
 
@@ -2815,24 +2815,23 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->set_cr3 = kvm_x86_ops->set_tdp_cr3;
context->get_cr3 = get_cr3;
context->inject_page_fault = kvm_inject_page_fault;
-   context->nx = is_nx(vcpu);
 
if (!is_paging(vcpu)) {
-   context->nx = false;
+   context->base_role.nxe = 0;
context->gva_to_gpa = nonpaging_gva_to_gpa;
context->root_level = 0;
} else if (is_long_mode(vcpu)) {
-   context->nx = is_nx(vcpu);
+   context->base_role.nxe = is_nx(vcpu);
reset_rsvds_bits_mask(vcpu, context, PT64_ROOT_LEVEL);
context->gva_to_gpa = paging64_gva_to_gpa;
context->root_level = PT64_ROOT_LEVEL;
} else if (is_pae(vcpu)) {
-   context->nx = is_nx(vcpu);
+   context->base_role.nxe = is_nx(vcpu);
reset_rsvds_bits_mask(vcpu, context, PT32E_ROOT_LEVEL);
context->gva_to_gpa = paging64_gva_to_gpa;
context->root_level = PT32E_ROOT_LEVEL;
} else {
-   context->nx = false;
+   context->base_role.nxe = 0;
reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
context->gva_to_gpa = paging32_gva_to_gpa;
context->root_level = PT32_ROOT_LEVEL;
@@ -2888,21 +2887,21 @@ static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
 * functions between mmu and nested_mmu are swapped.
 */
if (!is_paging(vcpu)) {
-   g_context->nx = false;
+   g_context->base_role.nxe = 0;
g_context->root_level = 0;
g_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
} else if (is_long_mode(vcpu)) {
-   g_context->nx = is_nx(vcpu);
+   g_context->base_role.nxe = is_nx(vcpu);
reset_rsvds_bits_mask(vcpu, g_context, PT64_ROOT_LEVEL);
g_context->root_level = PT64_ROOT_LEVEL;
g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
} else if (is_pae(vcpu)) {
-   g_context->nx = is_nx(vcpu);
+   g_context->base_role.nxe = is_nx(vcpu);
reset_rsvds_bits_mask(vcpu, g_context, PT32E_ROOT_LEVEL);
g_context->root_level = PT32E_ROOT_LEVEL;
g_context->gva_to_gpa = paging64_gva_to_gpa_nested;
} else {
-   g_context->nx = false;
+   g_context->base_role.nxe = false;
reset_rsvds_bits_mask(vcpu, g_context

Re: qemu-kvm and initrd

2010-09-14 Thread David S. Ahern



On 09/14/10 00:35, Nirmal Guhan wrote:
> Hi,
> 
> Getting an error while booting my guest with -initrd option as in :
> 
> qemu-kvm -net nic,macaddr=$macaddress -net tap,script=/etc/qemu-ifup
> -m 512 -hda /root/kvm/x86/vdisk.img -kernel /root/mvroot/bzImage
> -initrd /root/kvm/mv/ramdisk.img -append "root=/dev/ram0"
> 
> No filesystem could mount root, tried : ext3 ext2 ext4 vfat msds iso9660
> Kernel panic
> 
> #file ramdisk.img
> #ramdisk.img: Linux rev 1.0 ext2 filesystem data (mounted or unclean)

What's the size of ramdisk.img?

David


>
> I tried with both above initrd and gzipped initrd but same error.
> 
> If I try to mount the same file and do a -append  "ip=dhcp
> root=/dev/nfs rw nfsroot=:/root/kvm/mv/mnt" instead of -initrd
> option, it works  fine. So am guessing this is initrd related.
> 
> Any help would be much appreciated.
> 
> Thanks,
> Nirmal
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Michael S. Tsirkin

On Tue, Sep 14, 2010 at 05:21:13PM +0200, Arnd Bergmann wrote:
> On Tuesday 14 September 2010, Shirley Ma wrote:
> > On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
> >
> > > That's what io_submit() is for.  Then io_getevents() tells you what
> > > "a 
> > > while" actually was.
> > 
> > This macvtap zero copy uses iov buffers from vhost ring, which is
> > allocated from guest kernel. In host kernel, vhost calls macvtap
> > sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
> > pages for zero copy.
> > 
> > The patch is relying on how vhost handle these buffers. I need to look
> > at vhost code (qemu) first for addressing the questions here.
> 
> I guess the best solution would be to make macvtap_aio_write return
> -EIOCBQUEUED when a packet gets passed down to the adapter, and
> call aio_complete when the adapter is done with it.
> 
> This would change the regular behavior of macvtap into a model where
> every write on the file blocks until the packet has left the machine,
> which gives us better flow control, but does slow down the traffic
> when we only put one packet at a time into the queue.
> 
> It also allows the user to call io_submit instead of write in order
> to do an asynchronous submission as Avi was suggesting.
> 
>   Arnd

I would expect this to hurt performance significantly.
We could do this for asynchronous requests only to avoid the
slowdown.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tun: orphan an skb on tx

2010-09-14 Thread Michael S. Tsirkin

I looked at the macvtap driver and it seems that it should
have the below issue, same as tap.
Arnd?

On Tue, Apr 13, 2010 at 05:59:44PM +0300, Michael S. Tsirkin wrote:
> The following situation was observed in the field:
> tap1 sends packets, tap2 does not consume them, as a result
> tap1 can not be closed. This happens because
> tun/tap devices can hang on to skbs undefinitely.
> 
> As noted by Herbert, possible solutions include a timeout followed by a
> copy/change of ownership of the skb, or always copying/changing
> ownership if we're going into a hostile device.
> 
> This patch implements the second approach.
> 
> Note: one issue still remaining is that since skbs
> keep reference to tun socket and tun socket has a
> reference to tun device, we won't flush backlog,
> instead simply waiting for all skbs to get transmitted.
> At least this is not user-triggerable, and
> this was not reported in practice, my assumption is
> other devices besides tap complete an skb
> within finite time after it has been queued.
> 
> A possible solution for the second issue
> would not to have socket reference the device,
> instead, implement dev->destructor for tun, and
> wait for all skbs to complete there, but this
> needs some thought, probably too risky for 2.6.34.
> 
> Signed-off-by: Michael S. Tsirkin 
> Tested-by: Yan Vugenfirer 
> 
> ---
> 
> Please review the below, and consider for 2.6.34,
> and stable trees.
> 
>  drivers/net/tun.c |4 
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 96c39bd..4326520 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -387,6 +387,10 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>   }
>   }
>  
> + /* Orphan the skb - required as we might hang on to it
> +  * for indefinite time. */
> + skb_orphan(skb);
> +
>   /* Enqueue packet */
>   skb_queue_tail(&tun->socket.sk->sk_receive_queue, skb);
>   dev->trans_start = jiffies;
> -- 
> 1.7.0.2.280.gc6f05
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Arnd Bergmann

On Tuesday 14 September 2010, Shirley Ma wrote:
> On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
>
> > That's what io_submit() is for.  Then io_getevents() tells you what
> > "a 
> > while" actually was.
> 
> This macvtap zero copy uses iov buffers from vhost ring, which is
> allocated from guest kernel. In host kernel, vhost calls macvtap
> sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
> pages for zero copy.
> 
> The patch is relying on how vhost handle these buffers. I need to look
> at vhost code (qemu) first for addressing the questions here.

I guess the best solution would be to make macvtap_aio_write return
-EIOCBQUEUED when a packet gets passed down to the adapter, and
call aio_complete when the adapter is done with it.

This would change the regular behavior of macvtap into a model where
every write on the file blocks until the packet has left the machine,
which gives us better flow control, but does slow down the traffic
when we only put one packet at a time into the queue.

It also allows the user to call io_submit instead of write in order
to do an asynchronous submission as Avi was suggesting.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 0/1] macvtap TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

Hello Miachel,

On Tue, 2010-09-14 at 14:05 +0200, Michael S. Tsirkin wrote:
> While others pointed out correctness issues with the patch,
> I would still like to see the performance numbers, just so we
> understand what's possible.

The performance looks good, it either saves the host CPU utilization the
guest is running on (by 8-10% in 8 cpus) or gain high BW w/i more guest
CPU utilization when host utilization is similar or less than before.
And I run 32 netperf instants and didn't hit any problem.

Here are output from host perf top: (I am upgrading my guest to most
recent kernel now to collect perf top data.) My guest has 2 vcpus, host
has 8 cpus.

Please let me know what performance data you would like to see. I will
run more

w/o zero copy patch:

---
   PerfTop:1708 irqs/sec  kernel:63.7%  exact:  0.0% [1000Hz cycles],  
(all, 8 CPUs)
---

 samples  pcnt function DSO
 ___ _  
__

 6842.00 47.4% copy_user_generic_string 
/lib/modules/2.6.36-rc3+/build/vmlinux
  329.00  2.3% get_page_from_freelist   
/lib/modules/2.6.36-rc3+/build/vmlinux
  307.00  2.1% list_del 
/lib/modules/2.6.36-rc3+/build/vmlinux
  289.00  2.0% alloc_pages_current  
/lib/modules/2.6.36-rc3+/build/vmlinux
  283.00  2.0% __alloc_pages_nodemask   
/lib/modules/2.6.36-rc3+/build/vmlinux
  234.00  1.6% ixgbe_xmit_frame 
/lib/modules/2.6.36-rc3+/kernel/drivers/net/ixgbe/ixgbe.ko
  232.00  1.6% vmx_vcpu_run 
/lib/modules/2.6.36-rc3+/kernel/arch/x86/kvm/kvm-intel.ko
  210.00  1.5% schedule 
/lib/modules/2.6.36-rc3+/build/vmlinux
  173.00  1.2% _cond_resched
/lib/modules/2.6.36-rc3+/build/vmlinux


w/i zero copy patch:

---
   PerfTop:1108 irqs/sec  kernel:43.0%  exact:  0.0% [1000Hz cycles],  
(all, 8 CPUs)
---

 samples  pcnt function DSO
 ___ _  ___

  281.00  5.1% copy_user_generic_string [kernel]
  235.00  4.3% vmx_vcpu_run [kvm_intel]
  228.00  4.1% gup_pte_range[kernel]
  211.00  3.8% tg_shares_up [kernel]
  179.00  3.2% schedule [kernel]
  148.00  2.7% _raw_spin_lock_irqsave   [kernel]
  139.00  2.5% iommu_no_mapping [kernel]
  124.00  2.2% ixgbe_xmit_frame [ixgbe]
  123.00  2.2% kvm_arch_vcpu_ioctl_run  [kvm]
  122.00  2.2% _raw_spin_lock   [kernel]
  113.00  2.1% put_page [kernel]
   92.00  1.7% vhost_get_vq_desc[vhost_net]
   81.00  1.5% get_user_pages_fast  [kernel]
   81.00  1.5% memcpy_fromiovec [kernel]
   80.00  1.5% translate_desc   [vhost_net]

w/i zero copy patch, and NIC IRQ cpu affinity (netper/netserver on cpu 0, 
interrupts on cpu1)

[r...@localhost ~]# netperf -H 10.0.4.74 -c -C -l 60 -T0,0 -- -m 65536
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.4.74 (10.0.4.74) 
port 0 AF_INET : cpu bind
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  6553660.00  9384.25   53.9213.620.941   0.951
[r...@localhost ~]#






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call minutes for Sept 14

2010-09-14 Thread Anthony Liguori


On 09/14/2010 09:47 AM, Chris Wright wrote:

0.13
- if all goes well...tomorrow
   


To tag, it may be thursday for announcement.  I need to run a regression 
run tonight.



qed/qcow2
- increase concurrency, performance
   


To achieve performance, a block driver must: 1) support concurrent 
request handling 2) not hold the qemu_mutex for prolonged periods of time.


QED never does (2) and supports (1) in all circumstances except cluster 
allocation today.


qcow2 can do (1) for the data read/write portions of an I/O request.  
All metadata read/write is serialized.  It also does (2) for all 
metadata operations and for CoW operations.


These are implementation details though.  The real claim of QED is that 
by having fewer IO ops required to satisfy a request, it achieves better 
performance especially since it achieves zero syncs in the cluster 
allocation path.  qcow2 has two syncs in the cluster allocation path 
today.  One sync is due to the refcount table.  Another sync is due to 
the fact that it doesn't require fsck support.


We could sync() on cluster allocations in QED and we'd still have better 
performance than qcow2 on paper because we have less IO ops and fewer 
sync()s.  That would eliminate fsck.


However, since the design target is to have no sync()s in the fast path, 
we're starting with fsck.



- threading vs state machine
- avi doesn't like qed reliance on fsck
   - undermines value of error checking (errors become normal)
   - prefer preallocation and fsck just checks for leaked blocks
   


We will provide performance data on fsck.  That's the next step.


- just load and validate metadata
- options for correctness are
   - fsync at every data allocation
   - leak data blocks
   


I contend that leaking data blocks is incorrect and potentially guest 
exploitable so it's not an option IMHO.



   - scan
- qed is pure statemachine
   - state on stack, control flow vs function call
- common need to separate handle requests concurrently, issue async i/o
- most disk formats have similar metadata and methods
   - lookup cluster, read/write data
   - qed could be a path to cleaning up other formats (reusing)
- need an incremental way to improve qcow2 performance
   - threading doesn't seem to be the way to achieve this (incrementally)
   


Because qcow2 already implements a state machine and the qemu 
infrastructure is based on events.  We can incrementally split states in 
qcow2.  Once you've got explicit states, it's trivial to compact those 
states into control flow using coroutines.


OTOH, threading would probably require a full rewrite of qcow2 and a lot 
of the block layer.


Regards,

Anthony Liguori


- coroutines vs. traditional threads discussion
   - parallel (and locking) vs few well-defined preemption points
- plan for qed...attempt to merge in 0.14
   - online fsck support is all that's missing
   - add bdrv check callback, look for new patch series over the next week
- back to list with discussion...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 2/2] macvtap: TX zero copy between guest and host kernel

2010-09-14 Thread Shirley Ma

On Tue, 2010-09-14 at 11:12 +0200, Avi Kivity wrote:
> >> +base = (unsigned long)from->iov_base + offset1;
> >> +size = ((base&  ~PAGE_MASK) + len + ~PAGE_MASK)>>
> PAGE_SHIFT;
> >> +num_pages = get_user_pages_fast(base, size,
> 0,&page[i]);
> >> +if ((num_pages != size) ||
> >> +(num_pages>  MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> >> +/* put_page is in skb free */
> >> +return -EFAULT;
> > What keeps the user from writing to these pages in it's address
> space
> > after the write call returns?
> >
> > A write() return of success means:
> >
> >   "I wrote what you gave to me"
> >
> > not
> >
> >   "I wrote what you gave to me, oh and BTW don't touch these
> >   pages for a while."
> >
> > In fact "a while" isn't even defined in any way, as there is no way
> > for the write() invoker to know when the networking card is done
> with
> > those pages.
> 
> That's what io_submit() is for.  Then io_getevents() tells you what
> "a 
> while" actually was.

This macvtap zero copy uses iov buffers from vhost ring, which is
allocated from guest kernel. In host kernel, vhost calls macvtap
sendmsg. macvtap sendmsg calls get_user_pages_fast to pin these buffers'
pages for zero copy.

The patch is relying on how vhost handle these buffers. I need to look
at vhost code (qemu) first for addressing the questions here.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] device-assignment: register a reset function

2010-09-14 Thread Bernhard Kohl

This is necessary because during reboot of a VM the assigned devices
continue DMA transfers which causes memory corruption.

Signed-off-by: Thomas Ostler 
Signed-off-by: Bernhard Kohl 
---
 hw/device-assignment.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 87f7418..001aee8 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1450,6 +1450,17 @@ static void 
assigned_dev_unregister_msix_mmio(AssignedDevice *dev)
 dev->msix_table_page = NULL;
 }
 
+static void reset_assigned_device(void *opaque)
+{
+PCIDevice *d = (PCIDevice *)opaque;
+uint32_t conf;
+
+/* reset the bus master bit to avoid further DMA transfers */
+conf = assigned_dev_pci_read_config(d, 0x04, 0x02);
+conf &= ~0x04;
+assigned_dev_pci_write_config(d, 0x04, conf, 0x02);
+}
+
 static int assigned_initfn(struct PCIDevice *pci_dev)
 {
 AssignedDevice *dev = DO_UPCAST(AssignedDevice, dev, pci_dev);
@@ -1499,6 +1510,9 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 if (r < 0)
 goto assigned_out;
 
+/* register reset function for the device */
+qemu_register_reset(reset_assigned_device, pci_dev);
+
 /* intercept MSI-X entry page in the MMIO */
 if (dev->cap.available & ASSIGNED_DEVICE_CAP_MSIX)
 if (assigned_dev_register_msix_mmio(dev))
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 18:45, Avi Kivity пишет:
>> 17:27:23.96 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>> 10], left {0, 98})<0.09>
>> 17:27:24.000199 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>> [12], left {0, 998775})<0.001241>
>> 17:27:24.001666 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>> 10], left {0, 97})<0.06>
>> 17:27:24.001768 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>> [12], left {0, 32})<0.000103>
>> 17:27:24.001985 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5
>> 10], left {0, 98})<0.05>
>> 17:27:24.002061 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in
>> [12], left {0, 998407})<0.001617>
> 
> That pipe is doing a lot of damage (I don't have it, and couldn't
> reproduce your results, another pointer).  Do you have CONFIG_EVENTFD
> set?  If not, why not?

As I mentioned in other emails in this thread:

o yes, I do have CONFIG_EVENTFD set, and it is being used
  too (fd#12 in the above strace).

o 0.13.0-rc1 behaves the same way (that is, it also shows
  high load when idle -- the same 18% of host CPU), but it
  has no pipe on fd#5.

Here's how it looks like for 0.13.0-rc1
(so far i were playing with 0.12.5):

qemu-syst 11397  mjt0u   CHR  136,5  0t08 /dev/pts/5
qemu-syst 11397  mjt1u   CHR  136,5  0t08 /dev/pts/5
qemu-syst 11397  mjt2u   CHR  136,5  0t08 /dev/pts/5
qemu-syst 11397  mjt3u   CHR 10,232  0t0 4402 /dev/kvm
qemu-syst 11397  mjt4u  0,90  607 anon_inode
qemu-syst 11397  mjt5u  0,90  607 anon_inode
qemu-syst 11397  mjt6u  0,90  607 anon_inode
qemu-syst 11397  mjt7u   CHR 10,200  0t0 1228 
/dev/net/tun
qemu-syst 11397  mjt8u  0,90  607 anon_inode
qemu-syst 11397  mjt9u  unix 0x8801950a4c80  0t0 13170832 socket
qemu-syst 11397  mjt   10u  unix 0x8801950a4680  0t0 13170834 socket
qemu-syst 11397  mjt   11u  0,90  607 anon_inode
qemu-syst 11397  mjt   12u  0,90  607 anon_inode
qemu-syst 11397  mjt   13u  0,90  607 anon_inode

18:52:24.871650 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999867}) <0.000165>
18:52:24.872175 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 93}) <0.15>
18:52:24.872300 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 98}) <0.10>
18:52:24.872548 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.10>
18:52:24.872648 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999861}) <0.000171>
18:52:24.874516 select(10, [9], NULL, NULL, {0, 0}) = 0 (Timeout) <0.08>
18:52:24.874686 select(14, [0 5 7 11 13], [], [], {1, 0}) = 2 (in [11 13], left 
{0, 93}) <0.12>
18:52:24.875000 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 94}) <0.37>
18:52:24.875305 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.10>
18:52:24.875406 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999633}) <0.000387>
18:52:24.876084 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 96}) <0.09>
18:52:24.876195 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 75}) <0.31>
18:52:24.876414 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.08>
18:52:24.876507 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999741}) <0.000276>
18:52:24.877066 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 97}) <0.09>
18:52:24.877173 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 71}) <0.35>
18:52:24.877393 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.07>
18:52:24.877485 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999720}) <0.000298>
18:52:24.878067 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 97}) <0.09>
18:52:24.878174 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 71}) <0.34>
18:52:24.878394 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.07>
18:52:24.878485 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 997665}) <0.002351>
18:52:24.881150 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 97}) <0.07>
18:52:24.881240 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 98}) <0.06>
18:52:24.881389 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [11], left 
{0, 98}) <0.06>
18:52:24.881460 select(14, [0 5 7 11 13], [], [], {1, 0}) = 1 (in [13], left 
{0, 999703}) <0.000314>
18:52:24.882137 selec

KVM call minutes for Sept 14

2010-09-14 Thread Chris Wright

0.13
- if all goes well...tomorrow

stable tree
- please look at -stable to see what is missing (bugfixes)
  - esp. regressions from 0.12
- looking for dedicated stable maintainer/release manager
  - pick this discussion up next week

qed/qcow2
- increase concurrency, performance
- threading vs state machine
- avi doesn't like qed reliance on fsck
  - undermines value of error checking (errors become normal)
  - prefer preallocation and fsck just checks for leaked blocks
- just load and validate metadata
- options for correctness are
  - fsync at every data allocation
  - leak data blocks
  - scan
- qed is pure statemachine
  - state on stack, control flow vs function call
- common need to separate handle requests concurrently, issue async i/o
- most disk formats have similar metadata and methods
  - lookup cluster, read/write data
  - qed could be a path to cleaning up other formats (reusing)
- need an incremental way to improve qcow2 performance
  - threading doesn't seem to be the way to achieve this (incrementally)
- coroutines vs. traditional threads discussion
  - parallel (and locking) vs few well-defined preemption points
- plan for qed...attempt to merge in 0.14
  - online fsck support is all that's missing
  - add bdrv check callback, look for new patch series over the next week
- back to list with discussion...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Avi Kivity


 On 09/14/2010 03:29 PM, Michael Tokarev wrote:

14.09.2010 17:25, Avi Kivity wrote:
>   On 09/14/2010 03:15 PM, Michael Tokarev wrote:
[]
>>  Looking at what hw/usb-uhci.c:uhci_frame_timer() routine
>>  does, it is quite expected to have that many writes and
>>  reads and that many gettimers().  It is polling for events
>>  every 1/1000th of a second, instead of using some form of
>>  select().
>
>  IIUC that's mandated by USB hardware.  The guest may place data in
>  memory, and USB polls it to see if it needs to tell send some message on
>  the bus.

Well, checking guest memory does not involve so many
reads/writes, i guess ;)

>  Please post an strace again, this time with -e trace=select.  Looks like
>  each timer callback results in>50 syscalls, 4 of which are select()s).

Here we go.

qemu-syst 25728  mjt0u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt1u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt2u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt3u   CHR   10,232  0t0 4402 /dev/kvm
qemu-syst 25728  mjt4u    0,90  607 anon_inode
qemu-syst 25728  mjt5r  FIFO  0,8  0t0 11703862 pipe
qemu-syst 25728  mjt6w  FIFO  0,8  0t0 11703862 pipe
qemu-syst 25728  mjt7u   CHR   10,200  0t0 1228 /dev/net/tun
qemu-syst 25728  mjt8u    0,90  607 anon_inode
qemu-syst 25728  mjt9u  IPv4 11704055  0t0  TCP *:5900 (LISTEN)
qemu-syst 25728  mjt   10u    0,90  607 anon_inode
qemu-syst 25728  mjt   11u    0,90  607 anon_inode
qemu-syst 25728  mjt   12u    0,90  607 anon_inode

17:27:23.995096 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
998573})<0.001461>
17:27:23.996994 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
87})<0.42>
17:27:23.997258 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
98})<0.11>
17:27:23.997561 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
98})<0.09>
17:27:23.997739 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
998771})<0.001256>
17:27:23.999458 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
91})<0.17>
17:27:23.999665 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
98})<0.10>
17:27:23.96 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
98})<0.09>
17:27:24.000199 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
998775})<0.001241>
17:27:24.001666 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
97})<0.06>
17:27:24.001768 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
32})<0.000103>
17:27:24.001985 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], left {0, 
98})<0.05>
17:27:24.002061 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left {0, 
998407})<0.001617>



That pipe is doing a lot of damage (I don't have it, and couldn't 
reproduce your results, another pointer).  Do you have CONFIG_EVENTFD 
set?  If not, why not?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] qemu-kvm-x86: consider the irq0override flag in kvm_arch_init_irq_routing

2010-09-14 Thread Bernhard Kohl

The setting of the irq0override flag must be also passed properly
to the KVM_IRQCHIP_IOAPIC.

Signed-off-by: Bernhard Kohl 
---
 qemu-kvm-x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/qemu-kvm-x86.c b/qemu-kvm-x86.c
index fd974b3..e35c234 100644
--- a/qemu-kvm-x86.c
+++ b/qemu-kvm-x86.c
@@ -1388,9 +1388,9 @@ int kvm_arch_init_irq_routing(void)
 }
 }
 for (i = 0; i < 24; ++i) {
-if (i == 0) {
+if (i == 0 && irq0override) {
 r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, 2);
-} else if (i != 2) {
+} else if (i != 2 || !irq0override) {
 r = kvm_add_irq_route(kvm_context, i, KVM_IRQCHIP_IOAPIC, i);
 }
 if (r < 0) {
-- 
1.7.2.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for Sept 14

2010-09-14 Thread Anthony Liguori


On 09/13/2010 10:59 AM, Chris Wright wrote:

Please send in any agenda items you are interested in covering.
   


- QED and qcow2

Obviously, there have been lots of discussions on the ML.  It would be 
good to use the call to step back and try to discuss a higher level plan 
for moving forward.


Also, if someone is on the phone that is able to talk more about the use 
case around qcow2 on LVM, I'd like to hear more about it.


Regards,

Anthony Liguori


thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 17:29, Michael Tokarev wrote:
[]
>> Please post an strace again, this time with -e trace=select.  Looks like
>> each timer callback results in >50 syscalls, 4 of which are select()s).

I just built 0.13-rc1 to see how that one performs.  It is very similar
to 0.12 in this respect.  The selfpipe is gone, but the high load is
still here and is exactly like was with 0.12, with very similar strace.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

14.09.2010 17:25, Avi Kivity wrote:
>  On 09/14/2010 03:15 PM, Michael Tokarev wrote:
[]
>> Looking at what hw/usb-uhci.c:uhci_frame_timer() routine
>> does, it is quite expected to have that many writes and
>> reads and that many gettimers().  It is polling for events
>> every 1/1000th of a second, instead of using some form of
>> select().
> 
> IIUC that's mandated by USB hardware.  The guest may place data in
> memory, and USB polls it to see if it needs to tell send some message on
> the bus.

Well, checking guest memory does not involve so many
reads/writes, i guess ;)

> Please post an strace again, this time with -e trace=select.  Looks like
> each timer callback results in >50 syscalls, 4 of which are select()s).

Here we go.

qemu-syst 25728  mjt0u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt1u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt2u   CHR136,9  0t0   12 /dev/pts/9
qemu-syst 25728  mjt3u   CHR   10,232  0t0 4402 /dev/kvm
qemu-syst 25728  mjt4u    0,90  607 anon_inode
qemu-syst 25728  mjt5r  FIFO  0,8  0t0 11703862 pipe
qemu-syst 25728  mjt6w  FIFO  0,8  0t0 11703862 pipe
qemu-syst 25728  mjt7u   CHR   10,200  0t0 1228 /dev/net/tun
qemu-syst 25728  mjt8u    0,90  607 anon_inode
qemu-syst 25728  mjt9u  IPv4 11704055  0t0  TCP *:5900 (LISTEN)
qemu-syst 25728  mjt   10u    0,90  607 anon_inode
qemu-syst 25728  mjt   11u    0,90  607 anon_inode
qemu-syst 25728  mjt   12u    0,90  607 anon_inode

17:27:23.995096 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998573}) <0.001461>
17:27:23.996994 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 87}) <0.42>
17:27:23.997258 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 98}) <0.11>
17:27:23.997561 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 98}) <0.09>
17:27:23.997739 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998771}) <0.001256>
17:27:23.999458 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 91}) <0.17>
17:27:23.999665 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 98}) <0.10>
17:27:23.96 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 98}) <0.09>
17:27:24.000199 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998775}) <0.001241>
17:27:24.001666 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 97}) <0.06>
17:27:24.001768 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 32}) <0.000103>
17:27:24.001985 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 98}) <0.05>
17:27:24.002061 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998407}) <0.001617>
17:27:24.004234 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 3 (in [5 10 12], 
left {0, 92}) <0.41>
17:27:24.004827 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998683}) <0.001361>
17:27:24.006746 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 3 (in [5 10 12], 
left {0, 91}) <0.19>
17:27:24.007218 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998545}) <0.001479>
17:27:24.009212 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 92}) <0.14>
17:27:24.009366 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 97}) <0.08>
17:27:24.009667 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 92}) <0.17>
17:27:24.009964 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998755}) <0.001272>
17:27:24.011573 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 96}) <0.09>
17:27:24.011771 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 97}) <0.08>
17:27:24.016422 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 91}) <0.14>
17:27:24.016596 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998705}) <0.001313>
17:27:24.018166 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 96}) <0.08>
17:27:24.018307 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 35}) <0.71>
17:27:24.018643 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 97}) <0.07>
17:27:24.018767 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 92}) <0.12>
17:27:24.018964 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10], 
left {0, 98}) <0.09>
17:27:24.019097 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 1 (in [12], left 
{0, 998457}) <0.001562>
17:27:24.021019 select(13, [0 5 7 9 10 12], [], [], {1, 0}) = 2 (in [5 10],

Re: high load with usb device

2010-09-14 Thread Avi Kivity

 On 09/14/2010 03:15 PM, Michael Tokarev wrote:

>
>>  - instrument calls to qemu_mod_timer() in hw/usb-*hci.c.  Looks like
>>  these are all 1kHz, but something else is clearly happening.

Ok. There's nothing interesting going on there either,
apparently.

It is using hw/usb-uhci.c.  I added a few prints() in there,
but they're firing at the defined 1KHz frequency.

Just for test, I lowered the frequency (FRAME_TIMER_FREQ)
from 1000 to 500, and the load dropped to half, from 19%
to 9..10%.

Looking at what hw/usb-uhci.c:uhci_frame_timer() routine
does, it is quite expected to have that many writes and
reads and that many gettimers().  It is polling for events
every 1/1000th of a second, instead of using some form of
select().

IIUC that's mandated by USB hardware.  The guest may place data in 
memory, and USB polls it to see if it needs to tell send some message on 
the bus.

Please post an strace again, this time with -e trace=select.  Looks like 
each timer callback results in >50 syscalls, 4 of which are select()s).

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] vhost-net: fix range checking in mrg bufs case

2010-09-14 Thread Michael S. Tsirkin

In mergeable buffer case, we use headcount, log_num
and seg as indexes in same-size arrays, and
we know that headcount <= seg and
log_num equals either 0 or seg.

Therefore, the right thing to do is range-check seg,
not headcount as we do now: these will be different
if guest chains s/g descriptors (this does not
happen now, but we can not trust the guest).

Long term, we should add BUG_ON checks to verify
two other indexes are what we think they should be.

Reported-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 
---

Dave, I'll queue this on my tree, no need to bother.

 drivers/vhost/net.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 6400cd5..f095de6 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -245,7 +245,7 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
int r, nlogs = 0;
 
while (datalen > 0) {
-   if (unlikely(headcount >= VHOST_NET_MAX_SG)) {
+   if (unlikely(seg >= VHOST_NET_MAX_SG)) {
r = -ENOBUFS;
goto err;
}
-- 
1.7.3.rc1.5.ge5969
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: high load with usb device

2010-09-14 Thread Michael Tokarev

[linux-perf-users removed from Cc]

14.09.2010 15:02, Michael Tokarev wrote:
> 14.09.2010 14:39, Avi Kivity wrote:
>>  On 09/14/2010 12:25 PM, Michael Tokarev wrote:
>>> Not that it is much helpful either.  lsof:
>>>
>>> qemu-syst 23203  mjt0u   CHR   136,9  0t0  12 /dev/pts/9
>>> qemu-syst 23203  mjt1u   CHR   136,9  0t0  12 /dev/pts/9
>>> qemu-syst 23203  mjt2u   CHR   136,9  0t0  12 /dev/pts/9
>>> qemu-syst 23203  mjt3u   CHR  10,232  0t04402 /dev/kvm
>>> qemu-syst 23203  mjt4u   0,90 607 anon_inode
>>> qemu-syst 23203  mjt5r  FIFO 0,8  0t0 8172675 pipe
>>> qemu-syst 23203  mjt6w  FIFO 0,8  0t0 8172675 pipe
>>> qemu-syst 23203  mjt7u   CHR  10,200  0t01228 /dev/net/tun
>>> qemu-syst 23203  mjt8u   0,90 607 anon_inode
>>> qemu-syst 23203  mjt9u  IPv4 8173217  0t0 TCP *:5900 (LISTEN)
>>> qemu-syst 23203  mjt   10u   0,90 607 anon_inode
>>> qemu-syst 23203  mjt   11u   0,90 607 anon_inode
>>> qemu-syst 23203  mjt   12u   0,90 607 anon_inode
>>
>>> So it is constantly poking fds# 11, 12, 10, 5&  6.
>>> 5 and 6 are pipe (selfpipe?),
>>
>> signalfd emulation, used to deliver signals efficiently.  Older glibc?
> 
> [e]glibc-2.11.
> 
> $ grep SIGNAL config-host.mak
> CONFIG_SIGNALFD=y
> 
> From strace of another run:
> 24318 signalfd(-1, [BUS ALRM IO], 8)= 12
> (so one of the remaining fds is a signalfd :)
> 
>>> and 10..12 are "anon inode".
>>
>> Those are likely eventfds.
>>
>>> Here's the command line again:
>>>
>>> qemu-system-x86_64 \
>>>-netdev type=tap,ifname=tap-kvm,id=x \
>>>-device virtio-net-pci,netdev=x \
>>>-monitor stdio \
>>>-boot n \
>>>-usbdevice tablet \
>>>-m 1G \
>>>-vnc :0
>>>
>>> Yes, it does quite a lot of timer stuff... ;)
>>
>> So timers internal to usb.
>>
>> Please try (independently):
>>
>> - just -usb, without -usbdevice tablet
> 
> No, that one works as expected - all quiet.
> -usbdevice tablet is also quiet up until
> guest loads usb host controller driver
> (not particular usb device driver).
> 
>> - instrument calls to qemu_mod_timer() in hw/usb-*hci.c.  Looks like
>> these are all 1kHz, but something else is clearly happening.

Ok. There's nothing interesting going on there either,
apparently.

It is using hw/usb-uhci.c.  I added a few prints() in there,
but they're firing at the defined 1KHz frequency.

Just for test, I lowered the frequency (FRAME_TIMER_FREQ)
from 1000 to 500, and the load dropped to half, from 19%
to 9..10%.

Looking at what hw/usb-uhci.c:uhci_frame_timer() routine
does, it is quite expected to have that many writes and
reads and that many gettimers().  It is polling for events
every 1/1000th of a second, instead of using some form of
select().

I'll do some more tests -- after all I'm curious why winXP
does not show this behavour.

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 18/24] Exiting from L2 to L1

2010-09-14 Thread Nadav Har'El

On Mon, Jun 14, 2010, Avi Kivity wrote about "Re: [PATCH 18/24] Exiting from L2 
to L1":
> >+int switch_back_vmcs(struct kvm_vcpu *vcpu)
> >+{
> IIUC vpids are not exposed to the guest yet?  So the VPID should not 
> change between guest and nested guest.

Right. Removed.

> 
> >+
> >+vmcs_write64(IO_BITMAP_A, src->io_bitmap_a);
> >+vmcs_write64(IO_BITMAP_B, src->io_bitmap_b);
> >   
> 
> Why change the I/O bitmap?
>...
> Or the msr bitmap?  After all, we're switching the entire vmcs?
>...
> Why write all these?  What could have changed them?

You were right - most of these copies were utterly useless, and apparently
remained in our code since prehistory (when the same hardware vmcs was reused
for both L1 and L2). The great thing is that this removes several dozens of
vmwrites from the L2->L1 exit path in one fell swoop. In fact, the whole
function switch_back_vmcs() is now gone. Thanks for spotting this!


> >+vmx_set_cr0(vcpu,
> >+(vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> >+vmx->nested.l1_shadow_vmcs->cr0_read_shadow) |
> >+(~vmx->nested.l1_shadow_vmcs->cr0_guest_host_mask&
> >+vmx->nested.l1_shadow_vmcs->guest_cr0));
> >   
> 
> Helper wanted.

Done. The new helper looks like this:

static inline unsigned long guest_readable_cr0(struct vmcs_fields *fields)
{
return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
(fields->cr0_read_shadow & fields->cr0_guest_host_mask);
}

And is used in two places in the code (the above place, and another one).



> >+vmcs_write64(GUEST_PDPTR3, src->guest_pdptr3);
> >+}
> >   
> 
> A kvm_set_cr3(src->host_cr3) should do all that and more, no?
> >+
> >+vmx_set_cr4(vcpu, vmx->nested.l1_state.cr4);
> >+
> >   
> 
> Again, the kvm_set_crx() versions have more meat.

I have to admit, I still don't understand this part of the code completely.
The fact that kvm_set_cr4 does more than vmx_set_cr4 doesn't always mean
that we want (or need) to do those things. In particular:
> 
> >+if (enable_ept) {
> >+vcpu->arch.cr3 = vmx->nested.l1_shadow_vmcs->guest_cr3;
> >+vmcs_write32(GUEST_CR3, 
> >vmx->nested.l1_shadow_vmcs->guest_cr3);
> >+} else {
> >+kvm_set_cr3(vcpu, vmx->nested.l1_state.cr3);
> >+}
> >   
> 
> kvm_set_cr3() will load the PDPTRs in the EPT case (correctly in case 
> the nested guest was able to corrupted the guest's PDPT).

kvm_set_cr3 calls vmx_set_cr3 which calls ept_load_pdptrs which assumes
that vcpu->arch.pdptrs[] is correct. I am guessing (but am not yet completely
sure) that this code tried to avoid assuming that this cache is up-to-date.
Again, I still need to better understand this part of the code before I
can correct it (because, as the saying goes, "if it ain't broken, don't fix
it" - or at least fix it carefully).

> >+kvm_mmu_reset_context(vcpu);
> >+kvm_mmu_load(vcpu);
> >   
> 
> kvm_mmu_load() unneeded, usually.

Again, I'll need to look into this deeper and report back.
In the meantime, attached below is the current version of this patch.

Thanks,
Nadav.

Subject: [PATCH 19/26] nVMX: Exiting from L2 to L1

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El 
---
 arch/x86/kvm/vmx.c |  242 ++-
 1 file changed, 241 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2010-09-14 15:02:37.0 +0200
+++ .after/arch/x86/kvm/vmx.c   2010-09-14 15:02:37.0 +0200
@@ -4970,9 +4970,13 @@ static void vmx_complete_interrupts(stru
int type;
bool idtv_info_valid;
 
+   vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
+
+   if (vmx->nested.nested_mode)
+   return;
+
exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
 
-   vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
 
/* Handle machine checks before interrupts are enabled */
if ((vmx->exit_reason == EXIT_REASON_MCE_DURING_VMENTRY)
@@ -5961,6 +5965,242 @@ static int nested_vmx_run(struct kvm_vcp
return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0. This may not be the
+ * guest

Re: Regarding routed networking with KVM

2010-09-14 Thread Rajiv Rajaian

Thanks for your information Daniel

On Tue, Sep 14, 2010 at 5:46 PM, Daniel P. Berrange  wrote:
> On Tue, Sep 14, 2010 at 04:58:46PM +0530, Rajiv Rajaian wrote:
>> Thanks for your kind information Daniel.
>> Consider this scenario
>> VM1(144.68.100.1) and VM2(144.68.100.2) running on Host1(10.2.0.20)
>> and Host2(10.2.0.30) respectively. Is it possible to access the VM1
>> and VM2 from Host3(10.2.0.100). How to add a static route for this
>> scenario?? Here I don't need separate subnets for VMs.
>> All VMs should be in same subnet ie 144.68.100.0/255.255.255.0
>> Is there any way to configure this ??
>
> No, in the setup libvirt does, each host must have a separate subnet.
>
> Regards,
> Daniel
> --
> |: Red Hat, Engineering, London    -o-   http://people.redhat.com/berrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :|
> |: http://autobuild.org        -o-         http://search.cpan.org/~danberr/ :|
> |: GnuPG: 7D3B9505  -o-   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 126 matches

Mail list logo