Re: windows 2008 guest causing rcu_shed to emit NMI

2013-01-30 Thread Andrey Korolyov
On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote:
 On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
  On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
  On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti mtosa...@redhat.com 
  wrote:
   On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
   On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti 
   mtosa...@redhat.com wrote:
On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
Thank you Marcelo,
   
Host node locking up sometimes later than yesterday, bur problem 
still
here, please see attached dmesg. Stuck process looks like
root 19251  0.0  0.0 228476 12488 ?D14:42   0:00
/usr/bin/kvm -no-user-config -device ? -device pci-assign,? -device
virtio-blk-pci,? -device
   
on fourth vm by count.
   
Should I try upstream kernel instead of applying patch to the 
latest
3.4 or it is useless?
   
If you can upgrade to an upstream kernel, please do that.
   
  
   With vanilla 3.7.4 there is almost no changes, and NMI started firing
   again. External symptoms looks like following: starting from some
   count, may be third or sixth vm, qemu-kvm process allocating its
   memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. Patch
   helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
   kill stuck kvm processes and node returned back to the normal, when on
   3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
   output (problem and workaround when no scheduler involved described
   here http://www.spinics.net/lists/kvm/msg84799.html).
  
   Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module 
   parameter.
  
 
  Hi Marcelo,
 
  thanks, this parameter helped to increase number of working VMs in a
  half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
  to 15 percents, persists on such numbers for a long time, where linux
  guests in same configuration do not jump over one percent even under
  stress bench. After I disabled HT, crash happens only in long runs and
  now it is kernel panic :)
  Stair-like memory allocation behaviour disappeared, but other symptom
  leading to the crash which I have not counted previously, persists: if
  VM count is ``enough'' for crash, some qemu processes starting to eat
  one core, and they`ll panic system after run in tens of minutes in
  such state or if I try to attach debugger to one of them. If needed, I
  can log entire crash output via netconsole, now I have some tail,
  almost the same every time:
  http://xdel.ru/downloads/btwin.png
 
  Yes, please log entire crash output, thanks.
 
 
  Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
 
  http://xdel.ru/downloads/oops-default-kvmintel.txt

 Just an update: I was able to reproduce that on pure linux VMs using
 qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at
 start of vm(with count ten working machines at the moment). Qemu-1.1.2
 generally is not able to reproduce that, but host node with older
 version crashing on less amount of Windows VMs(three to six instead
 ten to fifteen) than with 1.3, please see trace below:

 http://xdel.ru/downloads/oops-old-qemu.txt

 Single bit memory error, apparently. Try:

 1. memtest86.
 2. Boot with slub_debug=ZFPU kernel parameter.
 3. Reproduce on different machine



Hi Marcelo,

I always follow the rule - if some weird bug exists, check it on
ECC-enabled machine and check IPMI logs too before start complaining
:) I have finally managed to ``fix'' the problem, but my solution
seems a bit strange:
- I have noticed that if virtual machines started without any cgroup
setting they will not cause this bug under any conditions,
- I have thought, very wrong in my mind, that the
CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and
should not touch tasks already inside any existing cpu cgroup. First
sight on the 200-line patch shows that the autogrouping always applies
to all tasks, so I tried to disable it,
- wild magic appears - VMs didn`t crashed host any more, even in count
30+ they work fine.
I still don`t know what exactly triggered that and will I face it
again under different conditions, so my solution more likely to be a
patch of mud in wall of the dam, instead of proper fixing.

There seems to be two possible origins of such error - a very very
hideous race condition involving cgroups and processes like qemu-kvm
causing frequent context switches and simple incompatibility between
NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing work
in the cgroup, since I have not observed this errors on single numa
node(mean, desktop) on relatively heavier condition.
--
To unsubscribe from this list: send the line unsubscribe kvm in

Re: What to do about non-qdevified devices?

2013-01-30 Thread Andreas Färber
Am 30.01.2013 08:02, schrieb Markus Armbruster:
 Anthony Liguori aligu...@us.ibm.com writes:
 
 [...]
 The problems I ran into were (1) this is a lot of work (2) it basically
 requires that all bus children have been qdev/QOM-ified.  Even with
 something like the ISA bus which is where I started, quite a few devices
 were not qdevified still.
 
 So what's the plan to complete the qdevification job?  Lay really low
 and quietly hope the problem goes away?  We've tried that for about
 three years, doesn't seem to work.

Stating (file) names would make that discussion much easier... ;)

I'd expect non-qdev'ified devices to rather be SysBusDevices (e.g.,
m68k, sh4, ppc). PReP's pc87312 qdev'ification was forgotten for 1.2 and
recently merged.
Would dma.c be a candidate for ISADevice? It uses isa_* API. (The stubs
in sun4m.c/sun4u.c due to use in fdc.c might be a candidate for stubs/
at least, short of an fdc.c rewrite.)

I recently went through all ISADevices and QOM'ified them:
https://lists.gnu.org/archive/html/qemu-devel/2012-11/msg02746.html

It became too late for 1.4 and I'm not quite sure where Anthony wanted
to draw the line between his 1) and 2):
https://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00071.html
Thus I've only been rebasing my queue [1] without sending a v2 so far.

Lack of an official ISA maintainer for reviewing is another issue, any
volunteers? :)

Cheers,
Andreas

[1] https://github.com/afaerber/qemu-cpu/commits/realize-isa

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] s390/kvm fixes

2013-01-30 Thread Christian Borntraeger
On 29/01/13 22:03, Gleb Natapov wrote:

 The question about 1/1. It is CCed to stable, does this mean you want it
 to go to 3.8? kvm-next is for 3.9.

 On the second thought, if it is not a regression 3.9 is the right place.

The store status part is broken, but it only has a severe impact in case of 
a machine check. (The machine check handler revalidates all registers with
the content of the save area).
Since machine checks are part of the virtio-ccw code, this can go into 3.9.
Feel free to remove the CC:stable. 

Christian

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] QEMU buildbot maintenance state

2013-01-30 Thread Gerd Hoffmann
  Hi,

 Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel
 and Christian?  It would be awesome if you could do this given your
 experience running and customizing buildbot.

I'll try to set aside some time for that.  Christians idea to host the
config at github is good, that certainly makes it easier to balance
things to more people.

Another thing which would be helpful:  Any chance we can setup a
maintainer tree mirror @ git.qemu.org?  A single repository where each
maintainer tree shows up as a branch?

This would make the buildbot setup *alot* easier.  We can go for a
AnyBranchScheduler then with BuildFactory and BuildConfig shared,
instead of needing one BuildFactory and BuildConfig per branch.  Also
makes the buildbot web interface less cluttered as we don't have a
insane amount of BuildConfigs any more.  And saves some resources
(bandwidth + diskspace) for the buildslaves.

I think people who want to look what is coming or who want to test stuff
cooking it would be a nice service too if they have a one-stop shop
where they can get everything.

cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: QEMU buildbot maintenance state

2013-01-30 Thread Stefan Hajnoczi
On Tue, Jan 29, 2013 at 04:04:39PM +0100, Christian Berendt wrote:
 On 01/28/2013 03:29 PM, Daniel Gollub wrote:
 JFYI, the main buildbot configuration which controls everything (beside
 buildslave credentials) is accessible to everyone:
 http://people.b1-systems.de/~gollub/buildbot/
 
 If you are familiar with buildbot feel free to incorporate your suggested
 changes directly on a copy and send me or Christian the diff so we just have
 to review and apply it.
 
 I moved the configuration on GitHub
 (https://github.com/b1-systems/buildbot). I'll add a cron job to the
 buildbot system to regular pull and apply the latest configuration.
 Simply open a pull request to modify the configuration.

Thanks Christian!  I have updated the QEMU wiki page:

http://wiki.qemu.org/ContinuousIntegration

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH V2 11/20] tap: support enabling or disabling a queue

2013-01-30 Thread Jason Wang
On 01/30/2013 07:03 AM, Michael S. Tsirkin wrote:
 On Tue, Jan 29, 2013 at 04:55:25PM -0600, Anthony Liguori wrote:
 Michael S. Tsirkin m...@redhat.com writes:

 On Tue, Jan 29, 2013 at 08:10:26PM +, Blue Swirl wrote:
 On Tue, Jan 29, 2013 at 1:50 PM, Jason Wang jasow...@redhat.com wrote:
 On 01/26/2013 03:13 AM, Blue Swirl wrote:
 On Fri, Jan 25, 2013 at 10:35 AM, Jason Wang jasow...@redhat.com wrote:
 This patch introduce a new bit - enabled in TAPState which tracks 
 whether a
 specific queue/fd is enabled. The tap/fd is enabled during 
 initialization and
 could be enabled/disabled by tap_enalbe() and tap_disable() which calls 
 platform
 specific helpers to do the real work. Polling of a tap fd can only done 
 when
 the tap was enabled.

 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  include/net/tap.h |2 ++
  net/tap-win32.c   |   10 ++
  net/tap.c |   43 ---
  3 files changed, 52 insertions(+), 3 deletions(-)

 diff --git a/include/net/tap.h b/include/net/tap.h
 index bb7efb5..0caf8c4 100644
 --- a/include/net/tap.h
 +++ b/include/net/tap.h
 @@ -35,6 +35,8 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len);
  void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr);
  void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, 
 int ecn, int ufo);
  void tap_set_vnet_hdr_len(NetClientState *nc, int len);
 +int tap_enable(NetClientState *nc);
 +int tap_disable(NetClientState *nc);

  int tap_get_fd(NetClientState *nc);

 diff --git a/net/tap-win32.c b/net/tap-win32.c
 index 265369c..a2cd94b 100644
 --- a/net/tap-win32.c
 +++ b/net/tap-win32.c
 @@ -764,3 +764,13 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int 
 len)
  {
  assert(0);
  }
 +
 +int tap_enable(NetClientState *nc)
 +{
 +assert(0);
 abort()
 This is just to be consistent with the reset of the helpers in this file.
 +}
 +
 +int tap_disable(NetClientState *nc)
 +{
 +assert(0);
 +}
 diff --git a/net/tap.c b/net/tap.c
 index 67080f1..95e557b 100644
 --- a/net/tap.c
 +++ b/net/tap.c
 @@ -59,6 +59,7 @@ typedef struct TAPState {
  unsigned int write_poll : 1;
  unsigned int using_vnet_hdr : 1;
  unsigned int has_ufo: 1;
 +unsigned int enabled : 1;
 bool without bit field?
 Also to be consistent with other field. If you wish I can send patches
 to convert all those bit field to bool on top of this series.
 That would be nice, likewise for the assert(0).
 OK so let's go ahead with this patchset as is,
 and a cleanup patch will be send after 1.4 then.
 Why?  I'd prefer that we didn't rush things into 1.4 just because.
 There's still ample time to respin a corrected series.

 Regards,

 Anthony Liguori
 Confused.  Do you want the coding style rework of net/tap.c
 switching it from assert(0)/bitfields to abort()/bool for 1.4?

I will send a new series with the patches that addresses Blue's comments
on assert(0) and bitfields.

Thanks

 Thanks
  VHostNetState *vhost_net;
  unsigned host_vnet_hdr_len;
  } TAPState;
 @@ -72,9 +73,9 @@ static void tap_writable(void *opaque);
  static void tap_update_fd_handler(TAPState *s)
  {
  qemu_set_fd_handler2(s-fd,
 - s-read_poll  ? tap_can_send : NULL,
 - s-read_poll  ? tap_send : NULL,
 - s-write_poll ? tap_writable : NULL,
 + s-read_poll  s-enabled ? tap_can_send : 
 NULL,
 + s-read_poll  s-enabled ? tap_send : 
 NULL,
 + s-write_poll  s-enabled ? tap_writable : 
 NULL,
   s);
  }

 @@ -339,6 +340,7 @@ static TAPState *net_tap_fd_init(NetClientState 
 *peer,
  s-host_vnet_hdr_len = vnet_hdr ? sizeof(struct virtio_net_hdr) : 
 0;
  s-using_vnet_hdr = 0;
  s-has_ufo = tap_probe_has_ufo(s-fd);
 +s-enabled = 1;
  tap_set_offload(s-nc, 0, 0, 0, 0, 0);
  /*
   * Make sure host header length is set correctly in tap:
 @@ -737,3 +739,38 @@ VHostNetState *tap_get_vhost_net(NetClientState 
 *nc)
  assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP);
  return s-vhost_net;
  }
 +
 +int tap_enable(NetClientState *nc)
 +{
 +TAPState *s = DO_UPCAST(TAPState, nc, nc);
 +int ret;
 +
 +if (s-enabled) {
 +return 0;
 +} else {
 +ret = tap_fd_enable(s-fd);
 +if (ret == 0) {
 +s-enabled = 1;
 +tap_update_fd_handler(s);
 +}
 +return ret;
 +}
 +}
 +
 +int tap_disable(NetClientState *nc)
 +{
 +TAPState *s = DO_UPCAST(TAPState, nc, nc);
 +int ret;
 +
 +if (s-enabled == 0) {
 +return 0;
 +} else {
 +ret = tap_fd_disable(s-fd);
 +if (ret == 0) {
 +qemu_purge_queued_packets(nc);
 +s-enabled = 0;
 +tap_update_fd_handler(s);
 +}
 +return ret;
 +}
 +}
 --
 1.7.1


 --
 To unsubscribe from this list: send the line 

[Bug 53191] hardware error 0x80000021 on a KVM virtual machine with kernel 3.7

2013-01-30 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=53191


Gleb g...@redhat.com changed:

   What|Removed |Added

 CC||g...@redhat.com




--- Comment #2 from Gleb g...@redhat.com  2013-01-30 09:51:48 ---
Can you try to load kvm-intel module with emulate_invalid_guest_state=0 flag?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] s390/kvm fixes

2013-01-30 Thread Gleb Natapov
On Wed, Jan 30, 2013 at 09:51:24AM +0100, Christian Borntraeger wrote:
 On 29/01/13 22:03, Gleb Natapov wrote:
 
  The question about 1/1. It is CCed to stable, does this mean you want it
  to go to 3.8? kvm-next is for 3.9.
 
  On the second thought, if it is not a regression 3.9 is the right place.
 
 The store status part is broken, but it only has a severe impact in case of 
 a machine check. (The machine check handler revalidates all registers with
 the content of the save area).
 Since machine checks are part of the virtio-ccw code, this can go into 3.9.
 Feel free to remove the CC:stable. 
 
No reason to drop stable, but 3.8 will have to get the fix through
stable to after it is released.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] What to do about non-qdevified devices? (was: KVM call minutes 2013-01-29)

2013-01-30 Thread Peter Maydell
On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote:
 Anthony Liguori aligu...@us.ibm.com writes:

 [...]
 The problems I ran into were (1) this is a lot of work (2) it basically
 requires that all bus children have been qdev/QOM-ified.  Even with
 something like the ISA bus which is where I started, quite a few devices
 were not qdevified still.

 So what's the plan to complete the qdevification job?  Lay really low
 and quietly hope the problem goes away?  We've tried that for about
 three years, doesn't seem to work.

Do we have a list of not-yet-qdevified devices? Maybe we need to
start saying fix X Y and Z or platform P is dropped from the next
release. (This would of course be easier if we had a way to let users
know that platform P was in danger...)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] s390/kvm fixes

2013-01-30 Thread Gleb Natapov
On Fri, Jan 25, 2013 at 03:34:14PM +0100, Christian Borntraeger wrote:
 Gleb, Marcelo,
 
 here are 3 kvm fixes for kvm-next.
 
 Christian Borntraeger (3):
   s390/kvm: Fix store status for ACRS/FPRS
   s390/virtio-ccw: Fix setup_vq error handling.
   s390/kvm: Fix instruction decoding
 
  arch/s390/kvm/kvm-s390.c  |  8 
  arch/s390/kvm/kvm-s390.h  | 25 ++---
  drivers/s390/kvm/virtio_ccw.c | 20 +++-
  3 files changed, 33 insertions(+), 20 deletions(-)
 
Applied, thanks.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] KVM: set_memory_region: Cleanup and new restriction

2013-01-30 Thread Takuya Yoshikawa
Patch 1: just rebased for this series.
Patch 2: an API change, so please let me know if you notice any problems.

Takuya Yoshikawa (2):
  KVM: set_memory_region: Identify the requested change explicitly
  KVM: set_memory_region: Disallow changing read-only attribute later

 Documentation/virtual/kvm/api.txt |   12 ++--
 virt/kvm/kvm_main.c   |   95 +
 2 files changed, 60 insertions(+), 47 deletions(-)

-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2 -v3] KVM: set_memory_region: Identify the requested change explicitly

2013-01-30 Thread Takuya Yoshikawa
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
what kind of change is being requested by checking the arguments.  The
current code does this checking at various points in code and each
condition being used there is not easy to understand at first glance.

This patch consolidates these checks and introduces an enum to name the
possible changes to clean up the code.

Although this does not introduce any functional changes, there is one
change which optimizes the code a bit: if we have nothing to change, the
new code returns 0 immediately.

Note that the return value for this case cannot be changed since QEMU
relies on it: we noticed this when we changed it to -EINVAL and got a
section mismatch error at the final stage of live migration.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
---
v2: updated iommu related parts
v3: converted !(A == B) to A != B

 virt/kvm/kvm_main.c |   64 +++
 1 files changed, 44 insertions(+), 20 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a83ca63..64c5dc3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -719,6 +719,24 @@ static struct kvm_memslots *install_new_memslots(struct 
kvm *kvm,
 }
 
 /*
+ * KVM_SET_USER_MEMORY_REGION ioctl allows the following operations:
+ * - create a new memory slot
+ * - delete an existing memory slot
+ * - modify an existing memory slot
+ *   -- move it in the guest physical memory space
+ *   -- just change its flags
+ *
+ * Since flags can be changed by some of these operations, the following
+ * differentiation is the best we can do for __kvm_set_memory_region():
+ */
+enum kvm_mr_change {
+   KVM_MR_CREATE,
+   KVM_MR_DELETE,
+   KVM_MR_MOVE,
+   KVM_MR_FLAGS_ONLY,
+};
+
+/*
  * Allocate some memory and give it an address in the guest physical address
  * space.
  *
@@ -737,6 +755,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_memory_slot old, new;
struct kvm_memslots *slots = NULL, *old_memslots;
bool old_iommu_mapped;
+   enum kvm_mr_change change;
 
r = check_memory_region_flags(mem);
if (r)
@@ -780,17 +799,30 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
old_iommu_mapped = old.npages;
 
-   /*
-* Disallow changing a memory slot's size or changing anything about
-* zero sized slots that doesn't involve making them non-zero.
-*/
r = -EINVAL;
-   if (npages  old.npages  npages != old.npages)
-   goto out;
-   if (!npages  !old.npages)
+   if (npages) {
+   if (!old.npages)
+   change = KVM_MR_CREATE;
+   else { /* Modify an existing slot. */
+   if ((mem-userspace_addr != old.userspace_addr) ||
+   (npages != old.npages))
+   goto out;
+
+   if (base_gfn != old.base_gfn)
+   change = KVM_MR_MOVE;
+   else if (new.flags != old.flags)
+   change = KVM_MR_FLAGS_ONLY;
+   else { /* Nothing to change. */
+   r = 0;
+   goto out;
+   }
+   }
+   } else if (old.npages) {
+   change = KVM_MR_DELETE;
+   } else /* Modify a non-existent slot: disallowed. */
goto out;
 
-   if ((npages  !old.npages) || (base_gfn != old.base_gfn)) {
+   if ((change == KVM_MR_CREATE) || (change == KVM_MR_MOVE)) {
/* Check for overlaps */
r = -EEXIST;
kvm_for_each_memslot(slot, kvm-memslots) {
@@ -808,20 +840,12 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.dirty_bitmap = NULL;
 
r = -ENOMEM;
-
-   /*
-* Allocate if a slot is being created.  If modifying a slot,
-* the userspace_addr cannot change.
-*/
-   if (!old.npages) {
+   if (change == KVM_MR_CREATE) {
new.user_alloc = user_alloc;
new.userspace_addr = mem-userspace_addr;
 
if (kvm_arch_create_memslot(new, npages))
goto out_free;
-   } else if (npages  mem-userspace_addr != old.userspace_addr) {
-   r = -EINVAL;
-   goto out_free;
}
 
/* Allocate page dirty bitmap if needed */
@@ -830,7 +854,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
goto out_free;
}
 
-   if (!npages || base_gfn != old.base_gfn) {
+   if ((change == KVM_MR_DELETE) || (change == KVM_MR_MOVE)) {
r = -ENOMEM;
slots = kmemdup(kvm-memslots, sizeof(struct kvm_memslots),
GFP_KERNEL);
@@ -881,7 +905,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 * slots (size changes, userspace 

[PATCH 2/2] KVM: set_memory_region: Disallow changing read-only attribute later

2013-01-30 Thread Takuya Yoshikawa
As Xiao pointed out, there are a few problems with it:
 - kvm_arch_commit_memory_region() write protects the memory slot only
   for GET_DIRTY_LOG when modifying the flags.
 - FNAME(sync_page) uses the old spte value to set a new one without
   checking KVM_MEM_READONLY flag.

Since we flush all shadow pages when creating a new slot, the simplest
fix is to disallow such problematic flag changes: this is safe because
no one is doing such things.

Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
Cc: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com
Cc: Alex Williamson alex.william...@redhat.com
---
 Documentation/virtual/kvm/api.txt |   12 ++--
 virt/kvm/kvm_main.c   |   35 ---
 2 files changed, 18 insertions(+), 29 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 09905cb..0e03b19 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -874,12 +874,12 @@ It is recommended that the lower 21 bits of 
guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flag, KVM_MEM_LOG_DIRTY_PAGES, which instructs
-kvm to keep track of writes to memory within the slot.  See KVM_GET_DIRTY_LOG
-ioctl.  The KVM_CAP_READONLY_MEM capability indicates the availability of the
-KVM_MEM_READONLY flag.  When this flag is set for a memory region, KVM only
-allows read accesses.  Writes will be posted to userspace as KVM_EXIT_MMIO
-exits.
+The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
+KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
+use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+to make a new slot read-only.  In this case, writes to this memory will be
+posted to userspace as KVM_EXIT_MMIO exits.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 64c5dc3..2e93630 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -754,7 +754,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_memory_slot *slot;
struct kvm_memory_slot old, new;
struct kvm_memslots *slots = NULL, *old_memslots;
-   bool old_iommu_mapped;
enum kvm_mr_change change;
 
r = check_memory_region_flags(mem);
@@ -797,15 +796,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
new.npages = npages;
new.flags = mem-flags;
 
-   old_iommu_mapped = old.npages;
-
r = -EINVAL;
if (npages) {
if (!old.npages)
change = KVM_MR_CREATE;
else { /* Modify an existing slot. */
if ((mem-userspace_addr != old.userspace_addr) ||
-   (npages != old.npages))
+   (npages != old.npages) ||
+   ((new.flags ^ old.flags)  KVM_MEM_READONLY))
goto out;
 
if (base_gfn != old.base_gfn)
@@ -867,7 +865,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/* slot was deleted or moved, clear iommu mapping */
kvm_iommu_unmap_pages(kvm, old);
-   old_iommu_mapped = false;
/* From this point no new shadow pages pointing to a deleted,
 * or moved, memslot will be created.
 *
@@ -898,25 +895,17 @@ int __kvm_set_memory_region(struct kvm *kvm,
 
/*
 * IOMMU mapping:  New slots need to be mapped.  Old slots need to be
-* un-mapped and re-mapped if their base changes or if flags that the
-* iommu cares about change (read-only).  Base change unmapping is
-* handled above with slot deletion, so we only unmap incompatible
-* flags here.  Anything else the iommu might care about for existing
-* slots (size changes, userspace addr changes) is disallowed above,
-* so any other attribute changes getting here can be skipped.
+* un-mapped and re-mapped if their base changes.  Since base change
+* unmapping is handled above with slot deletion, mapping alone is
+* needed here.  Anything else the iommu might care about for existing
+* slots (size changes, userspace addr changes and read-only flag
+* changes) is disallowed above, so any other attribute changes getting
+* here can be skipped.
 */
-   if (change != KVM_MR_DELETE) {
-   if (old_iommu_mapped 
-   ((new.flags ^ old.flags)  KVM_MEM_READONLY)) {
-   kvm_iommu_unmap_pages(kvm, old);
-   old_iommu_mapped = false;
- 

vCPU hotplug roadmap (was: Minutes for KVM call 2013-01-15)

2013-01-30 Thread Andreas Färber
Am 15.01.2013 17:16, schrieb Juan Quintela:
 
 * cpu hot plug
   - use qdev propierties conected to a set of socket objects (anthony)
   - cpusets are the wrong interface (anthony)
   - make a link between cpu - socket instead of a propierty?
   - how far are we from being able to describe a cpu with -device?
 (didn't heare the answer, andreas?)
   - perhaps the best approach?
   - After soft-freeze, exceptions depend on the maintainer
   - After hard-freeze, no exceptions
   -device don't require a bus, just an implementation detail, we can change 
 that
   - use cpuset as an intermediate step until full vision is implemented
   - several approaches from where we are now, to have something before
 we get a full solution
 
 
 At this point, Andreas agreed to write a better summary of the
 discussion and suggestions O:-)

Got buried, here we go:

== vCPU hot-plug user interfaces ==

=== cpu_set ===

Previously available in qemu-kvm.git:
`cpu_set n+1 online` via HMP

Pros:
* Hides QOM/qdev implementation details (afaerber)
* Thus: Doesn't depend on QOM CPUState refactoring (imammedo)
* Opens a fast route to implementing vCPU unplug in KVM (imammedo)
* Unintrusive to add and easy to obsolete/remove in future (imammedo)
* Existing virt-test cases (afaerber)
* Supported by libvirt (imammedo)
* Prevents confusing guests by hot-plugging random mix of CPUs (agraf)

Cons:
* Cannot express topologies (ehabkost)

=== device_add ===

`device_add driver=Haswell-x86_64-cpu id=qdevid`
[You can try this today and see it failing / not working.]

Pros:
* QMP/HMP command available today and known to users (afaerber)
* Unified command for device and CPU hot-plug (imammedo)
* Would allow first doing thread-level vCPU hotplug (imammedo)
* Could be extended to support socket-level hot-plug (aliguori/imammedo)

Cons:
* Operates on raw QOM type name unlike -cpu (afaerber)
* Needs support in libvirt for device_add driver=CPU (imammedo)
* libvirt needs means to enumerate CPU types (imammedo) = QMP? (AF)

Challenges:
* No CPU qbus (afaerber)
  = should work without (aliguori)
* CPU subclasses needed for identifying type name (afaerber/imammedo)
  = Haswell-x86_64-cpu does not exist yet, just x86_64-cpu
* CPU class_init for -cpu host requires KVM init (imammedo)
  [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber]
* Conversion of CPU features to static properties needed (imammedo)
  = device_add driver=foo,level=x,xlevel=y,...
* Alternatively conversion to global properties (imammedo)
* Cements type names - rename for 1.4? (afaerber) = permissable (alig.)
  [patches for arm, m68k, openrisc, unicore32 on list]

=== qom-set ===

`qom-set` via QMP w/ linkCPUSocket property (aliguori)

Topology represented in QOM:
CPUSocket has-aCPUCore has-aCPUThread a.k.a. CPUState, or
CPUSocket links-to CPUCore links-to CPUThread a.k.a. CPUState

Challenges (afaerber):
* No CPUSocket/CPUCore objects yet and may take a while to get there...
  topology fields being moved to CPUState for 1.4 [done, more WIP]
* No decisions on canonical paths for CPUs: CPU? machine? unassigned?
* Duality of thread-level device types and socket-level? (afaerber)
  = fine to have, e.g., quad-core Xeon 500 device (aliguori)
* CPUState is no_user (afaerber)
  = need to generally drop no_user for QOM (aliguori)

=== libvirt ===

libvirt's XML topology modelling is closer to today's -smp than to the
desired QOM modelling:
http://www.libvirt.org/formatcaps.html

`virsh setvcpus domain n`
http://libvirt.org/sources/virshcmdref/html/sect-setvcpus.html

== qom-cpu course of action (afaerber) ==

It was requested to have vCPU hot-plug in v1.5.

For device_add we need to move code from cpu_init() into QOM facilities.
= QOM realize support would help [applied by aliguori]
= cleanups piggy-backed onto CPU realizefn [applied to qom-cpu-next]

Agreement on goal of X86CPU subclasses, but conflicts how to get there:
* Refactor x86_def_t to X86CPUInfo for X86CPUClass class_init? (AF 2012)
* Refactor x86_def_t to X86CPU instance_init as done for arm?
* Refactor x86_def_t to class_inits? (afaerber)
  - heavy merge conflicts due to bug fixes / cleanups
  Pro: We can get things into a consistent QOM'ish state across targets.
  Con: We will refactor again on top for machine-compat properties.
* Keep x86_def_t within X86CPUClass as done for ppc? (WIP: afaerber)
  = smallest common denominator, separates x86 from cross-target work

APIC ID topology fixes are being reviewed for 1.4. [merged]
X86CPU wave 4 cleanups by Igor are being reviewed for 1.4. [merged]

Rename CPU types according to unified name-arch-cpu scheme for 1.4?
(aliguori: permissable) [patches on list]

VMState series by Juan being rebased - subset for 1.4, rest for 1.5.
[1.4 part on list, WIP for 1.5]

Remainder is considered 1.5 material, qom-cpu-next avail. during Freeze.

== Common issues (imammedo) ==

- back-port CPU hot-plug ACPI notification
- hot-plug is not allowed on SysBus:
  - APIC that 

RE: [PATCH 8/8] KVM:PPC:booke: Allow debug interrupt injection to guest

2013-01-30 Thread Bhushan Bharat-R65777


 -Original Message-
 From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org] On
 Behalf Of Alexander Graf
 Sent: Friday, January 25, 2013 5:44 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 8/8] KVM:PPC:booke: Allow debug interrupt injection to 
 guest
 
 
 On 16.01.2013, at 09:24, Bharat Bhushan wrote:
 
  Allow userspace to inject debug interrupt to guest. QEMU can
 
 s/QEMU/user space.
 
  inject the debug interrupt to guest if it is not able to handle the
  debug interrupt.
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
  arch/powerpc/kvm/booke.c  |   32 +++-
  arch/powerpc/kvm/e500mc.c |   10 +-
  2 files changed, 40 insertions(+), 2 deletions(-)
 
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
  faa0a0b..547797f 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -133,6 +133,13 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu
  *vcpu) #endif }
 
  +#ifdef CONFIG_KVM_BOOKE_HV
  +static int kvmppc_core_pending_debug(struct kvm_vcpu *vcpu) {
  +   return test_bit(BOOKE_IRQPRIO_DEBUG,
  +vcpu-arch.pending_exceptions); } #endif
  +
  /*
   * Helper function for full MSR writes.  No need to call this if only
   * EE/CE/ME/DE/RI are changing.
  @@ -144,7 +151,11 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr)
  #ifdef CONFIG_KVM_BOOKE_HV
  new_msr |= MSR_GS;
 
  -   if (vcpu-guest_debug)
  +   /*
  +* Set MSR_DE if the hardware debug resources are owned by user-space
  +* and there is no debug interrupt pending for guest to handle.
 
 Why?

QEMU is using the IAC/DAC registers to set hardware breakpoint/watchpoints via 
debug ioctls. As debug events are enabled/gated by MSR_DE so somehow we need to 
set MSR_DE on hardware MSR when guest is running in this case.

On bookehv this is how I am controlling the MSR_DE in hardware MSR.  

 And why is this whole thing only executed on HV?

On e500v2 we always enable MSR_DE using vcpu-arch.shadow_msr in e500.c
#ifndef CONFIG_KVM_BOOKE_HV
-   vcpu-arch.shadow_msr = MSR_USER | MSR_IS | MSR_DS;
+   vcpu-arch.shadow_msr = MSR_USER | MSR_DE | MSR_IS | MSR_DS;
vcpu-arch.shadow_pid = 1;
vcpu-arch.shared-msr = 0;
#endif

Thanks
-Bharat

 
 
 Alex
 
  +*/
  +   if (vcpu-guest_debug  !kvmppc_core_pending_debug(vcpu))
  new_msr |= MSR_DE;
  #endif
 
  @@ -234,6 +245,16 @@ static void kvmppc_core_dequeue_watchdog(struct 
  kvm_vcpu
 *vcpu)
  clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions);
  }
 
  +static void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu)
  +{
  +   kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG);
  +}
  +
  +static void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu)
  +{
  +   clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions);
  +}
  +
  static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 
  srr1)
  {
  #ifdef CONFIG_KVM_BOOKE_HV
  @@ -1278,6 +1299,7 @@ static void get_sregs_base(struct kvm_vcpu *vcpu,
  sregs-u.e.dec = kvmppc_get_dec(vcpu, tb);
  sregs-u.e.tb = tb;
  sregs-u.e.vrsave = vcpu-arch.vrsave;
  +   sregs-u.e.dbsr = vcpu-arch.dbsr;
  }
 
  static int set_sregs_base(struct kvm_vcpu *vcpu,
  @@ -1310,6 +1332,14 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
  update_timer_ints(vcpu);
  }
 
  +   if (sregs-u.e.update_special  KVM_SREGS_E_UPDATE_DBSR) {
  +   vcpu-arch.dbsr = sregs-u.e.dbsr;
  +   if (vcpu-arch.dbsr)
  +   kvmppc_core_queue_debug(vcpu);
  +   else
  +   kvmppc_core_dequeue_debug(vcpu);
  +   }
  +
  return 0;
  }
 
  diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
  index 81abe92..7d90622 100644
  --- a/arch/powerpc/kvm/e500mc.c
  +++ b/arch/powerpc/kvm/e500mc.c
  @@ -208,7 +208,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct
 kvm_sregs *sregs)
  struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 
  sregs-u.e.features |= KVM_SREGS_E_ARCH206_MMU | KVM_SREGS_E_PM |
  -  KVM_SREGS_E_PC;
  +  KVM_SREGS_E_PC | KVM_SREGS_E_ED;
  sregs-u.e.impl_id = KVM_SREGS_E_IMPL_FSL;
 
  sregs-u.e.impl.fsl.features = 0;
  @@ -216,6 +216,9 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct
 kvm_sregs *sregs)
  sregs-u.e.impl.fsl.hid0 = vcpu_e500-hid0;
  sregs-u.e.impl.fsl.mcar = vcpu_e500-mcar;
 
  +   sregs-u.e.dsrr0 = vcpu-arch.dsrr0;
  +   sregs-u.e.dsrr1 = vcpu-arch.dsrr1;
  +
  kvmppc_get_sregs_e500_tlb(vcpu, sregs);
 
  sregs-u.e.ivor_high[3] =
  @@ -256,6 +259,11 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct
 kvm_sregs *sregs)
  sregs-u.e.ivor_high[5];
  }
 
  +   if (sregs-u.e.features  KVM_SREGS_E_ED) {
  +   vcpu-arch.dsrr0 = sregs-u.e.dsrr0;
  + 

[PATCH V4 00/22] Multiqueue virtio-net

2013-01-30 Thread Jason Wang
Hello all:

This seires is an update of last version of multiqueue virtio-net support.

This series tries to brings multiqueue support to virtio-net through a
multiqueue support tap backend and multiple vhost threads.

Patch 1 converts bitfield in TAPState to bool. Patch 2 replace assert(0) with
abort() in tap.

To support this, multiqueue nic support were added to qemu. This is done by
introducing an array of NetClientStates in NICState, and make each pair of peers
to be an queue of the nic. This is done in patch 3-9.

Tap were also converted to be able to create a multiple queue
backend. Currently, only linux support this by issuing TUNSETIFF N times with
the same device name to create N queues. Each fd returned by TUNSETIFF were a
queue supported by kernel. Three new command lines were introduced, queues
were used to tell how many queues will be created by qemu; fds were used to
pass multiple pre-created tap file descriptors to qemu; vhostfds were used to
pass multiple pre-created vhost descriptors to qemu. This is done in patch 
10-15.

A method of deleting a queue and queue_index were also introduce for virtio,
this is done in patch 16-17.

Vhost were also changed to support multiqueue by introducing a start vq index
which tracks the first virtqueue that will be used by vhost instead of the
assumption that the vhost always use virtqueue from index 0. This is done in
patch 18.

The last part is the multiqueue userspace changes, this is done in patch 19-22.

With this changes, user could start a multiqueue virtio-net device through

./qemu -netdev tap,id=hn0,queues=2,vhost=on -device virtio-net-pci,netdev=hn0

Management tools such as libvirt can pass multiple pre-created fds/vhostfds 
through

./qemu -netdev tap,id=hn0,fds=X:Y,vhostfds=M:N -device virtio-net-pci,netdev=hn0

For the one who wants to try, a git tree is available at:
git://github.com/jasowang/qemu.git

Changes from V3:
- convert bitfield to bool in TAPState (Blue)
- use abort() instead of assert(0) in tap code (Blue)
- rebase to the latest
- fix a bug that breaks the non-tap network

Changes from V2:
- Don't start/stop vhost threads when changing queues and simplify the interface
  between virtio-net and vhost further.

Changes from V1:
- silent checkpatch (Blue)
- use fds/vhostfds instead of fd/vhostfd (Stefan)
- use fds=X:Y:Z instead of fd=X,fd=Y,fd=Z (Anthony)
- split patches (Stefan)
- typos in commit log (Stefan)
- Warn 'queues=' when fds/vhostfds is used (Stefan)
- rename __net_init_tap to net_init_tap_one (Stefan)
- check the consistency of vnet_hdr of multiple tap fds (Stefan)
- disable multiqueue support for bridge-helper (Stefan)
- rename tap_attach()/tap_detach() to tap_enable()/tap_disable() (Stefan)
- fix booting with legacy guest (WanLong)
- don't bump the version when doing migration (Michael)
- simplify the interface between virtio-net and multiqueue vhost_net (Michael)
- rebase the patches to latest
- re-order the patches that let the net part comes first to simplify the
  reviewing
- simplify the interface between virtio-net and multiqueue vhost_net
- move the guest notifiers setup from vhost to vhost_net
- fix a build issue of hw/mcf_fce.c

Changes from RFC v2:
- rebase the codes to latest qemu
- align the multiqueue virtio-net implementation to virtio spec
- split the patches into more smaller patches
- set_link and hotplug support

Changes from RFC V1:
- rebase to the latest
- fix memory leak in parse_netdev
- fix guest notifiers assignment/de-assignment
- changes the command lines to:
   qemu -netdev tap,queues=2 -device virtio-net-pci,queues=2

Reference:
V1: http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg03558.html
RFC v2: http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg04108.html
RFC v1: http://comments.gmane.org/gmane.comp.emulators.qemu/100481

Perf Numbers:
- norm is short for normalize result
- trans.rate is short for transaction rate

Two Intel Xeon 5620 with direct connected intel 82599EB
Host/Guest kernel: David net tree
vhost enabled

- lots of improvents of both latency and cpu utilization in request-reponse test
- get regression of guest sending small packets which because TCP tends to batch
  less when the latency were improved

1q/2q/4q
TCP_RR
 size #sessions trans.rate  norm trans.rate  norm trans.rate  norm
1 1 9393.26   595.64  9408.18   597.34  9375.19   584.12
1 2072162.1   2214.24 129880.22 2456.13 196949.81 2298.13
1 50107513.38 2653.99 139721.93 2490.58 259713.82 2873.57
1 100   126734.63 2676.54 145553.5  2406.63 265252.68 2943
64 19453.42   632.33  9371.37   616.13  9338.19   615.97
64 20   70620.03  2093.68 125155.75 2409.15 191239.91 2253.32
64 50   1069662448.29 146518.67 2514.47 242134.07 2720.91
64 100  117046.35 2394.56 190153.09 2696.82 238881.29 2704.41
256 1   8733.29   736.36  8701.07   680.83  8608.92   530.1
256 20  69279.89  2274.45 115103.07 2299.76 144555.16 1963.53
256 50  97676.02  2296.09 150719.57 2522.92 254510.5  3028.44
256 100 

[PATCH V4 01/22] net: tap: using bool instead of bitfield

2013-01-30 Thread Jason Wang
Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio-net.c   |2 +-
 include/net/tap.h |4 ++--
 net/tap-win32.c   |6 +++---
 net/tap.c |   38 ++
 4 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 3bb01b1..faf4cc9 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -1069,7 +1069,7 @@ VirtIODevice *virtio_net_init(DeviceState *dev, NICConf 
*conf,
 n-nic = qemu_new_nic(net_virtio_info, conf, 
object_get_typename(OBJECT(dev)), dev-id, n);
 peer_test_vnet_hdr(n);
 if (peer_has_vnet_hdr(n)) {
-tap_using_vnet_hdr(n-nic-nc.peer, 1);
+tap_using_vnet_hdr(n-nic-nc.peer, true);
 n-host_hdr_len = sizeof(struct virtio_net_hdr);
 } else {
 n-host_hdr_len = 0;
diff --git a/include/net/tap.h b/include/net/tap.h
index bb7efb5..883cebf 100644
--- a/include/net/tap.h
+++ b/include/net/tap.h
@@ -29,10 +29,10 @@
 #include qemu-common.h
 #include qapi-types.h
 
-int tap_has_ufo(NetClientState *nc);
+bool tap_has_ufo(NetClientState *nc);
 int tap_has_vnet_hdr(NetClientState *nc);
 int tap_has_vnet_hdr_len(NetClientState *nc, int len);
-void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr);
+void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr);
 void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int 
ecn, int ufo);
 void tap_set_vnet_hdr_len(NetClientState *nc, int len);
 
diff --git a/net/tap-win32.c b/net/tap-win32.c
index 265369c..3052bba 100644
--- a/net/tap-win32.c
+++ b/net/tap-win32.c
@@ -722,9 +722,9 @@ int net_init_tap(const NetClientOptions *opts, const char 
*name,
 return 0;
 }
 
-int tap_has_ufo(NetClientState *nc)
+bool tap_has_ufo(NetClientState *nc)
 {
-return 0;
+return false;
 }
 
 int tap_has_vnet_hdr(NetClientState *nc)
@@ -741,7 +741,7 @@ void tap_fd_set_vnet_hdr_len(int fd, int len)
 {
 }
 
-void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr)
+void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr)
 {
 }
 
diff --git a/net/tap.c b/net/tap.c
index eb40c42..5542c98 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -55,10 +55,10 @@ typedef struct TAPState {
 char down_script[1024];
 char down_script_arg[128];
 uint8_t buf[TAP_BUFSIZE];
-unsigned int read_poll : 1;
-unsigned int write_poll : 1;
-unsigned int using_vnet_hdr : 1;
-unsigned int has_ufo: 1;
+bool read_poll;
+bool write_poll;
+bool using_vnet_hdr;
+bool has_ufo;
 VHostNetState *vhost_net;
 unsigned host_vnet_hdr_len;
 } TAPState;
@@ -78,15 +78,15 @@ static void tap_update_fd_handler(TAPState *s)
  s);
 }
 
-static void tap_read_poll(TAPState *s, int enable)
+static void tap_read_poll(TAPState *s, bool enable)
 {
-s-read_poll = !!enable;
+s-read_poll = enable;
 tap_update_fd_handler(s);
 }
 
-static void tap_write_poll(TAPState *s, int enable)
+static void tap_write_poll(TAPState *s, bool enable)
 {
-s-write_poll = !!enable;
+s-write_poll = enable;
 tap_update_fd_handler(s);
 }
 
@@ -94,7 +94,7 @@ static void tap_writable(void *opaque)
 {
 TAPState *s = opaque;
 
-tap_write_poll(s, 0);
+tap_write_poll(s, false);
 
 qemu_flush_queued_packets(s-nc);
 }
@@ -108,7 +108,7 @@ static ssize_t tap_write_packet(TAPState *s, const struct 
iovec *iov, int iovcnt
 } while (len == -1  errno == EINTR);
 
 if (len == -1  errno == EAGAIN) {
-tap_write_poll(s, 1);
+tap_write_poll(s, true);
 return 0;
 }
 
@@ -186,7 +186,7 @@ ssize_t tap_read_packet(int tapfd, uint8_t *buf, int maxlen)
 static void tap_send_completed(NetClientState *nc, ssize_t len)
 {
 TAPState *s = DO_UPCAST(TAPState, nc, nc);
-tap_read_poll(s, 1);
+tap_read_poll(s, true);
 }
 
 static void tap_send(void *opaque)
@@ -209,12 +209,12 @@ static void tap_send(void *opaque)
 
 size = qemu_send_packet_async(s-nc, buf, size, tap_send_completed);
 if (size == 0) {
-tap_read_poll(s, 0);
+tap_read_poll(s, false);
 }
 } while (size  0  qemu_can_send_packet(s-nc));
 }
 
-int tap_has_ufo(NetClientState *nc)
+bool tap_has_ufo(NetClientState *nc)
 {
 TAPState *s = DO_UPCAST(TAPState, nc, nc);
 
@@ -253,12 +253,10 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int len)
 s-host_vnet_hdr_len = len;
 }
 
-void tap_using_vnet_hdr(NetClientState *nc, int using_vnet_hdr)
+void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr)
 {
 TAPState *s = DO_UPCAST(TAPState, nc, nc);
 
-using_vnet_hdr = using_vnet_hdr != 0;
-
 assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP);
 assert(!!s-host_vnet_hdr_len == using_vnet_hdr);
 
@@ -290,8 +288,8 @@ static void tap_cleanup(NetClientState *nc)
 if (s-down_script[0])
 launch_script(s-down_script, s-down_script_arg, s-fd);
 
-tap_read_poll(s, 0);
-tap_write_poll(s, 0);
+

[PATCH V4 02/22] net: tap: use abort() instead of assert(0)

2013-01-30 Thread Jason Wang
Signed-off-by: Jason Wang jasow...@redhat.com
---
 net/tap-linux.c |4 ++--
 net/tap-win32.c |2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tap-linux.c b/net/tap-linux.c
index 059f5f3..0a6acc7 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -164,7 +164,7 @@ int tap_probe_vnet_hdr_len(int fd, int len)
 if (ioctl(fd, TUNSETVNETHDRSZ, orig) == -1) {
 fprintf(stderr, TUNGETVNETHDRSZ ioctl() failed: %s. Exiting.\n,
 strerror(errno));
-assert(0);
+abort();
 return -errno;
 }
 return 1;
@@ -175,7 +175,7 @@ void tap_fd_set_vnet_hdr_len(int fd, int len)
 if (ioctl(fd, TUNSETVNETHDRSZ, len) == -1) {
 fprintf(stderr, TUNSETVNETHDRSZ ioctl() failed: %s. Exiting.\n,
 strerror(errno));
-assert(0);
+abort();
 }
 }
 
diff --git a/net/tap-win32.c b/net/tap-win32.c
index 3052bba..601437e 100644
--- a/net/tap-win32.c
+++ b/net/tap-win32.c
@@ -762,5 +762,5 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len)
 
 void tap_set_vnet_hdr_len(NetClientState *nc, int len)
 {
-assert(0);
+abort();
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 03/22] net: introduce qemu_get_queue()

2013-01-30 Thread Jason Wang
To support multiqueue, the patch introduce a helper qemu_get_queue()
which is used to get the NetClientState of a device. The following patches would
refactor this helper to support multiqueue.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/cadence_gem.c|9 +++--
 hw/dp8393x.c|9 +++--
 hw/e1000.c  |   24 ---
 hw/eepro100.c   |   12 
 hw/etraxfs_eth.c|5 ++-
 hw/lan9118.c|   10 +++---
 hw/mcf_fec.c|4 +-
 hw/milkymist-minimac2.c |4 +-
 hw/mipsnet.c|4 +-
 hw/musicpal.c   |2 +-
 hw/ne2000-isa.c |2 +-
 hw/ne2000.c |7 ++--
 hw/opencores_eth.c  |6 ++--
 hw/pcnet-pci.c  |2 +-
 hw/pcnet.c  |7 ++--
 hw/rtl8139.c|   14 
 hw/smc91c111.c  |4 +-
 hw/spapr_llan.c |4 +-
 hw/stellaris_enet.c |5 ++-
 hw/usb/dev-network.c|   10 +++---
 hw/virtio-net.c |   76 ++-
 hw/xen_nic.c|   13 +---
 hw/xgmac.c  |4 +-
 hw/xilinx_axienet.c |4 +-
 hw/xilinx_ethlite.c |6 ++--
 include/net/net.h   |1 +
 net/net.c   |5 +++
 savevm.c|2 +-
 28 files changed, 140 insertions(+), 115 deletions(-)

diff --git a/hw/cadence_gem.c b/hw/cadence_gem.c
index 0d83442..9de688f 100644
--- a/hw/cadence_gem.c
+++ b/hw/cadence_gem.c
@@ -389,10 +389,10 @@ static void gem_init_register_masks(GemState *s)
  */
 static void phy_update_link(GemState *s)
 {
-DB_PRINT(down %d\n, s-nic-nc.link_down);
+DB_PRINT(down %d\n, qemu_get_queue(s-nic)-link_down);
 
 /* Autonegotiation status mirrors link status.  */
-if (s-nic-nc.link_down) {
+if (qemu_get_queue(s-nic)-link_down) {
 s-phy_regs[PHY_REG_STATUS] = ~(PHY_REG_STATUS_ANEGCMPL |
  PHY_REG_STATUS_LINK);
 s-phy_regs[PHY_REG_INT_ST] |= PHY_REG_INT_ST_LINKC;
@@ -906,9 +906,10 @@ static void gem_transmit(GemState *s)
 
 /* Send the packet somewhere */
 if (s-phy_loop) {
-gem_receive(s-nic-nc, tx_packet, total_bytes);
+gem_receive(qemu_get_queue(s-nic), tx_packet, total_bytes);
 } else {
-qemu_send_packet(s-nic-nc, tx_packet, total_bytes);
+qemu_send_packet(qemu_get_queue(s-nic), tx_packet,
+ total_bytes);
 }
 
 /* Prepare for next packet */
diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index b501450..c2d0bc8 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -339,6 +339,7 @@ static void do_receiver_disable(dp8393xState *s)
 
 static void do_transmit_packets(dp8393xState *s)
 {
+NetClientState *nc = qemu_get_queue(s-nic);
 uint16_t data[12];
 int width, size;
 int tx_len, len;
@@ -408,13 +409,13 @@ static void do_transmit_packets(dp8393xState *s)
 if (s-regs[SONIC_RCR]  (SONIC_RCR_LB1 | SONIC_RCR_LB0)) {
 /* Loopback */
 s-regs[SONIC_TCR] |= SONIC_TCR_CRSL;
-if (s-nic-nc.info-can_receive(s-nic-nc)) {
+if (nc-info-can_receive(nc)) {
 s-loopback_packet = 1;
-s-nic-nc.info-receive(s-nic-nc, s-tx_buffer, tx_len);
+nc-info-receive(nc, s-tx_buffer, tx_len);
 }
 } else {
 /* Transmit packet */
-qemu_send_packet(s-nic-nc, s-tx_buffer, tx_len);
+qemu_send_packet(nc, s-tx_buffer, tx_len);
 }
 s-regs[SONIC_TCR] |= SONIC_TCR_PTX;
 
@@ -903,7 +904,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift,
 
 s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s);
 
-qemu_format_nic_info_str(s-nic-nc, s-conf.macaddr.a);
+qemu_format_nic_info_str(qemu_get_queue(s-nic), s-conf.macaddr.a);
 qemu_register_reset(nic_reset, s);
 nic_reset(s);
 
diff --git a/hw/e1000.c b/hw/e1000.c
index ef06ca1..7b310d7 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -167,11 +167,11 @@ set_phy_ctrl(E1000State *s, int index, uint16_t val)
 {
 if ((val  MII_CR_AUTO_NEG_EN)  (val  MII_CR_RESTART_AUTO_NEG)) {
 /* no need auto-negotiation if link was down */
-if (s-nic-nc.link_down) {
+if (qemu_get_queue(s-nic)-link_down) {
 s-phy_reg[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
 return;
 }
-s-nic-nc.link_down = true;
+qemu_get_queue(s-nic)-link_down = true;
 e1000_link_down(s);
 s-phy_reg[PHY_STATUS] = ~MII_SR_AUTONEG_COMPLETE;
 DBGOUT(PHY, Start link auto negotiation\n);
@@ -183,7 +183,7 @@ static void
 e1000_autoneg_timer(void *opaque)
 {
 E1000State *s = opaque;
-s-nic-nc.link_down = false;
+qemu_get_queue(s-nic)-link_down = false;
 e1000_link_up(s);
 s-phy_reg[PHY_STATUS] |= MII_SR_AUTONEG_COMPLETE;
   

[PATCH V4 04/22] net: introduce qemu_get_nic()

2013-01-30 Thread Jason Wang
To support multiqueue, this patch introduces a helper qemu_get_nic() to get
NICState from a NetClientState. The following patches would refactor this helper
to support multiqueue.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/cadence_gem.c|8 
 hw/dp8393x.c|6 +++---
 hw/e1000.c  |8 
 hw/eepro100.c   |6 +++---
 hw/etraxfs_eth.c|6 +++---
 hw/lan9118.c|6 +++---
 hw/lance.c  |2 +-
 hw/mcf_fec.c|6 +++---
 hw/milkymist-minimac2.c |6 +++---
 hw/mipsnet.c|6 +++---
 hw/musicpal.c   |4 ++--
 hw/ne2000-isa.c |2 +-
 hw/ne2000.c |6 +++---
 hw/opencores_eth.c  |6 +++---
 hw/pcnet-pci.c  |2 +-
 hw/pcnet.c  |6 +++---
 hw/rtl8139.c|8 
 hw/smc91c111.c  |6 +++---
 hw/spapr_llan.c |4 ++--
 hw/stellaris_enet.c |6 +++---
 hw/usb/dev-network.c|6 +++---
 hw/virtio-net.c |   10 +-
 hw/xen_nic.c|4 ++--
 hw/xgmac.c  |6 +++---
 hw/xilinx_axienet.c |6 +++---
 hw/xilinx_ethlite.c |6 +++---
 include/net/net.h   |2 ++
 net/net.c   |   20 
 28 files changed, 92 insertions(+), 78 deletions(-)

diff --git a/hw/cadence_gem.c b/hw/cadence_gem.c
index 9de688f..ab35329 100644
--- a/hw/cadence_gem.c
+++ b/hw/cadence_gem.c
@@ -409,7 +409,7 @@ static int gem_can_receive(NetClientState *nc)
 {
 GemState *s;
 
-s = DO_UPCAST(NICState, nc, nc)-opaque;
+s = qemu_get_nic_opaque(nc);
 
 DB_PRINT(\n);
 
@@ -612,7 +612,7 @@ static ssize_t gem_receive(NetClientState *nc, const 
uint8_t *buf, size_t size)
 uint8_trxbuf[2048];
 uint8_t   *rxbuf_ptr;
 
-s = DO_UPCAST(NICState, nc, nc)-opaque;
+s = qemu_get_nic_opaque(nc);
 
 /* Do nothing if receive is not enabled. */
 if (!(s-regs[GEM_NWCTRL]  GEM_NWCTRL_RXENA)) {
@@ -1149,7 +1149,7 @@ static const MemoryRegionOps gem_ops = {
 
 static void gem_cleanup(NetClientState *nc)
 {
-GemState *s = DO_UPCAST(NICState, nc, nc)-opaque;
+GemState *s = qemu_get_nic_opaque(nc);
 
 DB_PRINT(\n);
 s-nic = NULL;
@@ -1158,7 +1158,7 @@ static void gem_cleanup(NetClientState *nc)
 static void gem_set_link(NetClientState *nc)
 {
 DB_PRINT(\n);
-phy_update_link(DO_UPCAST(NICState, nc, nc)-opaque);
+phy_update_link(qemu_get_nic_opaque(nc));
 }
 
 static NetClientInfo net_gem_info = {
diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index c2d0bc8..0273fad 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -676,7 +676,7 @@ static const MemoryRegionOps dp8393x_ops = {
 
 static int nic_can_receive(NetClientState *nc)
 {
-dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque;
+dp8393xState *s = qemu_get_nic_opaque(nc);
 
 if (!(s-regs[SONIC_CR]  SONIC_CR_RXEN))
 return 0;
@@ -725,7 +725,7 @@ static int receive_filter(dp8393xState *s, const uint8_t * 
buf, int size)
 
 static ssize_t nic_receive(NetClientState *nc, const uint8_t * buf, size_t 
size)
 {
-dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque;
+dp8393xState *s = qemu_get_nic_opaque(nc);
 uint16_t data[10];
 int packet_type;
 uint32_t available, address;
@@ -861,7 +861,7 @@ static void nic_reset(void *opaque)
 
 static void nic_cleanup(NetClientState *nc)
 {
-dp8393xState *s = DO_UPCAST(NICState, nc, nc)-opaque;
+dp8393xState *s = qemu_get_nic_opaque(nc);
 
 memory_region_del_subregion(s-address_space, s-mmio);
 memory_region_destroy(s-mmio);
diff --git a/hw/e1000.c b/hw/e1000.c
index 7b310d7..36f4051 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -743,7 +743,7 @@ receive_filter(E1000State *s, const uint8_t *buf, int size)
 static void
 e1000_set_link_status(NetClientState *nc)
 {
-E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque;
+E1000State *s = qemu_get_nic_opaque(nc);
 uint32_t old_status = s-mac_reg[STATUS];
 
 if (nc-link_down) {
@@ -777,7 +777,7 @@ static bool e1000_has_rxbufs(E1000State *s, size_t 
total_size)
 static int
 e1000_can_receive(NetClientState *nc)
 {
-E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque;
+E1000State *s = qemu_get_nic_opaque(nc);
 
 return (s-mac_reg[RCTL]  E1000_RCTL_EN)  e1000_has_rxbufs(s, 1);
 }
@@ -793,7 +793,7 @@ static uint64_t rx_desc_base(E1000State *s)
 static ssize_t
 e1000_receive(NetClientState *nc, const uint8_t *buf, size_t size)
 {
-E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque;
+E1000State *s = qemu_get_nic_opaque(nc);
 struct e1000_rx_desc desc;
 dma_addr_t base;
 unsigned int n, rdt;
@@ -1230,7 +1230,7 @@ e1000_mmio_setup(E1000State *d)
 static void
 e1000_cleanup(NetClientState *nc)
 {
-E1000State *s = DO_UPCAST(NICState, nc, nc)-opaque;
+E1000State *s = qemu_get_nic_opaque(nc);
 
 s-nic = NULL;
 }
diff --git a/hw/eepro100.c b/hw/eepro100.c
index 

[PATCH V4 05/22] net: intorduce qemu_del_nic()

2013-01-30 Thread Jason Wang
To support multiqueue nic, this patch separate the nic destructor from
qemu_del_net_client() to a new helper qemu_del_nic() since the mapping bettween
NiCState and NetClientState were not 1:1 in multiqueue. The following patches
would refactor this function to support multiqueue nic.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/e1000.c   |2 +-
 hw/eepro100.c|2 +-
 hw/ne2000.c  |2 +-
 hw/pcnet-pci.c   |2 +-
 hw/rtl8139.c |2 +-
 hw/usb/dev-network.c |2 +-
 hw/virtio-net.c  |2 +-
 hw/xen_nic.c |2 +-
 include/net/net.h|1 +
 net/net.c|   15 ++-
 10 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/hw/e1000.c b/hw/e1000.c
index 36f4051..f3590a9 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -1244,7 +1244,7 @@ pci_e1000_uninit(PCIDevice *dev)
 qemu_free_timer(d-autoneg_timer);
 memory_region_destroy(d-mmio);
 memory_region_destroy(d-io);
-qemu_del_net_client(qemu_get_queue(d-nic));
+qemu_del_nic(d-nic);
 }
 
 static NetClientInfo net_e1000_info = {
diff --git a/hw/eepro100.c b/hw/eepro100.c
index f9856ae..5d23796 100644
--- a/hw/eepro100.c
+++ b/hw/eepro100.c
@@ -1849,7 +1849,7 @@ static void pci_nic_uninit(PCIDevice *pci_dev)
 memory_region_destroy(s-flash_bar);
 vmstate_unregister(pci_dev-qdev, s-vmstate, s);
 eeprom93xx_free(pci_dev-qdev, s-eeprom);
-qemu_del_net_client(qemu_get_queue(s-nic));
+qemu_del_nic(s-nic);
 }
 
 static NetClientInfo net_eepro100_info = {
diff --git a/hw/ne2000.c b/hw/ne2000.c
index c989190..3dd1c84 100644
--- a/hw/ne2000.c
+++ b/hw/ne2000.c
@@ -751,7 +751,7 @@ static void pci_ne2000_exit(PCIDevice *pci_dev)
 NE2000State *s = d-ne2000;
 
 memory_region_destroy(s-io);
-qemu_del_net_client(qemu_get_queue(s-nic));
+qemu_del_nic(s-nic);
 }
 
 static Property ne2000_properties[] = {
diff --git a/hw/pcnet-pci.c b/hw/pcnet-pci.c
index 26c90bf..df63b22 100644
--- a/hw/pcnet-pci.c
+++ b/hw/pcnet-pci.c
@@ -279,7 +279,7 @@ static void pci_pcnet_uninit(PCIDevice *dev)
 memory_region_destroy(d-io_bar);
 qemu_del_timer(d-state.poll_timer);
 qemu_free_timer(d-state.poll_timer);
-qemu_del_net_client(qemu_get_queue(d-state.nic));
+qemu_del_nic(d-state.nic);
 }
 
 static NetClientInfo net_pci_pcnet_info = {
diff --git a/hw/rtl8139.c b/hw/rtl8139.c
index b825e83..d7716be 100644
--- a/hw/rtl8139.c
+++ b/hw/rtl8139.c
@@ -3446,7 +3446,7 @@ static void pci_rtl8139_uninit(PCIDevice *dev)
 }
 qemu_del_timer(s-timer);
 qemu_free_timer(s-timer);
-qemu_del_net_client(qemu_get_queue(s-nic));
+qemu_del_nic(s-nic);
 }
 
 static void rtl8139_set_link_status(NetClientState *nc)
diff --git a/hw/usb/dev-network.c b/hw/usb/dev-network.c
index abc6eac..a01a5e7 100644
--- a/hw/usb/dev-network.c
+++ b/hw/usb/dev-network.c
@@ -1330,7 +1330,7 @@ static void usb_net_handle_destroy(USBDevice *dev)
 
 /* TODO: remove the nd_table[] entry */
 rndis_clear_responsequeue(s);
-qemu_del_net_client(qemu_get_queue(s-nic));
+qemu_del_nic(s-nic);
 }
 
 static NetClientInfo net_usbnet_info = {
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index af9a17b..1a3fc74 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -1124,6 +1124,6 @@ void virtio_net_exit(VirtIODevice *vdev)
 qemu_bh_delete(n-tx_bh);
 }
 
-qemu_del_net_client(qemu_get_queue(n-nic));
+qemu_del_nic(n-nic);
 virtio_cleanup(n-vdev);
 }
diff --git a/hw/xen_nic.c b/hw/xen_nic.c
index 55b7960..4be077d 100644
--- a/hw/xen_nic.c
+++ b/hw/xen_nic.c
@@ -408,7 +408,7 @@ static void net_disconnect(struct XenDevice *xendev)
 netdev-rxs = NULL;
 }
 if (netdev-nic) {
-qemu_del_net_client(qemu_get_queue(netdev-nic));
+qemu_del_nic(netdev-nic);
 netdev-nic = NULL;
 }
 }
diff --git a/include/net/net.h b/include/net/net.h
index 96e05c4..f0d1aa2 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -77,6 +77,7 @@ NICState *qemu_new_nic(NetClientInfo *info,
const char *model,
const char *name,
void *opaque);
+void qemu_del_nic(NICState *nic);
 NetClientState *qemu_get_queue(NICState *nic);
 NICState *qemu_get_nic(NetClientState *nc);
 void *qemu_get_nic_opaque(NetClientState *nc);
diff --git a/net/net.c b/net/net.c
index 41dc12c..8999f8d 100644
--- a/net/net.c
+++ b/net/net.c
@@ -291,6 +291,15 @@ void qemu_del_net_client(NetClientState *nc)
 return;
 }
 
+assert(nc-info-type != NET_CLIENT_OPTIONS_KIND_NIC);
+
+qemu_cleanup_net_client(nc);
+qemu_free_net_client(nc);
+}
+
+void qemu_del_nic(NICState *nic)
+{
+NetClientState *nc = qemu_get_queue(nic);
 /* If this is a peer NIC and peer has already been deleted, free it now. */
 if (nc-peer  nc-info-type == NET_CLIENT_OPTIONS_KIND_NIC) {
 NICState *nic = qemu_get_nic(nc);
@@ -933,7 +942,11 @@ void net_cleanup(void)
 

[PATCH V4 06/22] net: introduce qemu_find_net_clients_except()

2013-01-30 Thread Jason Wang
In multiqueue, all NetClientState that belongs to the same netdev or nic has the
same id. So this patches introduces an helper qemu_find_net_clients_except()
which finds all NetClientState with the same id. This will be used by multiqueue
networking.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 include/net/net.h |2 ++
 net/net.c |   21 +
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/net/net.h b/include/net/net.h
index f0d1aa2..995df5c 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -68,6 +68,8 @@ typedef struct NICState {
 } NICState;
 
 NetClientState *qemu_find_netdev(const char *id);
+int qemu_find_net_clients_except(const char *id, NetClientState **ncs,
+ NetClientOptionsKind type, int max);
 NetClientState *qemu_new_net_client(NetClientInfo *info,
 NetClientState *peer,
 const char *model,
diff --git a/net/net.c b/net/net.c
index 8999f8d..6457fc0 100644
--- a/net/net.c
+++ b/net/net.c
@@ -508,6 +508,27 @@ NetClientState *qemu_find_netdev(const char *id)
 return NULL;
 }
 
+int qemu_find_net_clients_except(const char *id, NetClientState **ncs,
+ NetClientOptionsKind type, int max)
+{
+NetClientState *nc;
+int ret = 0;
+
+QTAILQ_FOREACH(nc, net_clients, next) {
+if (nc-info-type == type) {
+continue;
+}
+if (!strcmp(nc-name, id)) {
+if (ret  max) {
+ncs[ret] = nc;
+}
+ret++;
+}
+}
+
+return ret;
+}
+
 static int nic_get_free_idx(void)
 {
 int index;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 07/22] net: introduce qemu_net_client_setup()

2013-01-30 Thread Jason Wang
This patch separates the setup of NetClientState from its allocation, this will
allow allocating an arrays of NetClientState and does the initialization one by
one which is what multiqueue needs.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 net/net.c |   29 +++--
 1 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/net/net.c b/net/net.c
index 6457fc0..4e84d54 100644
--- a/net/net.c
+++ b/net/net.c
@@ -182,17 +182,12 @@ static char *assign_name(NetClientState *nc1, const char 
*model)
 return g_strdup(buf);
 }
 
-NetClientState *qemu_new_net_client(NetClientInfo *info,
-NetClientState *peer,
-const char *model,
-const char *name)
+static void qemu_net_client_setup(NetClientState *nc,
+  NetClientInfo *info,
+  NetClientState *peer,
+  const char *model,
+  const char *name)
 {
-NetClientState *nc;
-
-assert(info-size = sizeof(NetClientState));
-
-nc = g_malloc0(info-size);
-
 nc-info = info;
 nc-model = g_strdup(model);
 if (name) {
@@ -210,6 +205,20 @@ NetClientState *qemu_new_net_client(NetClientInfo *info,
 
 nc-send_queue = qemu_new_net_queue(nc);
 
+}
+
+NetClientState *qemu_new_net_client(NetClientInfo *info,
+NetClientState *peer,
+const char *model,
+const char *name)
+{
+NetClientState *nc;
+
+assert(info-size = sizeof(NetClientState));
+
+nc = g_malloc0(info-size);
+qemu_net_client_setup(nc, info, peer, model, name);
+
 return nc;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 08/22] net: introduce NetClientState destructor

2013-01-30 Thread Jason Wang
To allow allocating an array of NetClientState and free it once, this patch
introduces destructor of NetClientState. Which could do type specific free,
which could be used by multiqueue to free the array once.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 include/net/net.h |2 ++
 net/net.c |   17 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/net/net.h b/include/net/net.h
index 995df5c..22adc99 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -35,6 +35,7 @@ typedef ssize_t (NetReceive)(NetClientState *, const uint8_t 
*, size_t);
 typedef ssize_t (NetReceiveIOV)(NetClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (NetClientState *);
 typedef void (LinkStatusChanged)(NetClientState *);
+typedef void (NetClientDestructor)(NetClientState *);
 
 typedef struct NetClientInfo {
 NetClientOptionsKind type;
@@ -58,6 +59,7 @@ struct NetClientState {
 char *name;
 char info_str[256];
 unsigned receive_disabled : 1;
+NetClientDestructor *destructor;
 };
 
 typedef struct NICState {
diff --git a/net/net.c b/net/net.c
index 4e84d54..6368896 100644
--- a/net/net.c
+++ b/net/net.c
@@ -182,11 +182,17 @@ static char *assign_name(NetClientState *nc1, const char 
*model)
 return g_strdup(buf);
 }
 
+static void qemu_net_client_destructor(NetClientState *nc)
+{
+g_free(nc);
+}
+
 static void qemu_net_client_setup(NetClientState *nc,
   NetClientInfo *info,
   NetClientState *peer,
   const char *model,
-  const char *name)
+  const char *name,
+  NetClientDestructor *destructor)
 {
 nc-info = info;
 nc-model = g_strdup(model);
@@ -204,7 +210,7 @@ static void qemu_net_client_setup(NetClientState *nc,
 QTAILQ_INSERT_TAIL(net_clients, nc, next);
 
 nc-send_queue = qemu_new_net_queue(nc);
-
+nc-destructor = destructor;
 }
 
 NetClientState *qemu_new_net_client(NetClientInfo *info,
@@ -217,7 +223,8 @@ NetClientState *qemu_new_net_client(NetClientInfo *info,
 assert(info-size = sizeof(NetClientState));
 
 nc = g_malloc0(info-size);
-qemu_net_client_setup(nc, info, peer, model, name);
+qemu_net_client_setup(nc, info, peer, model, name,
+  qemu_net_client_destructor);
 
 return nc;
 }
@@ -279,7 +286,9 @@ static void qemu_free_net_client(NetClientState *nc)
 }
 g_free(nc-name);
 g_free(nc-model);
-g_free(nc);
+if (nc-destructor) {
+nc-destructor(nc);
+}
 }
 
 void qemu_del_net_client(NetClientState *nc)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 09/22] net: multiqueue support

2013-01-30 Thread Jason Wang
This patch adds basic multiqueue support for qemu. The idea is simple, an array
of NetClientStates were introduced in NICState, parse_netdev() were extended to
find and match all NetClientStates belongs to the backend and place their
pointers in NICConf. Then qemu_new_nic can setup a N:N mapping between NICStates
that belongs to a nic and NICStates belongs to the netdev. And a queue_index
were introduced in NetClientState to track its index. After this, each peers of
a NICState were abstracted as a queue.

After this change, all NetClientState that belongs to the same backend/nic has
the same id. When use want to change the link status, all NetClientStates that
belongs to the same backend/nic will be also changed. When user want to delete
a device or netdev, all NetClientStates that belongs to the same backend/nic
will be deleted also. Changing or deleting an specific queue is not allowed.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/dp8393x.c|2 +-
 hw/mcf_fec.c|2 +-
 hw/qdev-properties-system.c |   46 +++---
 hw/qdev-properties.h|6 +-
 include/net/net.h   |   18 +--
 net/net.c   |  113 +++
 6 files changed, 139 insertions(+), 48 deletions(-)

diff --git a/hw/dp8393x.c b/hw/dp8393x.c
index 0273fad..808157b 100644
--- a/hw/dp8393x.c
+++ b/hw/dp8393x.c
@@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift,
 s-regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */
 
 s-conf.macaddr = nd-macaddr;
-s-conf.peer = nd-netdev;
+s-conf.peers.ncs[0] = nd-netdev;
 
 s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s);
 
diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c
index 909e32b..8e60f09 100644
--- a/hw/mcf_fec.c
+++ b/hw/mcf_fec.c
@@ -472,7 +472,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd,
 memory_region_add_subregion(sysmem, base, s-iomem);
 
 s-conf.macaddr = nd-macaddr;
-s-conf.peer = nd-netdev;
+s-conf.peers.ncs[0] = nd-netdev;
 
 s-nic = qemu_new_nic(net_mcf_fec_info, s-conf, nd-model, nd-name, s);
 
diff --git a/hw/qdev-properties-system.c b/hw/qdev-properties-system.c
index ce0f793..ce3af22 100644
--- a/hw/qdev-properties-system.c
+++ b/hw/qdev-properties-system.c
@@ -173,16 +173,47 @@ PropertyInfo qdev_prop_chr = {
 
 static int parse_netdev(DeviceState *dev, const char *str, void **ptr)
 {
-NetClientState *netdev = qemu_find_netdev(str);
+NICPeers *peers_ptr = (NICPeers *)ptr;
+NICConf *conf = container_of(peers_ptr, NICConf, peers);
+NetClientState **ncs = peers_ptr-ncs;
+NetClientState *peers[MAX_QUEUE_NUM];
+int queues, i = 0;
+int ret;
 
-if (netdev == NULL) {
-return -ENOENT;
+queues = qemu_find_net_clients_except(str, peers,
+  NET_CLIENT_OPTIONS_KIND_NIC,
+  MAX_QUEUE_NUM);
+if (queues == 0) {
+ret = -ENOENT;
+goto err;
 }
-if (netdev-peer) {
-return -EEXIST;
+
+if (queues  MAX_QUEUE_NUM) {
+ret = -E2BIG;
+goto err;
+}
+
+for (i = 0; i  queues; i++) {
+if (peers[i] == NULL) {
+ret = -ENOENT;
+goto err;
+}
+
+if (peers[i]-peer) {
+ret = -EEXIST;
+goto err;
+}
+
+ncs[i] = peers[i];
+ncs[i]-queue_index = i;
 }
-*ptr = netdev;
+
+conf-queues = queues;
+
 return 0;
+
+err:
+return ret;
 }
 
 static const char *print_netdev(void *ptr)
@@ -249,7 +280,8 @@ static void set_vlan(Object *obj, Visitor *v, void *opaque,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
-NetClientState **ptr = qdev_get_prop_ptr(dev, prop);
+NICPeers *peers_ptr = qdev_get_prop_ptr(dev, prop);
+NetClientState **ptr = peers_ptr-ncs[0];
 Error *local_err = NULL;
 int32_t id;
 NetClientState *hubport;
diff --git a/hw/qdev-properties.h b/hw/qdev-properties.h
index ddcf774..20c67f3 100644
--- a/hw/qdev-properties.h
+++ b/hw/qdev-properties.h
@@ -31,7 +31,7 @@ extern PropertyInfo qdev_prop_pci_host_devaddr;
 .name  = (_name),\
 .info  = (_prop),   \
 .offset= offsetof(_state, _field)\
-+ type_check(_type,typeof_field(_state, _field)),\
++ type_check(_type, typeof_field(_state, _field)),   \
 }
 #define DEFINE_PROP_DEFAULT(_name, _state, _field, _defval, _prop, _type) { \
 .name  = (_name),   \
@@ -77,9 +77,9 @@ extern PropertyInfo qdev_prop_pci_host_devaddr;
 #define DEFINE_PROP_STRING(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*)
 #define DEFINE_PROP_NETDEV(_n, _s, _f) \
-DEFINE_PROP(_n, _s, _f, 

[PATCH V4 10/22] tap: import linux multiqueue constants

2013-01-30 Thread Jason Wang
Import multiqueue constants from if_tun.h from 3.8-rc3. A new ifr flag
IFF_MULTI_QUEUE were introduced to create a multiqueue backend by calling
TUNSETIFF with the this flag and with the same interface name many times.

A new ioctl TUNSETQUEUE were introduced. When doing this ioctl with
IFF_DETACH_QUEUE, the queue were disabled in the linux kernel. When doing this
ioctl with IFF_ATTACH_QUEUE, the queue were enabled in the linux kernel.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 net/tap-linux.h |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/tap-linux.h b/net/tap-linux.h
index cb2a6d4..65087e1 100644
--- a/net/tap-linux.h
+++ b/net/tap-linux.h
@@ -29,6 +29,7 @@
 #define TUNSETSNDBUF   _IOW('T', 212, int)
 #define TUNGETVNETHDRSZ _IOR('T', 215, int)
 #define TUNSETVNETHDRSZ _IOW('T', 216, int)
+#define TUNSETQUEUE  _IOW('T', 217, int)
 
 #endif
 
@@ -36,6 +37,9 @@
 #define IFF_TAP0x0002
 #define IFF_NO_PI  0x1000
 #define IFF_VNET_HDR   0x4000
+#define IFF_MULTI_QUEUE 0x0100
+#define IFF_ATTACH_QUEUE 0x0200
+#define IFF_DETACH_QUEUE 0x0400
 
 /* Features for GSO (TUNSETOFFLOAD). */
 #define TUN_F_CSUM 0x01/* You can hand me unchecksummed packets. */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 11/22] tap: factor out common tap initialization

2013-01-30 Thread Jason Wang
This patch factors out the common initialization of tap into a new helper
net_init_tap_one(). This will be used by multiqueue tap patches.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 net/tap.c |  130 ++---
 1 files changed, 73 insertions(+), 57 deletions(-)

diff --git a/net/tap.c b/net/tap.c
index 5542c98..23fb6e0 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -591,6 +591,73 @@ static int net_tap_init(const NetdevTapOptions *tap, int 
*vnet_hdr,
 return fd;
 }
 
+static int net_init_tap_one(const NetdevTapOptions *tap, NetClientState *peer,
+const char *model, const char *name,
+const char *ifname, const char *script,
+const char *downscript, const char *vhostfdname,
+int vnet_hdr, int fd)
+{
+TAPState *s;
+
+s = net_tap_fd_init(peer, model, name, fd, vnet_hdr);
+if (!s) {
+close(fd);
+return -1;
+}
+
+if (tap_set_sndbuf(s-fd, tap)  0) {
+return -1;
+}
+
+if (tap-has_fd) {
+snprintf(s-nc.info_str, sizeof(s-nc.info_str), fd=%d, fd);
+} else if (tap-has_helper) {
+snprintf(s-nc.info_str, sizeof(s-nc.info_str), helper=%s,
+ tap-helper);
+} else {
+const char *downscript;
+
+downscript = tap-has_downscript ? tap-downscript :
+DEFAULT_NETWORK_DOWN_SCRIPT;
+
+snprintf(s-nc.info_str, sizeof(s-nc.info_str),
+ ifname=%s,script=%s,downscript=%s, ifname, script,
+ downscript);
+
+if (strcmp(downscript, no) != 0) {
+snprintf(s-down_script, sizeof(s-down_script), %s, downscript);
+snprintf(s-down_script_arg, sizeof(s-down_script_arg),
+ %s, ifname);
+}
+}
+
+if (tap-has_vhost ? tap-vhost :
+vhostfdname || (tap-has_vhostforce  tap-vhostforce)) {
+int vhostfd;
+
+if (tap-has_vhostfd) {
+vhostfd = monitor_handle_fd_param(cur_mon, vhostfdname);
+if (vhostfd == -1) {
+return -1;
+}
+} else {
+vhostfd = -1;
+}
+
+s-vhost_net = vhost_net_init(s-nc, vhostfd,
+  tap-has_vhostforce  tap-vhostforce);
+if (!s-vhost_net) {
+error_report(vhost-net requested but could not be initialized);
+return -1;
+}
+} else if (tap-has_vhostfd) {
+error_report(vhostfd= is not valid without vhost);
+return -1;
+}
+
+return 0;
+}
+
 int net_init_tap(const NetClientOptions *opts, const char *name,
  NetClientState *peer)
 {
@@ -598,10 +665,10 @@ int net_init_tap(const NetClientOptions *opts, const char 
*name,
 
 int fd, vnet_hdr = 0;
 const char *model;
-TAPState *s;
 
 /* for the no-fd, no-helper case */
 const char *script = NULL; /* suppress wrong uninit'd use gcc warning */
+const char *downscript = NULL;
 char ifname[128];
 
 assert(opts-kind == NET_CLIENT_OPTIONS_KIND_TAP);
@@ -647,6 +714,8 @@ int net_init_tap(const NetClientOptions *opts, const char 
*name,
 
 } else {
 script = tap-has_script ? tap-script : DEFAULT_NETWORK_SCRIPT;
+downscript = tap-has_downscript ? tap-downscript :
+DEFAULT_NETWORK_DOWN_SCRIPT;
 fd = net_tap_init(tap, vnet_hdr, script, ifname, sizeof ifname);
 if (fd == -1) {
 return -1;
@@ -655,62 +724,9 @@ int net_init_tap(const NetClientOptions *opts, const char 
*name,
 model = tap;
 }
 
-s = net_tap_fd_init(peer, model, name, fd, vnet_hdr);
-if (!s) {
-close(fd);
-return -1;
-}
-
-if (tap_set_sndbuf(s-fd, tap)  0) {
-return -1;
-}
-
-if (tap-has_fd) {
-snprintf(s-nc.info_str, sizeof(s-nc.info_str), fd=%d, fd);
-} else if (tap-has_helper) {
-snprintf(s-nc.info_str, sizeof(s-nc.info_str), helper=%s,
- tap-helper);
-} else {
-const char *downscript;
-
-downscript = tap-has_downscript ? tap-downscript :
-   DEFAULT_NETWORK_DOWN_SCRIPT;
-
-snprintf(s-nc.info_str, sizeof(s-nc.info_str),
- ifname=%s,script=%s,downscript=%s, ifname, script,
- downscript);
-
-if (strcmp(downscript, no) != 0) {
-snprintf(s-down_script, sizeof(s-down_script), %s, downscript);
-snprintf(s-down_script_arg, sizeof(s-down_script_arg), %s, 
ifname);
-}
-}
-
-if (tap-has_vhost ? tap-vhost :
-tap-has_vhostfd || (tap-has_vhostforce  tap-vhostforce)) {
-int vhostfd;
-
-if (tap-has_vhostfd) {
-vhostfd = monitor_handle_fd_param(cur_mon, tap-vhostfd);
-if (vhostfd == -1) {
-return -1;
-}
-} else {
-vhostfd = 

[PATCH V4 12/22] tap: add Linux multiqueue support

2013-01-30 Thread Jason Wang
This patch add basic multiqueue support for Linux. When multiqueue is needed, we
will first check whether kernel support multiqueue tap before creating more
queues. Two new functions tap_fd_enable() and tap_fd_disable() were introduced
to enable and disable a specific queue. Since the multiqueue is only supported
in Linux, return error on other platforms.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 net/tap-aix.c |   10 ++
 net/tap-bsd.c |   11 +++
 net/tap-haiku.c   |   11 +++
 net/tap-linux.c   |   52 
 net/tap-solaris.c |   11 +++
 net/tap_int.h |2 ++
 6 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/net/tap-aix.c b/net/tap-aix.c
index aff6c52..66e0574 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -59,3 +59,13 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_enable(int fd)
+{
+return -1;
+}
+
+int tap_fd_disable(int fd)
+{
+return -1;
+}
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 01c705b..cfc7a28 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -145,3 +145,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_enable(int fd)
+{
+return -1;
+}
+
+int tap_fd_disable(int fd)
+{
+return -1;
+}
+
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index 08cc034..664d40f 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -59,3 +59,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_enable(int fd)
+{
+return -1;
+}
+
+int tap_fd_disable(int fd)
+{
+return -1;
+}
+
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 0a6acc7..bdb0a79 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -41,6 +41,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 struct ifreq ifr;
 int fd, ret;
 int len = sizeof(struct virtio_net_hdr);
+int mq_required = 0;
 
 TFR(fd = open(PATH_NET_TUN, O_RDWR));
 if (fd  0) {
@@ -76,6 +77,20 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
 ioctl(fd, TUNSETVNETHDRSZ, len);
 }
 
+if (mq_required) {
+unsigned int features;
+
+if ((ioctl(fd, TUNGETFEATURES, features) != 0) ||
+!(features  IFF_MULTI_QUEUE)) {
+error_report(multiqueue required, but no kernel 
+ support for IFF_MULTI_QUEUE available);
+close(fd);
+return -1;
+} else {
+ifr.ifr_flags |= IFF_MULTI_QUEUE;
+}
+}
+
 if (ifname[0] != '\0')
 pstrcpy(ifr.ifr_name, IFNAMSIZ, ifname);
 else
@@ -209,3 +224,40 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 }
 }
 }
+
+/* Enable a specific queue of tap. */
+int tap_fd_enable(int fd)
+{
+struct ifreq ifr;
+int ret;
+
+memset(ifr, 0, sizeof(ifr));
+
+ifr.ifr_flags = IFF_ATTACH_QUEUE;
+ret = ioctl(fd, TUNSETQUEUE, (void *) ifr);
+
+if (ret != 0) {
+error_report(could not enable queue);
+}
+
+return ret;
+}
+
+/* Disable a specific queue of tap/ */
+int tap_fd_disable(int fd)
+{
+struct ifreq ifr;
+int ret;
+
+memset(ifr, 0, sizeof(ifr));
+
+ifr.ifr_flags = IFF_DETACH_QUEUE;
+ret = ioctl(fd, TUNSETQUEUE, (void *) ifr);
+
+if (ret != 0) {
+error_report(could not disable queue);
+}
+
+return ret;
+}
+
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index 486a7ea..12cc392 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -225,3 +225,14 @@ void tap_fd_set_offload(int fd, int csum, int tso4,
 int tso6, int ecn, int ufo)
 {
 }
+
+int tap_fd_enable(int fd)
+{
+return -1;
+}
+
+int tap_fd_disable(int fd)
+{
+return -1;
+}
+
diff --git a/net/tap_int.h b/net/tap_int.h
index 1dffe12..ca1c21b 100644
--- a/net/tap_int.h
+++ b/net/tap_int.h
@@ -42,5 +42,7 @@ int tap_probe_vnet_hdr_len(int fd, int len);
 int tap_probe_has_ufo(int fd);
 void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, int ecn, int 
ufo);
 void tap_fd_set_vnet_hdr_len(int fd, int len);
+int tap_fd_enable(int fd);
+int tap_fd_disable(int fd);
 
 #endif /* QEMU_TAP_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 13/22] tap: support enabling or disabling a queue

2013-01-30 Thread Jason Wang
This patch introduce a new bit - enabled in TAPState which tracks whether a
specific queue/fd is enabled. The tap/fd is enabled during initialization and
could be enabled/disabled by tap_enalbe() and tap_disable() which calls platform
specific helpers to do the real work. Polling of a tap fd can only done when
the tap was enabled.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 include/net/tap.h |2 ++
 net/tap-win32.c   |   10 ++
 net/tap.c |   43 ---
 3 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/include/net/tap.h b/include/net/tap.h
index 883cebf..a994f20 100644
--- a/include/net/tap.h
+++ b/include/net/tap.h
@@ -35,6 +35,8 @@ int tap_has_vnet_hdr_len(NetClientState *nc, int len);
 void tap_using_vnet_hdr(NetClientState *nc, bool using_vnet_hdr);
 void tap_set_offload(NetClientState *nc, int csum, int tso4, int tso6, int 
ecn, int ufo);
 void tap_set_vnet_hdr_len(NetClientState *nc, int len);
+int tap_enable(NetClientState *nc);
+int tap_disable(NetClientState *nc);
 
 int tap_get_fd(NetClientState *nc);
 
diff --git a/net/tap-win32.c b/net/tap-win32.c
index 601437e..91e9e84 100644
--- a/net/tap-win32.c
+++ b/net/tap-win32.c
@@ -764,3 +764,13 @@ void tap_set_vnet_hdr_len(NetClientState *nc, int len)
 {
 abort();
 }
+
+int tap_enable(NetClientState *nc)
+{
+abort();
+}
+
+int tap_disable(NetClientState *nc)
+{
+abort();
+}
diff --git a/net/tap.c b/net/tap.c
index 23fb6e0..8610ba2 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -59,6 +59,7 @@ typedef struct TAPState {
 bool write_poll;
 bool using_vnet_hdr;
 bool has_ufo;
+bool enabled;
 VHostNetState *vhost_net;
 unsigned host_vnet_hdr_len;
 } TAPState;
@@ -72,9 +73,9 @@ static void tap_writable(void *opaque);
 static void tap_update_fd_handler(TAPState *s)
 {
 qemu_set_fd_handler2(s-fd,
- s-read_poll  ? tap_can_send : NULL,
- s-read_poll  ? tap_send : NULL,
- s-write_poll ? tap_writable : NULL,
+ s-read_poll  s-enabled ? tap_can_send : NULL,
+ s-read_poll  s-enabled ? tap_send : NULL,
+ s-write_poll  s-enabled ? tap_writable : NULL,
  s);
 }
 
@@ -337,6 +338,7 @@ static TAPState *net_tap_fd_init(NetClientState *peer,
 s-host_vnet_hdr_len = vnet_hdr ? sizeof(struct virtio_net_hdr) : 0;
 s-using_vnet_hdr = false;
 s-has_ufo = tap_probe_has_ufo(s-fd);
+s-enabled = true;
 tap_set_offload(s-nc, 0, 0, 0, 0, 0);
 /*
  * Make sure host header length is set correctly in tap:
@@ -735,3 +737,38 @@ VHostNetState *tap_get_vhost_net(NetClientState *nc)
 assert(nc-info-type == NET_CLIENT_OPTIONS_KIND_TAP);
 return s-vhost_net;
 }
+
+int tap_enable(NetClientState *nc)
+{
+TAPState *s = DO_UPCAST(TAPState, nc, nc);
+int ret;
+
+if (s-enabled) {
+return 0;
+} else {
+ret = tap_fd_enable(s-fd);
+if (ret == 0) {
+s-enabled = true;
+tap_update_fd_handler(s);
+}
+return ret;
+}
+}
+
+int tap_disable(NetClientState *nc)
+{
+TAPState *s = DO_UPCAST(TAPState, nc, nc);
+int ret;
+
+if (s-enabled == 0) {
+return 0;
+} else {
+ret = tap_fd_disable(s-fd);
+if (ret == 0) {
+qemu_purge_queued_packets(nc);
+s-enabled = false;
+tap_update_fd_handler(s);
+}
+return ret;
+}
+}
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 14/22] tap: introduce a helper to get the name of an interface

2013-01-30 Thread Jason Wang
This patch introduces a helper tap_get_ifname() to get the device name of tap
device. This is needed when ifname is unspecified in the command line and qemu
were asked to create tap device by itself. In this situation, the name were
allocated by kernel, so if multiqueue is asked, we need to fetch its name after
creating the first queue.

Only linux has this support since it's the only platform that supports
multiqueue tap.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 include/net/tap.h |1 +
 net/tap-aix.c |6 ++
 net/tap-bsd.c |4 
 net/tap-haiku.c   |4 
 net/tap-linux.c   |   13 +
 net/tap-solaris.c |4 
 net/tap_int.h |1 +
 7 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/net/tap.h b/include/net/tap.h
index a994f20..c3eb85a 100644
--- a/include/net/tap.h
+++ b/include/net/tap.h
@@ -37,6 +37,7 @@ void tap_set_offload(NetClientState *nc, int csum, int tso4, 
int tso6, int ecn,
 void tap_set_vnet_hdr_len(NetClientState *nc, int len);
 int tap_enable(NetClientState *nc);
 int tap_disable(NetClientState *nc);
+int tap_get_ifname(NetClientState *nc, char *ifname);
 
 int tap_get_fd(NetClientState *nc);
 
diff --git a/net/tap-aix.c b/net/tap-aix.c
index 66e0574..e760e9a 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -69,3 +69,9 @@ int tap_fd_disable(int fd)
 {
 return -1;
 }
+
+int tap_fd_get_ifname(int fd, char *ifname)
+{
+return -1;
+}
+
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index cfc7a28..4f22109 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -156,3 +156,7 @@ int tap_fd_disable(int fd)
 return -1;
 }
 
+int tap_fd_get_ifname(int fd, char *ifname)
+{
+return -1;
+}
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index 664d40f..b3b5fbb 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -70,3 +70,7 @@ int tap_fd_disable(int fd)
 return -1;
 }
 
+int tap_fd_get_ifname(int fd, char *ifname)
+{
+return -1;
+}
diff --git a/net/tap-linux.c b/net/tap-linux.c
index bdb0a79..3b21662 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -261,3 +261,16 @@ int tap_fd_disable(int fd)
 return ret;
 }
 
+int tap_fd_get_ifname(int fd, char *ifname)
+{
+struct ifreq ifr;
+
+if (ioctl(fd, TUNGETIFF, ifr) != 0) {
+error_report(TUNGETIFF ioctl() failed: %s,
+ strerror(errno));
+return -1;
+}
+
+pstrcpy(ifname, sizeof(ifr.ifr_name), ifr.ifr_name);
+return 0;
+}
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index 12cc392..214d95e 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -236,3 +236,7 @@ int tap_fd_disable(int fd)
 return -1;
 }
 
+int tap_fd_get_ifname(int fd, char *ifname)
+{
+return -1;
+}
diff --git a/net/tap_int.h b/net/tap_int.h
index ca1c21b..125f83d 100644
--- a/net/tap_int.h
+++ b/net/tap_int.h
@@ -44,5 +44,6 @@ void tap_fd_set_offload(int fd, int csum, int tso4, int tso6, 
int ecn, int ufo);
 void tap_fd_set_vnet_hdr_len(int fd, int len);
 int tap_fd_enable(int fd);
 int tap_fd_disable(int fd);
+int tap_fd_get_ifname(int fd, char *ifname);
 
 #endif /* QEMU_TAP_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 15/22] tap: multiqueue support

2013-01-30 Thread Jason Wang
Recently, linux support multiqueue tap which could let userspace call TUNSETIFF
for a signle device many times to create multiple file descriptors as
independent queues. User could also enable/disabe a specific queue through
TUNSETQUEUE.

The patch adds the generic infrastructure to create multiqueue taps. To achieve
this a new parameter queues were introduced to specify how many queues were
expected to be created for tap by qemu itself. Alternatively, management could
also pass multiple pre-created tap file descriptors separated with ':' through a
new parameter fds like -netdev tap,id=hn0,fds=X:Y:..:Z. Multiple vhost file
descriptors could also be passed in this way.

Each TAPState were still associated to a tap fd, which mean multiple TAPStates
were created when user needs multiqueue taps. Since each TAPState contains one
NetClientState, with the multiqueue nic support, an N peers of NetClientState
were built up.

A new parameter, mq_required were introduce in tap_open() to create multiqueue
tap fds.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 include/net/tap.h |1 -
 net/tap-aix.c |3 +-
 net/tap-bsd.c |3 +-
 net/tap-haiku.c   |3 +-
 net/tap-linux.c   |4 +-
 net/tap-solaris.c |3 +-
 net/tap.c |  158 +
 net/tap_int.h |3 +-
 qapi-schema.json  |5 +-
 9 files changed, 139 insertions(+), 44 deletions(-)

diff --git a/include/net/tap.h b/include/net/tap.h
index c3eb85a..a994f20 100644
--- a/include/net/tap.h
+++ b/include/net/tap.h
@@ -37,7 +37,6 @@ void tap_set_offload(NetClientState *nc, int csum, int tso4, 
int tso6, int ecn,
 void tap_set_vnet_hdr_len(NetClientState *nc, int len);
 int tap_enable(NetClientState *nc);
 int tap_disable(NetClientState *nc);
-int tap_get_ifname(NetClientState *nc, char *ifname);
 
 int tap_get_fd(NetClientState *nc);
 
diff --git a/net/tap-aix.c b/net/tap-aix.c
index e760e9a..804d164 100644
--- a/net/tap-aix.c
+++ b/net/tap-aix.c
@@ -25,7 +25,8 @@
 #include tap_int.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int mq_required)
 {
 fprintf(stderr, no tap on AIX\n);
 return -1;
diff --git a/net/tap-bsd.c b/net/tap-bsd.c
index 4f22109..bcdb268 100644
--- a/net/tap-bsd.c
+++ b/net/tap-bsd.c
@@ -33,7 +33,8 @@
 #include net/if_tap.h
 #endif
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int mq_required)
 {
 int fd;
 #ifdef TAPGIFNAME
diff --git a/net/tap-haiku.c b/net/tap-haiku.c
index b3b5fbb..e5ce436 100644
--- a/net/tap-haiku.c
+++ b/net/tap-haiku.c
@@ -25,7 +25,8 @@
 #include tap_int.h
 #include stdio.h
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int mq_required)
 {
 fprintf(stderr, no tap on Haiku\n);
 return -1;
diff --git a/net/tap-linux.c b/net/tap-linux.c
index 3b21662..a953189 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -36,12 +36,12 @@
 
 #define PATH_NET_TUN /dev/net/tun
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int mq_required)
 {
 struct ifreq ifr;
 int fd, ret;
 int len = sizeof(struct virtio_net_hdr);
-int mq_required = 0;
 
 TFR(fd = open(PATH_NET_TUN, O_RDWR));
 if (fd  0) {
diff --git a/net/tap-solaris.c b/net/tap-solaris.c
index 214d95e..9c7278f 100644
--- a/net/tap-solaris.c
+++ b/net/tap-solaris.c
@@ -173,7 +173,8 @@ static int tap_alloc(char *dev, size_t dev_size)
 return tap_fd;
 }
 
-int tap_open(char *ifname, int ifname_size, int *vnet_hdr, int 
vnet_hdr_required)
+int tap_open(char *ifname, int ifname_size, int *vnet_hdr,
+ int vnet_hdr_required, int mq_required)
 {
 char  dev[10]=;
 int fd;
diff --git a/net/tap.c b/net/tap.c
index 8610ba2..1bf7609 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -558,17 +558,10 @@ int net_init_bridge(const NetClientOptions *opts, const 
char *name,
 
 static int net_tap_init(const NetdevTapOptions *tap, int *vnet_hdr,
 const char *setup_script, char *ifname,
-size_t ifname_sz)
+size_t ifname_sz, int mq_required)
 {
 int fd, vnet_hdr_required;
 
-if (tap-has_ifname) {
-pstrcpy(ifname, ifname_sz, tap-ifname);
-} else {
-assert(ifname_sz  0);
-ifname[0] = '\0';
-}
-
 if (tap-has_vnet_hdr) {
 *vnet_hdr = tap-vnet_hdr;
 vnet_hdr_required = *vnet_hdr;
@@ -577,7 +570,8 @@ static int net_tap_init(const NetdevTapOptions *tap, int 
*vnet_hdr,
 

[PATCH V4 16/22] vhost: multiqueue support

2013-01-30 Thread Jason Wang
This patch lets vhost support multiqueue. The idea is simple, just launching
multiple threads of vhost and let each of vhost thread processing a subset of
the virtqueues of the device. After this change each emulated device can have
multiple vhost threads as its backend.

To do this, a virtqueue index were introduced to record to first virtqueue that
will be handled by this vhost_net device. Based on this and nvqs, vhost could
calculate its relative index to setup vhost_net device.

Since we may have many vhost/net devices for a virtio-net device. The setting of
guest notifiers were moved out of the starting/stopping of a specific vhost
thread. The vhost_net_{start|stop}() were renamed to
vhost_net_{start|stop}_one(), and a new vhost_net_{start|stop}() were introduced
to configure the guest notifiers and start/stop all vhost/vhost_net devices.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/vhost.c  |   82 +++-
 hw/vhost.h  |2 +
 hw/vhost_net.c  |   86 +-
 hw/vhost_net.h  |4 +-
 hw/virtio-net.c |4 +-
 5 files changed, 120 insertions(+), 58 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index cee8aad..38257b9 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -619,14 +619,17 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
 {
 hwaddr s, l, a;
 int r;
+int vhost_vq_index = idx - dev-vq_index;
 struct vhost_vring_file file = {
-.index = idx,
+.index = vhost_vq_index
 };
 struct vhost_vring_state state = {
-.index = idx,
+.index = vhost_vq_index
 };
 struct VirtQueue *vvq = virtio_get_queue(vdev, idx);
 
+assert(idx = dev-vq_index  idx  dev-vq_index + dev-nvqs);
+
 vq-num = state.num = virtio_queue_get_num(vdev, idx);
 r = ioctl(dev-control, VHOST_SET_VRING_NUM, state);
 if (r) {
@@ -669,11 +672,12 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
 goto fail_alloc_ring;
 }
 
-r = vhost_virtqueue_set_addr(dev, vq, idx, dev-log_enabled);
+r = vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev-log_enabled);
 if (r  0) {
 r = -errno;
 goto fail_alloc;
 }
+
 file.fd = event_notifier_get_fd(virtio_queue_get_host_notifier(vvq));
 r = ioctl(dev-control, VHOST_SET_VRING_KICK, file);
 if (r) {
@@ -709,9 +713,10 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
 unsigned idx)
 {
 struct vhost_vring_state state = {
-.index = idx,
+.index = idx - dev-vq_index
 };
 int r;
+assert(idx = dev-vq_index  idx  dev-vq_index + dev-nvqs);
 r = ioctl(dev-control, VHOST_GET_VRING_BASE, state);
 if (r  0) {
 fprintf(stderr, vhost VQ %d ring restore failed: %d\n, idx, r);
@@ -867,7 +872,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 }
 
 for (i = 0; i  hdev-nvqs; ++i) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, true);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+ hdev-vq_index + i,
+ true);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier binding failed: %d\n, i, 
-r);
 goto fail_vq;
@@ -877,7 +884,9 @@ int vhost_dev_enable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 return 0;
 fail_vq:
 while (--i = 0) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+ hdev-vq_index + i,
+ false);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier cleanup error: %d\n, i, -r);
 fflush(stderr);
@@ -898,7 +907,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
 int i, r;
 
 for (i = 0; i  hdev-nvqs; ++i) {
-r = vdev-binding-set_host_notifier(vdev-binding_opaque, i, false);
+r = vdev-binding-set_host_notifier(vdev-binding_opaque,
+ hdev-vq_index + i,
+ false);
 if (r  0) {
 fprintf(stderr, vhost VQ %d notifier cleanup failed: %d\n, i, 
-r);
 fflush(stderr);
@@ -912,8 +923,9 @@ void vhost_dev_disable_notifiers(struct vhost_dev *hdev, 
VirtIODevice *vdev)
  */
 bool vhost_virtqueue_pending(struct vhost_dev *hdev, int n)
 {
-struct vhost_virtqueue *vq = hdev-vqs + n;
+struct vhost_virtqueue *vq = hdev-vqs + n - hdev-vq_index;
 assert(hdev-started);
+assert(n = hdev-vq_index  n  hdev-vq_index + hdev-nvqs);
 return event_notifier_test_and_clear(vq-masked_notifier);
 }
 
@@ -922,15 +934,16 @@ void vhost_virtqueue_mask(struct vhost_dev *hdev, 
VirtIODevice *vdev, 

[PATCH V4 17/22] virtio: introduce virtio_del_queue()

2013-01-30 Thread Jason Wang
Some device (such as virtio-net) needs the ability to destroy or re-order the
virtqueues, this patch adds a helper to do this.

Signed-off-by: Jason Wang jasowang
---
 hw/virtio.c |9 +
 hw/virtio.h |2 ++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index ca170c3..d8c77b0 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -701,6 +701,15 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 return vdev-vq[i];
 }
 
+void virtio_del_queue(VirtIODevice *vdev, int n)
+{
+if (n  0 || n = VIRTIO_PCI_QUEUE_MAX) {
+abort();
+}
+
+vdev-vq[n].vring.num = 0;
+}
+
 void virtio_irq(VirtQueue *vq)
 {
 trace_virtio_irq(vq);
diff --git a/hw/virtio.h b/hw/virtio.h
index 9cc7b85..d3da1d2 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -181,6 +181,8 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 void (*handle_output)(VirtIODevice *,
   VirtQueue *));
 
+void virtio_del_queue(VirtIODevice *vdev, int n);
+
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
 unsigned int len);
 void virtqueue_flush(VirtQueue *vq, unsigned int count);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 18/22] virtio: add a queue_index to VirtQueue

2013-01-30 Thread Jason Wang
Add a queue_index to VirtQueue and a helper to fetch it, this could be used by
multiqueue supported device.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio.c |8 
 hw/virtio.h |1 +
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index d8c77b0..e259348 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -73,6 +73,8 @@ struct VirtQueue
 /* Notification enabled? */
 bool notification;
 
+uint16_t queue_index;
+
 int inuse;
 
 uint16_t vector;
@@ -931,6 +933,7 @@ void virtio_init(VirtIODevice *vdev, const char *name,
 for (i = 0; i  VIRTIO_PCI_QUEUE_MAX; i++) {
 vdev-vq[i].vector = VIRTIO_NO_VECTOR;
 vdev-vq[i].vdev = vdev;
+vdev-vq[i].queue_index = i;
 }
 
 vdev-name = name;
@@ -1018,6 +1021,11 @@ VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n)
 return vdev-vq + n;
 }
 
+uint16_t virtio_get_queue_index(VirtQueue *vq)
+{
+return vq-queue_index;
+}
+
 static void virtio_queue_guest_notifier_read(EventNotifier *n)
 {
 VirtQueue *vq = container_of(n, VirtQueue, guest_notifier);
diff --git a/hw/virtio.h b/hw/virtio.h
index d3da1d2..a29a54d 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -280,6 +280,7 @@ hwaddr virtio_queue_get_ring_size(VirtIODevice *vdev, int 
n);
 uint16_t virtio_queue_get_last_avail_idx(VirtIODevice *vdev, int n);
 void virtio_queue_set_last_avail_idx(VirtIODevice *vdev, int n, uint16_t idx);
 VirtQueue *virtio_get_queue(VirtIODevice *vdev, int n);
+uint16_t virtio_get_queue_index(VirtQueue *vq);
 int virtio_queue_get_id(VirtQueue *vq);
 EventNotifier *virtio_queue_get_guest_notifier(VirtQueue *vq);
 void virtio_queue_set_guest_notifier_fd_handler(VirtQueue *vq, bool assign,
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 19/22] virtio-net: separate virtqueue from VirtIONet

2013-01-30 Thread Jason Wang
To support multiqueue virtio-net, the first step is to separate the virtqueue
related fields from VirtIONet to a new structure VirtIONetQueue. The following
patches will add an array of VirtIONetQueue to VirtIONet based on this patch.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio-net.c |  195 ---
 1 files changed, 114 insertions(+), 81 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index d30cc31..b4d53b3 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -26,28 +26,33 @@
 #define MAC_TABLE_ENTRIES64
 #define MAX_VLAN(1  12)   /* Per 802.1Q definition */
 
+typedef struct VirtIONetQueue {
+VirtQueue *rx_vq;
+VirtQueue *tx_vq;
+QEMUTimer *tx_timer;
+QEMUBH *tx_bh;
+int tx_waiting;
+struct {
+VirtQueueElement elem;
+ssize_t len;
+} async_tx;
+struct VirtIONet *n;
+} VirtIONetQueue;
+
 typedef struct VirtIONet
 {
 VirtIODevice vdev;
 uint8_t mac[ETH_ALEN];
 uint16_t status;
-VirtQueue *rx_vq;
-VirtQueue *tx_vq;
+VirtIONetQueue vq;
 VirtQueue *ctrl_vq;
 NICState *nic;
-QEMUTimer *tx_timer;
-QEMUBH *tx_bh;
 uint32_t tx_timeout;
 int32_t tx_burst;
-int tx_waiting;
 uint32_t has_vnet_hdr;
 size_t host_hdr_len;
 size_t guest_hdr_len;
 uint8_t has_ufo;
-struct {
-VirtQueueElement elem;
-ssize_t len;
-} async_tx;
 int mergeable_rx_bufs;
 uint8_t promisc;
 uint8_t allmulti;
@@ -67,6 +72,12 @@ typedef struct VirtIONet
 DeviceState *qdev;
 } VirtIONet;
 
+static VirtIONetQueue *virtio_net_get_queue(NetClientState *nc)
+{
+VirtIONet *n = qemu_get_nic_opaque(nc);
+
+return n-vq;
+}
 /* TODO
  * - we could suppress RX interrupt if we were so inclined.
  */
@@ -134,6 +145,8 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 error_report(unable to start vhost net: %d: 
  falling back on userspace virtio, -r);
 n-vhost_started = 0;
+} else {
+n-vhost_started = 1;
 }
 } else {
 vhost_net_stop(n-vdev, nc, 1);
@@ -144,25 +157,26 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
 {
 VirtIONet *n = to_virtio_net(vdev);
+VirtIONetQueue *q = n-vq;
 
 virtio_net_vhost_status(n, status);
 
-if (!n-tx_waiting) {
+if (!q-tx_waiting) {
 return;
 }
 
 if (virtio_net_started(n, status)  !n-vhost_started) {
-if (n-tx_timer) {
-qemu_mod_timer(n-tx_timer,
+if (q-tx_timer) {
+qemu_mod_timer(q-tx_timer,
qemu_get_clock_ns(vm_clock) + n-tx_timeout);
 } else {
-qemu_bh_schedule(n-tx_bh);
+qemu_bh_schedule(q-tx_bh);
 }
 } else {
-if (n-tx_timer) {
-qemu_del_timer(n-tx_timer);
+if (q-tx_timer) {
+qemu_del_timer(q-tx_timer);
 } else {
-qemu_bh_cancel(n-tx_bh);
+qemu_bh_cancel(q-tx_bh);
 }
 }
 }
@@ -474,35 +488,40 @@ static void virtio_net_handle_rx(VirtIODevice *vdev, 
VirtQueue *vq)
 static int virtio_net_can_receive(NetClientState *nc)
 {
 VirtIONet *n = qemu_get_nic_opaque(nc);
+VirtIONetQueue *q = virtio_net_get_queue(nc);
+
 if (!n-vdev.vm_running) {
 return 0;
 }
 
-if (!virtio_queue_ready(n-rx_vq) ||
-!(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK))
+if (!virtio_queue_ready(q-rx_vq) ||
+!(n-vdev.status  VIRTIO_CONFIG_S_DRIVER_OK)) {
 return 0;
+}
 
 return 1;
 }
 
-static int virtio_net_has_buffers(VirtIONet *n, int bufsize)
+static int virtio_net_has_buffers(VirtIONetQueue *q, int bufsize)
 {
-if (virtio_queue_empty(n-rx_vq) ||
+VirtIONet *n = q-n;
+if (virtio_queue_empty(q-rx_vq) ||
 (n-mergeable_rx_bufs 
- !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) {
-virtio_queue_set_notification(n-rx_vq, 1);
+ !virtqueue_avail_bytes(q-rx_vq, bufsize, 0))) {
+virtio_queue_set_notification(q-rx_vq, 1);
 
 /* To avoid a race condition where the guest has made some buffers
  * available after the above check but before notification was
  * enabled, check for available buffers again.
  */
-if (virtio_queue_empty(n-rx_vq) ||
+if (virtio_queue_empty(q-rx_vq) ||
 (n-mergeable_rx_bufs 
- !virtqueue_avail_bytes(n-rx_vq, bufsize, 0)))
+ !virtqueue_avail_bytes(q-rx_vq, bufsize, 0))) {
 return 0;
+}
 }
 
-virtio_queue_set_notification(n-rx_vq, 0);
+virtio_queue_set_notification(q-rx_vq, 0);
 return 1;
 }
 
@@ -605,6 +624,7 @@ static int receive_filter(VirtIONet *n, const uint8_t *buf, 
int size)
 static ssize_t virtio_net_receive(NetClientState *nc, const 

[PATCH V4 20/22] virtio-net: multiqueue support

2013-01-30 Thread Jason Wang
This patch implements both userspace and vhost support for multiple queue
virtio-net (VIRTIO_NET_F_MQ). This is done by introducing an array of
VirtIONetQueue to VirtIONet.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio-net.c |  303 +++
 hw/virtio-net.h |   28 +-
 2 files changed, 264 insertions(+), 67 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index b4d53b3..0e4063f 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -44,7 +44,7 @@ typedef struct VirtIONet
 VirtIODevice vdev;
 uint8_t mac[ETH_ALEN];
 uint16_t status;
-VirtIONetQueue vq;
+VirtIONetQueue vqs[MAX_QUEUE_NUM];
 VirtQueue *ctrl_vq;
 NICState *nic;
 uint32_t tx_timeout;
@@ -70,14 +70,23 @@ typedef struct VirtIONet
 } mac_table;
 uint32_t *vlans;
 DeviceState *qdev;
+int multiqueue;
+uint16_t max_queues;
+uint16_t curr_queues;
 } VirtIONet;
 
-static VirtIONetQueue *virtio_net_get_queue(NetClientState *nc)
+static VirtIONetQueue *virtio_net_get_subqueue(NetClientState *nc)
 {
 VirtIONet *n = qemu_get_nic_opaque(nc);
 
-return n-vq;
+return n-vqs[nc-queue_index];
 }
+
+static int vq2q(int queue_index)
+{
+return queue_index / 2;
+}
+
 /* TODO
  * - we could suppress RX interrupt if we were so inclined.
  */
@@ -93,6 +102,7 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
 struct virtio_net_config netcfg;
 
 stw_p(netcfg.status, n-status);
+stw_p(netcfg.max_virtqueue_pairs, n-max_queues);
 memcpy(netcfg.mac, n-mac, ETH_ALEN);
 memcpy(config, netcfg, sizeof(netcfg));
 }
@@ -119,6 +129,7 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
 static void virtio_net_vhost_status(VirtIONet *n, uint8_t status)
 {
 NetClientState *nc = qemu_get_queue(n-nic);
+int queues = n-multiqueue ? n-max_queues : 1;
 
 if (!nc-peer) {
 return;
@@ -130,6 +141,7 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 if (!tap_get_vhost_net(nc-peer)) {
 return;
 }
+
 if (!!n-vhost_started == virtio_net_started(n, status) 
   !nc-peer-link_down) {
 return;
@@ -140,16 +152,14 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 return;
 }
 n-vhost_started = 1;
-r = vhost_net_start(n-vdev, nc, 1);
+r = vhost_net_start(n-vdev, n-nic-ncs, queues);
 if (r  0) {
 error_report(unable to start vhost net: %d: 
  falling back on userspace virtio, -r);
 n-vhost_started = 0;
-} else {
-n-vhost_started = 1;
 }
 } else {
-vhost_net_stop(n-vdev, nc, 1);
+vhost_net_stop(n-vdev, n-nic-ncs, queues);
 n-vhost_started = 0;
 }
 }
@@ -157,26 +167,38 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
 {
 VirtIONet *n = to_virtio_net(vdev);
-VirtIONetQueue *q = n-vq;
+VirtIONetQueue *q;
+int i;
+uint8_t queue_status;
 
 virtio_net_vhost_status(n, status);
 
-if (!q-tx_waiting) {
-return;
-}
+for (i = 0; i  n-max_queues; i++) {
+q = n-vqs[i];
 
-if (virtio_net_started(n, status)  !n-vhost_started) {
-if (q-tx_timer) {
-qemu_mod_timer(q-tx_timer,
-   qemu_get_clock_ns(vm_clock) + n-tx_timeout);
+if ((!n-multiqueue  i != 0) || i = n-curr_queues) {
+queue_status = 0;
 } else {
-qemu_bh_schedule(q-tx_bh);
+queue_status = status;
 }
-} else {
-if (q-tx_timer) {
-qemu_del_timer(q-tx_timer);
+
+if (!q-tx_waiting) {
+continue;
+}
+
+if (virtio_net_started(n, queue_status)  !n-vhost_started) {
+if (q-tx_timer) {
+qemu_mod_timer(q-tx_timer,
+   qemu_get_clock_ns(vm_clock) + n-tx_timeout);
+} else {
+qemu_bh_schedule(q-tx_bh);
+}
 } else {
-qemu_bh_cancel(q-tx_bh);
+if (q-tx_timer) {
+qemu_del_timer(q-tx_timer);
+} else {
+qemu_bh_cancel(q-tx_bh);
+}
 }
 }
 }
@@ -208,6 +230,8 @@ static void virtio_net_reset(VirtIODevice *vdev)
 n-nomulti = 0;
 n-nouni = 0;
 n-nobcast = 0;
+/* multiqueue is disabled by default */
+n-curr_queues = 1;
 
 /* Flush any MAC and VLAN filter table state */
 n-mac_table.in_use = 0;
@@ -249,18 +273,70 @@ static int peer_has_ufo(VirtIONet *n)
 
 static void virtio_net_set_mrg_rx_bufs(VirtIONet *n, int mergeable_rx_bufs)
 {
+int i;
+NetClientState *nc;
+
 n-mergeable_rx_bufs = mergeable_rx_bufs;
 
 n-guest_hdr_len = n-mergeable_rx_bufs ?
 

[PATCH V4 21/22] virtio-net: migration support for multiqueue

2013-01-30 Thread Jason Wang
This patch add migration support for multiqueue virtio-net. Instead of bumping
the version, we conditionally send the info of multiqueue only when the device
support more than one queue to maintain the backward compatibility.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/virtio-net.c |   35 +--
 1 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 0e4063f..d57b255 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -1062,8 +1062,8 @@ static void virtio_net_set_multiqueue(VirtIONet *n, int 
multiqueue, int ctrl)
 
 static void virtio_net_save(QEMUFile *f, void *opaque)
 {
+int i;
 VirtIONet *n = opaque;
-VirtIONetQueue *q = n-vqs[0];
 
 /* At this point, backend must be stopped, otherwise
  * it might keep writing to memory. */
@@ -1071,7 +1071,7 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
 virtio_save(n-vdev, f);
 
 qemu_put_buffer(f, n-mac, ETH_ALEN);
-qemu_put_be32(f, q-tx_waiting);
+qemu_put_be32(f, n-vqs[0].tx_waiting);
 qemu_put_be32(f, n-mergeable_rx_bufs);
 qemu_put_be16(f, n-status);
 qemu_put_byte(f, n-promisc);
@@ -1087,13 +1087,19 @@ static void virtio_net_save(QEMUFile *f, void *opaque)
 qemu_put_byte(f, n-nouni);
 qemu_put_byte(f, n-nobcast);
 qemu_put_byte(f, n-has_ufo);
+if (n-max_queues  1) {
+qemu_put_be16(f, n-max_queues);
+qemu_put_be16(f, n-curr_queues);
+for (i = 1; i  n-curr_queues; i++) {
+qemu_put_be32(f, n-vqs[i].tx_waiting);
+}
+}
 }
 
 static int virtio_net_load(QEMUFile *f, void *opaque, int version_id)
 {
 VirtIONet *n = opaque;
-VirtIONetQueue *q = n-vqs[0];
-int ret, i;
+int ret, i, link_down;
 
 if (version_id  2 || version_id  VIRTIO_NET_VM_VERSION)
 return -EINVAL;
@@ -1104,7 +1110,7 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 qemu_get_buffer(f, n-mac, ETH_ALEN);
-q-tx_waiting = qemu_get_be32(f);
+n-vqs[0].tx_waiting = qemu_get_be32(f);
 
 virtio_net_set_mrg_rx_bufs(n, qemu_get_be32(f));
 
@@ -1174,6 +1180,20 @@ static int virtio_net_load(QEMUFile *f, void *opaque, 
int version_id)
 }
 }
 
+if (n-max_queues  1) {
+if (n-max_queues != qemu_get_be16(f)) {
+error_report(virtio-net: different max_queues );
+return -1;
+}
+
+n-curr_queues = qemu_get_be16(f);
+for (i = 1; i  n-curr_queues; i++) {
+n-vqs[i].tx_waiting = qemu_get_be32(f);
+}
+}
+
+virtio_net_set_queues(n);
+
 /* Find the first multicast entry in the saved MAC filter */
 for (i = 0; i  n-mac_table.in_use; i++) {
 if (n-mac_table.macs[i * ETH_ALEN]  1) {
@@ -1184,7 +1204,10 @@ static int virtio_net_load(QEMUFile *f, void *opaque, 
int version_id)
 
 /* nc.link_down can't be migrated, so infer link_down according
  * to link status bit in n-status */
-qemu_get_queue(n-nic)-link_down = (n-status  VIRTIO_NET_S_LINK_UP) == 
0;
+link_down = (n-status  VIRTIO_NET_S_LINK_UP) == 0;
+for (i = 0; i  n-max_queues; i++) {
+qemu_get_subqueue(n-nic, i)-link_down = link_down;
+}
 
 return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 22/22] virtio-net: compat multiqueue support

2013-01-30 Thread Jason Wang
Disable multiqueue support for pre 1.4.

Signed-off-by: Jason Wang jasow...@redhat.com
---
 hw/pc_piix.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/hw/pc_piix.c b/hw/pc_piix.c
index b9a9b2e..84069b1 100644
--- a/hw/pc_piix.c
+++ b/hw/pc_piix.c
@@ -309,6 +309,10 @@ static QEMUMachine pc_i440fx_machine_v1_4 = {
 .driver   = usb-tablet,\
 .property = usb_version,\
 .value= stringify(1),\
+},{ \
+.driver   = virtio-net-pci, \
+.property = mq, \
+.value= off, \
 }
 
 static QEMUMachine pc_machine_v1_3 = {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/8] KVM: PPC: booke: Added debug handler

2013-01-30 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Friday, January 25, 2013 5:13 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777
 Subject: Re: [PATCH 3/8] KVM: PPC: booke: Added debug handler
 
 
 On 16.01.2013, at 09:24, Bharat Bhushan wrote:
 
  From: Bharat Bhushan bharat.bhus...@freescale.com
 
  Installed debug handler will be used for guest debug support and debug
  facility emulation features (patches for these features will follow
  this patch).
 
  Signed-off-by: Liu Yu yu@freescale.com
  [bharat.bhus...@freescale.com: Substantial changes]
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
  arch/powerpc/include/asm/kvm_host.h |1 +
  arch/powerpc/kernel/asm-offsets.c   |1 +
  arch/powerpc/kvm/booke_interrupts.S |   49 
  ++-
  3 files changed, 44 insertions(+), 7 deletions(-)
 
  diff --git a/arch/powerpc/include/asm/kvm_host.h
  b/arch/powerpc/include/asm/kvm_host.h
  index 8a72d59..f4ba881 100644
  --- a/arch/powerpc/include/asm/kvm_host.h
  +++ b/arch/powerpc/include/asm/kvm_host.h
  @@ -503,6 +503,7 @@ struct kvm_vcpu_arch {
  u32 tlbcfg[4];
  u32 mmucfg;
  u32 epr;
  +   u32 crit_save;
  struct kvmppc_booke_debug_reg dbg_reg; #endif
  gpa_t paddr_accessed;
  diff --git a/arch/powerpc/kernel/asm-offsets.c
  b/arch/powerpc/kernel/asm-offsets.c
  index 46f6afd..02048f3 100644
  --- a/arch/powerpc/kernel/asm-offsets.c
  +++ b/arch/powerpc/kernel/asm-offsets.c
  @@ -562,6 +562,7 @@ int main(void)
  DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
  DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear));
  DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
  +   DEFINE(VCPU_CRIT_SAVE, offsetof(struct kvm_vcpu, arch.crit_save));
  #endif /* CONFIG_PPC_BOOK3S */
  #endif /* CONFIG_KVM */
 
  diff --git a/arch/powerpc/kvm/booke_interrupts.S
  b/arch/powerpc/kvm/booke_interrupts.S
  index eae8483..dd9c5d4 100644
  --- a/arch/powerpc/kvm/booke_interrupts.S
  +++ b/arch/powerpc/kvm/booke_interrupts.S
  @@ -52,12 +52,7 @@
 (1BOOKE_INTERRUPT_PROGRAM) | \
 (1BOOKE_INTERRUPT_DTLB_MISS))
 
  -.macro KVM_HANDLER ivor_nr scratch srr0
  -_GLOBAL(kvmppc_handler_\ivor_nr)
  -   /* Get pointer to vcpu and record exit number. */
  -   mtspr   \scratch , r4
  -   mfspr   r4, SPRN_SPRG_THREAD
  -   lwz r4, THREAD_KVM_VCPU(r4)
  +.macro __KVM_HANDLER ivor_nr scratch srr0
  stw r3, VCPU_GPR(R3)(r4)
  stw r5, VCPU_GPR(R5)(r4)
  stw r6, VCPU_GPR(R6)(r4)
  @@ -74,6 +69,46 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
  bctr
  .endm
 
  +.macro KVM_HANDLER ivor_nr scratch srr0
  +_GLOBAL(kvmppc_handler_\ivor_nr)
  +   /* Get pointer to vcpu and record exit number. */
  +   mtspr   \scratch , r4
  +   mfspr   r4, SPRN_SPRG_THREAD
  +   lwz r4, THREAD_KVM_VCPU(r4)
  +   __KVM_HANDLER \ivor_nr \scratch \srr0 .endm
  +
  +.macro KVM_DBG_HANDLER ivor_nr scratch srr0
  +_GLOBAL(kvmppc_handler_\ivor_nr)
  +   mtspr   \scratch, r4
  +   mfspr   r4, SPRN_SPRG_THREAD
  +   lwz r4, THREAD_KVM_VCPU(r4)
  +   stw r3, VCPU_CRIT_SAVE(r4)
  +   mfcrr3
  +   mfspr   r4, SPRN_CSRR1
  +   andi.   r4, r4, MSR_PR
  +   bne 1f
 
 
  +   /* debug interrupt happened in enter/exit path */
  +   mfspr   r4, SPRN_CSRR1
  +   rlwinm  r4, r4, 0, ~MSR_DE
  +   mtspr   SPRN_CSRR1, r4
  +   lis r4, 0x
  +   ori r4, r4, 0x
  +   mtspr   SPRN_DBSR, r4
  +   mfspr   r4, SPRN_SPRG_THREAD
  +   lwz r4, THREAD_KVM_VCPU(r4)
  +   mtcrr3
  +   lwz r3, VCPU_CRIT_SAVE(r4)
  +   mfspr   r4, \scratch
  +   rfci
 
 What is this part doing? Try to ignore the debug exit?

As BOOKE doesn't have hardware support for virtualization, hardware never know 
current pc is in guest or in host.
So when enable hardware single step for guest, it cannot be disabled at the 
time guest exit. Thus, we'll see that an single step interrupt happens at the 
beginning of guest exit path.

With the above code we recognize this kind of single step interrupt disable 
single step and rfci.

 Why would we have MSR_DE
 enabled in the first place when we can't handle it?

When QEMU is using hardware debug resource then we always set MSR_DE during 
guest is running.

 
  +1: /* debug interrupt happened in guest */
  +   mtcrr3
  +   mfspr   r4, SPRN_SPRG_THREAD
  +   lwz r4, THREAD_KVM_VCPU(r4)
  +   lwz r3, VCPU_CRIT_SAVE(r4)
  +   __KVM_HANDLER \ivor_nr \scratch \srr0
 
 I don't think you need the __KVM_HANDLER split. This should be quite easily
 refactorable into a simple DBG prolog.

Can you please elaborate how you are envisioning this?

Thanks
-Bharat

 
 
 Alex
 
  +.endm
  +
  .macro KVM_HANDLER_ADDR ivor_nr
  .long   kvmppc_handler_\ivor_nr
  .endm
  @@ -98,7 +133,7 @@ KVM_HANDLER BOOKE_INTERRUPT_FIT SPRN_SPRG_RSCRATCH0
  SPRN_SRR0 

[Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Andreas Färber
Am 29.01.2013 16:41, schrieb Juan Quintela:
 * Portio port to new memory regions?
   Andreas, could you fill?

MemoryRegion's .old_portio mechanism requires workarounds for VGA on
ppc, affecting among others the sPAPR PCI host bridge:
http://git.qemu.org/?p=qemu.git;a=commit;h=a3cfa18eb075c7ef78358ca1956fe7b01caa1724

Patches were posted and merged removing all .old_portio users but one:
hw/ioport.c:portio_list_add_1(), used by portio_list_add()

hw/isa-bus.c:portio_list_add(piolist, isabus-address_space_io, start);
hw/qxl.c:portio_list_add(qxl_vga_port_list,
pci_address_space_io(dev), 0x3b0);
hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);
hw/vga.c:portio_list_add(vbe_port_list, address_space_io, 0x1ce);

Proposal by hpoussin was to move _list_add() code to ISADevice:
http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

Concerns:
* PCI devices (VGA, QXL) register I/O ports as well
  = above patches add dependency on ISABus to machines
 - benh no mac ever had one
  = PCIDevice shouldn't use ISA API with NULL ISADevice
* Lack of avi: Who decides about memory API these days?

armbru and agraf concluded that moving this into ISA is wrong.

= I will drop the remaining ioport patches from above series.

Suggestions on how to proceed with tackling the issue are welcome.

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Peter Maydell
On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.

How does this stuff work on real hardware? I would have
expected that a PCI device registering the fact it has
IO ports would have to do so via the PCI controller it
is plugged into...

My naive don't-know-much-about-portio suggestion is that this
should work the same way as memory regions: each device
provides portio regions, and the controller for the bus
(ISA or PCI) exposes those to the next layer up, and
something at board level maps it all into the right places.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Michael S. Tsirkin
On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote:
 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
  Proposal by hpoussin was to move _list_add() code to ISADevice:
  http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
 
  Concerns:
  * PCI devices (VGA, QXL) register I/O ports as well
= above patches add dependency on ISABus to machines
   - benh no mac ever had one
= PCIDevice shouldn't use ISA API with NULL ISADevice
  * Lack of avi: Who decides about memory API these days?
 
  armbru and agraf concluded that moving this into ISA is wrong.
 
  = I will drop the remaining ioport patches from above series.
 
  Suggestions on how to proceed with tackling the issue are welcome.
 
 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

All programming is done by the OS, devices do not register
with controller.

Each bridge has two ways to claim an IO transaction:
- transaction is within the window programmed in the bridge
- subtractive decoding enabled and no one else claims the transaction

At the bus level, transaction happens on a bus and an appropriate device
will claim it.

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions, and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.
 
 -- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Alexander Graf

On 30.01.2013, at 12:48, Peter Maydell wrote:

 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
 
 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
  = above patches add dependency on ISABus to machines
 - benh no mac ever had one
  = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?
 
 armbru and agraf concluded that moving this into ISA is wrong.
 
 = I will drop the remaining ioport patches from above series.
 
 Suggestions on how to proceed with tackling the issue are welcome.
 
 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

That's pretty much how it works for PCI hardware, yes.

For ISA like hardware, I asked Ben last night:

29-01-2013 23:41:10  agraf: benh: hey ben :)
29-01-2013 23:41:50  agraf: benh: do you remember if g3 beige (grackle) and/or 
U2 based macs had an actual ISA bus exposed through MMIO or whether it was PCI 
only with a PIO compat region mapped by the PCI controller?
29-01-2013 23:59:28  benh!~benh@180.200.150.145: agraf: no ISA
29-01-2013 23:59:48  benh!~benh@180.200.150.145: agraf: no mac ever had one
29-01-2013 23:59:57  agraf: benh: well, MCP750 has one
30-01-2013 00:00:06  agraf: benh: that's why I'm asking :)
30-01-2013 00:00:17  benh!~benh@180.200.150.145: mcp750 ? what is this ?
30-01-2013 00:00:28  agraf: benh: some motorola soc
30-01-2013 00:00:39  benh!~benh@180.200.150.145: ah ok
30-01-2013 00:00:50  benh!~benh@180.200.150.145: mostly ISA is just hooked 
onto PCI anyway
30-01-2013 00:00:59  benh!~benh@180.200.150.145: ie, PCI cycles with low 
addresses land on ISA
30-01-2013 00:01:59  agraf: benh: sounds tricky to model :)
30-01-2013 00:02:44  benh!~benh@180.200.150.145: that's also how it works on 
x86
30-01-2013 00:03:05  benh!~benh@180.200.150.145: dunno how it works on that 
specific SoC tho but that's how it's usually done
30-01-2013 00:04:36  agraf: interesting - didn't know that :)
30-01-2013 00:04:51  agraf: on x86 it's hard to see from a software pov, 
because everything's linear ;)
30-01-2013 00:26:27  benh!~benh@180.200.150.145: yeah, that's why x86 has a 
memory hole to make room for ISA
30-01-2013 00:26:40  benh!~benh@180.200.150.145: while usually on ppc we remap 
things with an offset so we don't have to punch a hole in ram

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions, and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.

Right. With the addition that on some boards, the PCI host controller which 
provides a portio map would also expose an ISABus for devices to plug in. At 
least if I understand Ben correctly.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit

2013-01-30 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

The acknowledge interrupt on exit feature controls processor behavior
for external interrupt acknowledgement. When this control is set, the
processor acknowledges the interrupt controller to acquire the
interrupt vector on VM exit.

After enabling this feature, an interrupt which arrived when target cpu is
running in vmx non-root mode will be handled by vmx handler instead of handler
in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
table to let real handler to handle it. Further, we will recognize the interrupt
and only delivery the interrupt which not belong to current vcpu through idt 
table.
The interrupt which belonged to current vcpu will be handled inside vmx handler.
This will reduce the interrupt handle cost of KVM.

Also, interrupt enable logic is changed if this feature is turnning on:
Before this patch, hypervior call local_irq_enable() to enable it directly.
Now IF bit is set on interrupt stack frame, and will be enabled on a return from
interrupt handler if exterrupt interrupt exists. If no external interrupt, still
call local_irq_enable() to enable it.

Refer to Intel SDM volum 3, chapter 33.2.

Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/svm.c  |6 +++
 arch/x86/kvm/vmx.c  |   70 --
 arch/x86/kvm/x86.c  |4 ++-
 4 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 77d56a4..1f1b2f8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -725,6 +725,7 @@ struct kvm_x86_ops {
int (*check_intercept)(struct kvm_vcpu *vcpu,
   struct x86_instruction_info *info,
   enum x86_intercept_stage stage);
+   void (*handle_external_intr)(struct kvm_vcpu *vcpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d29d3cd..c283185 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4227,6 +4227,11 @@ out:
return ret;
 }
 
+static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
+{
+   local_irq_enable();
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -4318,6 +4323,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.set_tdp_cr3 = set_tdp_cr3,
 
.check_intercept = svm_check_intercept,
+   .handle_external_intr = svm_handle_external_intr,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 02eeba8..eaef185 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -379,6 +379,7 @@ struct vcpu_vmx {
struct shared_msr_entry *guest_msrs;
int   nmsrs;
int   save_nmsrs;
+   unsigned long host_idt_base;
 #ifdef CONFIG_X86_64
u64   msr_host_kernel_gs_base;
u64   msr_guest_kernel_gs_base;
@@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
 #ifdef CONFIG_X86_64
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
 #endif
-   opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
+   opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
+   VM_EXIT_ACK_INTR_ON_EXIT;
if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
_vmexit_control)  0)
return -EIO;
@@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr, bool 
longmode_only)
  * Note that host-state that does change is set elsewhere. E.g., host-state
  * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
  */
-static void vmx_set_constant_host_state(void)
+static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu)
 {
u32 low32, high32;
unsigned long tmpl;
struct desc_ptr dt;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
 
vmcs_writel(HOST_CR0, read_cr0()  ~X86_CR0_TS);  /* 22.2.3 */
vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
@@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void)
 
native_store_idt(dt);
vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
+   vmx-host_idt_base = dt.address;
 
vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */
 
@@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
vmcs_write16(HOST_FS_SELECTOR, 0);/* 22.2.4 */
vmcs_write16(HOST_GS_SELECTOR, 0);/* 22.2.4 */
-   vmx_set_constant_host_state();
+   vmx_set_constant_host_state(vmx-vcpu);
 #ifdef CONFIG_X86_64
rdmsrl(MSR_FS_BASE, a);
vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */
@@ -6094,6 +6098,63 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx 

Re: [Qemu-devel] What to do about non-qdevified devices?

2013-01-30 Thread Markus Armbruster
Peter Maydell peter.mayd...@linaro.org writes:

 On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote:
 Anthony Liguori aligu...@us.ibm.com writes:

 [...]
 The problems I ran into were (1) this is a lot of work (2) it basically
 requires that all bus children have been qdev/QOM-ified.  Even with
 something like the ISA bus which is where I started, quite a few devices
 were not qdevified still.

 So what's the plan to complete the qdevification job?  Lay really low
 and quietly hope the problem goes away?  We've tried that for about
 three years, doesn't seem to work.

 Do we have a list of not-yet-qdevified devices? Maybe we need to
 start saying fix X Y and Z or platform P is dropped from the next
 release. (This would of course be easier if we had a way to let users
 know that platform P was in danger...)

I think that's a good idea.  Only problem is identifying pre-qdev
devices in the code requires code inspection (grep won't do, I'm
afraid).

If we agree on a qdevify or else plan, I'd be prepared to help with
the digging up of devices.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit

2013-01-30 Thread Gleb Natapov
On Wed, Jan 30, 2013 at 08:36:12PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 The acknowledge interrupt on exit feature controls processor behavior
 for external interrupt acknowledgement. When this control is set, the
 processor acknowledges the interrupt controller to acquire the
 interrupt vector on VM exit.
 
 After enabling this feature, an interrupt which arrived when target cpu is
 running in vmx non-root mode will be handled by vmx handler instead of handler
 in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
 table to let real handler to handle it. Further, we will recognize the 
 interrupt
 and only delivery the interrupt which not belong to current vcpu through idt 
 table.
 The interrupt which belonged to current vcpu will be handled inside vmx 
 handler.
 This will reduce the interrupt handle cost of KVM.
 
 Also, interrupt enable logic is changed if this feature is turnning on:
 Before this patch, hypervior call local_irq_enable() to enable it directly.
 Now IF bit is set on interrupt stack frame, and will be enabled on a return 
 from
 interrupt handler if exterrupt interrupt exists. If no external interrupt, 
 still
 call local_irq_enable() to enable it.
 
 Refer to Intel SDM volum 3, chapter 33.2.
 
Looks good to me except one comment bellow. Send that patch as part of
posted interrupt series, there is not point to apply it separately.

 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |6 +++
  arch/x86/kvm/vmx.c  |   70 --
  arch/x86/kvm/x86.c  |4 ++-
  4 files changed, 76 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 77d56a4..1f1b2f8 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -725,6 +725,7 @@ struct kvm_x86_ops {
   int (*check_intercept)(struct kvm_vcpu *vcpu,
  struct x86_instruction_info *info,
  enum x86_intercept_stage stage);
 + void (*handle_external_intr)(struct kvm_vcpu *vcpu);
  };
  
  struct kvm_arch_async_pf {
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index d29d3cd..c283185 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -4227,6 +4227,11 @@ out:
   return ret;
  }
  
 +static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
 +{
 + local_irq_enable();
 +}
 +
  static struct kvm_x86_ops svm_x86_ops = {
   .cpu_has_kvm_support = has_svm,
   .disabled_by_bios = is_disabled,
 @@ -4318,6 +4323,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .set_tdp_cr3 = set_tdp_cr3,
  
   .check_intercept = svm_check_intercept,
 + .handle_external_intr = svm_handle_external_intr,
  };
  
  static int __init svm_init(void)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 02eeba8..eaef185 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -379,6 +379,7 @@ struct vcpu_vmx {
   struct shared_msr_entry *guest_msrs;
   int   nmsrs;
   int   save_nmsrs;
 + unsigned long host_idt_base;
  #ifdef CONFIG_X86_64
   u64   msr_host_kernel_gs_base;
   u64   msr_guest_kernel_gs_base;
 @@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
  #ifdef CONFIG_X86_64
   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #endif
 - opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
 + opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
 + VM_EXIT_ACK_INTR_ON_EXIT;
   if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
   _vmexit_control)  0)
   return -EIO;
 @@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr, 
 bool longmode_only)
   * Note that host-state that does change is set elsewhere. E.g., host-state
   * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
   */
 -static void vmx_set_constant_host_state(void)
 +static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu)
Pass vmx to the function. No need to convert vmx op vcpu and back.

  {
   u32 low32, high32;
   unsigned long tmpl;
   struct desc_ptr dt;
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
  
   vmcs_writel(HOST_CR0, read_cr0()  ~X86_CR0_TS);  /* 22.2.3 */
   vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
 @@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void)
  
   native_store_idt(dt);
   vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
 + vmx-host_idt_base = dt.address;
  
   vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */
  
 @@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
  
   vmcs_write16(HOST_FS_SELECTOR, 0);/* 22.2.4 */
   

Re: vCPU hotplug roadmap (was: Minutes for KVM call 2013-01-15)

2013-01-30 Thread Eduardo Habkost
On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote:
 Am 15.01.2013 17:16, schrieb Juan Quintela:
  
  * cpu hot plug
- use qdev propierties conected to a set of socket objects (anthony)
- cpusets are the wrong interface (anthony)
- make a link between cpu - socket instead of a propierty?
- how far are we from being able to describe a cpu with -device?
  (didn't heare the answer, andreas?)
- perhaps the best approach?
- After soft-freeze, exceptions depend on the maintainer
- After hard-freeze, no exceptions
-device don't require a bus, just an implementation detail, we can change 
  that
- use cpuset as an intermediate step until full vision is implemented
- several approaches from where we are now, to have something before
  we get a full solution
  
  
  At this point, Andreas agreed to write a better summary of the
  discussion and suggestions O:-)
 
 Got buried, here we go:
 
 == vCPU hot-plug user interfaces ==
 
 === cpu_set ===
 
 Previously available in qemu-kvm.git:
 `cpu_set n+1 online` via HMP
 
 Pros:
 * Hides QOM/qdev implementation details (afaerber)
 * Thus: Doesn't depend on QOM CPUState refactoring (imammedo)
 * Opens a fast route to implementing vCPU unplug in KVM (imammedo)
 * Unintrusive to add and easy to obsolete/remove in future (imammedo)
 * Existing virt-test cases (afaerber)
 * Supported by libvirt (imammedo)
 * Prevents confusing guests by hot-plugging random mix of CPUs (agraf)
 
 Cons:
 * Cannot express topologies (ehabkost)

Actually, I believe this is not the main problem (we will have exactly
the same limitation if using thread-level device_add). To me, the main
problem is that we are creating a new QMP command that should be
eventually obsoleted by device_add.


 
 === device_add ===
 
 `device_add driver=Haswell-x86_64-cpu id=qdevid`
 [You can try this today and see it failing / not working.]
 
 Pros:
 * QMP/HMP command available today and known to users (afaerber)
 * Unified command for device and CPU hot-plug (imammedo)
 * Would allow first doing thread-level vCPU hotplug (imammedo)
 * Could be extended to support socket-level hot-plug (aliguori/imammedo)
 
 Cons:
 * Operates on raw QOM type name unlike -cpu (afaerber)
 * Needs support in libvirt for device_add driver=CPU (imammedo)
 * libvirt needs means to enumerate CPU types (imammedo) = QMP? (AF)
 
 Challenges:
 * No CPU qbus (afaerber)
   = should work without (aliguori)
 * CPU subclasses needed for identifying type name (afaerber/imammedo)
   = Haswell-x86_64-cpu does not exist yet, just x86_64-cpu
 * CPU class_init for -cpu host requires KVM init (imammedo)
   [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber]

I don't know what you mean by use kvm_arch_vcpu_init(). I sent a RFC
following somebody's suggestion of simply make kvm_arch_init() call a
function to finish the -cpu host initialization, as we can't initialize
everything inside class_init.

See x86_cpu_finish_host_class_init() at:
 Message-Id: 1357329382-20944-7-git-send-email-ehabk...@redhat.com
 http://article.gmane.org/gmane.comp.emulators.qemu/186778


 * Conversion of CPU features to static properties needed (imammedo)
   = device_add driver=foo,level=x,xlevel=y,...
 * Alternatively conversion to global properties (imammedo)
 * Cements type names - rename for 1.4? (afaerber) = permissable (alig.)
   [patches for arm, m68k, openrisc, unicore32 on list]
 
 === qom-set ===
 
 `qom-set` via QMP w/ linkCPUSocket property (aliguori)
 
 Topology represented in QOM:
 CPUSocket has-aCPUCore has-aCPUThread a.k.a. CPUState, or
 CPUSocket links-to CPUCore links-to CPUThread a.k.a. CPUState
 
 Challenges (afaerber):
 * No CPUSocket/CPUCore objects yet and may take a while to get there...
   topology fields being moved to CPUState for 1.4 [done, more WIP]
 * No decisions on canonical paths for CPUs: CPU? machine? unassigned?
 * Duality of thread-level device types and socket-level? (afaerber)
   = fine to have, e.g., quad-core Xeon 500 device (aliguori)
 * CPUState is no_user (afaerber)
   = need to generally drop no_user for QOM (aliguori)

I would like to drop no_user on 1.5 even if we don't manage to finish
CPU hotplug, as exposing the CPU objects and classes will be very useful
to allow libvirt to probe for the available CPU models and features.


 
 === libvirt ===
 
 libvirt's XML topology modelling is closer to today's -smp than to the
 desired QOM modelling:
 http://www.libvirt.org/formatcaps.html
 
 `virsh setvcpus domain n`
 http://libvirt.org/sources/virshcmdref/html/sect-setvcpus.html
 
 == qom-cpu course of action (afaerber) ==
 
 It was requested to have vCPU hot-plug in v1.5.
 
 For device_add we need to move code from cpu_init() into QOM facilities.
 = QOM realize support would help [applied by aliguori]
 = cleanups piggy-backed onto CPU realizefn [applied to qom-cpu-next]
 
 Agreement on goal of X86CPU subclasses, but conflicts how to get there:
 * Refactor 

RE: [PATCH v4] KVM: VMX: enable acknowledge interupt on vmexit

2013-01-30 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-01-30:
 On Wed, Jan 30, 2013 at 08:36:12PM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 The acknowledge interrupt on exit feature controls processor behavior
 for external interrupt acknowledgement. When this control is set, the
 processor acknowledges the interrupt controller to acquire the
 interrupt vector on VM exit.
 
 After enabling this feature, an interrupt which arrived when target cpu
 is running in vmx non-root mode will be handled by vmx handler instead
 of handler in idt. Currently, vmx handler only fakes an interrupt stack
 and jump to idt table to let real handler to handle it. Further, we
 will recognize the interrupt and only delivery the interrupt which not
 belong to current vcpu through idt table. The interrupt which belonged
 to current vcpu will be handled inside vmx handler. This will reduce
 the interrupt handle cost of KVM.
 
 Also, interrupt enable logic is changed if this feature is turnning on:
 Before this patch, hypervior call local_irq_enable() to enable it directly.
 Now IF bit is set on interrupt stack frame, and will be enabled on a return 
 from
 interrupt handler if exterrupt interrupt exists. If no external interrupt, 
 still
 call local_irq_enable() to enable it.
 
 Refer to Intel SDM volum 3, chapter 33.2.
 
 Looks good to me except one comment bellow. Send that patch as part of
 posted interrupt series, there is not point to apply it separately.
Sure. I will send out the PI patch after it passes all testings. 

 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c   
|6 +++ arch/x86/kvm/vmx.c  |   70
  -- arch/x86/kvm/x86.c 
  |4 ++- 4 files changed, 76 insertions(+), 5 deletions(-)
 diff --git a/arch/x86/include/asm/kvm_host.h
 b/arch/x86/include/asm/kvm_host.h index 77d56a4..1f1b2f8 100644 ---
 a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
 @@ -725,6 +725,7 @@ struct kvm_x86_ops {
  int (*check_intercept)(struct kvm_vcpu *vcpu,  
 struct
  x86_instruction_info *info,enum 
 x86_intercept_stage stage);
  +   void (*handle_external_intr)(struct kvm_vcpu *vcpu); };
  
  struct kvm_arch_async_pf {
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index d29d3cd..c283185 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -4227,6 +4227,11 @@ out:
  return ret;
  }
 +static void svm_handle_external_intr(struct kvm_vcpu *vcpu)
 +{
 +local_irq_enable();
 +}
 +
  static struct kvm_x86_ops svm_x86_ops = {   .cpu_has_kvm_support =
  has_svm,.disabled_by_bios = is_disabled, @@ -4318,6 +4323,7 @@
  static struct kvm_x86_ops svm_x86_ops = {   .set_tdp_cr3 = set_tdp_cr3,
  
  .check_intercept = svm_check_intercept, +   .handle_external_intr =
  svm_handle_external_intr, };
  
  static int __init svm_init(void)
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 02eeba8..eaef185 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -379,6 +379,7 @@ struct vcpu_vmx {
  struct shared_msr_entry *guest_msrs;int   nmsrs;
  int   save_nmsrs; + unsigned long
  host_idt_base; #ifdef CONFIG_X86_64 u64  
  msr_host_kernel_gs_base;u64   msr_guest_kernel_gs_base;
 @@ -2565,7 +2566,8 @@ static __init int setup_vmcs_config(struct
 vmcs_config *vmcs_conf)
  #ifdef CONFIG_X86_64
  min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #endif
 -opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT;
 +opt = VM_EXIT_SAVE_IA32_PAT | VM_EXIT_LOAD_IA32_PAT |
 +VM_EXIT_ACK_INTR_ON_EXIT;
  if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
  _vmexit_control)  0)
  return -EIO;
 @@ -3742,11 +3744,12 @@ static void vmx_disable_intercept_for_msr(u32 msr,
 bool longmode_only)
   * Note that host-state that does change is set elsewhere. E.g., host-state
   * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
   */
 -static void vmx_set_constant_host_state(void)
 +static void vmx_set_constant_host_state(struct kvm_vcpu *vcpu)
 Pass vmx to the function. No need to convert vmx op vcpu and back.
 
  {
  u32 low32, high32;
  unsigned long tmpl;
  struct desc_ptr dt;
 +struct vcpu_vmx *vmx = to_vmx(vcpu);
 
  vmcs_writel(HOST_CR0, read_cr0()  ~X86_CR0_TS);  /* 22.2.3 */
  vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
 @@ -3770,6 +3773,7 @@ static void vmx_set_constant_host_state(void)
 
  native_store_idt(dt);
  vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
 +vmx-host_idt_base = dt.address;
 
  vmcs_writel(HOST_RIP, vmx_return); /* 22.2.5 */
 @@ -3884,7 +3888,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
  vmcs_write16(HOST_FS_SELECTOR, 

Re: vCPU hotplug roadmap

2013-01-30 Thread Andreas Färber
Am 30.01.2013 13:49, schrieb Eduardo Habkost:
 On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote:
 * CPU class_init for -cpu host requires KVM init (imammedo)
   [suggestion by ehabkost to use kvm_arch_vcpu_init, WIP by afaerber]
 
 I don't know what you mean by use kvm_arch_vcpu_init().

Sorry, scratch the _vcpu. I.e., the x86-specific KVM init hook.

 I sent a RFC
 following somebody's suggestion of simply make kvm_arch_init() call a
 function to finish the -cpu host initialization, as we can't initialize
 everything inside class_init.
 
 See x86_cpu_finish_host_class_init() at:
  Message-Id: 1357329382-20944-7-git-send-email-ehabk...@redhat.com
  http://article.gmane.org/gmane.comp.emulators.qemu/186778

...and I have been working on making it even simpler for the
still-x86_def_t-based approach. I'm still busy looking at 1.4 issues
currently though.

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] vCPU hotplug roadmap

2013-01-30 Thread Igor Mammedov
On Wed, 30 Jan 2013 14:02:16 +0100
Andreas Färber afaer...@suse.de wrote:

 Am 30.01.2013 13:49, schrieb Eduardo Habkost:
  On Wed, Jan 30, 2013 at 11:58:56AM +0100, Andreas Färber wrote:
[...]
   http://article.gmane.org/gmane.comp.emulators.qemu/186778
 
 ...and I have been working on making it even simpler for the
 still-x86_def_t-based approach. I'm still busy looking at 1.4 issues
 currently though.
 
 Andreas
 

I'll try to cook series that would do properties and classes in one seamless
approach without intermediate steps. Perhaps it would work out better.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Anthony Liguori
Michael S. Tsirkin m...@redhat.com writes:

 On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote:
 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
  Proposal by hpoussin was to move _list_add() code to ISADevice:
  http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
 
  Concerns:
  * PCI devices (VGA, QXL) register I/O ports as well
= above patches add dependency on ISABus to machines
   - benh no mac ever had one
= PCIDevice shouldn't use ISA API with NULL ISADevice
  * Lack of avi: Who decides about memory API these days?
 
  armbru and agraf concluded that moving this into ISA is wrong.
 
  = I will drop the remaining ioport patches from above series.
 
  Suggestions on how to proceed with tackling the issue are welcome.
 
 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

 All programming is done by the OS, devices do not register
 with controller.

 Each bridge has two ways to claim an IO transaction:
 - transaction is within the window programmed in the bridge
 - subtractive decoding enabled and no one else claims the transaction

And there can only be one endpoint that accepts subtractive decoding and
this is usually the ISA bridge.

Also note that there are some really special cases with PCI.  The legacy
VGA ports are always routed to the first device with a DISPLAY class
type.

Likewise, with legacy IDE ports are routed to the first device with an
IDE class.  That's the only reason you can have these legacy devices not
behind the ISA bridge.

Regards,

Anthony Liguori


 At the bus level, transaction happens on a bus and an appropriate device
 will claim it.

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions, and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.
 
 -- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier

2013-01-30 Thread Mihai Caraman
VCPU's MMUCFG register initialization should not depend on KVM_CAP_SW_TLB
ioctl call. Move it earlier into tlb initalization phase.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/kvm/e500_mmu.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 5c44759..bb1b2b0 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -692,8 +692,6 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
vcpu_e500-gtlb_offset[0] = 0;
vcpu_e500-gtlb_offset[1] = params.tlb_sizes[0];
 
-   vcpu-arch.mmucfg = mfspr(SPRN_MMUCFG)  ~MMUCFG_LPIDSIZE;
-
vcpu-arch.tlbcfg[0] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
if (params.tlb_sizes[0] = 2048)
vcpu-arch.tlbcfg[0] |= params.tlb_sizes[0];
@@ -781,6 +779,8 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
if (!vcpu_e500-g2h_tlb1_map)
goto err;
 
+   vcpu-arch.mmucfg = mfspr(SPRN_MMUCFG)  ~MMUCFG_LPIDSIZE;
+
/* Init TLB configuration register */
vcpu-arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) 
 ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
-- 
1.7.4.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM: PPC: e500: Emulate TLBnPS registers

2013-01-30 Thread Mihai Caraman
Emulate TLBnPS registers which are available in MMU Architecture Version
(MAV) 2.0.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/e500.h |5 +
 arch/powerpc/kvm/e500_emulate.c |   10 ++
 arch/powerpc/kvm/e500_mmu.c |5 +
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..88fcfe6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -501,6 +501,7 @@ struct kvm_vcpu_arch {
spinlock_t wdt_lock;
struct timer_list wdt_timer;
u32 tlbcfg[4];
+   u32 tlbps[4];
u32 mmucfg;
u32 epr;
struct kvmppc_booke_debug_reg dbg_reg;
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 41cefd4..b9f76d8 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -303,4 +303,9 @@ static inline unsigned int get_tlbmiss_tid(struct kvm_vcpu 
*vcpu)
 #define get_tlb_sts(gtlbe)  (MAS1_TS)
 #endif /* !BOOKE_HV */
 
+static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu)
+{
+   return ((vcpu-arch.mmucfg  MMUCFG_MAVN) == MMUCFG_MAVN_V2);
+}
+
 #endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index e78f353..5515dc5 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -329,6 +329,16 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int 
sprn, ulong *spr_val)
*spr_val = vcpu-arch.ivor[BOOKE_IRQPRIO_DBELL_CRIT];
break;
 #endif
+   case SPRN_TLB0PS:
+   if (!has_mmu_v2(vcpu))
+   return EMULATE_FAIL;
+   *spr_val = vcpu-arch.tlbps[0];
+   break;
+   case SPRN_TLB1PS:
+   if (!has_mmu_v2(vcpu))
+   return EMULATE_FAIL;
+   *spr_val = vcpu-arch.tlbps[1];
+   break;
default:
emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val);
}
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index bb1b2b0..129299a 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -794,6 +794,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 
*vcpu_e500)
vcpu-arch.tlbcfg[1] |=
vcpu_e500-gtlb_params[1].ways  TLBnCFG_ASSOC_SHIFT;
 
+   if (has_mmu_v2(vcpu)) {
+   vcpu-arch.tlbps[0] = mfspr(SPRN_TLB0PS);
+   vcpu-arch.tlbps[1] = mfspr(SPRN_TLB1PS);
+   }
+
kvmppc_recalc_tlb1map_range(vcpu_e500);
return 0;
 
-- 
1.7.4.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] KVM: PPC: e500: Enable FSL e6500 core

2013-01-30 Thread Mihai Caraman
Enable Freescale e6500 core adding missing MAV 2.0 support. LRAT and Page
Table are not addresses by this commit.

Mihai Caraman (5):
  KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
  KVM: PPC: e500: Emulate TLBnPS registers
  KVM: PPC: e500: Remove E.PT category from VCPUs
  KVM: PPC: e500: Emulate EPTCFG register
  KVM: PPC: e500mc: Enable e6500 cores

 arch/powerpc/include/asm/kvm_host.h |2 ++
 arch/powerpc/kvm/e500.h |   11 +++
 arch/powerpc/kvm/e500_emulate.c |   19 +++
 arch/powerpc/kvm/e500_mmu.c |   24 ++--
 arch/powerpc/kvm/e500mc.c   |2 ++
 5 files changed, 52 insertions(+), 6 deletions(-)

-- 
1.7.4.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM: PPC: e500: Remove E.PT category from VCPUs

2013-01-30 Thread Mihai Caraman
Embedded.Page Table (E.PT) category in VMs requires indirect tlb entries
emulation which is not supported yet. Configure TLBnCFG to remove E.PT
category from VCPUs.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/kvm/e500_mmu.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 129299a..9a1f7b7 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -692,12 +692,14 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
vcpu_e500-gtlb_offset[0] = 0;
vcpu_e500-gtlb_offset[1] = params.tlb_sizes[0];
 
-   vcpu-arch.tlbcfg[0] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+   vcpu-arch.tlbcfg[0] =
+ ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
if (params.tlb_sizes[0] = 2048)
vcpu-arch.tlbcfg[0] |= params.tlb_sizes[0];
vcpu-arch.tlbcfg[0] |= params.tlb_ways[0]  TLBnCFG_ASSOC_SHIFT;
 
-   vcpu-arch.tlbcfg[1] = ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+   vcpu-arch.tlbcfg[1] =
+ ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
vcpu-arch.tlbcfg[1] |= params.tlb_sizes[1];
vcpu-arch.tlbcfg[1] |= params.tlb_ways[1]  TLBnCFG_ASSOC_SHIFT;
 
@@ -783,13 +785,13 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 
*vcpu_e500)
 
/* Init TLB configuration register */
vcpu-arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) 
-~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
vcpu-arch.tlbcfg[0] |= vcpu_e500-gtlb_params[0].entries;
vcpu-arch.tlbcfg[0] |=
vcpu_e500-gtlb_params[0].ways  TLBnCFG_ASSOC_SHIFT;
 
vcpu-arch.tlbcfg[1] = mfspr(SPRN_TLB1CFG) 
-~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
vcpu-arch.tlbcfg[1] |= vcpu_e500-gtlb_params[1].entries;
vcpu-arch.tlbcfg[1] |=
vcpu_e500-gtlb_params[1].ways  TLBnCFG_ASSOC_SHIFT;
-- 
1.7.4.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register

2013-01-30 Thread Mihai Caraman
EPTCFG register defined by E.PT is accessed unconditionally by Linux guests
in the presence of MAV 2.0. Emulate EPTCFG register now.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/include/asm/kvm_host.h |1 +
 arch/powerpc/kvm/e500.h |6 ++
 arch/powerpc/kvm/e500_emulate.c |9 +
 arch/powerpc/kvm/e500_mmu.c |5 +
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 88fcfe6..f480b20 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -503,6 +503,7 @@ struct kvm_vcpu_arch {
u32 tlbcfg[4];
u32 tlbps[4];
u32 mmucfg;
+   u32 eptcfg;
u32 epr;
struct kvmppc_booke_debug_reg dbg_reg;
 #endif
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index b9f76d8..983eb95 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -308,4 +308,10 @@ static inline unsigned int has_mmu_v2(const struct 
kvm_vcpu *vcpu)
return ((vcpu-arch.mmucfg  MMUCFG_MAVN) == MMUCFG_MAVN_V2);
 }
 
+static inline unsigned int supports_page_tables(const struct kvm_vcpu *vcpu)
+{
+   return ((vcpu-arch.tlbcfg[0]  TLBnCFG_IND)
+   || (vcpu-arch.tlbcfg[1]  TLBnCFG_IND));
+}
+
 #endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 5515dc5..493e231 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -339,6 +339,15 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int 
sprn, ulong *spr_val)
return EMULATE_FAIL;
*spr_val = vcpu-arch.tlbps[1];
break;
+   case SPRN_EPTCFG:
+   if (!has_mmu_v2(vcpu))
+   return EMULATE_FAIL;
+   /*
+* Legacy Linux guests access EPTCFG register even if the E.PT
+* category is disabled in the VM. Give them a chance to live.
+*/
+   *spr_val = vcpu-arch.eptcfg;
+   break;
default:
emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val);
}
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 9a1f7b7..199c11e 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -799,6 +799,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 
*vcpu_e500)
if (has_mmu_v2(vcpu)) {
vcpu-arch.tlbps[0] = mfspr(SPRN_TLB0PS);
vcpu-arch.tlbps[1] = mfspr(SPRN_TLB1PS);
+
+   if (supports_page_tables(vcpu))
+   vcpu-arch.eptcfg = mfspr(SPRN_EPTCFG);
+   else
+   vcpu-arch.eptcfg = 0;
}
 
kvmppc_recalc_tlb1map_range(vcpu_e500);
-- 
1.7.4.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] KVM: PPC: e500mc: Enable e6500 cores

2013-01-30 Thread Mihai Caraman
Extend processor compatibility names to e6500 cores.

Signed-off-by: Mihai Caraman mihai.cara...@freescale.com
---
 arch/powerpc/kvm/e500mc.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index 1f89d26..6c87299 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -172,6 +172,8 @@ int kvmppc_core_check_processor_compat(void)
r = 0;
else if (strcmp(cur_cpu_spec-cpu_name, e5500) == 0)
r = 0;
+   else if (strcmp(cur_cpu_spec-cpu_name, e6500) == 0)
+   r = 0;
else
r = -ENOTSUPP;
 
-- 
1.7.4.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] What to do about non-qdevified devices?

2013-01-30 Thread Andreas Färber
Am 30.01.2013 13:35, schrieb Markus Armbruster:
 Peter Maydell peter.mayd...@linaro.org writes:
 
 On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote:
 Anthony Liguori aligu...@us.ibm.com writes:

 [...]
 The problems I ran into were (1) this is a lot of work (2) it basically
 requires that all bus children have been qdev/QOM-ified.  Even with
 something like the ISA bus which is where I started, quite a few devices
 were not qdevified still.

 So what's the plan to complete the qdevification job?  Lay really low
 and quietly hope the problem goes away?  We've tried that for about
 three years, doesn't seem to work.

 Do we have a list of not-yet-qdevified devices? Maybe we need to
 start saying fix X Y and Z or platform P is dropped from the next
 release. (This would of course be easier if we had a way to let users
 know that platform P was in danger...)
 
 I think that's a good idea.  Only problem is identifying pre-qdev
 devices in the code requires code inspection (grep won't do, I'm
 afraid).

+1 That would address my request as well.

Having a list of low-hanging fruit on the Wiki might also give new
contributors some ideas of where and how to start poking at the code.

 If we agree on a qdevify or else plan, I'd be prepared to help with
 the digging up of devices.

I disagree on the or else part. I have been qdev'ifying and QOM'ifying
devices in my maintenance area, and progress is slow. It gets even
slower if one leaves clearly maintained areas. I see no good reason to
force a pistol on someone's breast, like you have done for IDE, unless
there is a good reason to do so. Currently I don't see any.

Just think of my pending ide/mmio.c patch [1] that no one has reviewed
or applied so far. Similarly, Fred's virtio refactoring has pretty long
review cycles, with discussions about very basic QOM and OOD idioms.

If we want to make progress, we need to encourage contributors to send
such patches by making sure they get feedback and find their way into
the tree within a reasonable time frame. It's always easier to rip out
and damage other people's work than to get things right yourself.

To take that thought to the extreme, I could propose to rip out any qdev
device that's not properly QOM'ified and realize'ified yet. That would
include i440fx, fdc and many core x86 devices in the repository...

Technical risks have been raised elsewhere: Making random code
SysBusDevices can lead to PCIDevices instantiating them not being
hot-pluggable any more simply because SysBus is a crappy fallback,
overused in lack of a clear alternative. I already started reviewing
parent_bus and qdev_get_parent_bus() uses in the tree [2, 3], but
constructive help would be more welcome than constant nagging about code
that's in bad shape. There's a lot of work to be done!

Andreas

[1] http://patchwork.ozlabs.org/patch/215482/
[2] http://patchwork.ozlabs.org/patch/209499/
[3] http://patchwork.ozlabs.org/patch/213971/

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Anthony Liguori
Andreas Färber afaer...@suse.de writes:

 Am 29.01.2013 16:41, schrieb Juan Quintela:
 * Portio port to new memory regions?
   Andreas, could you fill?

 MemoryRegion's .old_portio mechanism requires workarounds for VGA on
 ppc, affecting among others the sPAPR PCI host bridge:
 http://git.qemu.org/?p=qemu.git;a=commit;h=a3cfa18eb075c7ef78358ca1956fe7b01caa1724

 Patches were posted and merged removing all .old_portio users but one:
 hw/ioport.c:portio_list_add_1(), used by portio_list_add()

 hw/isa-bus.c:portio_list_add(piolist, isabus-address_space_io, start);
 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);
 hw/vga.c:portio_list_add(vbe_port_list, address_space_io, 0x1ce);

 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

Okay, a couple things here:

There is no such thing as PIO as a general concept.  What leaves the
CPU and what a bus interprets are totally different things.

An x86 CPU has a MMIO capability that's essentially 65 bits.  Whether
the top bit is set determines whether it's a PIO transaction or an
MMIO transaction.  A large chunk of that address space is invalid of
course.

PCI has a 65 bit address space too.  The 65th bit determines whether
it's an IO transaction or an MMIO transaction.

For architectures that only have a 64-bit address space, what the PCI
controller typically does is pick a 16-bit window within that address
space to map to a PCI address with the 65th bit set.

Within the PCI bus, transactions are usually routed to devices via
positive decoding.  The device lists what address regions it wants to
handle (via BARs) and the PCI bus uses those to determine who to send
transactions to.

There are some exceptions though.  Specifically:

1) A chipset will route any non-positively decoded IO transaction (65th
   bit set) to a single end point (usually the ISA-bridge).  Which one it
   chooses is up to the chipset.  This is called subtractive decoding
   because the PCI bus will wait multiple cycles for that device to
   claim the transaction before bouncing it.

2) There are special hacks in most PCI chipsets to route very specific
   addresses ranges to certain devices.  Namely, legacy VGA IO transactions
   go to the first VGA device.  Legacy IDE IO transactions go to the first
   IDE device.  This doesn't need to be programmed in the BARs.  It will
   just happen.

3) As it turns out, all legacy PIIX3 devices are positively decoded and
   sent to the ISA-bridge (because it's faster this way).

Notice the lack of the word ISA in all of this other than describing
the PCI class of an end point.

So how should this be modeled?

On x86, the CPU has a pio address space.  That can propagate down
through the PCI bus which is what we do today.

On !x86, the PCI controller ought to setup a MemoryRegion for downstream
PIO that devices can use to register on.

We probably need to do something like change the PCI VGA devices to
export a MemoryRegion and allow the PCI controller to device how to
register that as a subregion.

Regards,

Anthony Liguori


 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.

 Regards,
 Andreas

 -- 
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Michael S. Tsirkin
On Wed, Jan 30, 2013 at 07:24:57AM -0600, Anthony Liguori wrote:
 Michael S. Tsirkin m...@redhat.com writes:
 
  On Wed, Jan 30, 2013 at 11:48:14AM +, Peter Maydell wrote:
  On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
   Proposal by hpoussin was to move _list_add() code to ISADevice:
   http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
  
   Concerns:
   * PCI devices (VGA, QXL) register I/O ports as well
 = above patches add dependency on ISABus to machines
- benh no mac ever had one
 = PCIDevice shouldn't use ISA API with NULL ISADevice
   * Lack of avi: Who decides about memory API these days?
  
   armbru and agraf concluded that moving this into ISA is wrong.
  
   = I will drop the remaining ioport patches from above series.
  
   Suggestions on how to proceed with tackling the issue are welcome.
  
  How does this stuff work on real hardware? I would have
  expected that a PCI device registering the fact it has
  IO ports would have to do so via the PCI controller it
  is plugged into...
 
  All programming is done by the OS, devices do not register
  with controller.
 
  Each bridge has two ways to claim an IO transaction:
  - transaction is within the window programmed in the bridge
  - subtractive decoding enabled and no one else claims the transaction
 
 And there can only be one endpoint that accepts subtractive decoding and
 this is usually the ISA bridge.
 
 Also note that there are some really special cases with PCI.  The legacy
 VGA ports are always routed to the first device with a DISPLAY class
 type.
 
 Likewise, with legacy IDE ports are routed to the first device with an
 IDE class.  That's the only reason you can have these legacy devices not
 behind the ISA bridge.
 
 Regards,
 
 Anthony Liguori

Yes. And to futher clarify that, 'routed' in the sense that the spec
specifies the addresses for each class, it's a hard-coded set of
addresses.

The hardware never looks at the class, each device of
simply knows which addresses to claim and whether it's enabled.

What happens if you have more than one VGA adapter on a bus?
As long as only one is enabled, you are fine.
If more than one is enabled, bad things will happen including
possibly overheating.

Also, it's not just the class that specifies the addresses,
it's the programming interface too.
For example for display, hardcoded addresses are used for legacy sublass 0x0
and for programming ifc 0x0 - vga compatible adapter and
0x1 - 8514 compatible adapter.
But again - it specifies this to the OS.

 
  At the bus level, transaction happens on a bus and an appropriate device
  will claim it.
 
  My naive don't-know-much-about-portio suggestion is that this
  should work the same way as memory regions: each device
  provides portio regions, and the controller for the bus
  (ISA or PCI) exposes those to the next layer up, and
  something at board level maps it all into the right places.
  
  -- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Markus Armbruster
Peter Maydell peter.mayd...@linaro.org writes:

 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.

 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions, and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.

Makes sense me, but I'm naive, too :)

For me, I/O ports are just an alternate address space some devices
have.  For instance, x86 CPUs have an extra pin for selecting I/O
vs. memory address space.  The ISA bus has separate read/write pins for
memory and I/O.

This isn't terribly special.  Mapping address spaces around is what
devices bridging buses do.

I'd expect a system bus for an x86 CPU to have both a memory and an I/O
address space.

I'd expect an ISA PC's sysbus - ISA bridge to map both directly.

I'd expect an ISA bridge for a sysbus without a separate I/O address
space to map the ISA I/O address space into the sysbus's normal address
space somehow.

PCI ISA bridges have their own rules, but I've gotten away with ignoring
the details so far :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/8] KVM: PPC: debug stub interface parameter defined

2013-01-30 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Friday, January 25, 2013 5:24 PM
 To: Bhushan Bharat-R65777
 Cc: Paul Mackerras; kvm-...@vger.kernel.org; kvm@vger.kernel.org
 Subject: Re: [PATCH 5/8] KVM: PPC: debug stub interface parameter defined
 
 
 On 17.01.2013, at 12:11, Bhushan Bharat-R65777 wrote:
 
 
 
  -Original Message-
  From: Paul Mackerras [mailto:pau...@samba.org]
  Sent: Thursday, January 17, 2013 12:53 PM
  To: Bhushan Bharat-R65777
  Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de;
  Bhushan Bharat-
  R65777
  Subject: Re: [PATCH 5/8] KVM: PPC: debug stub interface parameter
  defined
 
  On Wed, Jan 16, 2013 at 01:54:42PM +0530, Bharat Bhushan wrote:
  This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
  ioctl support. Follow up patches will use this for setting up
  hardware breakpoints, watchpoints and software breakpoints.
 
  [snip]
 
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
  index 453a10f..7d5a51c 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -1483,6 +1483,12 @@ int kvm_vcpu_ioctl_set_one_reg(struct
  kvm_vcpu *vcpu,
  struct kvm_one_reg *reg)
return r;
  }
 
  +int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
  +  struct kvm_guest_debug *dbg)
  +{
  + return -EINVAL;
  +}
  +
  int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct
  kvm_fpu
  *fpu)  {
return -ENOTSUPP;
  diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
  index 934413c..4c94ca9 100644
  --- a/arch/powerpc/kvm/powerpc.c
  +++ b/arch/powerpc/kvm/powerpc.c
  @@ -532,12 +532,6 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
  #endif  }
 
  -int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
  -struct kvm_guest_debug *dbg)
  -{
  - return -EINVAL;
  -}
  -
 
  This will break the build for non-book E machines, since
  kvm_arch_vcpu_ioctl_set_guest_debug() is referenced from generic code.
  You need to add it to arch/powerpc/kvm/book3s.c as well.
 
  right,  I will correct this.
 
 Would the implementation actually be different on booke vs book3s? My feeling 
 is
 that powerpc.c is actually the right place for this.
 

I am not sure there will be anything common between book3s and booke. Should we 
define the cpu specific function something like 
kvm_ppc_vcpu_ioctl_set_guest_debug() for booke and book3s and call this new 
defined function from kvm_arch_vcpu_ioctl_set_guest_debug() in powerpc.c ?

Thanks
-Bharat



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] What to do about non-qdevified devices?

2013-01-30 Thread Anthony Liguori
Markus Armbruster arm...@redhat.com writes:

 Peter Maydell peter.mayd...@linaro.org writes:

 On 30 January 2013 07:02, Markus Armbruster arm...@redhat.com wrote:
 Anthony Liguori aligu...@us.ibm.com writes:

 [...]
 The problems I ran into were (1) this is a lot of work (2) it basically
 requires that all bus children have been qdev/QOM-ified.  Even with
 something like the ISA bus which is where I started, quite a few devices
 were not qdevified still.

 So what's the plan to complete the qdevification job?  Lay really low
 and quietly hope the problem goes away?  We've tried that for about
 three years, doesn't seem to work.

 Do we have a list of not-yet-qdevified devices? Maybe we need to
 start saying fix X Y and Z or platform P is dropped from the next
 release. (This would of course be easier if we had a way to let users
 know that platform P was in danger...)

 I think that's a good idea.  Only problem is identifying pre-qdev
 devices in the code requires code inspection (grep won't do, I'm
 afraid).

 If we agree on a qdevify or else plan, I'd be prepared to help with
 the digging up of devices.

That's a nice thought, but we're not going to rip out dma.c and break
every PC target.

But I will help put together a list of devices that need converting.  I
have patches actually for most of the PC devices.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] QEMU buildbot maintenance state

2013-01-30 Thread Anthony Liguori
Gerd Hoffmann kra...@redhat.com writes:

   Hi,

 Gerd: Are you willing to co-maintain the QEMU buildmaster with Daniel
 and Christian?  It would be awesome if you could do this given your
 experience running and customizing buildbot.

 I'll try to set aside some time for that.  Christians idea to host the
 config at github is good, that certainly makes it easier to balance
 things to more people.

 Another thing which would be helpful:  Any chance we can setup a
 maintainer tree mirror @ git.qemu.org?  A single repository where each
 maintainer tree shows up as a branch?

I will setup a tree based on the 'T:' fields in MAINTAINERS.  So if you
want your tree to be part of buildbot, please make sure that you have a
correct entry in MAINTAINERS.

Regards,

Anthony Liguori


 This would make the buildbot setup *alot* easier.  We can go for a
 AnyBranchScheduler then with BuildFactory and BuildConfig shared,
 instead of needing one BuildFactory and BuildConfig per branch.  Also
 makes the buildbot web interface less cluttered as we don't have a
 insane amount of BuildConfigs any more.  And saves some resources
 (bandwidth + diskspace) for the buildslaves.

 I think people who want to look what is coming or who want to test stuff
 cooking it would be a nice service too if they have a one-stop shop
 where they can get everything.

 cheers,
   Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: MMU: make spte_is_locklessly_modifiable() more clear

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com

spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are present on spte. Make it more explicit.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9f628f7..2fa82b0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -448,7 +448,8 @@ static bool __check_direct_spte_mmio_pf(u64 spte)
 
 static bool spte_is_locklessly_modifiable(u64 spte)
 {
-   return !(~spte  (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE));
+   return (spte  (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
+   (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
 }
 
 static bool spte_has_volatile_bits(u64 spte)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] KVM: MMU: drop unneeded checks.

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2fa82b0..40737b3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2328,9 +2328,8 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, 
gfn_t gfn,
if (s-role.level != PT_PAGE_TABLE_LEVEL)
return 1;
 
-   if (!need_unsync  !s-unsync) {
+   if (!s-unsync)
need_unsync = true;
-   }
}
if (need_unsync)
kvm_unsync_pages(vcpu, gfn);
@@ -4008,7 +4007,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
  !((sp-role.word ^ vcpu-arch.mmu.base_role.word)
   mask.word)  rmap_can_add(vcpu))
mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
-   if (!remote_flush  need_remote_flush(entry, *spte))
+   if (need_remote_flush(entry, *spte))
remote_flush = true;
++spte;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] small cleanups in MMU code

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com

Any of those should not change functionality.

Gleb Natapov (6):
  KVM: MMU: make spte_is_locklessly_modifiable() more clear
  KVM: MMU: drop unneeded checks.
  KVM: MMU: set base_role.nxe during mmu initialization.
  KVM: MMU: drop superfluous min() call.
  KVM: MMU: drop superfluous is_present_gpte() check.
  Revert KVM: MMU: split kvm_mmu_free_page

 arch/x86/kvm/mmu.c |   32 +---
 arch/x86/kvm/paging_tmpl.h |3 ---
 arch/x86/kvm/x86.c |2 --
 3 files changed, 9 insertions(+), 28 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com

This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4.

There is not user for kvm_mmu_isolate_page() any more.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |   21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 42ba85c..0242a8a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1461,28 +1461,14 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(kvm_total_used_mmu_pages, nr);
 }
 
-/*
- * Remove the sp from shadow page cache, after call it,
- * we can not find this sp from the cache, and the shadow
- * page table is still valid.
- * It should be under the protection of mmu lock.
- */
-static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
hlist_del(sp-hash_link);
-   if (!sp-role.direct)
-   free_page((unsigned long)sp-gfns);
-}
-
-/*
- * Free the shadow page table and the sp, we can do it
- * out of the protection of mmu lock.
- */
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
-{
list_del(sp-link);
free_page((unsigned long)sp-spt);
+   if (!sp-role.direct)
+   free_page((unsigned long)sp-gfns);
kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2126,7 +2112,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
do {
sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
WARN_ON(!sp-role.invalid || sp-root_count);
-   kvm_mmu_isolate_page(sp);
kvm_mmu_free_page(sp);
} while (!list_empty(invalid_list));
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] KVM: MMU: drop superfluous is_present_gpte() check.

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com

Gust page walker puts only present ptes into ptes[] array. No need to
check it again.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/paging_tmpl.h |3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ca69dcc..34c5c99 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -409,9 +409,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
unsigned direct_access, access = gw-pt_access;
int top_level, emulate = 0;
 
-   if (!is_present_gpte(gw-ptes[gw-level - 1]))
-   return 0;
-
direct_access = gw-pte_access;
 
top_level = vcpu-arch.mmu.root_level;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] KVM: MMU: drop superfluous min() call.

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com


Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8028ac6..42ba85c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3854,7 +3854,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
*gpa = ~(gpa_t)7;
*bytes = 8;
-   r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8));
+   r = kvm_read_guest(vcpu-kvm, *gpa, gentry, 8);
if (r)
gentry = 0;
new = (const u8 *)gentry;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] KVM: MMU: set base_role.nxe during mmu initialization.

2013-01-30 Thread y
From: Gleb Natapov g...@redhat.com

Move base_role.nxe initialisation to where all other roles are initialized.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |1 +
 arch/x86/kvm/x86.c |2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40737b3..8028ac6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3687,6 +3687,7 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context)
else
r = paging32_init_context(vcpu, context);
 
+   vcpu-arch.mmu.base_role.nxe = is_nx(vcpu);
vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
vcpu-arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
vcpu-arch.mmu.base_role.smep_andnot_wp
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf512e70..373e17a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -870,8 +870,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
kvm_x86_ops-set_efer(vcpu, efer);
 
-   vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
-
/* Update reserved bits */
if ((efer ^ old_efer)  EFER_NX)
kvm_mmu_reset_context(vcpu);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] small cleanups in MMU code

2013-01-30 Thread Gleb Natapov
Something went wrong with git send-email. Ignore this one please.

On Wed, Jan 30, 2013 at 04:42:27PM +0200, y...@redhat.com wrote:
 From: Gleb Natapov g...@redhat.com
 
 Any of those should not change functionality.
 
 Gleb Natapov (6):
   KVM: MMU: make spte_is_locklessly_modifiable() more clear
   KVM: MMU: drop unneeded checks.
   KVM: MMU: set base_role.nxe during mmu initialization.
   KVM: MMU: drop superfluous min() call.
   KVM: MMU: drop superfluous is_present_gpte() check.
   Revert KVM: MMU: split kvm_mmu_free_page
 
  arch/x86/kvm/mmu.c |   32 +---
  arch/x86/kvm/paging_tmpl.h |3 ---
  arch/x86/kvm/x86.c |2 --
  3 files changed, 9 insertions(+), 28 deletions(-)
 
 -- 
 1.7.10.4
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] KVM: MMU: drop unneeded checks.

2013-01-30 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2fa82b0..40737b3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2328,9 +2328,8 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, 
gfn_t gfn,
if (s-role.level != PT_PAGE_TABLE_LEVEL)
return 1;
 
-   if (!need_unsync  !s-unsync) {
+   if (!s-unsync)
need_unsync = true;
-   }
}
if (need_unsync)
kvm_unsync_pages(vcpu, gfn);
@@ -4008,7 +4007,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
  !((sp-role.word ^ vcpu-arch.mmu.base_role.word)
   mask.word)  rmap_can_add(vcpu))
mmu_pte_write_new_pte(vcpu, sp, spte, gentry);
-   if (!remote_flush  need_remote_flush(entry, *spte))
+   if (need_remote_flush(entry, *spte))
remote_flush = true;
++spte;
}
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] small cleanups in MMU code

2013-01-30 Thread Gleb Natapov
Any of those should not change functionality.

Gleb Natapov (6):
  KVM: MMU: make spte_is_locklessly_modifiable() more clear
  KVM: MMU: drop unneeded checks.
  KVM: MMU: set base_role.nxe during mmu initialization.
  KVM: MMU: drop superfluous min() call.
  KVM: MMU: drop superfluous is_present_gpte() check.
  Revert KVM: MMU: split kvm_mmu_free_page

 arch/x86/kvm/mmu.c |   32 +---
 arch/x86/kvm/paging_tmpl.h |3 ---
 arch/x86/kvm/x86.c |2 --
 3 files changed, 9 insertions(+), 28 deletions(-)

-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] KVM: MMU: set base_role.nxe during mmu initialization.

2013-01-30 Thread Gleb Natapov
Move base_role.nxe initialisation to where all other roles are initialized.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |1 +
 arch/x86/kvm/x86.c |2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 40737b3..8028ac6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3687,6 +3687,7 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *context)
else
r = paging32_init_context(vcpu, context);
 
+   vcpu-arch.mmu.base_role.nxe = is_nx(vcpu);
vcpu-arch.mmu.base_role.cr4_pae = !!is_pae(vcpu);
vcpu-arch.mmu.base_role.cr0_wp  = is_write_protection(vcpu);
vcpu-arch.mmu.base_role.smep_andnot_wp
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cf512e70..373e17a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -870,8 +870,6 @@ static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
 
kvm_x86_ops-set_efer(vcpu, efer);
 
-   vcpu-arch.mmu.base_role.nxe = (efer  EFER_NX)  !tdp_enabled;
-
/* Update reserved bits */
if ((efer ^ old_efer)  EFER_NX)
kvm_mmu_reset_context(vcpu);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] KVM: MMU: drop superfluous is_present_gpte() check.

2013-01-30 Thread Gleb Natapov
Gust page walker puts only present ptes into ptes[] array. No need to
check it again.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/paging_tmpl.h |3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index ca69dcc..34c5c99 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -409,9 +409,6 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
unsigned direct_access, access = gw-pt_access;
int top_level, emulate = 0;
 
-   if (!is_present_gpte(gw-ptes[gw-level - 1]))
-   return 0;
-
direct_access = gw-pte_access;
 
top_level = vcpu-arch.mmu.root_level;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: MMU: make spte_is_locklessly_modifiable() more clear

2013-01-30 Thread Gleb Natapov
spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are present on spte. Make it more explicit.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9f628f7..2fa82b0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -448,7 +448,8 @@ static bool __check_direct_spte_mmio_pf(u64 spte)
 
 static bool spte_is_locklessly_modifiable(u64 spte)
 {
-   return !(~spte  (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE));
+   return (spte  (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE)) ==
+   (SPTE_HOST_WRITEABLE | SPTE_MMU_WRITEABLE);
 }
 
 static bool spte_has_volatile_bits(u64 spte)
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page

2013-01-30 Thread Gleb Natapov
This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4.

There is not user for kvm_mmu_isolate_page() any more.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |   21 +++--
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 42ba85c..0242a8a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1461,28 +1461,14 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
*kvm, int nr)
percpu_counter_add(kvm_total_used_mmu_pages, nr);
 }
 
-/*
- * Remove the sp from shadow page cache, after call it,
- * we can not find this sp from the cache, and the shadow
- * page table is still valid.
- * It should be under the protection of mmu lock.
- */
-static void kvm_mmu_isolate_page(struct kvm_mmu_page *sp)
+static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
 {
ASSERT(is_empty_shadow_page(sp-spt));
hlist_del(sp-hash_link);
-   if (!sp-role.direct)
-   free_page((unsigned long)sp-gfns);
-}
-
-/*
- * Free the shadow page table and the sp, we can do it
- * out of the protection of mmu lock.
- */
-static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
-{
list_del(sp-link);
free_page((unsigned long)sp-spt);
+   if (!sp-role.direct)
+   free_page((unsigned long)sp-gfns);
kmem_cache_free(mmu_page_header_cache, sp);
 }
 
@@ -2126,7 +2112,6 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
do {
sp = list_first_entry(invalid_list, struct kvm_mmu_page, link);
WARN_ON(!sp-role.invalid || sp-root_count);
-   kvm_mmu_isolate_page(sp);
kvm_mmu_free_page(sp);
} while (!list_empty(invalid_list));
 }
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] KVM: MMU: drop superfluous min() call.

2013-01-30 Thread Gleb Natapov

Signed-off-by: Gleb Natapov g...@redhat.com
---
 arch/x86/kvm/mmu.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8028ac6..42ba85c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3854,7 +3854,7 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu 
*vcpu, gpa_t *gpa,
/* Handle a 32-bit guest writing two halves of a 64-bit gpte */
*gpa = ~(gpa_t)7;
*bytes = 8;
-   r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8));
+   r = kvm_read_guest(vcpu-kvm, *gpa, gentry, 8);
if (r)
gentry = 0;
new = (const u8 *)gentry;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] small cleanups in MMU code

2013-01-30 Thread Asias He
 On Wed, Jan 30, 2013 at 10:42 PM,  y...@redhat.com wrote:

y...@redhat.com ?

 From: Gleb Natapov g...@redhat.com

 Any of those should not change functionality.

 Gleb Natapov (6):
   KVM: MMU: make spte_is_locklessly_modifiable() more clear
   KVM: MMU: drop unneeded checks.
   KVM: MMU: set base_role.nxe during mmu initialization.
   KVM: MMU: drop superfluous min() call.
   KVM: MMU: drop superfluous is_present_gpte() check.
   Revert KVM: MMU: split kvm_mmu_free_page

  arch/x86/kvm/mmu.c |   32 +---
  arch/x86/kvm/paging_tmpl.h |3 ---
  arch/x86/kvm/x86.c |2 --
  3 files changed, 9 insertions(+), 28 deletions(-)

 --
 1.7.10.4

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Asias He
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] small cleanups in MMU code

2013-01-30 Thread Gleb Natapov
On Wed, Jan 30, 2013 at 10:46:56PM +0800, Asias He wrote:
  On Wed, Jan 30, 2013 at 10:42 PM,  y...@redhat.com wrote:
 
 y...@redhat.com ?
 
y not?

  From: Gleb Natapov g...@redhat.com
 
  Any of those should not change functionality.
 
  Gleb Natapov (6):
KVM: MMU: make spte_is_locklessly_modifiable() more clear
KVM: MMU: drop unneeded checks.
KVM: MMU: set base_role.nxe during mmu initialization.
KVM: MMU: drop superfluous min() call.
KVM: MMU: drop superfluous is_present_gpte() check.
Revert KVM: MMU: split kvm_mmu_free_page
 
   arch/x86/kvm/mmu.c |   32 +---
   arch/x86/kvm/paging_tmpl.h |3 ---
   arch/x86/kvm/x86.c |2 --
   3 files changed, 9 insertions(+), 28 deletions(-)
 
  --
  1.7.10.4
 
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 -- 
 Asias He
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Anthony Liguori
Markus Armbruster arm...@redhat.com writes:

 Peter Maydell peter.mayd...@linaro.org writes:

 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.

 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions, and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.

 Makes sense me, but I'm naive, too :)

 For me, I/O ports are just an alternate address space some devices
 have.  For instance, x86 CPUs have an extra pin for selecting I/O
 vs. memory address space.  The ISA bus has separate read/write pins for
 memory and I/O.

 This isn't terribly special.  Mapping address spaces around is what
 devices bridging buses do.

 I'd expect a system bus for an x86 CPU to have both a memory and an I/O
 address space.

There is no such thing as a system bus.

There is a bus that links the CPUs to each other and to the North
Bridge.  This is QPI on modern systems.

Sometimes there's a bus to link the North Bridge to the South Bridge.
On modern systems, this is QPI.  On the i440fx, the i440fx is both the
South Bridge and North Bridge and the link between the two is internal
to the chip.  The South Bridge may then export one or more downstream
interfaces.  In the i440fx, it only exports PCI.

Behind the PCI bus, there may be bridges.  On the i440fx, there is a ISA
Bridge which also acts as a Super I/O chip.  It exposes a downstream ISA
bus.

sysbus is a relic of poor modeling.  A major milestone in QEMU's
evolution will be when sysbus is completely removed.

Regards,

Anthony Liguori


 I'd expect an ISA PC's sysbus - ISA bridge to map both directly.

 I'd expect an ISA bridge for a sysbus without a separate I/O address
 space to map the ISA I/O address space into the sysbus's normal address
 space somehow.

 PCI ISA bridges have their own rules, but I've gotten away with ignoring
 the details so far :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Gerd Hoffmann
  Hi,

 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

That reminds me I should solve this in a more elegant way.

qxl takes over the vga io ports.  The reason it does this is because qxl
switches into vga mode in case the vga ports are accessed while not in
vga mode.  After doing the check (and possibly switching mode) the vga
handler is called to actually handle it.

That twist makes it a bit hard to convert vga ...

Anyone knows how one would do that with the memory api instead? I think
taking over the ports is easy as the memory regions have priorities so I
can simply register a region with higher priority. I have no clue how to
forward the access to the vga code though.

Anyone has clues / suggestions?

thanks,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Anthony Liguori
Gerd Hoffmann kra...@redhat.com writes:

   Hi,

 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

 That reminds me I should solve this in a more elegant way.

 qxl takes over the vga io ports.  The reason it does this is because qxl
 switches into vga mode in case the vga ports are accessed while not in
 vga mode.  After doing the check (and possibly switching mode) the vga
 handler is called to actually handle it.

The best way to handle this would be to remodel how we do VGA.

Make VGACommonState a proper QOM object and use it as the base class for
QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.

The VGA accessors should be exposed as a memory region but the sub class
ought to be responsible for actually adding it to a subregion.


 That twist makes it a bit hard to convert vga ...

 Anyone knows how one would do that with the memory api instead? I think
 taking over the ports is easy as the memory regions have priorities so I
 can simply register a region with higher priority. I have no clue how to
 forward the access to the vga code though.


That should be possible with priorities, but I think it's wrong.  There
aren't two VGA devices.  QXL is-a VGA device and the best way to
override behavior of base VGA device is through polymorphism.

This isn't really a memory API issue, it's a modeling issue.

Regards,

Anthony Liguori

 Anyone has clues / suggestions?

 thanks,
   Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Andreas Färber
Am 30.01.2013 17:33, schrieb Anthony Liguori:
 Gerd Hoffmann kra...@redhat.com writes:
 
 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

 That reminds me I should solve this in a more elegant way.

 qxl takes over the vga io ports.  The reason it does this is because qxl
 switches into vga mode in case the vga ports are accessed while not in
 vga mode.  After doing the check (and possibly switching mode) the vga
 handler is called to actually handle it.
 
 The best way to handle this would be to remodel how we do VGA.
 
 Make VGACommonState a proper QOM object and use it as the base class for
 QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.

That would require polymorphism since we already need to derive from
PCIDevice or ISADevice respectively for interfacing with the bus...
Modern object-oriented languages have tried to avoid multi-inheritence
due to arising complications, I thought. Wouldn't object if someone
wanted to do the dirty implementation work though. ;)

Another such example is EHCI, with PCIDevice and SysBusDevice frontends,
sharing an EHCIState struct and having helper functions operating on
that core state only. Quite a few device share such a pattern today
actually (serial, m48t59, ...).

 The VGA accessors should be exposed as a memory region but the sub class
 ought to be responsible for actually adding it to a subregion.
 

 That twist makes it a bit hard to convert vga ...

 Anyone knows how one would do that with the memory api instead? I think
 taking over the ports is easy as the memory regions have priorities so I
 can simply register a region with higher priority. I have no clue how to
 forward the access to the vga code though.

 
 That should be possible with priorities, but I think it's wrong.  There
 aren't two VGA devices.  QXL is-a VGA device and the best way to
 override behavior of base VGA device is through polymorphism.

In this particular case QXL is-a PCI VGA device though, so we can
decouple it from core VGA modeling. Placing the MemoryRegionOps inside
the Class (rather than static const) might be a short-term solution for
overriding read/write handlers of a particular VGA MemoryRegion. :)

Cheers,
Andreas

 This isn't really a memory API issue, it's a modeling issue.
 
 Regards,
 
 Anthony Liguori
 
 Anyone has clues / suggestions?

 thanks,
   Gerd

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] What to do about non-qdevified devices?

2013-01-30 Thread Paolo Bonzini
Il 30/01/2013 14:44, Andreas Färber ha scritto:
 I disagree on the or else part. I have been qdev'ifying and QOM'ifying
 devices in my maintenance area, and progress is slow. It gets even
 slower if one leaves clearly maintained areas. I see no good reason to
 force a pistol on someone's breast, like you have done for IDE, unless
 there is a good reason to do so. Currently I don't see any.

The reason for IDE is that it involved devices that are not
SysBusDevices (the IDE disk devices).  Having the same code work in two
ways, one qdevified and one not, is bad.

For simple SysBusDevice you're changing a crappy default to a less bad
one, but there's really little incentive to qdev/QOM-ification.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Paolo Bonzini
Il 30/01/2013 17:33, Anthony Liguori ha scritto:
 Gerd Hoffmann kra...@redhat.com writes:
 
   Hi,

 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

 That reminds me I should solve this in a more elegant way.

 qxl takes over the vga io ports.  The reason it does this is because qxl
 switches into vga mode in case the vga ports are accessed while not in
 vga mode.  After doing the check (and possibly switching mode) the vga
 handler is called to actually handle it.
 
 The best way to handle this would be to remodel how we do VGA.
 
 Make VGACommonState a proper QOM object and use it as the base class for
 QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.

I think QXL should have-a VGA rather than being one.  It completely
bypasses the VGA infrastructure if not in VGA mode.

 The VGA accessors should be exposed as a memory region but the sub class
 ought to be responsible for actually adding it to a subregion.
 

 That twist makes it a bit hard to convert vga ...

 Anyone knows how one would do that with the memory api instead? I think
 taking over the ports is easy as the memory regions have priorities so I
 can simply register a region with higher priority. I have no clue how to
 forward the access to the vga code though.

Avi had a prototype patch series for IOMMU regions.  You could add one
between the QXL device and the VGA.  It doesn't have to do a
translation, but trying to translate a VGA address already means that
you must go to VGA mode.

Paolo

 
 That should be possible with priorities, but I think it's wrong.  There
 aren't two VGA devices.  QXL is-a VGA device and the best way to
 override behavior of base VGA device is through polymorphism.
 
 This isn't really a memory API issue, it's a modeling issue.
 
 Regards,
 
 Anthony Liguori
 
 Anyone has clues / suggestions?

 thanks,
   Gerd
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] What to do about non-qdevified devices?

2013-01-30 Thread Andreas Färber
Am 30.01.2013 17:58, schrieb Paolo Bonzini:
 Il 30/01/2013 14:44, Andreas Färber ha scritto:
 I disagree on the or else part. I have been qdev'ifying and QOM'ifying
 devices in my maintenance area, and progress is slow. It gets even
 slower if one leaves clearly maintained areas. I see no good reason to
 force a pistol on someone's breast, like you have done for IDE, unless
 there is a good reason to do so. Currently I don't see any.
 
 The reason for IDE is that it involved devices that are not
 SysBusDevices (the IDE disk devices).  Having the same code work in two
 ways, one qdevified and one not, is bad.

Sure, I did help with the QOM'ification there. Currently I don't see
any [good reason] by contrast referred to removing *all* devices that
are not yet qdev/QOM'ified without such pressing reason.

 For simple SysBusDevice you're changing a crappy default to a less bad
 one, but there's really little incentive to qdev/QOM-ification.

No disagreement. The benefits don't come from doing a conversion, they
come from basing new work on the result of a conversion. :)

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Anthony Liguori
Andreas Färber afaer...@suse.de writes:

 Am 30.01.2013 17:33, schrieb Anthony Liguori:
 Gerd Hoffmann kra...@redhat.com writes:
 
 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

 That reminds me I should solve this in a more elegant way.

 qxl takes over the vga io ports.  The reason it does this is because qxl
 switches into vga mode in case the vga ports are accessed while not in
 vga mode.  After doing the check (and possibly switching mode) the vga
 handler is called to actually handle it.
 
 The best way to handle this would be to remodel how we do VGA.
 
 Make VGACommonState a proper QOM object and use it as the base class for
 QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.

 That would require polymorphism since we already need to derive from
 PCIDevice or ISADevice respectively for interfacing with the bus...

Nope.  You can use composition:

QXLDevice is-a VGACommonState

QXLPCI is-a PCIDevice
   has-a QXLDevice

 Modern object-oriented languages have tried to avoid multi-inheritence
 due to arising complications, I thought. Wouldn't object if someone
 wanted to do the dirty implementation work though. ;)

There is no need for MI.

 Another such example is EHCI, with PCIDevice and SysBusDevice frontends,
 sharing an EHCIState struct and having helper functions operating on
 that core state only. Quite a few device share such a pattern today
 actually (serial, m48t59, ...).

Yes, this is all about chipset modelling.  Chipsets should derive from
device and then be embedded in the appropriate bus device.

For instance.

SerialState is-a DeviceState

ISASerialState is-a ISADevice, has-a SerialState
MMIOSerialState is-a SysbusDevice, has-a SerialState

This is what we're doing in practice, we just aren't modeling the
chipsets and we're open coding the relationships (often in subtley
different ways).

Regards,

Anthony Liguori

 The VGA accessors should be exposed as a memory region but the sub class
 ought to be responsible for actually adding it to a subregion.
 

 That twist makes it a bit hard to convert vga ...

 Anyone knows how one would do that with the memory api instead? I think
 taking over the ports is easy as the memory regions have priorities so I
 can simply register a region with higher priority. I have no clue how to
 forward the access to the vga code though.

 
 That should be possible with priorities, but I think it's wrong.  There
 aren't two VGA devices.  QXL is-a VGA device and the best way to
 override behavior of base VGA device is through polymorphism.

 In this particular case QXL is-a PCI VGA device though, so we can
 decouple it from core VGA modeling. Placing the MemoryRegionOps inside
 the Class (rather than static const) might be a short-term solution for
 overriding read/write handlers of a particular VGA MemoryRegion. :)

 Cheers,
 Andreas

 This isn't really a memory API issue, it's a modeling issue.
 
 Regards,
 
 Anthony Liguori
 
 Anyone has clues / suggestions?

 thanks,
   Gerd

 -- 
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Andreas Färber
Am 30.01.2013 12:48, schrieb Peter Maydell:
 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.
 
 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...
 
 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions,

One remark on same way as memory regions, me not knowing all the gory
hardware details myself.

PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO
device you would have a continuous region from say 0xa000 to
0xa007 inclusive and within that region you have some kind of sparse
registers. With ISA ports you often have dense overlapping ranges, say,
0x3-0x6 byte-reads foo, while 0x4 word-write does bar.
This is handled by having lists of (offset, length, size, handler)
quadruplets and consolidating those into MemoryRegions and aliases (cf.
patches) that then have a validation function to check whether a
particular access is valid and by whom it should be handled - that's
what MemoryRegionPortio[] and similar APIs are good for.

So yes, it might be possible to have a device declare its ports at
PCIDevice or DeviceState level, but it can't be directly passed through
to MemoryRegion API in most cases, or conflicts would arise. At least
that was my experience with PReP.

Andreas

 and the controller for the bus
 (ISA or PCI) exposes those to the next layer up, and
 something at board level maps it all into the right places.
 
 -- PMM

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Michael S. Tsirkin
On Wed, Jan 30, 2013 at 11:29:58AM -0600, Anthony Liguori wrote:
 Andreas Färber afaer...@suse.de writes:
 
  Am 30.01.2013 17:33, schrieb Anthony Liguori:
  Gerd Hoffmann kra...@redhat.com writes:
  
  hw/qxl.c:portio_list_add(qxl_vga_port_list,
  pci_address_space_io(dev), 0x3b0);
  hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);
 
  That reminds me I should solve this in a more elegant way.
 
  qxl takes over the vga io ports.  The reason it does this is because qxl
  switches into vga mode in case the vga ports are accessed while not in
  vga mode.  After doing the check (and possibly switching mode) the vga
  handler is called to actually handle it.
  
  The best way to handle this would be to remodel how we do VGA.
  
  Make VGACommonState a proper QOM object and use it as the base class for
  QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.
 
  That would require polymorphism since we already need to derive from
  PCIDevice or ISADevice respectively for interfacing with the bus...
 
 Nope.  You can use composition:
 
 QXLDevice is-a VGACommonState
 
 QXLPCI is-a PCIDevice
has-a QXLDevice

But why like this?
The distinction is artificial, isn't it?

  Modern object-oriented languages have tried to avoid multi-inheritence
  due to arising complications, I thought. Wouldn't object if someone
  wanted to do the dirty implementation work though. ;)
 
 There is no need for MI.
 
  Another such example is EHCI, with PCIDevice and SysBusDevice frontends,
  sharing an EHCIState struct and having helper functions operating on
  that core state only. Quite a few device share such a pattern today
  actually (serial, m48t59, ...).
 
 Yes, this is all about chipset modelling.  Chipsets should derive from
 device and then be embedded in the appropriate bus device.
 
 For instance.
 
 SerialState is-a DeviceState
 
 ISASerialState is-a ISADevice, has-a SerialState
 MMIOSerialState is-a SysbusDevice, has-a SerialState

ISASerialState is not a SerialState?
Hmm but why?

 This is what we're doing in practice, we just aren't modeling the
 chipsets and we're open coding the relationships (often in subtley
 different ways).
 
 Regards,
 
 Anthony Liguori
 
  The VGA accessors should be exposed as a memory region but the sub class
  ought to be responsible for actually adding it to a subregion.
  
 
  That twist makes it a bit hard to convert vga ...
 
  Anyone knows how one would do that with the memory api instead? I think
  taking over the ports is easy as the memory regions have priorities so I
  can simply register a region with higher priority. I have no clue how to
  forward the access to the vga code though.
 
  
  That should be possible with priorities, but I think it's wrong.  There
  aren't two VGA devices.  QXL is-a VGA device and the best way to
  override behavior of base VGA device is through polymorphism.
 
  In this particular case QXL is-a PCI VGA device though, so we can
  decouple it from core VGA modeling. Placing the MemoryRegionOps inside
  the Class (rather than static const) might be a short-term solution for
  overriding read/write handlers of a particular VGA MemoryRegion. :)
 
  Cheers,
  Andreas
 
  This isn't really a memory API issue, it's a modeling issue.
  
  Regards,
  
  Anthony Liguori
  
  Anyone has clues / suggestions?
 
  thanks,
Gerd
 
  -- 
  SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
  GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Michael S. Tsirkin
On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote:
 Am 30.01.2013 12:48, schrieb Peter Maydell:
  On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
  Proposal by hpoussin was to move _list_add() code to ISADevice:
  http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
 
  Concerns:
  * PCI devices (VGA, QXL) register I/O ports as well
= above patches add dependency on ISABus to machines
   - benh no mac ever had one
= PCIDevice shouldn't use ISA API with NULL ISADevice
  * Lack of avi: Who decides about memory API these days?
 
  armbru and agraf concluded that moving this into ISA is wrong.
 
  = I will drop the remaining ioport patches from above series.
 
  Suggestions on how to proceed with tackling the issue are welcome.
  
  How does this stuff work on real hardware? I would have
  expected that a PCI device registering the fact it has
  IO ports would have to do so via the PCI controller it
  is plugged into...
  
  My naive don't-know-much-about-portio suggestion is that this
  should work the same way as memory regions: each device
  provides portio regions,
 
 One remark on same way as memory regions, me not knowing all the gory
 hardware details myself.
 
 PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO
 device you would have a continuous region from say 0xa000 to
 0xa007 inclusive and within that region you have some kind of sparse
 registers. With ISA ports you often have dense overlapping ranges, say,
 0x3-0x6 byte-reads foo, while 0x4 word-write does bar.

Hmm on x86 this is what happens with cf8..cfb range registers for example.
We plan handle this ATM using memory region priorities.
Same would work for prep won't it?

 This is handled by having lists of (offset, length, size, handler)
 quadruplets and consolidating those into MemoryRegions and aliases (cf.
 patches) that then have a validation function to check whether a
 particular access is valid and by whom it should be handled - that's
 what MemoryRegionPortio[] and similar APIs are good for.
 
 So yes, it might be possible to have a device declare its ports at
 PCIDevice or DeviceState level, but it can't be directly passed through
 to MemoryRegion API in most cases, or conflicts would arise. At least
 that was my experience with PReP.
 
 Andreas
 
  and the controller for the bus
  (ISA or PCI) exposes those to the next layer up, and
  something at board level maps it all into the right places.
  
  -- PMM
 
 -- 
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Andreas Färber
Am 30.01.2013 18:29, schrieb Anthony Liguori:
 Andreas Färber afaer...@suse.de writes:
 
 Am 30.01.2013 17:33, schrieb Anthony Liguori:
 Gerd Hoffmann kra...@redhat.com writes:

 hw/qxl.c:portio_list_add(qxl_vga_port_list,
 pci_address_space_io(dev), 0x3b0);
 hw/vga.c:portio_list_add(vga_port_list, address_space_io, 0x3b0);

 That reminds me I should solve this in a more elegant way.

 qxl takes over the vga io ports.  The reason it does this is because qxl
 switches into vga mode in case the vga ports are accessed while not in
 vga mode.  After doing the check (and possibly switching mode) the vga
 handler is called to actually handle it.

 The best way to handle this would be to remodel how we do VGA.

 Make VGACommonState a proper QOM object and use it as the base class for
 QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.

 That would require polymorphism since we already need to derive from
 PCIDevice or ISADevice respectively for interfacing with the bus...
 
 Nope.  You can use composition:
 
 QXLDevice is-a VGACommonState
 
 QXLPCI is-a PCIDevice
has-a QXLDevice
 
 Modern object-oriented languages have tried to avoid multi-inheritence
 due to arising complications, I thought. Wouldn't object if someone
 wanted to do the dirty implementation work though. ;)
 
 There is no need for MI.
 
 Another such example is EHCI, with PCIDevice and SysBusDevice frontends,
 sharing an EHCIState struct and having helper functions operating on
 that core state only. Quite a few device share such a pattern today
 actually (serial, m48t59, ...).
 
 Yes, this is all about chipset modelling.  Chipsets should derive from
 device and then be embedded in the appropriate bus device.
 
 For instance.
 
 SerialState is-a DeviceState
 
 ISASerialState is-a ISADevice, has-a SerialState
 MMIOSerialState is-a SysbusDevice, has-a SerialState

Okay, but I don't like that both are transitively DeviceState then.
It's much too easy to add / hot-add the wrong device then, especially
when dropping no_user.

Andreas

 This is what we're doing in practice, we just aren't modeling the
 chipsets and we're open coding the relationships (often in subtley
 different ways).
 
 Regards,
 
 Anthony Liguori


-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Peter Maydell
On 30 January 2013 20:08, Michael S. Tsirkin m...@redhat.com wrote:
 Anthony wrote:
 Nope.  You can use composition:

 QXLDevice is-a VGACommonState

 QXLPCI is-a PCIDevice
has-a QXLDevice

 But why like this?
 The distinction is artificial, isn't it?

I think it's the wrong way round. QXLPCI should has-a PCI interface
(the physical card possesses an edge connector which fits a PCI
socket; it is not the case that the physical card is a kind of
edge connector). Having PCI card models inherit from PCIDevice
is just a convenient (but misleading) shortcut, and that is what
we should drop if it turns out that we should be inheriting
from some other class.

Or you could make them both has-a; I don't know enough about
QXLDevice to know if it should be is-a or has-a.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Andreas Färber
Am 30.01.2013 21:20, schrieb Michael S. Tsirkin:
 On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote:
 Am 30.01.2013 12:48, schrieb Peter Maydell:
 On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
 Proposal by hpoussin was to move _list_add() code to ISADevice:
 http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html

 Concerns:
 * PCI devices (VGA, QXL) register I/O ports as well
   = above patches add dependency on ISABus to machines
  - benh no mac ever had one
   = PCIDevice shouldn't use ISA API with NULL ISADevice
 * Lack of avi: Who decides about memory API these days?

 armbru and agraf concluded that moving this into ISA is wrong.

 = I will drop the remaining ioport patches from above series.

 Suggestions on how to proceed with tackling the issue are welcome.

 How does this stuff work on real hardware? I would have
 expected that a PCI device registering the fact it has
 IO ports would have to do so via the PCI controller it
 is plugged into...

 My naive don't-know-much-about-portio suggestion is that this
 should work the same way as memory regions: each device
 provides portio regions,

 One remark on same way as memory regions, me not knowing all the gory
 hardware details myself.

 PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO
 device you would have a continuous region from say 0xa000 to
 0xa007 inclusive and within that region you have some kind of sparse
 registers. With ISA ports you often have dense overlapping ranges, say,
 0x3-0x6 byte-reads foo, while 0x4 word-write does bar.
 
 Hmm on x86 this is what happens with cf8..cfb range registers for example.
 We plan handle this ATM using memory region priorities.
 Same would work for prep won't it?

Hm, my point was that iiuc a MemoryRegion is per-address-range whereas
for I/O ports we seem to have per-data-width mappings.

Priorities would allow us to say:

0x1-0xff  is one region
0x8-0xab  is a region with higher priority

but fallback for, e.g., word-access at 0xa0 to the lower-priority region
being unsupported today, no? I.e., the region being opaque.

Having said that, for the purposes of this discussion PReP is pretty
much a PC with a PowerPC CPU in it, unlike the modern CHRP machines.

Andreas

 This is handled by having lists of (offset, length, size, handler)
 quadruplets and consolidating those into MemoryRegions and aliases (cf.
 patches) that then have a validation function to check whether a
 particular access is valid and by whom it should be handled - that's
 what MemoryRegionPortio[] and similar APIs are good for.

 So yes, it might be possible to have a device declare its ports at
 PCIDevice or DeviceState level, but it can't be directly passed through
 to MemoryRegion API in most cases, or conflicts would arise. At least
 that was my experience with PReP.

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Michael S. Tsirkin
On Wed, Jan 30, 2013 at 09:33:05PM +0100, Andreas Färber wrote:
 Am 30.01.2013 21:20, schrieb Michael S. Tsirkin:
  On Wed, Jan 30, 2013 at 06:55:47PM +0100, Andreas Färber wrote:
  Am 30.01.2013 12:48, schrieb Peter Maydell:
  On 30 January 2013 11:39, Andreas Färber afaer...@suse.de wrote:
  Proposal by hpoussin was to move _list_add() code to ISADevice:
  http://lists.gnu.org/archive/html/qemu-devel/2013-01/msg00508.html
 
  Concerns:
  * PCI devices (VGA, QXL) register I/O ports as well
= above patches add dependency on ISABus to machines
   - benh no mac ever had one
= PCIDevice shouldn't use ISA API with NULL ISADevice
  * Lack of avi: Who decides about memory API these days?
 
  armbru and agraf concluded that moving this into ISA is wrong.
 
  = I will drop the remaining ioport patches from above series.
 
  Suggestions on how to proceed with tackling the issue are welcome.
 
  How does this stuff work on real hardware? I would have
  expected that a PCI device registering the fact it has
  IO ports would have to do so via the PCI controller it
  is plugged into...
 
  My naive don't-know-much-about-portio suggestion is that this
  should work the same way as memory regions: each device
  provides portio regions,
 
  One remark on same way as memory regions, me not knowing all the gory
  hardware details myself.
 
  PIO often contradicts the normal MemoryRegion usage. I.e., for an MMIO
  device you would have a continuous region from say 0xa000 to
  0xa007 inclusive and within that region you have some kind of sparse
  registers. With ISA ports you often have dense overlapping ranges, say,
  0x3-0x6 byte-reads foo, while 0x4 word-write does bar.
  
  Hmm on x86 this is what happens with cf8..cfb range registers for example.
  We plan handle this ATM using memory region priorities.
  Same would work for prep won't it?
 
 Hm, my point was that iiuc a MemoryRegion is per-address-range whereas
 for I/O ports we seem to have per-data-width mappings.
 Priorities would allow us to say:
 
 0x1-0xff  is one region
 0x8-0xab  is a region with higher priority
 
 but fallback for, e.g., word-access at 0xa0 to the lower-priority region
 being unsupported today, no? I.e., the region being opaque.

No, MemoryRegion takes data width into account too.
See 'PIIX3: reset the VM when the Reset Control Register's RCPU bit gets
set' as one example.

 
 Having said that, for the purposes of this discussion PReP is pretty
 much a PC with a PowerPC CPU in it, unlike the modern CHRP machines.
 
 Andreas
 
  This is handled by having lists of (offset, length, size, handler)
  quadruplets and consolidating those into MemoryRegions and aliases (cf.
  patches) that then have a validation function to check whether a
  particular access is valid and by whom it should be handled - that's
  what MemoryRegionPortio[] and similar APIs are good for.
 
  So yes, it might be possible to have a device declare its ports at
  PCIDevice or DeviceState level, but it can't be directly passed through
  to MemoryRegion API in most cases, or conflicts would arise. At least
  that was my experience with PReP.
 
 -- 
 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
 GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Benjamin Herrenschmidt
On Wed, 2013-01-30 at 07:59 -0600, Anthony Liguori wrote:
 An x86 CPU has a MMIO capability that's essentially 65 bits.  Whether
 the top bit is set determines whether it's a PIO transaction or an
 MMIO transaction.  A large chunk of that address space is invalid of
 course.
 
 PCI has a 65 bit address space too.  The 65th bit determines whether
 it's an IO transaction or an MMIO transaction.

This is somewhat an over simplification since IO and MMIO differs in
other ways, such as ordering rules :-) But for the sake of memory
regions decoding I suppose it will do.

 For architectures that only have a 64-bit address space, what the PCI
 controller typically does is pick a 16-bit window within that address
 space to map to a PCI address with the 65th bit set.

Sort-of yes. The window doesn't have to be 16-bit (we commonly have
larger IO space windows on powerpc) and there's a window per host
bridge, so there's effectively more than one IO space (as there is more
than one PCI MMIO space, with only a window off the CPU space routed to
each brigde).

Making a hard wired assumption that the PCI (MMIO and IO) space relates
directly to the CPU bus space is wrong on pretty much all !x86
architectures.

 .../...

You make it sound like substractive decode is a chipset hack. It's not,
it's specified in the PCI spec.

1) A chipset will route any non-positively decoded IO transaction (65th
bit set) to a single end point (usually the ISA-bridge).  Which one it
chooses is up to the chipset.  This is called subtractive decoding
because the PCI bus will wait multiple cycles for that device to
claim the transaction before bouncing it.

This is not a chipset matter. It's the ISA bridge itself that does
substractive decoding. There also exists P2P bridges doing such substractive
decoding, this used to be fairly common with transparent bridges used for
laptop docking.

 2) There are special hacks in most PCI chipsets to route very specific
addresses ranges to certain devices.  Namely, legacy VGA IO transactions
go to the first VGA device.  Legacy IDE IO transactions go to the first
IDE device.  This doesn't need to be programmed in the BARs.  It will
just happen.

This is also mostly not a hack in the chipset. It's a well defined behaviour
for legacy devices, sometimes call hard decoding. Of course often those devices
are built into the chipset but they don't have to. Plug-in VGA devices will
hard decode legacy VGA regions for both IO and MMIO by default (this can be
disabled on most of them nowadays) for example. This has nothing to do with
the chipset.

There's a specific bit in P2P bridge to control the forwarding of legacy
transaction downstream (and VGA palette snoops), this is also fully specified
in the PCI spec.

 3) As it turns out, all legacy PIIX3 devices are positively decoded and
sent to the ISA-bridge (because it's faster this way).

Chipsets don't send to a bridge. It's the bridge itself that decodes.

 Notice the lack of the word ISA in all of this other than describing
 the PCI class of an end point.

ISA is only relevant to the extent that the legacy regions of IO space
originate from the original ISA addresses of devices (VGA, IDE, etc...)
and to the extent that an ISA bus might still be present which will get
the transactions that nothing else have decoded in that space.
 
 So how should this be modeled?
 
 On x86, the CPU has a pio address space.  That can propagate down
 through the PCI bus which is what we do today.
 
 On !x86, the PCI controller ought to setup a MemoryRegion for
downstream
 PIO that devices can use to register on.
 
 We probably need to do something like change the PCI VGA devices to
 export a MemoryRegion and allow the PCI controller to device how to
 register that as a subregion.

The VGA device should just register fixed address port IOs the same way
it would register an IO BAR. Essentially, hard coded IO addresses (or
memory, VGA does memory too, don't forget that) are equivalent to having
an invisible BAR with a fixed value in it.

There should be no global port IO because that concept is broken on
real multi-domain setups. Those legacy address ranges are just
hard-wired sub regions of the normal PCI space on which the device sits
on (unless you start doing real non-PCI ISA x86).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Benjamin Herrenschmidt
On Wed, 2013-01-30 at 17:54 +0100, Andreas Färber wrote:
 
 That would require polymorphism since we already need to derive from
 PCIDevice or ISADevice respectively for interfacing with the bus...
 Modern object-oriented languages have tried to avoid multi-inheritence
 due to arising complications, I thought. Wouldn't object if someone
 wanted to do the dirty implementation work though. ;)
 
 Another such example is EHCI, with PCIDevice and SysBusDevice
 frontends,
 sharing an EHCIState struct and having helper functions operating on
 that core state only. Quite a few device share such a pattern today
 actually (serial, m48t59, ...).

This is a design bug of your model :-) You shouldn't derive from your
bus interface IMHO but from your functional interface, and have an
ownership relation to the PCIDevice (a bit like IOKit does if my memory
serves me well).

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call minutes 2013-01-29 - Port I/O

2013-01-30 Thread Benjamin Herrenschmidt
On Wed, 2013-01-30 at 18:08 +0100, Paolo Bonzini wrote:
  Make VGACommonState a proper QOM object and use it as the base class
 for
  QXL, CirrusVGA, QEMUVGA (std-vga), and VMwareVGA.
 
 I think QXL should have-a VGA rather than being one.  It completely
 bypasses the VGA infrastructure if not in VGA mode.

 ... Like any modern video card the minute you turn off the enable
legacy crap bit on them :-)

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: windows 2008 guest causing rcu_shed to emit NMI

2013-01-30 Thread Marcelo Tosatti
On Wed, Jan 30, 2013 at 11:21:08AM +0300, Andrey Korolyov wrote:
 On Wed, Jan 30, 2013 at 3:15 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Tue, Jan 29, 2013 at 02:35:02AM +0300, Andrey Korolyov wrote:
  On Mon, Jan 28, 2013 at 5:56 PM, Andrey Korolyov and...@xdel.ru wrote:
   On Mon, Jan 28, 2013 at 3:14 AM, Marcelo Tosatti mtosa...@redhat.com 
   wrote:
   On Mon, Jan 28, 2013 at 12:04:50AM +0300, Andrey Korolyov wrote:
   On Sat, Jan 26, 2013 at 12:49 AM, Marcelo Tosatti 
   mtosa...@redhat.com wrote:
On Fri, Jan 25, 2013 at 10:45:02AM +0300, Andrey Korolyov wrote:
On Thu, Jan 24, 2013 at 4:20 PM, Marcelo Tosatti 
mtosa...@redhat.com wrote:
 On Thu, Jan 24, 2013 at 01:54:03PM +0300, Andrey Korolyov wrote:
 Thank you Marcelo,

 Host node locking up sometimes later than yesterday, bur problem 
 still
 here, please see attached dmesg. Stuck process looks like
 root 19251  0.0  0.0 228476 12488 ?D14:42   0:00
 /usr/bin/kvm -no-user-config -device ? -device pci-assign,? 
 -device
 virtio-blk-pci,? -device

 on fourth vm by count.

 Should I try upstream kernel instead of applying patch to the 
 latest
 3.4 or it is useless?

 If you can upgrade to an upstream kernel, please do that.

   
With vanilla 3.7.4 there is almost no changes, and NMI started 
firing
again. External symptoms looks like following: starting from some
count, may be third or sixth vm, qemu-kvm process allocating its
memory very slowly and by jumps, 20M-200M-700M-1.6G in minutes. 
Patch
helps, of course - on both patched 3.4 and vanilla 3.7 I`m able to
kill stuck kvm processes and node returned back to the normal, when 
on
3.2 sending SIGKILL to the process causing zombies and hanged ``ps''
output (problem and workaround when no scheduler involved described
here http://www.spinics.net/lists/kvm/msg84799.html).
   
Try disabling pause loop exiting with ple_gap=0 kvm-intel.ko module 
parameter.
   
  
   Hi Marcelo,
  
   thanks, this parameter helped to increase number of working VMs in a
   half of order of magnitude, from 3-4 to 10-15. Very high SY load, 10
   to 15 percents, persists on such numbers for a long time, where linux
   guests in same configuration do not jump over one percent even under
   stress bench. After I disabled HT, crash happens only in long runs and
   now it is kernel panic :)
   Stair-like memory allocation behaviour disappeared, but other symptom
   leading to the crash which I have not counted previously, persists: if
   VM count is ``enough'' for crash, some qemu processes starting to eat
   one core, and they`ll panic system after run in tens of minutes in
   such state or if I try to attach debugger to one of them. If needed, I
   can log entire crash output via netconsole, now I have some tail,
   almost the same every time:
   http://xdel.ru/downloads/btwin.png
  
   Yes, please log entire crash output, thanks.
  
  
   Here please, 3.7.4-vanilla, 16 vms, ple_gap=0:
  
   http://xdel.ru/downloads/oops-default-kvmintel.txt
 
  Just an update: I was able to reproduce that on pure linux VMs using
  qemu-1.3.0 and ``stress'' benchmark running on them - panic occurs at
  start of vm(with count ten working machines at the moment). Qemu-1.1.2
  generally is not able to reproduce that, but host node with older
  version crashing on less amount of Windows VMs(three to six instead
  ten to fifteen) than with 1.3, please see trace below:
 
  http://xdel.ru/downloads/oops-old-qemu.txt
 
  Single bit memory error, apparently. Try:
 
  1. memtest86.
  2. Boot with slub_debug=ZFPU kernel parameter.
  3. Reproduce on different machine
 
 
 
 Hi Marcelo,
 
 I always follow the rule - if some weird bug exists, check it on
 ECC-enabled machine and check IPMI logs too before start complaining
 :) I have finally managed to ``fix'' the problem, but my solution
 seems a bit strange:
 - I have noticed that if virtual machines started without any cgroup
 setting they will not cause this bug under any conditions,
 - I have thought, very wrong in my mind, that the
 CONFIG_SCHED_AUTOGROUP should regroup the tasks without any cgroup and
 should not touch tasks already inside any existing cpu cgroup. First
 sight on the 200-line patch shows that the autogrouping always applies
 to all tasks, so I tried to disable it,
 - wild magic appears - VMs didn`t crashed host any more, even in count
 30+ they work fine.
 I still don`t know what exactly triggered that and will I face it
 again under different conditions, so my solution more likely to be a
 patch of mud in wall of the dam, instead of proper fixing.
 
 There seems to be two possible origins of such error - a very very
 hideous race condition involving cgroups and processes like qemu-kvm
 causing frequent context switches and simple incompatibility between
 NUMA, logic of CONFIG_SCHED_AUTOGROUP and qemu VMs already doing 

  1   2   >