RE: [Qemu-devel] [PATCH v4] Fixes related to processing of qemu's -numa option

2012-07-18 Thread Vinod, Chegu
Thanks Eduardo !

Hi Anthony, If you are ok with this patch...could you pl pull these changes 
into upstream (or) 
suggest who I should talk to get these changes in ?

Thanks!
Vinod

-Original Message-
From: Eduardo Habkost [mailto:ehabk...@redhat.com] 
Sent: Wednesday, July 18, 2012 10:15 AM
To: Vinod, Chegu
Cc: qemu-de...@nongnu.org; aligu...@us.ibm.com; kvm@vger.kernel.org
Subject: Re: [Qemu-devel] [PATCH v4] Fixes related to processing of qemu's 
-numa option

On Mon, Jul 16, 2012 at 09:31:30PM -0700, Chegu Vinod wrote:
 Changes since v3:
- using bitmap_set() instead of set_bit() in numa_add() routine.
- removed call to bitmak_zero() since bitmap_new() also zeros' the bitmap.
- Rebased to the latest qemu.

Tested-by: Eduardo Habkost ehabk...@redhat.com
Reviewed-by: Eduardo Habkost ehabk...@redhat.com


 
 Changes since v2:
- Using unsigned long * for the node_cpumask[].
- Use bitmap_new() instead of g_malloc0() for allocation.
- Don't rely on max_cpus since it may not be initialized
  before the numa related qemu options are parsed  processed.
 
 Note: Continuing to use a new constant for allocation of
   the mask (This constant is currently set to 255 since
   with an 8bit APIC ID VCPUs can range from 0-254 in a
   guest. The APIC ID 255 (0xFF) is reserved for broadcast).
 
 Changes since v1:
 
- Use bitmap functions that are already in qemu (instead
  of cpu_set_t macro's from sched.h)
- Added a check for endvalue = max_cpus.
- Fix to address the round-robbing assignment when
  cpu's are not explicitly specified.
 ---
 
 v1:
 --
 
 The -numa option to qemu is used to create [fake] numa nodes and 
 expose them to the guest OS instance.
 
 There are a couple of issues with the -numa option:
 
 a) Max VCPU's that can be specified for a guest while using
the qemu's -numa option is 64. Due to a typecasting issue
when the number of VCPUs is  32 the VCPUs don't show up
under the specified [fake] numa nodes.
 
 b) KVM currently has support for 160VCPUs per guest. The
qemu's -numa option has only support for upto 64VCPUs
per guest.
 This patch addresses these two issues.
 
 Below are examples of (a) and (b)
 
 a) 32 VCPUs are specified with the -numa option:
 
 /usr/local/bin/qemu-system-x86_64 \
 -enable-kvm \
 71:01:01 \
 -net tap,ifname=tap0,script=no,downscript=no \ -vnc :4
 
 ...
 Upstream qemu :
 --
 
 QEMU 1.1.50 monitor - type 'help' for more information
 (qemu) info numa
 6 nodes
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 32 33 34 35 36 37 38 39 40 41 node 0 
 size: 131072 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 
 46 47 48 49 50 51 node 1 size: 131072 MB node 2 cpus: 20 21 22 23 24 
 25 26 27 28 29 52 53 54 55 56 57 58 59 node 2 size: 131072 MB node 3 
 cpus: 30 node 3 size: 131072 MB node 4 cpus:
 node 4 size: 131072 MB
 node 5 cpus: 31
 node 5 size: 131072 MB
 
 With the patch applied :
 ---
 
 QEMU 1.1.50 monitor - type 'help' for more information
 (qemu) info numa
 6 nodes
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9
 node 0 size: 131072 MB
 node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 131072 MB node 
 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 131072 MB node 3 
 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 131072 MB node 4 
 cpus: 40 41 42 43 44 45 46 47 48 49 node 4 size: 131072 MB node 5 
 cpus: 50 51 52 53 54 55 56 57 58 59 node 5 size: 131072 MB
 
 b) 64 VCPUs specified with -numa option:
 
 /usr/local/bin/qemu-system-x86_64 \
 -enable-kvm \
 -cpu 
 Westmere,+rdtscp,+pdpe1gb,+dca,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl
 ,+monitor,+dtes64,+pclmuldq,+pbe,+tm,+ht,+ss,+acpi,+d-vnc :4
 
 ...
 
 Upstream qemu :
 --
 
 only 63 CPUs in NUMA mode supported.
 only 64 CPUs in NUMA mode supported.
 QEMU 1.1.50 monitor - type 'help' for more information
 (qemu) info numa
 8 nodes
 node 0 cpus: 6 7 8 9 38 39 40 41 70 71 72 73 node 0 size: 65536 MB 
 node 1 cpus: 10 11 12 13 14 15 16 17 18 19 42 43 44 45 46 47 48 49 50 
 51 74 75 76 77 78 79 node 1 size: 65536 MB node 2 cpus: 20 21 22 23 24 
 25 26 27 28 29 52 53 54 55 56 57 58 59 60 61 node 2 size: 65536 MB 
 node 3 cpus: 30 62 node 3 size: 65536 MB node 4 cpus:
 node 4 size: 65536 MB
 node 5 cpus:
 node 5 size: 65536 MB
 node 6 cpus: 31 63
 node 6 size: 65536 MB
 node 7 cpus: 0 1 2 3 4 5 32 33 34 35 36 37 64 65 66 67 68 69 node 7 
 size: 65536 MB
 
 With the patch applied :
 ---
 
 QEMU 1.1.50 monitor - type 'help' for more information
 (qemu) info numa
 8 nodes
 node 0 cpus: 0 1 2 3 4 5 6 7 8 9
 node 0 size: 65536 MB
 node 1 cpus: 10 11 12 13 14 15 16 17 18 19 node 1 size: 65536 MB node 
 2 cpus: 20 21 22 23 24 25 26 27 28 29 node 2 size: 65536 MB node 3 
 cpus: 30 31 32 33 34 35 36 37 38 39 node 3 size: 65536 MB node 4 cpus: 
 40 41 42 43 44 45 46 47 48 49 node 4 size: 65536 MB node 5 cpus: 50 51 
 52 53 54 55 56 57 58 59 node 5 size: 

Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Marcelo Tosatti
On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
  Back to original point though current
  situation is that calling kvm_set_irq() under spinlock is not worse 
  for
  scalability than calling it not under one.
 
 Yes. Still the specific use can just use an atomic flag,
 lock+bool is not needed, and we won't need to undo it later.


Actually, no, replacing it with an atomic is racy.

CPU0 (inject)   CPU1 (EOI)
atomic_cmpxchg(asserted, 0, 1)
atomic_cmpxchg(asserted, 1, 0)
kvm_set_irq(0)
kvm_set_irq(1)
eventfd_signal

The interrupt is now stuck on until another interrupt is injected.

   
   Well EOI somehow happened here before interrupt so it's a bug somewhere
   else?
  
  Interrupts can be shared.  We also can't guarantee that the guest won't
  write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
  irq source id... I'm not sure it can.
 
 I guess if Avi OKs adding another kvm_set_irq under spinlock that's
 the best we can do for now.

Why can't a mutex be used instead of a spinlock again?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Gleb Natapov
On Wed, Jul 18, 2012 at 03:42:09PM -0300, Marcelo Tosatti wrote:
 On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
   Back to original point though current
   situation is that calling kvm_set_irq() under spinlock is not 
   worse for
   scalability than calling it not under one.
  
  Yes. Still the specific use can just use an atomic flag,
  lock+bool is not needed, and we won't need to undo it later.
 
 
 Actually, no, replacing it with an atomic is racy.
 
 CPU0 (inject)   CPU1 (EOI)
 atomic_cmpxchg(asserted, 0, 1)
 atomic_cmpxchg(asserted, 1, 0)
 kvm_set_irq(0)
 kvm_set_irq(1)
 eventfd_signal
 
 The interrupt is now stuck on until another interrupt is injected.
 

Well EOI somehow happened here before interrupt so it's a bug somewhere
else?
   
   Interrupts can be shared.  We also can't guarantee that the guest won't
   write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
   irq source id... I'm not sure it can.
  
  I guess if Avi OKs adding another kvm_set_irq under spinlock that's
  the best we can do for now.
 
 Why can't a mutex be used instead of a spinlock again?
 
Why was it changed at the first place? Commit says that the function is
called from unsleepable context, but no stack trace.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v9 11/27] virtio-blk: Indirect vring and flush support

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:07:38PM +0100, Stefan Hajnoczi wrote:
 RHEL6 and other new guest kernels use indirect vring descriptors to
 increase the number of requests that can be batched.  This fundamentally
 changes vring from a scheme that requires fixed resources to something
 more dynamic (although there is still an absolute maximum number of
 descriptors).  Cope with indirect vrings by taking on as many requests
 as we can in one go and then postponing the remaining requests until the
 first batch completes.
 
 It would be possible to switch to dynamic resource management so iovec
 and iocb structs are malloced.  This would allow the entire ring to be
 processed even with indirect descriptors, but would probably hit a
 bottleneck when io_submit refuses to queue more requests.  Therefore,
 stick with the simpler scheme for now.
 
 Unfortunately Linux AIO does not support asynchronous fsync/fdatasync on
 all files.  In particular, an O_DIRECT opened file on ext4 does not
 support Linux AIO fdsync.  Work around this by performing fdatasync()
 synchronously for now.
 
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 ---
  hw/dataplane/ioq.h   |   18 -
  hw/dataplane/vring.h |  103 
 +++---
  hw/virtio-blk.c  |   75 ++--
  3 files changed, 144 insertions(+), 52 deletions(-)
 
 diff --git a/hw/dataplane/ioq.h b/hw/dataplane/ioq.h
 index 7200e87..d1545d6 100644
 --- a/hw/dataplane/ioq.h
 +++ b/hw/dataplane/ioq.h
 @@ -3,7 +3,7 @@
  
  typedef struct {
  int fd; /* file descriptor */
 -unsigned int max_reqs;   /* max length of freelist and queue */
 +unsigned int max_reqs;  /* max length of freelist and queue */
  
  io_context_t io_ctx;/* Linux AIO context */
  EventNotifier io_notifier;  /* Linux AIO eventfd */
 @@ -91,18 +91,16 @@ static struct iocb *ioq_rdwr(IOQueue *ioq, bool read, 
 struct iovec *iov, unsigne
  return iocb;
  }
  
 -static struct iocb *ioq_fdsync(IOQueue *ioq)
 -{
 -struct iocb *iocb = ioq_get_iocb(ioq);
 -
 -io_prep_fdsync(iocb, ioq-fd);
 -io_set_eventfd(iocb, event_notifier_get_fd(ioq-io_notifier));
 -return iocb;
 -}
 -
  static int ioq_submit(IOQueue *ioq)
  {
  int rc = io_submit(ioq-io_ctx, ioq-queue_idx, ioq-queue);
 +if (unlikely(rc  0)) {
 +unsigned int i;
 +fprintf(stderr, io_submit io_ctx=%#lx nr=%d iovecs=%p\n, 
 (uint64_t)ioq-io_ctx, ioq-queue_idx, ioq-queue);
 +for (i = 0; i  ioq-queue_idx; i++) {
 +fprintf(stderr, [%u] type=%#x fd=%d\n, i, 
 ioq-queue[i]-aio_lio_opcode, ioq-queue[i]-aio_fildes);
 +}
 +}
  ioq-queue_idx = 0; /* reset */
  return rc;
  }
 diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
 index 70675e5..3eab4b4 100644
 --- a/hw/dataplane/vring.h
 +++ b/hw/dataplane/vring.h
 @@ -64,6 +64,86 @@ static void vring_setup(Vring *vring, VirtIODevice *vdev, 
 int n)
  vring-vr.desc, vring-vr.avail, vring-vr.used);
  }
  
 +static bool vring_more_avail(Vring *vring)
 +{
 + return vring-vr.avail-idx != vring-last_avail_idx;
 +}
 +
 +/* This is stolen from linux-2.6/drivers/vhost/vhost.c. */

So add a Red Hat copyright pls.

 +static bool get_indirect(Vring *vring,
 + struct iovec iov[], struct iovec *iov_end,
 + unsigned int *out_num, unsigned int *in_num,
 + struct vring_desc *indirect)
 +{
 + struct vring_desc desc;
 + unsigned int i = 0, count, found = 0;
 +
 + /* Sanity check */
 + if (unlikely(indirect-len % sizeof desc)) {
 + fprintf(stderr, Invalid length in indirect descriptor: 
 +len 0x%llx not multiple of 0x%zx\n,
 +(unsigned long long)indirect-len,
 +sizeof desc);
 + exit(1);
 + }
 +
 + count = indirect-len / sizeof desc;
 + /* Buffers are chained via a 16 bit next field, so
 +  * we can have at most 2^16 of these. */
 + if (unlikely(count  USHRT_MAX + 1)) {
 + fprintf(stderr, Indirect buffer length too big: %d\n,
 +indirect-len);
 +exit(1);
 + }
 +
 +/* Point to translate indirect desc chain */
 +indirect = phys_to_host(vring, indirect-addr);
 +
 + /* We will use the result as an address to read from, so most
 +  * architectures only need a compiler barrier here. */
 + __sync_synchronize(); /* read_barrier_depends(); */
 +
 + do {
 + if (unlikely(++found  count)) {
 + fprintf(stderr, Loop detected: last one at %u 
 +indirect size %u\n,
 +i, count);
 + exit(1);
 + }
 +
 +desc = *indirect++;
 + if (unlikely(desc.flags  VRING_DESC_F_INDIRECT)) {
 + fprintf(stderr, Nested indirect 

Re: [RFC v9 12/27] virtio-blk: Add workaround for BUG_ON() dependency in virtio_ring.h

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:07:39PM +0100, Stefan Hajnoczi wrote:
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
 ---
  hw/dataplane/vring.h |5 +
  1 file changed, 5 insertions(+)
 
 diff --git a/hw/dataplane/vring.h b/hw/dataplane/vring.h
 index 3eab4b4..44ef4a9 100644
 --- a/hw/dataplane/vring.h
 +++ b/hw/dataplane/vring.h
 @@ -1,6 +1,11 @@
  #ifndef VRING_H
  #define VRING_H
  
 +/* Some virtio_ring.h files use BUG_ON() */

It's a bug then. Do we really need to work around broken systems?
If yes let's just ship our own headers ...

 +#ifndef BUG_ON
 +#define BUG_ON(x)
 +#endif
 +
  #include linux/virtio_ring.h
  #include qemu-common.h
  
 -- 
 1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v9 22/27] virtio-blk: Fix request merging

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:07:49PM +0100, Stefan Hajnoczi wrote:
 Khoa Huynh k...@us.ibm.com discovered that request merging is broken.
 The merged iocb is not updated to reflect the total number of iovecs and
 the offset is also outdated.
 
 This patch fixes request merging.
 
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

So all these fixups need to be folded in making it correct first time.

 ---
  hw/virtio-blk.c |   10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
 index 9131a7a..51807b5 100644
 --- a/hw/virtio-blk.c
 +++ b/hw/virtio-blk.c
 @@ -178,13 +178,17 @@ static void merge_request(struct iocb *iocb_a, struct 
 iocb *iocb_b)
  req_a-len = iocb_nbytes(iocb_a);
  }
  
 -iocb_b-u.v.vec = iovec;
 -req_b-len = iocb_nbytes(iocb_b);
 -req_b-next_merged = req_a;
  /*
  fprintf(stderr, merged %p (%u) and %p (%u), %u iovecs in total\n,
  req_a, iocb_a-u.v.nr, req_b, iocb_b-u.v.nr, iocb_a-u.v.nr + 
 iocb_b-u.v.nr);
  */
 +
 +iocb_b-u.v.vec = iovec;
 +iocb_b-u.v.nr += iocb_a-u.v.nr;
 +iocb_b-u.v.offset = iocb_a-u.v.offset;
 +
 +req_b-len = iocb_nbytes(iocb_b);
 +req_b-next_merged = req_a;
  }
  
  static void process_request(IOQueue *ioq, struct iovec iov[], unsigned int 
 out_num, unsigned int in_num, unsigned int head)
 -- 
 1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v9 23/27] virtio-blk: Stub out SCSI commands

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:07:50PM +0100, Stefan Hajnoczi wrote:
 Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com

Why?

 ---
  hw/virtio-blk.c |   25 +
  1 file changed, 17 insertions(+), 8 deletions(-)
 
 diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
 index 51807b5..8734029 100644
 --- a/hw/virtio-blk.c
 +++ b/hw/virtio-blk.c
 @@ -215,14 +215,8 @@ static void process_request(IOQueue *ioq, struct iovec 
 iov[], unsigned int out_n
  
  /* TODO Linux sets the barrier bit even when not advertised! */
  uint32_t type = outhdr-type  ~VIRTIO_BLK_T_BARRIER;
 -
 -if (unlikely(type  ~(VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH))) {
 -fprintf(stderr, virtio-blk unsupported request type %#x\n, 
 outhdr-type);
 -exit(1);
 -}
 -
  struct iocb *iocb;
 -switch (type  (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_FLUSH)) {
 +switch (type  (VIRTIO_BLK_T_OUT | VIRTIO_BLK_T_SCSI_CMD | 
 VIRTIO_BLK_T_FLUSH)) {
  case VIRTIO_BLK_T_IN:
  if (unlikely(out_num != 1)) {
  fprintf(stderr, virtio-blk invalid read request\n);
 @@ -239,6 +233,21 @@ static void process_request(IOQueue *ioq, struct iovec 
 iov[], unsigned int out_n
  iocb = ioq_rdwr(ioq, false, iov[1], out_num - 1, outhdr-sector * 
 512UL); /* TODO is it always 512? */
  break;
  
 +case VIRTIO_BLK_T_SCSI_CMD:
 +if (unlikely(in_num == 0)) {
 +fprintf(stderr, virtio-blk invalid SCSI command request\n);
 +exit(1);
 +}
 +
 +/* TODO support SCSI commands */
 +{
 +VirtIOBlock *s = container_of(ioq, VirtIOBlock, ioqueue);
 +inhdr-status = VIRTIO_BLK_S_UNSUPP;
 +vring_push(s-vring, head, sizeof *inhdr);
 +virtio_blk_notify_guest(s);
 +}
 +return;
 +
  case VIRTIO_BLK_T_FLUSH:
  if (unlikely(in_num != 1 || out_num != 1)) {
  fprintf(stderr, virtio-blk invalid flush request\n);
 @@ -256,7 +265,7 @@ static void process_request(IOQueue *ioq, struct iovec 
 iov[], unsigned int out_n
  return;
  
  default:
 -fprintf(stderr, virtio-blk multiple request type bits set\n);
 +fprintf(stderr, virtio-blk unsupported request type %#x\n, 
 outhdr-type);
  exit(1);
  }
  
 -- 
 1.7.10.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Alex Williamson
On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
 On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
   Back to original point though current
   situation is that calling kvm_set_irq() under spinlock is not 
   worse for
   scalability than calling it not under one.
  
  Yes. Still the specific use can just use an atomic flag,
  lock+bool is not needed, and we won't need to undo it later.
 
 
 Actually, no, replacing it with an atomic is racy.
 
 CPU0 (inject)   CPU1 (EOI)
 atomic_cmpxchg(asserted, 0, 1)
 atomic_cmpxchg(asserted, 1, 0)
 kvm_set_irq(0)
 kvm_set_irq(1)
 eventfd_signal
 
 The interrupt is now stuck on until another interrupt is injected.
 

Well EOI somehow happened here before interrupt so it's a bug somewhere
else?
   
   Interrupts can be shared.  We also can't guarantee that the guest won't
   write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
   irq source id... I'm not sure it can.
  
  I guess if Avi OKs adding another kvm_set_irq under spinlock that's
  the best we can do for now.
 
 Why can't a mutex be used instead of a spinlock again?

eventfd_signal calls the inject function from atomic context.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6

2012-07-18 Thread Anthony Liguori

On 07/18/2012 11:47 AM, James Bottomley wrote:

On Wed, 2012-07-18 at 11:00 -0500, Anthony Liguori wrote:

Of course: Think about the consequences: you want to upgrade one array
on your SAN.  You definitely don't want to shut down your entire data
centre to achieve it.  In place upgrades on running SANs have been
common in enterprise environments for a while.


Would firmware upgrades ever result in major OS visible changes though?

Maybe OSes are more robust with SCSI than with other types of buses, but I don't 
think it's safe to completely ignore the problem.



I agree that in general, SCSI targets don't need this, but I'm pretty sure that
if a guest probes for a command, you migrate to an old version, and that command
is no longer there, badness will ensue.


What command are we talking about?  Operation of initiators is usually
just READ and WRITE.  So perhaps we might have inline UNMAP ... but the
world wouldn't come to an end even if the latter stopped working.


Is that true for all OSes?  Linux may handle things gracefully if UNMAP starts 
throwing errors but that doesn't mean that Windows will.


There are other cases where this creates problems.  Windows (and some other 
OSes) fingerprint the hardware profile in order to do license enforcement.  If 
the hardware changes beyond a certain amount, then they refuse to boot.


Windows does this with a points system and I do believe that INQUIRY responses 
from any local disks are included in this tally.



Most of the complex SCSI stuff is done at start of day; it's actually
only then we'd notice things like changes in INQUIRY strings or mode
pages.

Failover, which is what you're talking about, requires reinstatement of
all the operating parameters of the source/target system, but that's not
wholly the responsibility of the storage system ...


It's the responsibility of the hypervisor when dealing with live migration.

Regards,

Anthony Liguori



James


It's different when you're talking about a reboot happening or a
disconnect/reconnect due to firmware upgrade.  The OS would naturally be
reprobing in this case.






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Alex Williamson
On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
 On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
  On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
Back to original point though current
situation is that calling kvm_set_irq() under spinlock is not 
worse for
scalability than calling it not under one.
   
   Yes. Still the specific use can just use an atomic flag,
   lock+bool is not needed, and we won't need to undo it later.
  
  
  Actually, no, replacing it with an atomic is racy.
  
  CPU0 (inject)   CPU1 (EOI)
  atomic_cmpxchg(asserted, 0, 1)
  atomic_cmpxchg(asserted, 1, 0)
  kvm_set_irq(0)
  kvm_set_irq(1)
  eventfd_signal
  
  The interrupt is now stuck on until another interrupt is injected.
  
 
 Well EOI somehow happened here before interrupt so it's a bug 
 somewhere
 else?

Interrupts can be shared.  We also can't guarantee that the guest won't
write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter on
irq source id... I'm not sure it can.
   
   I guess if Avi OKs adding another kvm_set_irq under spinlock that's
   the best we can do for now.
  
  Why can't a mutex be used instead of a spinlock again?
 
 eventfd_signal calls the inject function from atomic context.

Actually, that's called from a workq.  I'll have to switch it back and
turn on lockdep to remember why I couldn't sleep there.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 01:13:06PM -0600, Alex Williamson wrote:
 On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
  On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
   On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
 Back to original point though current
 situation is that calling kvm_set_irq() under spinlock is not 
 worse for
 scalability than calling it not under one.

Yes. Still the specific use can just use an atomic flag,
lock+bool is not needed, and we won't need to undo it later.
   
   
   Actually, no, replacing it with an atomic is racy.
   
   CPU0 (inject)   CPU1 (EOI)
   atomic_cmpxchg(asserted, 0, 1)
   atomic_cmpxchg(asserted, 1, 
   0)
   kvm_set_irq(0)
   kvm_set_irq(1)
   eventfd_signal
   
   The interrupt is now stuck on until another interrupt is injected.
   
  
  Well EOI somehow happened here before interrupt so it's a bug 
  somewhere
  else?
 
 Interrupts can be shared.  We also can't guarantee that the guest 
 won't
 write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter 
 on
 irq source id... I'm not sure it can.

I guess if Avi OKs adding another kvm_set_irq under spinlock that's
the best we can do for now.
   
   Why can't a mutex be used instead of a spinlock again?
  
  eventfd_signal calls the inject function from atomic context.
 
 Actually, that's called from a workq.  I'll have to switch it back and
 turn on lockdep to remember why I couldn't sleep there.


I'll try to fix kvm_set_irq so it returns 0 if level was already 0.
Then you do not need extra state.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] KVM: Optimize MMU notifier's THP page invalidation -v4

2012-07-18 Thread Marcelo Tosatti
On Mon, Jul 02, 2012 at 05:52:39PM +0900, Takuya Yoshikawa wrote:
 v3-v4: Resolved trace_kvm_age_page() issue -- patch 6,7
 v2-v3: Fixed intersection calculations. -- patch 3, 8
 
   Takuya
 
 Takuya Yoshikawa (8):
   KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva()
   KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva()
   KVM: MMU: Make kvm_handle_hva() handle range of addresses
   KVM: Introduce kvm_unmap_hva_range() for 
 kvm_mmu_notifier_invalidate_range_start()
   KVM: Separate rmap_pde from kvm_lpage_info-write_count
   KVM: MMU: Add memslot parameter to hva handlers
   KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
   KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
 
  arch/powerpc/include/asm/kvm_host.h |2 +
  arch/powerpc/kvm/book3s_64_mmu_hv.c |   47 ---
  arch/x86/include/asm/kvm_host.h |3 +-
  arch/x86/kvm/mmu.c  |  107 
 +++
  arch/x86/kvm/x86.c  |   11 
  include/linux/kvm_host.h|8 +++
  virt/kvm/kvm_main.c |3 +-
  7 files changed, 131 insertions(+), 50 deletions(-)
 
 
 From v2:
 
 The new test result was impressively good, see below, and THP page
 invalidation was more than 5 times faster on my x86 machine.
 
 Before:
   ...
   19.852 us  |  __mmu_notifier_invalidate_range_start();
   28.033 us  |  __mmu_notifier_invalidate_range_start();
   19.066 us  |  __mmu_notifier_invalidate_range_start();
   44.715 us  |  __mmu_notifier_invalidate_range_start();
   31.613 us  |  __mmu_notifier_invalidate_range_start();
   20.659 us  |  __mmu_notifier_invalidate_range_start();
   19.979 us  |  __mmu_notifier_invalidate_range_start();
   20.416 us  |  __mmu_notifier_invalidate_range_start();
   20.632 us  |  __mmu_notifier_invalidate_range_start();
   22.316 us  |  __mmu_notifier_invalidate_range_start();
   ...
 
 After:
   ...
   4.089 us   |  __mmu_notifier_invalidate_range_start();
   4.096 us   |  __mmu_notifier_invalidate_range_start();
   3.560 us   |  __mmu_notifier_invalidate_range_start();
   3.376 us   |  __mmu_notifier_invalidate_range_start();
   3.772 us   |  __mmu_notifier_invalidate_range_start();
   3.353 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.337 us   |  __mmu_notifier_invalidate_range_start();
   ...

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6

2012-07-18 Thread Rustad, Mark D
On Jul 18, 2012, at 10:17 AM, Michael S. Tsirkin wrote:

snip

 You do have to pay very close attention to some things however. Don't change 
 the device identity in any way - even version information, otherwise a 
 Windows initiator will blue-screen. I made that mistake myself, so I 
 remember it well. It seemed like such an innocent change. I don't recall 
 there being any issue with adding commands and we did do that on occasion.
 
 How about removing commands?

Good question. With the storage system I am familiar with, that would only be a 
risk if firmware were downgraded. Downgrading would never have been 
recommended. I am sure that if something like persistent reserve suddenly went 
away it would cause big trouble for some initiators.

-- 
Mark Rustad, LAN Access Division, Intel Corporation

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86, hyper: fix build with !CONFIG_KVM_GUEST

2012-07-18 Thread Marcelo Tosatti
On Wed, Jul 18, 2012 at 11:48:50AM +0300, Avi Kivity wrote:
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  arch/x86/kernel/cpu/hypervisor.c | 2 ++
  1 file changed, 2 insertions(+)

Applied to next, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Alex Williamson
On Wed, 2012-07-18 at 13:13 -0600, Alex Williamson wrote:
 On Wed, 2012-07-18 at 13:07 -0600, Alex Williamson wrote:
  On Wed, 2012-07-18 at 15:42 -0300, Marcelo Tosatti wrote:
   On Wed, Jul 18, 2012 at 06:58:24PM +0300, Michael S. Tsirkin wrote:
 Back to original point though current
 situation is that calling kvm_set_irq() under spinlock is not 
 worse for
 scalability than calling it not under one.

Yes. Still the specific use can just use an atomic flag,
lock+bool is not needed, and we won't need to undo it later.
   
   
   Actually, no, replacing it with an atomic is racy.
   
   CPU0 (inject)   CPU1 (EOI)
   atomic_cmpxchg(asserted, 0, 1)
   atomic_cmpxchg(asserted, 1, 
   0)
   kvm_set_irq(0)
   kvm_set_irq(1)
   eventfd_signal
   
   The interrupt is now stuck on until another interrupt is injected.
   
  
  Well EOI somehow happened here before interrupt so it's a bug 
  somewhere
  else?
 
 Interrupts can be shared.  We also can't guarantee that the guest 
 won't
 write a bogus EOI to the ioapic.  The irq ack notifier doesn't filter 
 on
 irq source id... I'm not sure it can.

I guess if Avi OKs adding another kvm_set_irq under spinlock that's
the best we can do for now.
   
   Why can't a mutex be used instead of a spinlock again?
  
  eventfd_signal calls the inject function from atomic context.
 
 Actually, that's called from a workq.  I'll have to switch it back and
 turn on lockdep to remember why I couldn't sleep there.

switching to a mutex results in:

BUG: sleeping function called from invalid context at kernel/mutex.c:269
in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
INFO: lockdep is turned off.
Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
Call Trace:
 [81088425] __might_sleep+0xf5/0x130
 [81564c6f] mutex_lock_nested+0x2f/0x60
 [a07db7d5] eoifd_event+0x25/0x70 [kvm]
 [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm]
 [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
 [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
 [a0806c43] apic_set_eoi+0x123/0x130 [kvm]
 [a0806fd8] apic_reg_write+0x388/0x670 [kvm]
 [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm]
 [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm]
 [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel]
 [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel]


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended

2012-07-18 Thread Marcelo Tosatti
On Thu, Jul 12, 2012 at 06:35:09PM +0900, Takuya Yoshikawa wrote:
 On Thu, 5 Jul 2012 23:05:46 +0900
 Takuya Yoshikawa takuya.yoshik...@gmail.com wrote:
 
  On Thu, 5 Jul 2012 14:50:00 +0300
  Gleb Natapov g...@redhat.com wrote:
  
Note that if (!nr_to_scan--) check is removed since we do not try to
free mmu pages from more than one VM.

   IIRC this was proposed in the past that we should iterate over vm list
   until freeing something eventually, but Avi was against it. I think the
   probability of a VM with kvm-arch.n_used_mmu_pages == 0 is low, so
   it looks OK to drop nr_to_scan to me.
  
  Since our batch size is 128, the minimum positive @nr_to_scan, it's almost
  impossible to see the effect of the check.
 
 Thinking more about this:
 
 I think freeing mmu pages by shrink_slab() is problematic.
 
 For example, if we do
  # echo 2 /proc/sys/vm/drop_caches
 on the host, some mmu pages will be freed.
 
 This is not what most people expect, probably.
 
 
 Although this patch is needed to care about shadow paging's extreme
 mmu page usage, we should do somthing better in the future.
 
 What I think reasonable is not treating all mmu pages as freeable:
  - determine some base number of mmu pages: base_mmu_pages
  - return (total_mmu_pages - base_mmu_pages) to the caller
 
  * We may use n_max_mmu_pages for calculating this base number.
 
 By doing so, we can avoid freeing mmu pages especially when EPT/NPT ON.

Takuya,

Can't understand, can you please expand more clearly? 

TIA

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu kvm: Recognize PCID feature

2012-07-18 Thread Marcelo Tosatti
On Wed, Jul 18, 2012 at 11:29:36AM +0200, Jan Kiszka wrote:
 On 2012-07-18 10:44, Mao, Junjie wrote:
  Hi, Avi
  
  Any comments on this patch? :)
 
 Always include qemu-devel when your are changing QEMU, qemu-kvm is just
 staging for the latter. This patch can actually go into upstream
 directly, maybe even via qemu-trivial as it just makes that flag selectable.
 
 Jan

Agreed, CCing Stefan.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Latest Centos6.3 kernel, rebooting from inside centos6.3 VM - gets stuck on seabios/grub loop - previous kernels fine

2012-07-18 Thread Marcelo Tosatti
On Wed, Jul 11, 2012 at 10:54:34PM +0100, Morgan Cox wrote:
 Hi
 
 I have an Ubuntu 12.04 KVM server
 
 In Centos6 VM's - when I install the latest kernel for centos 6.3 -
 2.6.32-279.1.1.el6 - if you reboot from inside a Centos6 vm it gets
 stuck in a loop between seabios/grub.
 
 If i use virsh/virt-manager to reboot its fine, only from inside a
 centos6 vm (with latest centos kernel) does this occur - I am 100% its
 connected with the latest kernel (or at least the 6.3 updates).
 
 As a test I installed centos 6.2 - this was 100% fine *until* I did a
 yum update then I got the same issue.
 
 If I use the older kernel with all the rest of the Centos 6.3 updates
 the issue does not occur either (so it 'must'? be related to the
 kernel update)
 
 How can this be fixed ?

Morgan,

What is the last working kernel RPM version?

Can you disable kvmclock? (by appending no-kvmclock to the end of the
kernel line of 2.6.32-279.1.1.el6 entry in /boot/grub/menu.lst).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v3 3/4] vhost: Add vhost_scsi specific defines

2012-07-18 Thread Nicholas A. Bellinger
On Wed, 2012-07-18 at 16:05 +0300, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 12:59:31AM +, Nicholas A. Bellinger wrote:
  From: Nicholas Bellinger n...@risingtidesystems.com
  
  This patch adds the initial vhost_scsi_ioctl() callers for 
  VHOST_SCSI_SET_ENDPOINT
  and VHOST_SCSI_CLEAR_ENDPOINT respectively, and also adds struct 
  vhost_vring_target
  that is used by tcm_vhost code when locating target ports during qemu setup.
  
  Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
  Cc: Zhi Yong Wu wu...@cn.ibm.com
  Cc: Michael S. Tsirkin m...@redhat.com
  Cc: Paolo Bonzini pbonz...@redhat.com,
  Signed-off-by: Nicholas A. Bellinger n...@risingtidesystems.com
  ---
   include/linux/vhost.h |9 +
   1 files changed, 9 insertions(+), 0 deletions(-)
  
  diff --git a/include/linux/vhost.h b/include/linux/vhost.h
  index e847f1e..33b313b 100644
  --- a/include/linux/vhost.h
  +++ b/include/linux/vhost.h
  @@ -24,7 +24,11 @@ struct vhost_vring_state {
   struct vhost_vring_file {
  unsigned int index;
  int fd; /* Pass -1 to unbind from file. */
  +};
   
  +struct vhost_vring_target {
 
 Can this be renamed vhost_scsi_target?

Done

 
  +   unsigned char vhost_wwpn[224];
 
 224? I am guessing ISCSI_NAME_LEN from include/scsi/iscsi_proto.h?
 Unfortunately we can't include iscsi_proto.h here as it
 is not exported to users. But let's add a comment for now.
 

This is actually from target/target_core_base.h:TRANSPORT_IQN_LEN.

Fixing this up now..

  +   unsigned short vhost_tpgt;
   };
   
   struct vhost_vring_addr {
  @@ -121,6 +125,11 @@ struct vhost_memory {
* device.  This can be used to stop the ring (e.g. for migration). */
   #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct 
  vhost_vring_file)
   
  +/* VHOST_SCSI specific defines */
  +
  +#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
  vhost_vring_target)
  +#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
  vhost_vring_target)
  +
   /* Feature bits */
   /* Log all write descriptors. Can be changed while device is active. */
 
 Can these go into appropriate ifdef CONFIG_TCP_VHOST please?
 

M, I don't think we can do that with CONFIG_TCM_VHOST=m, or at least
not with the following patch

diff --git a/include/linux/vhost.h b/include/linux/vhost.h
index 33b313b..e4b1ee3 100644
--- a/include/linux/vhost.h
+++ b/include/linux/vhost.h
@@ -125,10 +126,14 @@ struct vhost_memory {
  * device.  This can be used to stop the ring (e.g. for migration). */
 #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)
 
+#ifdef CONFIG_TCM_VHOST
+
 /* VHOST_SCSI specific defines */
 
-#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
vhost_vring_target)
-#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
vhost_vring_target)
+# define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
vhost_scsi_target)
+# define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
vhost_scsi_target)
+
+#endif

--

drivers/vhost/tcm/tcm_vhost.c: In function ‘vhost_scsi_ioctl’:
drivers/vhost/tcm/tcm_vhost.c:970: error: ‘VHOST_SCSI_SET_ENDPOINT’ undeclared 
(first use in this function)
drivers/vhost/tcm/tcm_vhost.c:970: error: (Each undeclared identifier is 
reported only once
drivers/vhost/tcm/tcm_vhost.c:970: error: for each function it appears in.)
drivers/vhost/tcm/tcm_vhost.c:975: error: ‘VHOST_SCSI_CLEAR_ENDPOINT’ 
undeclared (first use in this function)
make[3]: *** [drivers/vhost/tcm/tcm_vhost.o] Error 1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v3 4/4] tcm_vhost: Initial merge for vhost level target fabric driver

2012-07-18 Thread Nicholas A. Bellinger
On Wed, 2012-07-18 at 19:09 +0300, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 12:59:32AM +, Nicholas A. Bellinger wrote:

SNIP

  
  Changelog v2 - v3:
  
Unlock on error in tcm_vhost_drop_nexus() (DanC)
Fix strlen() doesn't count the terminator (DanC)
Call kfree() on an error path (DanC)
Convert tcm_vhost_write_pending to use target_execute_cmd (hch + nab)
Fix another strlen() off by one in tcm_vhost_make_tport (DanC)
Add option under drivers/staging/Kconfig, and move to drivers/vhost/tcm/
as requested by MST (nab)
  
  ---
   drivers/staging/Kconfig   |2 +
   drivers/vhost/Makefile|2 +
   drivers/vhost/tcm/Kconfig |6 +
   drivers/vhost/tcm/Makefile|1 +
   drivers/vhost/tcm/tcm_vhost.c | 1611 
  +
   drivers/vhost/tcm/tcm_vhost.h |   74 ++
   6 files changed, 1696 insertions(+), 0 deletions(-)
   create mode 100644 drivers/vhost/tcm/Kconfig
   create mode 100644 drivers/vhost/tcm/Makefile
   create mode 100644 drivers/vhost/tcm/tcm_vhost.c
   create mode 100644 drivers/vhost/tcm/tcm_vhost.h
  
 
 Really sorry about making you run around like that,
 I did not mean moving all of tcm to a directory,
 just adding tcm/Kconfig or adding drivers/vhost/Kconfig.tcm
 because eventually it's easier to keep it all together
 in one place.
 

Er, apologies for the slight mis-understanding here..  Moving back now +
fixing up the Kbuild bits.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC-v3 3/4] vhost: Add vhost_scsi specific defines

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 02:17:05PM -0700, Nicholas A. Bellinger wrote:
 On Wed, 2012-07-18 at 16:05 +0300, Michael S. Tsirkin wrote:
  On Wed, Jul 18, 2012 at 12:59:31AM +, Nicholas A. Bellinger wrote:
   From: Nicholas Bellinger n...@risingtidesystems.com
   
   This patch adds the initial vhost_scsi_ioctl() callers for 
   VHOST_SCSI_SET_ENDPOINT
   and VHOST_SCSI_CLEAR_ENDPOINT respectively, and also adds struct 
   vhost_vring_target
   that is used by tcm_vhost code when locating target ports during qemu 
   setup.
   
   Signed-off-by: Stefan Hajnoczi stefa...@linux.vnet.ibm.com
   Cc: Zhi Yong Wu wu...@cn.ibm.com
   Cc: Michael S. Tsirkin m...@redhat.com
   Cc: Paolo Bonzini pbonz...@redhat.com,
   Signed-off-by: Nicholas A. Bellinger n...@risingtidesystems.com
   ---
include/linux/vhost.h |9 +
1 files changed, 9 insertions(+), 0 deletions(-)
   
   diff --git a/include/linux/vhost.h b/include/linux/vhost.h
   index e847f1e..33b313b 100644
   --- a/include/linux/vhost.h
   +++ b/include/linux/vhost.h
   @@ -24,7 +24,11 @@ struct vhost_vring_state {
struct vhost_vring_file {
 unsigned int index;
 int fd; /* Pass -1 to unbind from file. */
   +};

   +struct vhost_vring_target {
  
  Can this be renamed vhost_scsi_target?
 
 Done
 
  
   + unsigned char vhost_wwpn[224];
  
  224? I am guessing ISCSI_NAME_LEN from include/scsi/iscsi_proto.h?
  Unfortunately we can't include iscsi_proto.h here as it
  is not exported to users. But let's add a comment for now.
  
 
 This is actually from target/target_core_base.h:TRANSPORT_IQN_LEN.
 
 Fixing this up now..
 
   + unsigned short vhost_tpgt;
};

struct vhost_vring_addr {
   @@ -121,6 +125,11 @@ struct vhost_memory {
 * device.  This can be used to stop the ring (e.g. for migration). */
#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct 
   vhost_vring_file)

   +/* VHOST_SCSI specific defines */
   +
   +#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
   vhost_vring_target)
   +#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
   vhost_vring_target)
   +
/* Feature bits */
/* Log all write descriptors. Can be changed while device is active. */
  
  Can these go into appropriate ifdef CONFIG_TCP_VHOST please?
  
 
 M, I don't think we can do that with CONFIG_TCM_VHOST=m, or at least
 not with the following patch
 
 diff --git a/include/linux/vhost.h b/include/linux/vhost.h
 index 33b313b..e4b1ee3 100644
 --- a/include/linux/vhost.h
 +++ b/include/linux/vhost.h
 @@ -125,10 +126,14 @@ struct vhost_memory {
   * device.  This can be used to stop the ring (e.g. for migration). */
  #define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct 
 vhost_vring_file)
  
 +#ifdef CONFIG_TCM_VHOST
 +
  /* VHOST_SCSI specific defines */
  
 -#define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
 vhost_vring_target)
 -#define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
 vhost_vring_target)
 +# define VHOST_SCSI_SET_ENDPOINT _IOW(VHOST_VIRTIO, 0x40, struct 
 vhost_scsi_target)
 +# define VHOST_SCSI_CLEAR_ENDPOINT _IOW(VHOST_VIRTIO, 0x41, struct 
 vhost_scsi_target)
 +
 +#endif
 
 --
 
 drivers/vhost/tcm/tcm_vhost.c: In function ‘vhost_scsi_ioctl’:
 drivers/vhost/tcm/tcm_vhost.c:970: error: ‘VHOST_SCSI_SET_ENDPOINT’ 
 undeclared (first use in this function)
 drivers/vhost/tcm/tcm_vhost.c:970: error: (Each undeclared identifier is 
 reported only once
 drivers/vhost/tcm/tcm_vhost.c:970: error: for each function it appears in.)
 drivers/vhost/tcm/tcm_vhost.c:975: error: ‘VHOST_SCSI_CLEAR_ENDPOINT’ 
 undeclared (first use in this function)
 make[3]: *** [drivers/vhost/tcm/tcm_vhost.o] Error 1


Maybe ifdefs only work for booleans? If yes you can probably add a boolean and 
select it?
What I want to prevent is exposing tcm stuff in the header if it is
configured off. I'll play with it tomorrow.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Marcelo Tosatti
On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote:
  turn on lockdep to remember why I couldn't sleep there.
 
 switching to a mutex results in:
 
 BUG: sleeping function called from invalid context at kernel/mutex.c:269
 in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
 INFO: lockdep is turned off.
 Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
 Call Trace:
  [81088425] __might_sleep+0xf5/0x130
  [81564c6f] mutex_lock_nested+0x2f/0x60
  [a07db7d5] eoifd_event+0x25/0x70 [kvm]
  [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm]
  [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
  [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
  [a0806c43] apic_set_eoi+0x123/0x130 [kvm]
  [a0806fd8] apic_reg_write+0x388/0x670 [kvm]
  [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm]
  [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm]
  [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel]
  [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel]

Its RCU from ack notifiers, OK. 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 1/4] kvm: Extend irqfd to support level interrupts

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 06:23:34PM -0300, Marcelo Tosatti wrote:
 On Wed, Jul 18, 2012 at 02:28:34PM -0600, Alex Williamson wrote:
   turn on lockdep to remember why I couldn't sleep there.
  
  switching to a mutex results in:
  
  BUG: sleeping function called from invalid context at kernel/mutex.c:269
  in_atomic(): 1, irqs_disabled(): 0, pid: 30025, name: qemu-system-x86
  INFO: lockdep is turned off.
  Pid: 30025, comm: qemu-system-x86 Not tainted 3.5.0-rc4+ #109
  Call Trace:
   [81088425] __might_sleep+0xf5/0x130
   [81564c6f] mutex_lock_nested+0x2f/0x60
   [a07db7d5] eoifd_event+0x25/0x70 [kvm]
   [a07daea4] kvm_notify_acked_irq+0xa4/0x140 [kvm]
   [a07dae2a] ? kvm_notify_acked_irq+0x2a/0x140 [kvm]
   [a07d9bb4] kvm_ioapic_update_eoi+0x84/0xf0 [kvm]
   [a0806c43] apic_set_eoi+0x123/0x130 [kvm]
   [a0806fd8] apic_reg_write+0x388/0x670 [kvm]
   [a07eb03c] ? vcpu_enter_guest+0x32c/0x740 [kvm]
   [a0807481] kvm_lapic_set_eoi+0x21/0x30 [kvm]
   [a04ba3f9] handle_apic_access+0x69/0x80 [kvm_intel]
   [a04ba02a] vmx_handle_exit+0xaa/0x260 [kvm_intel]
 
 Its RCU from ack notifiers, OK. 

I'm testing a patch that moves all bitmap handling to under
pic/ioapic lock. After this we can teach kvm_set_irq to report
when a bit that is cleared/set is already clear/set, without
races.

And then no tracking will be necessary in irqfd - we can just
call kvm_set_irq(..., 0) and look at the return status.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Michael S. Tsirkin
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states:

CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
calls kvm_ioapic_set_irq(0).

Now ioapic thinks the level is 0 but irq_state is not 0.

Note that above is valid behaviour if CPU0 and CPU1 are using different
source ids.

Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.

Reported-by: Gleb Natapov g...@redhat.com
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

This survives stress testing for me (though I only tried virtio, not
device assignment).
Avi, Marcelo, though we have not observed this in the field,
it's a bugfix so probably 3.5 material?
I assume yes so the patch is against 3.5-rc7.
Also stable? It's a very old bug.


 arch/x86/include/asm/kvm_host.h | 15 ++-
 arch/x86/kvm/i8259.c| 14 --
 virt/kvm/ioapic.c   | 13 -
 virt/kvm/ioapic.h   |  4 +++-
 virt/kvm/irq_comm.c | 31 ---
 5 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db7c1f2..fe6e9f1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *mmu,
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
 
-int kvm_pic_set_irq(void *opaque, int irq, int level);
+static inline int kvm_irq_line_state(unsigned long *irq_state,
+int irq_source_id, int level)
+{
+   /* Logical OR for level trig interrupt */
+   if (level)
+   __set_bit(irq_source_id, irq_state);
+   else
+   __clear_bit(irq_source_id, irq_state);
+
+   return !!(*irq_state);
+}
+
+int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
level);
+void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
 
 void kvm_inject_nmi(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 81cf4fa..95b504b 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
pic_unlock(s);
 }
 
-int kvm_pic_set_irq(void *opaque, int irq, int level)
+int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
 {
-   struct kvm_pic *s = opaque;
int ret = -1;
+   int level;
 
pic_lock(s);
if (irq = 0  irq  PIC_NUM_PINS) {
+   level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
@@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
return ret;
 }
 
+void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
+{
+   int i;
+   pic_lock(s);
+   for (i = 0; i  PIC_NUM_PINS; i++)
+   __clear_bit(src_id, s-irq_states[i]);
+   pic_unlock(s);
+}
+
 /*
  * acknowledge interrupt 'irq'
  */
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 26fd54d..268b332 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int 
irq)
return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
 }
 
-int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
+int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l)
 {
u32 old_irr;
u32 mask = 1  irq;
union kvm_ioapic_redirect_entry entry;
int ret = 1;
+   int level;
 
spin_lock(ioapic-lock);
old_irr = ioapic-irr;
if (irq = 0  irq  IOAPIC_NUM_PINS) {
entry = ioapic-redirtbl[irq];
+   level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
level ^= entry.fields.polarity;
if (!level)
ioapic-irr = ~mask;
@@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, 
int level)
return ret;
 }
 
+void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id)
+{
+   int i;
+   spin_lock(ioapic-lock);
+   for (i = 0; i  KVM_IOAPIC_NUM_PINS; i++)
+   __clear_bit(src_id, ioapic-irq_states[i]);
+   spin_unlock(ioapic-lock);
+}
+
 static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,
 int trigger_mode)
 {
diff --git a/virt/kvm/ioapic.h b/virt/kvm/ioapic.h
index 32872a0..a30abfe 100644
--- a/virt/kvm/ioapic.h
+++ b/virt/kvm/ioapic.h
@@ -74,7 +74,9 @@ void 

Re: [PATCH v5 3/4] kvm: Create kvm_clear_irq()

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 09:27:42AM +0300, Gleb Natapov wrote:
 On Tue, Jul 17, 2012 at 07:14:52PM +0300, Michael S. Tsirkin wrote:
   _Seems_ racy, or _is_ racy?  Please identify the race.
  
  Look at this:
  
  static inline int kvm_irq_line_state(unsigned long *irq_state,
   int irq_source_id, int level)
  {
  /* Logical OR for level trig interrupt */
  if (level)
  set_bit(irq_source_id, irq_state);
  else
  clear_bit(irq_source_id, irq_state);
  
  return !!(*irq_state);
  }
  
  
  Now:
  If other CPU changes some other bit after the atomic change,
  it looks like !!(*irq_state) might return a stale value.
  
  CPU 0 clears bit 0. CPU 1 sets bit 1. CPU 1 sets level to 1.
  If CPU 0 sees a stale value now it will return 0 here
  and interrupt will get cleared.
  
 This will hardly happen on x86 especially since bit is set with
 serialized instruction. But there is actually a race here.
 CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
 CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
 No ioapic thinks the level is 0 but irq_state is not 0.
 
 This untested and un-compiled patch should fix it.

Getting rid of atomics completely makes me more comfortable,
and by moving all bitmap handling to under pic/ioapic lock
we can do just that.
I just tested and posted a patch that fixes the race in this way.
Could you take a look pls?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC untested] kvm_set_irq: report coalesced for clear

2012-07-18 Thread Michael S. Tsirkin
This creates a way to detect when kvm_set_irq(...,0) was run
twice with the same source id by returning 0 in this case.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

This is on top of my bugfix patch.  Uncompiled and untested.  Alex, I
think something like this patch will make it possible for you to simply
do
if (kvm_set_irq(, 0))
eventfd_signal()

without need for further tricks.

 arch/x86/include/asm/kvm_host.h |  9 -
 arch/x86/kvm/i8259.c| 11 +++
 virt/kvm/ioapic.c   | 17 ++---
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fe6e9f1..6f5ed0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct 
kvm_mmu *mmu,
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
 
-static inline int kvm_irq_line_state(unsigned long *irq_state,
+/* Tweak line state. Return true if state was changed. */
+static inline bool kvm_irq_line_state(unsigned long *irq_state,
 int irq_source_id, int level)
 {
/* Logical OR for level trig interrupt */
if (level)
-   __set_bit(irq_source_id, irq_state);
+   return !__test_and_set_bit(irq_source_id, irq_state);
else
-   __clear_bit(irq_source_id, irq_state);
-
-   return !!(*irq_state);
+   return __test_and_clear_bit(irq_source_id, irq_state);
 }
 
 int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
level);
diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
index 95b504b..44e3635 100644
--- a/arch/x86/kvm/i8259.c
+++ b/arch/x86/kvm/i8259.c
@@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 
 int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
 {
-   int ret = -1;
-   int level;
+   int ret, level;
 
pic_lock(s);
-   if (irq = 0  irq  PIC_NUM_PINS) {
-   level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
+   if (irq  0 || irq = PIC_NUM_PINS) {
+   ret = -1;
+   } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) {
+   ret = 0;
+   } else {
+   level = !!s-irq_states[irq];
ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
pic_update_irq(s);
trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 268b332..edc9445 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
irq, int src_id, int l)
u32 old_irr;
u32 mask = 1  irq;
union kvm_ioapic_redirect_entry entry;
-   int ret = 1;
-   int level;
+   int ret, level;
 
spin_lock(ioapic-lock);
old_irr = ioapic-irr;
-   if (irq = 0  irq  IOAPIC_NUM_PINS) {
+   if (irq  0 || irq = IOAPIC_NUM_PINS) {
+   ret = -1;
+   } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) {
+   ret = 0;
+   } else {
entry = ioapic-redirtbl[irq];
-   level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
-   level ^= entry.fields.polarity;
-   if (!level)
+   level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity;
+   if (!level) {
ioapic-irr = ~mask;
-   else {
+   ret = 1;
+   } else {
int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
ioapic-irr |= mask;
if ((edge  old_irr != ioapic-irr) ||
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Alex Williamson
On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
 When more than 1 source id is in use for the same GSI, we have the
 following race related to handling irq_states:
 
 CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
 CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
 calls kvm_ioapic_set_irq(0).
 
 Now ioapic thinks the level is 0 but irq_state is not 0.
 
 Note that above is valid behaviour if CPU0 and CPU1 are using different
 source ids.
 
 Fix by performing all irq_states bitmap handling under pic/ioapic lock.
 This also removes the need for atomics with irq_states handling.
 
 Reported-by: Gleb Natapov g...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 This survives stress testing for me (though I only tried virtio, not
 device assignment).
 Avi, Marcelo, though we have not observed this in the field,
 it's a bugfix so probably 3.5 material?
 I assume yes so the patch is against 3.5-rc7.
 Also stable? It's a very old bug.
 
 
  arch/x86/include/asm/kvm_host.h | 15 ++-
  arch/x86/kvm/i8259.c| 14 --
  virt/kvm/ioapic.c   | 13 -
  virt/kvm/ioapic.h   |  4 +++-
  virt/kvm/irq_comm.c | 31 ---
  5 files changed, 45 insertions(+), 32 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index db7c1f2..fe6e9f1 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
 struct kvm_mmu *mmu,
  void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
  
 -int kvm_pic_set_irq(void *opaque, int irq, int level);
 +static inline int kvm_irq_line_state(unsigned long *irq_state,
 +  int irq_source_id, int level)
 +{
 + /* Logical OR for level trig interrupt */
 + if (level)
 + __set_bit(irq_source_id, irq_state);
 + else
 + __clear_bit(irq_source_id, irq_state);
 +
 + return !!(*irq_state);
 +}
 +
 +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
 level);
 +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
  
  void kvm_inject_nmi(struct kvm_vcpu *vcpu);
  
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 81cf4fa..95b504b 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
   pic_unlock(s);
  }
  
 -int kvm_pic_set_irq(void *opaque, int irq, int level)
 +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)

Please use irq_source_id and keep level.  I hate reviewing code
where I have to differentiate 'l' vs '1'.  Same below, mixing src_id and
irq_source_id in various places.

  {
 - struct kvm_pic *s = opaque;
   int ret = -1;
 + int level;

And you'd avoid this variable

  
   pic_lock(s);
   if (irq = 0  irq  PIC_NUM_PINS) {
 + level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
   ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
   pic_update_irq(s);
   trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
 @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
   return ret;
  }
  
 +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
 +{
 + int i;
 + pic_lock(s);
 + for (i = 0; i  PIC_NUM_PINS; i++)
 + __clear_bit(src_id, s-irq_states[i]);
 + pic_unlock(s);
 +}
 +
  /*
   * acknowledge interrupt 'irq'
   */
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 26fd54d..268b332 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, 
 int irq)
   return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
  }
  
 -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l)
  {
   u32 old_irr;
   u32 mask = 1  irq;
   union kvm_ioapic_redirect_entry entry;
   int ret = 1;
 + int level;
  
   spin_lock(ioapic-lock);
   old_irr = ioapic-irr;
   if (irq = 0  irq  IOAPIC_NUM_PINS) {
   entry = ioapic-redirtbl[irq];
 + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
   level ^= entry.fields.polarity;
   if (!level)
   ioapic-irr = ~mask;
 @@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
 irq, int level)
   return ret;
  }
  
 +void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id)
 +{
 + int i;
 + spin_lock(ioapic-lock);
 + for (i = 0; i  KVM_IOAPIC_NUM_PINS; i++)
 + __clear_bit(src_id, ioapic-irq_states[i]);
 + 

Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Alex Williamson
On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
 When more than 1 source id is in use for the same GSI, we have the
 following race related to handling irq_states:
 
 CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
 CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
 calls kvm_ioapic_set_irq(0).
 
 Now ioapic thinks the level is 0 but irq_state is not 0.
 
 Note that above is valid behaviour if CPU0 and CPU1 are using different
 source ids.
 
 Fix by performing all irq_states bitmap handling under pic/ioapic lock.
 This also removes the need for atomics with irq_states handling.
 
 Reported-by: Gleb Natapov g...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 This survives stress testing for me (though I only tried virtio, not
 device assignment).
 Avi, Marcelo, though we have not observed this in the field,
 it's a bugfix so probably 3.5 material?
 I assume yes so the patch is against 3.5-rc7.
 Also stable? It's a very old bug.
 
 
  arch/x86/include/asm/kvm_host.h | 15 ++-
  arch/x86/kvm/i8259.c| 14 --
  virt/kvm/ioapic.c   | 13 -
  virt/kvm/ioapic.h   |  4 +++-
  virt/kvm/irq_comm.c | 31 ---
  5 files changed, 45 insertions(+), 32 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index db7c1f2..fe6e9f1 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
 struct kvm_mmu *mmu,
  void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
  
 -int kvm_pic_set_irq(void *opaque, int irq, int level);
 +static inline int kvm_irq_line_state(unsigned long *irq_state,
 +  int irq_source_id, int level)

This should probably be __kvm_irq_line_state given the calling
restrictions.

 +{
 + /* Logical OR for level trig interrupt */
 + if (level)
 + __set_bit(irq_source_id, irq_state);
 + else
 + __clear_bit(irq_source_id, irq_state);
 +
 + return !!(*irq_state);
 +}
 +
 +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
 level);
 +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
  
  void kvm_inject_nmi(struct kvm_vcpu *vcpu);
  
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 81cf4fa..95b504b 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
   pic_unlock(s);
  }
  
 -int kvm_pic_set_irq(void *opaque, int irq, int level)
 +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
  {
 - struct kvm_pic *s = opaque;
   int ret = -1;
 + int level;
  
   pic_lock(s);
   if (irq = 0  irq  PIC_NUM_PINS) {
 + level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
   ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
   pic_update_irq(s);
   trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
 @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
   return ret;
  }
  
 +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
 +{
 + int i;
 + pic_lock(s);
 + for (i = 0; i  PIC_NUM_PINS; i++)
 + __clear_bit(src_id, s-irq_states[i]);
 + pic_unlock(s);
 +}
 +
  /*
   * acknowledge interrupt 'irq'
   */
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 26fd54d..268b332 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, 
 int irq)
   return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
  }
  
 -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
 +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int l)
  {
   u32 old_irr;
   u32 mask = 1  irq;
   union kvm_ioapic_redirect_entry entry;
   int ret = 1;
 + int level;
  
   spin_lock(ioapic-lock);
   old_irr = ioapic-irr;
   if (irq = 0  irq  IOAPIC_NUM_PINS) {
   entry = ioapic-redirtbl[irq];
 + level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
   level ^= entry.fields.polarity;
   if (!level)
   ioapic-irr = ~mask;
 @@ -221,6 +223,15 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
 irq, int level)
   return ret;
  }
  
 +void kvm_ioapic_clear_all(struct kvm_ioapic *ioapic, int src_id)
 +{
 + int i;
 + spin_lock(ioapic-lock);
 + for (i = 0; i  KVM_IOAPIC_NUM_PINS; i++)
 + __clear_bit(src_id, ioapic-irq_states[i]);
 + spin_unlock(ioapic-lock);
 +}
 +
  static void __kvm_ioapic_update_eoi(struct kvm_ioapic *ioapic, int vector,

Re: [PATCH RFC untested] kvm_set_irq: report coalesced for clear

2012-07-18 Thread Alex Williamson
On Thu, 2012-07-19 at 01:11 +0300, Michael S. Tsirkin wrote:
 This creates a way to detect when kvm_set_irq(...,0) was run
 twice with the same source id by returning 0 in this case.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 This is on top of my bugfix patch.  Uncompiled and untested.  Alex, I
 think something like this patch will make it possible for you to simply
 do
   if (kvm_set_irq(, 0))
   eventfd_signal()
 
 without need for further tricks.
 
  arch/x86/include/asm/kvm_host.h |  9 -
  arch/x86/kvm/i8259.c| 11 +++
  virt/kvm/ioapic.c   | 17 ++---
  3 files changed, 21 insertions(+), 16 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index fe6e9f1..6f5ed0a 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
 struct kvm_mmu *mmu,
  void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
  
 -static inline int kvm_irq_line_state(unsigned long *irq_state,
 +/* Tweak line state. Return true if state was changed. */
 +static inline bool kvm_irq_line_state(unsigned long *irq_state,
int irq_source_id, int level)
  {
   /* Logical OR for level trig interrupt */
   if (level)
 - __set_bit(irq_source_id, irq_state);
 + return !__test_and_set_bit(irq_source_id, irq_state);
   else
 - __clear_bit(irq_source_id, irq_state);
 -
 - return !!(*irq_state);
 + return __test_and_clear_bit(irq_source_id, irq_state);
  }
  
  int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
 level);
 diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
 index 95b504b..44e3635 100644
 --- a/arch/x86/kvm/i8259.c
 +++ b/arch/x86/kvm/i8259.c
 @@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s)
  
  int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
  {
 - int ret = -1;
 - int level;
 + int ret, level;
  
   pic_lock(s);
 - if (irq = 0  irq  PIC_NUM_PINS) {
 - level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
 + if (irq  0 || irq = PIC_NUM_PINS) {

Why test this under lock?  Return error before lock, then it becomes

int ret = 0;

if (irq  0 || irq = PIC_NUM_PINS)
return -1;

pic_lock(s);

if (kvm_irq_line_state(s-irq_states[irq], irq_source_id, level) {
level = !!s-irq_states[irq];
...
}

pic_unlock(s);

return ret;


 + ret = -1;
 + } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) {
 + ret = 0;
 + } else {
 + level = !!s-irq_states[irq];
   ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
   pic_update_irq(s);
   trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 268b332..edc9445 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
 irq, int src_id, int l)
   u32 old_irr;
   u32 mask = 1  irq;
   union kvm_ioapic_redirect_entry entry;
 - int ret = 1;
 - int level;
 + int ret, level;
  
   spin_lock(ioapic-lock);
   old_irr = ioapic-irr;
 - if (irq = 0  irq  IOAPIC_NUM_PINS) {
 + if (irq  0 || irq = IOAPIC_NUM_PINS) {

Same here

 + ret = -1;
 + } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) {
 + ret = 0;
 + } else {
   entry = ioapic-redirtbl[irq];
 - level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
 - level ^= entry.fields.polarity;
 - if (!level)
 + level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity;
 + if (!level) {
   ioapic-irr = ~mask;
 - else {
 + ret = 1;
 + } else {
   int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
   ioapic-irr |= mask;
   if ((edge  old_irr != ioapic-irr) ||



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:16:14PM -0600, Alex Williamson wrote:
 On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
  When more than 1 source id is in use for the same GSI, we have the
  following race related to handling irq_states:
  
  CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
  CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
  calls kvm_ioapic_set_irq(0).
  
  Now ioapic thinks the level is 0 but irq_state is not 0.
  
  Note that above is valid behaviour if CPU0 and CPU1 are using different
  source ids.
  
  Fix by performing all irq_states bitmap handling under pic/ioapic lock.
  This also removes the need for atomics with irq_states handling.
  
  Reported-by: Gleb Natapov g...@redhat.com
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  This survives stress testing for me (though I only tried virtio, not
  device assignment).
  Avi, Marcelo, though we have not observed this in the field,
  it's a bugfix so probably 3.5 material?
  I assume yes so the patch is against 3.5-rc7.
  Also stable? It's a very old bug.
  
  
   arch/x86/include/asm/kvm_host.h | 15 ++-
   arch/x86/kvm/i8259.c| 14 --
   virt/kvm/ioapic.c   | 13 -
   virt/kvm/ioapic.h   |  4 +++-
   virt/kvm/irq_comm.c | 31 ---
   5 files changed, 45 insertions(+), 32 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index db7c1f2..fe6e9f1 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
  struct kvm_mmu *mmu,
   void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
  *fault);
   bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
   
  -int kvm_pic_set_irq(void *opaque, int irq, int level);
  +static inline int kvm_irq_line_state(unsigned long *irq_state,
  +int irq_source_id, int level)
  +{
  +   /* Logical OR for level trig interrupt */
  +   if (level)
  +   __set_bit(irq_source_id, irq_state);
  +   else
  +   __clear_bit(irq_source_id, irq_state);
  +
  +   return !!(*irq_state);
  +}
  +
  +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
  level);
  +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
   
   void kvm_inject_nmi(struct kvm_vcpu *vcpu);
   
  diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
  index 81cf4fa..95b504b 100644
  --- a/arch/x86/kvm/i8259.c
  +++ b/arch/x86/kvm/i8259.c
  @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
  pic_unlock(s);
   }
   
  -int kvm_pic_set_irq(void *opaque, int irq, int level)
  +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
 
 Please use irq_source_id and keep level.  I hate reviewing code
 where I have to differentiate 'l' vs '1'.

l is an illegal variable name?  Switch to a different font.

  Same below, mixing src_id and
 irq_source_id in various places.

irq_source_id is too long, coding style says variable names
should be short:
LOCAL variable names should be short, and to the point.

   {
  -   struct kvm_pic *s = opaque;
  int ret = -1;
  +   int level;
 
 And you'd avoid this variable

by giving a different meaning to the same variable in
different parts of a function? Now *that* is a bad idea.

l is level for a specific source, level is the resulting irq value.

   
  pic_lock(s);
  if (irq = 0  irq  PIC_NUM_PINS) {
  +   level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
  ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
  pic_update_irq(s);
  trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
  @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
  return ret;
   }
   
  +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
  +{
  +   int i;
  +   pic_lock(s);
  +   for (i = 0; i  PIC_NUM_PINS; i++)
  +   __clear_bit(src_id, s-irq_states[i]);
  +   pic_unlock(s);
  +}
  +
   /*
* acknowledge interrupt 'irq'
*/
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
  index 26fd54d..268b332 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, 
  int irq)
  return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
   }
   
  -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
  +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int 
  l)
   {
  u32 old_irr;
  u32 mask = 1  irq;
  union kvm_ioapic_redirect_entry entry;
  int ret = 1;
  +   int level;
   
  spin_lock(ioapic-lock);
  old_irr = ioapic-irr;
  if (irq = 0  irq  IOAPIC_NUM_PINS) {
  entry = ioapic-redirtbl[irq];
  +   level = 

Re: [RFC-v2 0/4] tcm_vhost+cmwq fabric driver code for-3.6

2012-07-18 Thread Nicholas A. Bellinger
On Wed, 2012-07-18 at 08:42 -0500, Anthony Liguori wrote:
 On 07/17/2012 04:50 PM, Nicholas A. Bellinger wrote:
  On Tue, 2012-07-17 at 13:55 -0500, Anthony Liguori wrote:
  On 07/17/2012 10:05 AM, Michael S. Tsirkin wrote:
  On Wed, Jul 11, 2012 at 09:15:00PM +, Nicholas A. Bellinger wrote:
 

SNIP

 
  But I do think the kernel should carefully consider whether it wants to 
  support
  an interface like this.  This an extremely complicated ABI with a lot of 
  subtle
  details around state and compatibility.
 
  Are you absolutely confident that you can support a userspace application 
  that
  expects to get exactly the same response from all possible commands in 20 
  kernel
  versions from now?  Virtualization requires absolutely precise 
  compatibility in
  terms of bugs and features.  This is probably not something the TCM stack 
  has
  had to consider yet.
 
 
  We most certainly have thought about long term userspace compatibility
  with TCM.  Our userspace code (that's now available in all major
  distros) is completely forward-compatible with new fabric modules such
  as tcm_vhost.  No update required.
 
 I'm not sure we're talking about the same thing when we say compatibility.
 
 I'm not talking about the API.  I'm talking about the behavior of the 
 commands 
 that tcm_vhost supports.
 

OK, I understand what your getting at now..

 If you add support for a new command, you need to provide userspace a way to 
 disable this command.  If you change what gets reported for VPD, you need to 
 provide userspace a way to make VPD look like what it did in a previous 
 version.
 
 Basically, you need to be able to make a TCM device behave 100% the same as 
 it 
 did in an older version of the kernel.
 
 This is unique to virtualization due to live migration.  If you migrate from 
 a 
 3.6 kernel to a 3.8 kernel, you need to make sure that the 3.8 kernel's TCM 
 device behaves exactly like the 3.6 kernel because the guest that is 
 interacting 
 with it does not realize that live migration happened.
 
 Yes, you can add knobs via configfs to control this behavior, but I think the 
 question is, what's the plan for this?
 

So we already allow for some types of CDBs emulation to be toggled via
backend device attributes:

root@tifa:/usr/src/target-pending.git# tree 
/sys/kernel/config/target/core/iblock_2/fioa/attrib/
/sys/kernel/config/target/core/iblock_2/fioa/attrib/
├── block_size
├── emulate_dpo
├── emulate_fua_read
├── emulate_fua_write
├── emulate_rest_reord
├── emulate_tas
├── emulate_tpu
├── emulate_tpws
├── emulate_ua_intlck_ctrl
├── emulate_write_cache
├── enforce_pr_isids

SNIP

Some things like SPC-3 persistent reservations + implict/explict ALUA
multipath currently can't be disabled, but adding two more backend
attributes to disable/enable this logic individual is easy enough to do.

So that said, I don't have a problem with adding the necessary device
attributes to limit what type of CDBs a backend device is capable of
processing.

Trying to limiting this per-guest (instead of per-device) is where
things get a little more tricky..

 BTW, I think this is a good thing to cover in 
 Documentation/vhost/tcm_vhost.txt. 
   I think that's probably the only change that's needed here.
 

Sure, but I'll need to know what else that you'd like to optionally
restrict it terms of CDB processing that's not already there..

Thanks for your feedback!

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:26:41PM -0600, Alex Williamson wrote:
 On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
  When more than 1 source id is in use for the same GSI, we have the
  following race related to handling irq_states:
  
  CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
  CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
  calls kvm_ioapic_set_irq(0).
  
  Now ioapic thinks the level is 0 but irq_state is not 0.
  
  Note that above is valid behaviour if CPU0 and CPU1 are using different
  source ids.
  
  Fix by performing all irq_states bitmap handling under pic/ioapic lock.
  This also removes the need for atomics with irq_states handling.
  
  Reported-by: Gleb Natapov g...@redhat.com
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  This survives stress testing for me (though I only tried virtio, not
  device assignment).
  Avi, Marcelo, though we have not observed this in the field,
  it's a bugfix so probably 3.5 material?
  I assume yes so the patch is against 3.5-rc7.
  Also stable? It's a very old bug.
  
  
   arch/x86/include/asm/kvm_host.h | 15 ++-
   arch/x86/kvm/i8259.c| 14 --
   virt/kvm/ioapic.c   | 13 -
   virt/kvm/ioapic.h   |  4 +++-
   virt/kvm/irq_comm.c | 31 ---
   5 files changed, 45 insertions(+), 32 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index db7c1f2..fe6e9f1 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
  struct kvm_mmu *mmu,
   void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
  *fault);
   bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
   
  -int kvm_pic_set_irq(void *opaque, int irq, int level);
  +static inline int kvm_irq_line_state(unsigned long *irq_state,
  +int irq_source_id, int level)
 
 This should probably be __kvm_irq_line_state given the calling
 restrictions.

It's such a trivial helper, do we need to split hairs about this?

Look it's not a good time for minor coding style nits.
3.5 is imminent, it's about 1am here and I really don't have time to retest
today, so we'll release another kernel with a bug.

Could you focus on reviewing the functionality and correctness, and
leave ideas for better variable naming aside until 3.6?

  +{
  +   /* Logical OR for level trig interrupt */
  +   if (level)
  +   __set_bit(irq_source_id, irq_state);
  +   else
  +   __clear_bit(irq_source_id, irq_state);
  +
  +   return !!(*irq_state);
  +}
  +
  +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
  level);
  +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);
   
   void kvm_inject_nmi(struct kvm_vcpu *vcpu);
   
  diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
  index 81cf4fa..95b504b 100644
  --- a/arch/x86/kvm/i8259.c
  +++ b/arch/x86/kvm/i8259.c
  @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
  pic_unlock(s);
   }
   
  -int kvm_pic_set_irq(void *opaque, int irq, int level)
  +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
   {
  -   struct kvm_pic *s = opaque;
  int ret = -1;
  +   int level;
   
  pic_lock(s);
  if (irq = 0  irq  PIC_NUM_PINS) {
  +   level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
  ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
  pic_update_irq(s);
  trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
  @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
  return ret;
   }
   
  +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
  +{
  +   int i;
  +   pic_lock(s);
  +   for (i = 0; i  PIC_NUM_PINS; i++)
  +   __clear_bit(src_id, s-irq_states[i]);
  +   pic_unlock(s);
  +}
  +
   /*
* acknowledge interrupt 'irq'
*/
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
  index 26fd54d..268b332 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, 
  int irq)
  return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
   }
   
  -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
  +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, int 
  l)
   {
  u32 old_irr;
  u32 mask = 1  irq;
  union kvm_ioapic_redirect_entry entry;
  int ret = 1;
  +   int level;
   
  spin_lock(ioapic-lock);
  old_irr = ioapic-irr;
  if (irq = 0  irq  IOAPIC_NUM_PINS) {
  entry = ioapic-redirtbl[irq];
  +   level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
  level ^= entry.fields.polarity;
  if (!level)
 

Re: [PATCH RFC untested] kvm_set_irq: report coalesced for clear

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 04:40:38PM -0600, Alex Williamson wrote:
 On Thu, 2012-07-19 at 01:11 +0300, Michael S. Tsirkin wrote:
  This creates a way to detect when kvm_set_irq(...,0) was run
  twice with the same source id by returning 0 in this case.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  This is on top of my bugfix patch.  Uncompiled and untested.  Alex, I
  think something like this patch will make it possible for you to simply
  do
  if (kvm_set_irq(, 0))
  eventfd_signal()
  
  without need for further tricks.
  
   arch/x86/include/asm/kvm_host.h |  9 -
   arch/x86/kvm/i8259.c| 11 +++
   virt/kvm/ioapic.c   | 17 ++---
   3 files changed, 21 insertions(+), 16 deletions(-)
  
  diff --git a/arch/x86/include/asm/kvm_host.h 
  b/arch/x86/include/asm/kvm_host.h
  index fe6e9f1..6f5ed0a 100644
  --- a/arch/x86/include/asm/kvm_host.h
  +++ b/arch/x86/include/asm/kvm_host.h
  @@ -802,16 +802,15 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
  struct kvm_mmu *mmu,
   void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
  *fault);
   bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
   
  -static inline int kvm_irq_line_state(unsigned long *irq_state,
  +/* Tweak line state. Return true if state was changed. */
  +static inline bool kvm_irq_line_state(unsigned long *irq_state,
   int irq_source_id, int level)
   {
  /* Logical OR for level trig interrupt */
  if (level)
  -   __set_bit(irq_source_id, irq_state);
  +   return !__test_and_set_bit(irq_source_id, irq_state);
  else
  -   __clear_bit(irq_source_id, irq_state);
  -
  -   return !!(*irq_state);
  +   return __test_and_clear_bit(irq_source_id, irq_state);
   }
   
   int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
  level);
  diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
  index 95b504b..44e3635 100644
  --- a/arch/x86/kvm/i8259.c
  +++ b/arch/x86/kvm/i8259.c
  @@ -190,12 +190,15 @@ void kvm_pic_update_irq(struct kvm_pic *s)
   
   int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
   {
  -   int ret = -1;
  -   int level;
  +   int ret, level;
   
  pic_lock(s);
  -   if (irq = 0  irq  PIC_NUM_PINS) {
  -   level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
  +   if (irq  0 || irq = PIC_NUM_PINS) {
 
 Why test this under lock?  Return error before lock, then it becomes
 
 int ret = 0;
 
 if (irq  0 || irq = PIC_NUM_PINS)
   return -1;
 
 pic_lock(s);
 
 if (kvm_irq_line_state(s-irq_states[irq], irq_source_id, level) {
   level = !!s-irq_states[irq];
   ...
 }
 
 pic_unlock(s);
 
 return ret;
 
 
  +   ret = -1;
  +   } else if (!kvm_irq_line_state(s-irq_states[irq], src_id, l)) {
  +   ret = 0;
  +   } else {
  +   level = !!s-irq_states[irq];
  ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
  pic_update_irq(s);
  trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
  index 268b332..edc9445 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -196,18 +196,21 @@ int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int 
  irq, int src_id, int l)
  u32 old_irr;
  u32 mask = 1  irq;
  union kvm_ioapic_redirect_entry entry;
  -   int ret = 1;
  -   int level;
  +   int ret, level;
   
  spin_lock(ioapic-lock);
  old_irr = ioapic-irr;
  -   if (irq = 0  irq  IOAPIC_NUM_PINS) {
  +   if (irq  0 || irq = IOAPIC_NUM_PINS) {
 
 Same here
 
  +   ret = -1;
  +   } else if (!kvm_irq_line_state(ioapic-irq_states[irq], src_id, l)) {
  +   ret = 0;
  +   } else {
  entry = ioapic-redirtbl[irq];
  -   level = kvm_irq_line_state(ioapic-irq_states[irq], src_id, l);
  -   level ^= entry.fields.polarity;
  -   if (!level)
  +   level = (!!ioapic-irq_states[irq]) ^ entry.fields.polarity;
  +   if (!level) {
  ioapic-irr = ~mask;
  -   else {
  +   ret = 1;
  +   } else {
  int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
  ioapic-irr |= mask;
  if ((edge  old_irr != ioapic-irr) ||
 


Sure, go ahead. I just sent this to show how to address the locking with
EOI patches, don't intent to pursue it myself.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Alex Williamson
On Thu, 2012-07-19 at 01:49 +0300, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 04:26:41PM -0600, Alex Williamson wrote:
  On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
   When more than 1 source id is in use for the same GSI, we have the
   following race related to handling irq_states:
   
   CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
   CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
   calls kvm_ioapic_set_irq(0).
   
   Now ioapic thinks the level is 0 but irq_state is not 0.
   
   Note that above is valid behaviour if CPU0 and CPU1 are using different
   source ids.
   
   Fix by performing all irq_states bitmap handling under pic/ioapic lock.
   This also removes the need for atomics with irq_states handling.
   
   Reported-by: Gleb Natapov g...@redhat.com
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
   
   This survives stress testing for me (though I only tried virtio, not
   device assignment).
   Avi, Marcelo, though we have not observed this in the field,
   it's a bugfix so probably 3.5 material?
   I assume yes so the patch is against 3.5-rc7.
   Also stable? It's a very old bug.
   
   
arch/x86/include/asm/kvm_host.h | 15 ++-
arch/x86/kvm/i8259.c| 14 --
virt/kvm/ioapic.c   | 13 -
virt/kvm/ioapic.h   |  4 +++-
virt/kvm/irq_comm.c | 31 ---
5 files changed, 45 insertions(+), 32 deletions(-)
   
   diff --git a/arch/x86/include/asm/kvm_host.h 
   b/arch/x86/include/asm/kvm_host.h
   index db7c1f2..fe6e9f1 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
   struct kvm_mmu *mmu,
void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
   *fault);
bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);

   -int kvm_pic_set_irq(void *opaque, int irq, int level);
   +static inline int kvm_irq_line_state(unsigned long *irq_state,
   +  int irq_source_id, int level)
  
  This should probably be __kvm_irq_line_state given the calling
  restrictions.
 
 It's such a trivial helper, do we need to split hairs about this?
 
 Look it's not a good time for minor coding style nits.
 3.5 is imminent, it's about 1am here and I really don't have time to retest
 today, so we'll release another kernel with a bug.
 
 Could you focus on reviewing the functionality and correctness, and
 leave ideas for better variable naming aside until 3.6?

Please get off your high horse, this bug has existed for a long time and
nobody has noticed.  __ indicates locking and makes people think twice
about arbitrarily calling it.  Correct naming prevents future bugs,
which is something you've been riding me on in other areas...

   +{
   + /* Logical OR for level trig interrupt */
   + if (level)
   + __set_bit(irq_source_id, irq_state);
   + else
   + __clear_bit(irq_source_id, irq_state);
   +
   + return !!(*irq_state);
   +}
   +
   +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
   level);
   +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);

void kvm_inject_nmi(struct kvm_vcpu *vcpu);

   diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
   index 81cf4fa..95b504b 100644
   --- a/arch/x86/kvm/i8259.c
   +++ b/arch/x86/kvm/i8259.c
   @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 pic_unlock(s);
}

   -int kvm_pic_set_irq(void *opaque, int irq, int level)
   +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
{
   - struct kvm_pic *s = opaque;
 int ret = -1;
   + int level;

 pic_lock(s);
 if (irq = 0  irq  PIC_NUM_PINS) {
   + level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
 ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
 pic_update_irq(s);
 trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
   @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
 return ret;
}

   +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
   +{
   + int i;
   + pic_lock(s);
   + for (i = 0; i  PIC_NUM_PINS; i++)
   + __clear_bit(src_id, s-irq_states[i]);
   + pic_unlock(s);
   +}
   +
/*
 * acknowledge interrupt 'irq'
 */
   diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
   index 26fd54d..268b332 100644
   --- a/virt/kvm/ioapic.c
   +++ b/virt/kvm/ioapic.c
   @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic 
   *ioapic, int irq)
 return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
}

   -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
   +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, 
   int l)
{
 u32 old_irr;
 

Re: [PATCH] kvm: fix race with level interrupts

2012-07-18 Thread Alex Williamson
On Thu, 2012-07-19 at 01:44 +0300, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 04:16:14PM -0600, Alex Williamson wrote:
  On Thu, 2012-07-19 at 00:52 +0300, Michael S. Tsirkin wrote:
   When more than 1 source id is in use for the same GSI, we have the
   following race related to handling irq_states:
   
   CPU 0 clears bit in irq_states. CPU 0 reads irq_state as 0.
   CPU 1 sets bit in irq_states.  CPU 1 calls kvm_ioapic_set_irq(1). CPU 0
   calls kvm_ioapic_set_irq(0).
   
   Now ioapic thinks the level is 0 but irq_state is not 0.
   
   Note that above is valid behaviour if CPU0 and CPU1 are using different
   source ids.
   
   Fix by performing all irq_states bitmap handling under pic/ioapic lock.
   This also removes the need for atomics with irq_states handling.
   
   Reported-by: Gleb Natapov g...@redhat.com
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
   
   This survives stress testing for me (though I only tried virtio, not
   device assignment).
   Avi, Marcelo, though we have not observed this in the field,
   it's a bugfix so probably 3.5 material?
   I assume yes so the patch is against 3.5-rc7.
   Also stable? It's a very old bug.
   
   
arch/x86/include/asm/kvm_host.h | 15 ++-
arch/x86/kvm/i8259.c| 14 --
virt/kvm/ioapic.c   | 13 -
virt/kvm/ioapic.h   |  4 +++-
virt/kvm/irq_comm.c | 31 ---
5 files changed, 45 insertions(+), 32 deletions(-)
   
   diff --git a/arch/x86/include/asm/kvm_host.h 
   b/arch/x86/include/asm/kvm_host.h
   index db7c1f2..fe6e9f1 100644
   --- a/arch/x86/include/asm/kvm_host.h
   +++ b/arch/x86/include/asm/kvm_host.h
   @@ -802,7 +802,20 @@ int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, 
   struct kvm_mmu *mmu,
void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception 
   *fault);
bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);

   -int kvm_pic_set_irq(void *opaque, int irq, int level);
   +static inline int kvm_irq_line_state(unsigned long *irq_state,
   +  int irq_source_id, int level)
   +{
   + /* Logical OR for level trig interrupt */
   + if (level)
   + __set_bit(irq_source_id, irq_state);
   + else
   + __clear_bit(irq_source_id, irq_state);
   +
   + return !!(*irq_state);
   +}
   +
   +int kvm_pic_set_irq(struct kvm_pic *pic, int irq, int irq_source_id, int 
   level);
   +void kvm_pic_clear_all(struct kvm_pic *pic, int irq_source_id);

void kvm_inject_nmi(struct kvm_vcpu *vcpu);

   diff --git a/arch/x86/kvm/i8259.c b/arch/x86/kvm/i8259.c
   index 81cf4fa..95b504b 100644
   --- a/arch/x86/kvm/i8259.c
   +++ b/arch/x86/kvm/i8259.c
   @@ -188,13 +188,14 @@ void kvm_pic_update_irq(struct kvm_pic *s)
 pic_unlock(s);
}

   -int kvm_pic_set_irq(void *opaque, int irq, int level)
   +int kvm_pic_set_irq(struct kvm_pic *s, int irq, int src_id, int l)
  
  Please use irq_source_id and keep level.  I hate reviewing code
  where I have to differentiate 'l' vs '1'.
 
 l is an illegal variable name?  Switch to a different font.

WTF

   Same below, mixing src_id and
  irq_source_id in various places.
 
 irq_source_id is too long, coding style says variable names
 should be short:
   LOCAL variable names should be short, and to the point.

irq_source_id is used all over the code, so either convert them all of
be consistent.

{
   - struct kvm_pic *s = opaque;
 int ret = -1;
   + int level;
  
  And you'd avoid this variable
 
 by giving a different meaning to the same variable in
 different parts of a function? Now *that* is a bad idea.
 
 l is level for a specific source, level is the resulting irq value.


Then maybe 'l' isn't a meaningful variable name.


 pic_lock(s);
 if (irq = 0  irq  PIC_NUM_PINS) {
   + level = kvm_irq_line_state(s-irq_states[irq], src_id, l);
 ret = pic_set_irq1(s-pics[irq  3], irq  7, level);
 pic_update_irq(s);
 trace_kvm_pic_set_irq(irq  3, irq  7, s-pics[irq  3].elcr,
   @@ -205,6 +206,15 @@ int kvm_pic_set_irq(void *opaque, int irq, int level)
 return ret;
}

   +void kvm_pic_clear_all(struct kvm_pic *s, int src_id)
   +{
   + int i;
   + pic_lock(s);
   + for (i = 0; i  PIC_NUM_PINS; i++)
   + __clear_bit(src_id, s-irq_states[i]);
   + pic_unlock(s);
   +}
   +
/*
 * acknowledge interrupt 'irq'
 */
   diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
   index 26fd54d..268b332 100644
   --- a/virt/kvm/ioapic.c
   +++ b/virt/kvm/ioapic.c
   @@ -191,17 +191,19 @@ static int ioapic_deliver(struct kvm_ioapic 
   *ioapic, int irq)
 return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe);
}

   -int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int level)
   +int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int src_id, 
   int l)
{
 u32 

Re: UIO: missing resource mapping

2012-07-18 Thread Hans J. Koch
On Wed, Jul 18, 2012 at 12:40:47PM +0200, Dominic Eschweiler wrote:
 Am Montag, den 16.07.2012, 23:58 +0200 schrieb Hans J. Koch:
  Try to hack up a patch to add generic BAR mapping to uio_pci_generic.c
  and post it for review.
  
 
 Here we go ...

Great! I'm a bit under time pressure with my current work ATM. A call to
open(/dev/hansjkoch, O_NONBLOCK) only returns -EBUSY. So it'll take me one or
two days to review it thoroughly. But at first sight, it is what I had in mind.

You'll hear from me soon, thanks for your work! Comments and reviews from
others are welcome...

Hans

  
 Signed-off-by: Dominic Eschweiler eschwei...@fias.uni-frankfurt.de
 diff --git a/drivers/uio/uio_pci_generic.c
 b/drivers/uio/uio_pci_generic.c
 index 0bd08ef..e25991e 100644
 --- a/drivers/uio/uio_pci_generic.c
 +++ b/drivers/uio/uio_pci_generic.c
 @@ -25,10 +25,12 @@
  #include linux/slab.h
  #include linux/uio_driver.h
  
 -#define DRIVER_VERSION   0.01.0
 +#define DRIVER_VERSION   0.02.0
  #define DRIVER_AUTHORMichael S. Tsirkin m...@redhat.com
  #define DRIVER_DESC  Generic UIO driver for PCI 2.3 devices
  
 +#define DRV_NAME uio_pci_generic
 +
  struct uio_pci_generic_dev {
   struct uio_info info;
   struct pci_dev *pdev;
 @@ -58,6 +60,7 @@ static int __devinit probe(struct pci_dev *pdev,
  {
   struct uio_pci_generic_dev *gdev;
   int err;
 + int i;
  
   err = pci_enable_device(pdev);
   if (err) {
 @@ -67,8 +70,7 @@ static int __devinit probe(struct pci_dev *pdev,
   }
  
   if (!pdev-irq) {
 - dev_warn(pdev-dev, No IRQ assigned to device: 
 -  no support for interrupts?\n);
 + dev_warn(pdev-dev, No IRQ assigned to device: no support for
 interrupts?\n);
   pci_disable_device(pdev);
   return -ENODEV;
   }
 @@ -91,10 +93,31 @@ static int __devinit probe(struct pci_dev *pdev,
   gdev-info.handler = irqhandler;
   gdev-pdev = pdev;
  
 + /* request regions */
 + err = pci_request_regions(pdev, DRV_NAME);
 + if (err) {
 + dev_err(pdev-dev, Couldn't get PCI resources, aborting\n);
 + return err;
 + }
 +
 + /* create attributes for BAR mappings */
 + for (i = 0; i  PCI_NUM_RESOURCES; i++) {
 + if (pdev-resource[i].flags 
 + (pdev-resource[i].flags  IORESOURCE_MEM)) {
 + gdev-info.mem[i].addr = pci_resource_start(pdev, i);
 + gdev-info.mem[i].size = pci_resource_len(pdev, i);
 + gdev-info.mem[i].internal_addr = NULL;
 + gdev-info.mem[i].memtype = UIO_MEM_PHYS;
 + }
 + }
 +
   if (uio_register_device(pdev-dev, gdev-info))
   goto err_register;
   pci_set_drvdata(pdev, gdev);
  
 + pr_info(UIO_PCI_GENERIC : initialized new device (%x %x)\n,
 + pdev-vendor, pdev-device);
 +
   return 0;
  err_register:
   kfree(gdev);
 @@ -107,17 +130,21 @@ err_verify:
  static void remove(struct pci_dev *pdev)
  {
   struct uio_pci_generic_dev *gdev = pci_get_drvdata(pdev);
 -
   uio_unregister_device(gdev-info);
 +
 + pci_release_regions(pdev);
   pci_disable_device(pdev);
   kfree(gdev);
 +
 + pr_info(UIO_PCI_GENERIC : removed device (%x %x)\n,
 + pdev-vendor, pdev-device);
  }
  
  static struct pci_driver driver = {
 - .name = uio_pci_generic,
 + .name = DRV_NAME,
   .id_table = NULL, /* only dynamic id's */
 - .probe = probe,
 - .remove = remove,
 + .probe= probe,
 + .remove   = remove,
  };
  
  static int __init init(void)
 
 -- 
 Gruß
   Dominic
 
 Frankfurt Institute for Advanced Studies (FIAS)
 Ruth-Moufang-Straße 1
 D-60438 Frankfurt am Main
 Germany
 
 Phone:  +49 69 79844114
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation

2012-07-18 Thread Scott Wood
On 07/18/2012 04:39 AM, Bharat Bhushan wrote:
 This patch adds the watchdog emulation in KVM. The watchdog
 emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl.
 The kernel timer are used for watchdog emulation and emulates
 h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
 if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
 it is configured.
 
 Signed-off-by: Liu Yu yu@freescale.com
 Signed-off-by: Scott Wood scottw...@freescale.com
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com

Please put a note before your signoff stating that it's been modified
since the previous signoffs, such as:

[bharat.bhus...@freescale.com: reworked patch]

 @@ -482,6 +484,7 @@ struct kvm_vcpu_arch {
   u8 osi_needed;
   u8 osi_enabled;
   u8 papr_enabled;
 + u8 watchdog_enable;

s/enable;/enabled;/

 +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
 +{
 + unsigned long nr_jiffies;
 +
 + nr_jiffies = watchdog_next_timeout(vcpu);
 + spin_lock(vcpu-arch.wdt_lock);

Do watchdog_next_timeout() from within the lock.

 +void kvmppc_watchdog_func(unsigned long data)
 +{
 + struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
 + u32 tsr, new_tsr;
 + int final;
 +
 + do {
 + new_tsr = tsr = vcpu-arch.tsr;
 + final = 0;
 +
 + /* Time out event */
 + if (tsr  TSR_ENW) {
 + if (tsr  TSR_WIS) {
 + final = 1;
 + } else {
 + new_tsr = tsr | TSR_WIS;
 + }

Unnecessary braces on the inner if/else.

 + } else {
 + new_tsr = tsr | TSR_ENW;
 + }
 + } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr);
 +
 + if (new_tsr  TSR_WIS) {
 + smp_wmb();
 + kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
 + kvm_vcpu_kick(vcpu);
 + }
 +
 + /*
 +  * If this is final watchdog expiry and some action is required
 +  * then exit to userspace.
 +  */
 + if (final  (vcpu-arch.tcr  TCR_WRC_MASK)) {
 + smp_wmb();
 + kvm_make_request(KVM_REQ_WATCHDOG, vcpu);
 + kvm_vcpu_kick(vcpu);
 + }
 +
 +
 + /*
 +  * Stop running the watchdog timer after final expiration to
 +  * prevent the host from being flooded with timers if the
 +  * guest sets a short period.
 +  * Timers will resume when TSR/TCR is updated next time.
 +  */
 + if (!final)
 + arm_next_watchdog(vcpu);
 +}
  static void update_timer_ints(struct kvm_vcpu *vcpu)

Blank line between functions

  static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu)
 @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
   goto out;
   }
  
 + if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu)  
 + vcpu-arch.watchdog_enable) {

Check .watchdog_enabled when you make the request, not here.

 + kvm_run-exit_reason = KVM_EXIT_WATCHDOG;
 + ret = 0;
 + goto out;
 + }

I think we should check that ENW/WIS are still set.  Now that we don't
use TSR[WRS], this is the only way QEMU can preemptively clear a pending
final expiration after leaving debug halt.  If QEMU doesn't do this, and
KVM comes back with a watchdog exit, QEMU won't know if it's a new
legitimate one, or if it expired during the debug halt.  This would be
less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first
watchdog exit after a debug halt, and letting the expiration happen
again if the guest is really stuck.

   if (sregs-u.e.update_special  KVM_SREGS_E_UPDATE_TSR) {
 + u32 old_tsr = vcpu-arch.tsr;
 +
   vcpu-arch.tsr = sregs-u.e.tsr;
 +
 + if ((old_tsr ^ vcpu-arch.tsr) 
 + (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
 + arm_next_watchdog(vcpu);

Do we still do anything with TSR[WRS] at all?

 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 87f4dc8..3c50b81 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -38,9 +38,11 @@
  
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
  {
 - return !(v-arch.shared-msr  MSR_WE) ||
 -!!(v-arch.pending_exceptions) ||
 -v-requests;
 + bool ret = !(v-arch.shared-msr  MSR_WE) ||
 +!!(v-arch.pending_exceptions) ||
 +v-requests;
 +
 + return ret;
  }

There's no need to change this function anymore.

 +#define KVM_CAP_PPC_WDT  81

s/WDT/WATCHDOG/

or better, s/WDT/BOOKE_WATCHDOG/

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation

2012-07-18 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, July 19, 2012 7:56 AM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Bhushan 
 Bharat-
 R65777
 Subject: Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
 
 On 07/18/2012 04:39 AM, Bharat Bhushan wrote:
  This patch adds the watchdog emulation in KVM. The watchdog
  emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl.
  The kernel timer are used for watchdog emulation and emulates
  h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
  if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
  it is configured.
 
  Signed-off-by: Liu Yu yu@freescale.com
  Signed-off-by: Scott Wood scottw...@freescale.com
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 
 Please put a note before your signoff stating that it's been modified
 since the previous signoffs, such as:
 
 [bharat.bhus...@freescale.com: reworked patch]
 
  @@ -482,6 +484,7 @@ struct kvm_vcpu_arch {
  u8 osi_needed;
  u8 osi_enabled;
  u8 papr_enabled;
  +   u8 watchdog_enable;
 
 s/enable;/enabled;/
 
  +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
  +{
  +   unsigned long nr_jiffies;
  +
  +   nr_jiffies = watchdog_next_timeout(vcpu);
  +   spin_lock(vcpu-arch.wdt_lock);
 
 Do watchdog_next_timeout() from within the lock.

Why you think so? Is the next timeout calculation needed to be protected?

 
  +void kvmppc_watchdog_func(unsigned long data)
  +{
  +   struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
  +   u32 tsr, new_tsr;
  +   int final;
  +
  +   do {
  +   new_tsr = tsr = vcpu-arch.tsr;
  +   final = 0;
  +
  +   /* Time out event */
  +   if (tsr  TSR_ENW) {
  +   if (tsr  TSR_WIS) {
  +   final = 1;
  +   } else {
  +   new_tsr = tsr | TSR_WIS;
  +   }
 
 Unnecessary braces on the inner if/else.
 
  +   } else {
  +   new_tsr = tsr | TSR_ENW;
  +   }
  +   } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr);
  +
  +   if (new_tsr  TSR_WIS) {
  +   smp_wmb();
  +   kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
  +   kvm_vcpu_kick(vcpu);
  +   }
  +
  +   /*
  +* If this is final watchdog expiry and some action is required
  +* then exit to userspace.
  +*/
  +   if (final  (vcpu-arch.tcr  TCR_WRC_MASK)) {
  +   smp_wmb();
  +   kvm_make_request(KVM_REQ_WATCHDOG, vcpu);
  +   kvm_vcpu_kick(vcpu);
  +   }
  +
  +
  +   /*
  +* Stop running the watchdog timer after final expiration to
  +* prevent the host from being flooded with timers if the
  +* guest sets a short period.
  +* Timers will resume when TSR/TCR is updated next time.
  +*/
  +   if (!final)
  +   arm_next_watchdog(vcpu);
  +}
   static void update_timer_ints(struct kvm_vcpu *vcpu)
 
 Blank line between functions
 
   static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu)
  @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct
 kvm_vcpu *vcpu)
  goto out;
  }
 
  +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
  +   vcpu-arch.watchdog_enable) {
 
 Check .watchdog_enabled when you make the request, not here.
 
  +   kvm_run-exit_reason = KVM_EXIT_WATCHDOG;
  +   ret = 0;
  +   goto out;
  +   }
 
 I think we should check that ENW/WIS are still set.  Now that we don't
 use TSR[WRS], this is the only way QEMU can preemptively clear a pending
 final expiration after leaving debug halt.  If QEMU doesn't do this, and
 KVM comes back with a watchdog exit, QEMU won't know if it's a new
 legitimate one, or if it expired during the debug halt.  This would be
 less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first
 watchdog exit after a debug halt, and letting the expiration happen
 again if the guest is really stuck.

ahh, What about clearing KVM_REQ_WATCHDOG request when QEMU clears ENW|WIS (or 
whenever KVM or QEMU clears ENW|WIS)?

Thanks
-Bharat

 
  if (sregs-u.e.update_special  KVM_SREGS_E_UPDATE_TSR) {
  +   u32 old_tsr = vcpu-arch.tsr;
  +
  vcpu-arch.tsr = sregs-u.e.tsr;
  +
  +   if ((old_tsr ^ vcpu-arch.tsr) 
  +   (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
  +   arm_next_watchdog(vcpu);
 
 Do we still do anything with TSR[WRS] at all?
 
  diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
  index 87f4dc8..3c50b81 100644
  --- a/arch/powerpc/kvm/powerpc.c
  +++ b/arch/powerpc/kvm/powerpc.c
  @@ -38,9 +38,11 @@
 
   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
   {
  -   return !(v-arch.shared-msr  MSR_WE) ||
  -  !!(v-arch.pending_exceptions) ||
  -  v-requests;
  +   bool ret = !(v-arch.shared-msr  MSR_WE) ||
  +  

[PATCH 2/2 v3] KVM: PPC: booke: Add watchdog emulation

2012-07-18 Thread Bharat Bhushan
This patch adds the watchdog emulation in KVM. The watchdog
emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl.
The kernel timer are used for watchdog emulation and emulates
h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
it is configured.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v3:
 - Using KVM_REQ_WATCHDOG for userspace exit.
 - TSR changes left for vcpu thread.
 - Other review comments on v2

 arch/powerpc/include/asm/kvm_host.h  |3 +
 arch/powerpc/include/asm/kvm_ppc.h   |3 +
 arch/powerpc/include/asm/reg_booke.h |7 ++
 arch/powerpc/kvm/booke.c |  140 ++
 arch/powerpc/kvm/booke_emulate.c |8 ++
 arch/powerpc/kvm/powerpc.c   |   27 ++-
 include/linux/kvm.h  |2 +
 include/linux/kvm_host.h |1 +
 8 files changed, 187 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 50ea12f..01047f4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -467,6 +467,8 @@ struct kvm_vcpu_arch {
ulong fault_esr;
ulong queued_dear;
ulong queued_esr;
+   spinlock_t wdt_lock;
+   struct timer_list wdt_timer;
u32 tlbcfg[4];
u32 mmucfg;
u32 epr;
@@ -482,6 +484,7 @@ struct kvm_vcpu_arch {
u8 osi_needed;
u8 osi_enabled;
u8 papr_enabled;
+   u8 watchdog_enable;
u8 sane;
u8 cpu_type;
u8 hcall_needed;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0124937..e5cf4b9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -67,6 +67,7 @@ extern int kvmppc_emulate_mmio(struct kvm_run *run, struct 
kvm_vcpu *vcpu);
 extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu);
 extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb);
 extern void kvmppc_decrementer_func(unsigned long data);
+extern void kvmppc_watchdog_func(unsigned long data);
 extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu);
 
 /* Core-specific hooks */
@@ -104,6 +105,8 @@ extern void kvmppc_core_queue_external(struct kvm_vcpu 
*vcpu,
struct kvm_interrupt *irq);
 extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
  struct kvm_interrupt *irq);
+extern void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu);
+extern void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu);
 
 extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
   unsigned int op, int *advance);
diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index 2d916c4..e07e6af 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -539,6 +539,13 @@
 #define TCR_FIE0x0080  /* FIT Interrupt Enable */
 #define TCR_ARE0x0040  /* Auto Reload Enable */
 
+#ifdef CONFIG_E500
+#define TCR_GET_WP(tcr)  tcr)  0xC000)  30) | \
+ (((tcr)  0x1E)  15))
+#else
+#define TCR_GET_WP(tcr)  (((tcr)  0xC000)  30)
+#endif
+
 /* Bit definitions for the TSR. */
 #define TSR_ENW0x8000  /* Enable Next Watchdog */
 #define TSR_WIS0x4000  /* WDT Interrupt Status */
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index d25a097..9682506 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -206,6 +206,16 @@ void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
clear_bit(BOOKE_IRQPRIO_EXTERNAL_LEVEL, vcpu-arch.pending_exceptions);
 }
 
+void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu)
+{
+   kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_WATCHDOG);
+}
+
+void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu)
+{
+   clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions);
+}
+
 static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 srr1)
 {
 #ifdef CONFIG_KVM_BOOKE_HV
@@ -325,6 +335,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
msr_mask = MSR_CE | MSR_ME | MSR_DE;
int_class = INT_CLASS_NONCRIT;
break;
+   case BOOKE_IRQPRIO_WATCHDOG:
case BOOKE_IRQPRIO_CRITICAL:
case BOOKE_IRQPRIO_DBELL_CRIT:
allowed = vcpu-arch.shared-msr  MSR_CE;
@@ -404,12 +415,112 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
return allowed;
 }
 
+/*
+ * Return the number of jiffies until the next timeout.  If the timeout is
+ * longer than the NEXT_TIMER_MAX_DELTA, then return NEXT_TIMER_MAX_DELTA
+ * because the larger value can break the timer APIs.
+ */

[PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation

2012-07-18 Thread Bharat Bhushan
This patch adds the watchdog emulation in KVM. The watchdog
emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl.
The kernel timer are used for watchdog emulation and emulates
h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
it is configured.

Signed-off-by: Liu Yu yu@freescale.com
Signed-off-by: Scott Wood scottw...@freescale.com
Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v4:
 - in v3 i forgot to add Scott Wood and Liu Yu signoff

v3:
 - Using KVM_REQ_WATCHDOG for userspace exit.
 - TSR changes left for vcpu thread.
 - Other review comments on v2

 arch/powerpc/include/asm/kvm_host.h  |3 +
 arch/powerpc/include/asm/kvm_ppc.h   |3 +
 arch/powerpc/include/asm/reg_booke.h |7 ++
 arch/powerpc/kvm/booke.c |  140 ++
 arch/powerpc/kvm/booke_emulate.c |8 ++
 arch/powerpc/kvm/powerpc.c   |   27 ++-
 include/linux/kvm.h  |2 +
 include/linux/kvm_host.h |1 +
 8 files changed, 187 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 50ea12f..01047f4 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -467,6 +467,8 @@ struct kvm_vcpu_arch {
ulong fault_esr;
ulong queued_dear;
ulong queued_esr;
+   spinlock_t wdt_lock;
+   struct timer_list wdt_timer;
u32 tlbcfg[4];
u32 mmucfg;
u32 epr;
@@ -482,6 +484,7 @@ struct kvm_vcpu_arch {
u8 osi_needed;
u8 osi_enabled;
u8 papr_enabled;
+   u8 watchdog_enable;
u8 sane;
u8 cpu_type;
u8 hcall_needed;
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0124937..e5cf4b9 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -67,6 +67,7 @@ extern int kvmppc_emulate_mmio(struct kvm_run *run, struct 
kvm_vcpu *vcpu);
 extern void kvmppc_emulate_dec(struct kvm_vcpu *vcpu);
 extern u32 kvmppc_get_dec(struct kvm_vcpu *vcpu, u64 tb);
 extern void kvmppc_decrementer_func(unsigned long data);
+extern void kvmppc_watchdog_func(unsigned long data);
 extern int kvmppc_sanity_check(struct kvm_vcpu *vcpu);
 
 /* Core-specific hooks */
@@ -104,6 +105,8 @@ extern void kvmppc_core_queue_external(struct kvm_vcpu 
*vcpu,
struct kvm_interrupt *irq);
 extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
  struct kvm_interrupt *irq);
+extern void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu);
+extern void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu);
 
 extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
   unsigned int op, int *advance);
diff --git a/arch/powerpc/include/asm/reg_booke.h 
b/arch/powerpc/include/asm/reg_booke.h
index 2d916c4..e07e6af 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -539,6 +539,13 @@
 #define TCR_FIE0x0080  /* FIT Interrupt Enable */
 #define TCR_ARE0x0040  /* Auto Reload Enable */
 
+#ifdef CONFIG_E500
+#define TCR_GET_WP(tcr)  tcr)  0xC000)  30) | \
+ (((tcr)  0x1E)  15))
+#else
+#define TCR_GET_WP(tcr)  (((tcr)  0xC000)  30)
+#endif
+
 /* Bit definitions for the TSR. */
 #define TSR_ENW0x8000  /* Enable Next Watchdog */
 #define TSR_WIS0x4000  /* WDT Interrupt Status */
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index d25a097..9682506 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -206,6 +206,16 @@ void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
clear_bit(BOOKE_IRQPRIO_EXTERNAL_LEVEL, vcpu-arch.pending_exceptions);
 }
 
+void kvmppc_core_queue_watchdog(struct kvm_vcpu *vcpu)
+{
+   kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_WATCHDOG);
+}
+
+void kvmppc_core_dequeue_watchdog(struct kvm_vcpu *vcpu)
+{
+   clear_bit(BOOKE_IRQPRIO_WATCHDOG, vcpu-arch.pending_exceptions);
+}
+
 static void set_guest_srr(struct kvm_vcpu *vcpu, unsigned long srr0, u32 srr1)
 {
 #ifdef CONFIG_KVM_BOOKE_HV
@@ -325,6 +335,7 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
msr_mask = MSR_CE | MSR_ME | MSR_DE;
int_class = INT_CLASS_NONCRIT;
break;
+   case BOOKE_IRQPRIO_WATCHDOG:
case BOOKE_IRQPRIO_CRITICAL:
case BOOKE_IRQPRIO_DBELL_CRIT:
allowed = vcpu-arch.shared-msr  MSR_CE;
@@ -404,12 +415,112 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
return allowed;
 }
 
+/*
+ * Return the number of jiffies until the next timeout.  If the 

[PATCH 3/3 v3] Enable kvm emulated watchdog

2012-07-18 Thread Bharat Bhushan
Enable the KVM emulated watchdog if KVM supports (use the
capability enablement in watchdog handler). Also watchdog exit
(KVM_EXIT_WATCHDOG) handling is added.
Watchdog state machine is cleared whenever VM state changes to running.
This is to handle the cases like return from debug halt etc.

Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
---
v3:
 - TSR clearing is removed in whatchdog exit handling as this is no more needed.

v2:
 - Merged ([PATCH 3/4] Watchdog exit handling support) and ([PATCH 4/4] Enable 
to use kvm emulated watchdog)
 - Clear watchdog state machine when VM state changes to running.

 hw/ppc_booke.c|5 +++
 linux-headers/linux/kvm.h |2 +
 target-ppc/cpu.h  |1 +
 target-ppc/kvm.c  |   63 +
 4 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/hw/ppc_booke.c b/hw/ppc_booke.c
index 837a5b6..a9fba15 100644
--- a/hw/ppc_booke.c
+++ b/hw/ppc_booke.c
@@ -203,6 +203,11 @@ static void booke_wdt_cb(void *opaque)
  booke_timer-wdt_timer);
 }
 
+void ppc_booke_wdt_clear_tsr(CPUPPCState *env, target_ulong tsr)
+{
+env-spr[SPR_BOOKE_TSR] = tsr  ~(TSR_ENW | TSR_WIS | TSR_WRS_MASK);
+}
+
 void store_booke_tsr(CPUPPCState *env, target_ulong val)
 {
 env-spr[SPR_BOOKE_TSR] = ~val;
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 4b9e575..64d94da 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -163,6 +163,7 @@ struct kvm_pit_config {
 #define KVM_EXIT_OSI  18
 #define KVM_EXIT_PAPR_HCALL  19
 #define KVM_EXIT_S390_UCONTROL   20
+#define KVM_EXIT_WATCHDOG 21
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 #define KVM_INTERNAL_ERROR_EMULATION 1
@@ -618,6 +619,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
 #define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_PPC_WDT81
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
index ca2fc21..78212b4 100644
--- a/target-ppc/cpu.h
+++ b/target-ppc/cpu.h
@@ -1191,6 +1191,7 @@ void store_40x_dbcr0 (CPUPPCState *env, uint32_t val);
 void store_40x_sler (CPUPPCState *env, uint32_t val);
 void store_booke_tcr (CPUPPCState *env, target_ulong val);
 void store_booke_tsr (CPUPPCState *env, target_ulong val);
+void ppc_booke_wdt_clear_tsr(CPUPPCState *env, target_ulong tsr);
 void booke206_flush_tlb(CPUPPCState *env, int flags, const int check_iprot);
 target_phys_addr_t booke206_tlb_to_page_size(CPUPPCState *env, ppcmas_tlb_t 
*tlb);
 int ppcmas_tlb_check(CPUPPCState *env, ppcmas_tlb_t *tlb,
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index b6ef72d..393214e 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -32,6 +32,7 @@
 #include device_tree.h
 #include hw/sysbus.h
 #include hw/spapr.h
+#include hw/watchdog.h
 
 #include hw/sysbus.h
 #include hw/spapr.h
@@ -60,6 +61,7 @@ static int cap_booke_sregs;
 static int cap_ppc_smt;
 static int cap_ppc_rma;
 static int cap_spapr_tce;
+static int cap_ppc_wdt;
 
 /* XXX We have a race condition where we actually have a level triggered
  * interrupt, but the infrastructure can't expose that yet, so the guest
@@ -86,6 +88,7 @@ int kvm_arch_init(KVMState *s)
 cap_ppc_smt = kvm_check_extension(s, KVM_CAP_PPC_SMT);
 cap_ppc_rma = kvm_check_extension(s, KVM_CAP_PPC_RMA);
 cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE);
+cap_ppc_wdt = kvm_check_extension(s, KVM_CAP_PPC_WDT);
 
 if (!cap_interrupt_level) {
 fprintf(stderr, KVM: Couldn't find level irq capability. Expect the 
@@ -168,6 +171,25 @@ static int kvm_booke206_tlb_init(CPUPPCState *env)
 return 0;
 }
 
+static int kvm_wdt_enable(CPUPPCState *env)
+{
+int ret;
+struct kvm_enable_cap encap = {};
+
+if (!kvm_enabled() || !cap_ppc_wdt) {
+return 0;
+}
+
+encap.cap = KVM_CAP_PPC_WDT;
+ret = kvm_vcpu_ioctl(env, KVM_ENABLE_CAP, encap);
+if (ret  0) {
+fprintf(stderr, %s: couldn't enable KVM_CAP_PPC_WDT: %s\n,
+__func__, strerror(-ret));
+return ret;
+}
+
+return ret;
+}
 
 #if defined(TARGET_PPC64)
 static void kvm_get_fallback_smmu_info(CPUPPCState *env,
@@ -371,6 +393,33 @@ static inline void kvm_fixup_page_sizes(CPUPPCState *env)
 
 #endif /* !defined (TARGET_PPC64) */
 
+static void cpu_state_change_handler(void *opaque, int running, RunState state)
+{
+CPUPPCState *env = opaque;
+
+struct kvm_sregs sregs;
+
+printf(running = %d, state = %d \n, running, state);
+if (!running)
+return;
+
+/*
+ * Clear watchdog interrupt condition by clearing TSR.
+ * Similar logic needed to be implemented for watchdog
+ * emulation in qemu.
+ */
+if (cap_booke_sregs  cap_ppc_wdt) {
+kvm_vcpu_ioctl(env, KVM_GET_SREGS, sregs);
+
+/* Clear TSR.ENW, TSR.WIS and TSR.WRS */
+ppc_booke_wdt_clear_tsr(env, sregs.u.e.tsr);

Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Michael S. Tsirkin
On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.
 
 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.
 
 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.
 
 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

So guests do enable MSI through config space, but do
not fill in vectors? Very strange. Are you sure it's not
just a guest bug? How does it work for other PCI devices?
Can't we just fix guest drivers to program the vectors properly?

Also pls address the comment below.

Thanks!

 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)
 
 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
 dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +

Please add documentation. Something like

/*
 * Special API for POWER to configure the vectors through
 * a side channel. Should never be used by devices.
 */

  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
  /* Add MSI-X capability to the config space for the device. */
  /* Given a bar and its size, add MSI-X table on top of it
   * and fill MSI-X capability in the config space.
 diff --git a/hw/msix.h b/hw/msix.h
 index 50aee82..26a437e 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -4,6 +4,8 @@
  #include qemu-common.h
  #include pci.h
  
 +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
 +
  int msix_init(PCIDevice *pdev, unsigned short nentries,
MemoryRegion *bar,
unsigned bar_nr, unsigned bar_size);
 -- 
 1.7.10
 
 ps. double '-' and git version is an end-of-patch scissor as I read 
 somewhere, cannot recall where exactly :)
 
 
 
 
 
 
 On 21/06/12 20:56, Jan Kiszka wrote:
  On 2012-06-21 12:50, Alexey Kardashevskiy wrote:
  On 21/06/12 20:38, Jan Kiszka wrote:
  On 2012-06-21 12:28, Alexey Kardashevskiy wrote:
  On 21/06/12 17:39, Jan Kiszka wrote:
  On 2012-06-21 09:18, Alexey Kardashevskiy wrote:
 
  agrhhh. sha1 of the patch changed after rebasing :)
 
 
 
  Added (msi|msix)_(set|get)_message() function for whoever might
  want to use them.
 
  Currently msi_notify()/msix_notify() write to these vectors to
  signal the guest about an interrupt so the correct values have to
  written there by the guest or QEMU.
 
  For example, POWER guest never initializes MSI/MSIX vectors, instead
  it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
  POWER we have to initialize MSI/MSIX message from QEMU.
 
  As only set* function are required by now, the get functions were 
  added
  or made public for a symmetry.
 
  Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
  ---
   hw/msi.c  |   29 +
   hw/msi.h  |2 ++
   hw/msix.c |   11 ++-
   hw/msix.h |3 +++
   4 files changed, 44 insertions(+), 1 deletion(-)
 
  diff --git a/hw/msi.c b/hw/msi.c
  index 5233204..9ad84a4 100644
  --- a/hw/msi.c
  +++ b/hw/msi.c
  @@ -105,6 +105,35 @@ static inline uint8_t msi_pending_off(const 
  PCIDevice* dev, bool 

Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Alexey Kardashevskiy
On 18/07/12 22:43, Michael S. Tsirkin wrote:
 On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
 
 So guests do enable MSI through config space, but do
 not fill in vectors? 

Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything 
it needs to do (i.e. calls hypervisor) before msix_capability_init() writes 
PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register.

These vectors are the PCI bus addresses, the way they are set is specific for a 
PCI host controller, I do not see why the current scheme is a bug.


 Very strange. Are you sure it's not
 just a guest bug? How does it work for other PCI devices?

Did not get the question. It works the same for every PCI device under POWER 
guest.


 Can't we just fix guest drivers to program the vectors properly?
 
 Also pls address the comment below.

Comment below.

 Thanks!
 
 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const PCIDevice* 
 dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +
 
 Please add documentation. Something like
 
 /*
  * Special API for POWER to configure the vectors through
  * a side channel. Should never be used by devices.
  */


It is useful for any para-virtualized environment I believe, is not it?
For s390 as well. Of course, if it supports PCI, for example, what I am not 
sure it does though :)



  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 +}
 +
  /* Add MSI-X capability to the config space for the device. */
  /* Given a bar and its size, add MSI-X table on top of it
   * and fill MSI-X capability in the config space.
 diff --git a/hw/msix.h b/hw/msix.h
 index 50aee82..26a437e 100644
 --- a/hw/msix.h
 +++ b/hw/msix.h
 @@ -4,6 +4,8 @@
  #include qemu-common.h
  #include pci.h
  
 +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
 +
  int msix_init(PCIDevice *pdev, unsigned short nentries,
MemoryRegion *bar,
unsigned bar_nr, unsigned bar_size);
 -- 
 1.7.10

 ps. double '-' and git version is an end-of-patch scissor as I read 
 somewhere, cannot recall where exactly :)






 On 21/06/12 20:56, Jan Kiszka wrote:
 On 2012-06-21 12:50, Alexey Kardashevskiy wrote:
 On 21/06/12 20:38, Jan Kiszka wrote:
 On 2012-06-21 12:28, Alexey Kardashevskiy wrote:
 On 21/06/12 17:39, Jan Kiszka wrote:
 On 2012-06-21 09:18, Alexey Kardashevskiy wrote:

 agrhhh. sha1 of the patch changed after rebasing :)



 Added (msi|msix)_(set|get)_message() function for whoever might
 want to use them.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 

Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Michael S. Tsirkin
On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote:
 On 18/07/12 22:43, Michael S. Tsirkin wrote:
  On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
  Added (msi|msix)_set_message() functions.
 
  Currently msi_notify()/msix_notify() write to these vectors to
  signal the guest about an interrupt so the correct values have to
  written there by the guest or QEMU.
 
  For example, POWER guest never initializes MSI/MSIX vectors, instead
  it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
  POWER we have to initialize MSI/MSIX message from QEMU.
 
  Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru
  
  So guests do enable MSI through config space, but do
  not fill in vectors? 
 
 Yes. msix_capability_init() calls arch_setup_msi_irqs() which does everything 
 it needs to do (i.e. calls hypervisor) before msix_capability_init() writes 
 PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS register.
 
 These vectors are the PCI bus addresses, the way they are set is specific for 
 a PCI host controller, I do not see why the current scheme is a bug.

I won't work with any real PCI device, will it? Real pci devices expect
vectors to be written into their memory.

  Very strange. Are you sure it's not
  just a guest bug? How does it work for other PCI devices?
 
 Did not get the question. It works the same for every PCI device under POWER 
 guest.

I mean for real PCI devices.

  Can't we just fix guest drivers to program the vectors properly?
  
  Also pls address the comment below.
 
 Comment below.
 
  Thanks!
  
  ---
   hw/msi.c  |   13 +
   hw/msi.h  |1 +
   hw/msix.c |9 +
   hw/msix.h |2 ++
   4 files changed, 25 insertions(+)
 
  diff --git a/hw/msi.c b/hw/msi.c
  index 5233204..cc6102f 100644
  --- a/hw/msi.c
  +++ b/hw/msi.c
  @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const 
  PCIDevice* dev, bool msi64bit)
   return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
  PCI_MSI_PENDING_32);
   }
   
  +void msi_set_message(PCIDevice *dev, MSIMessage msg)
  +{
  +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
  +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
  +
  +if (msi64bit) {
  +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
  +} else {
  +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
  +}
  +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
  +}
  +
  
  Please add documentation. Something like
  
  /*
   * Special API for POWER to configure the vectors through
   * a side channel. Should never be used by devices.
   */
 
 
 It is useful for any para-virtualized environment I believe, is not it?
 For s390 as well. Of course, if it supports PCI, for example, what I am not 
 sure it does though :)

I expect the normal guest to program the address into MSI register using
config accesses, same way that it enables MSI/MSIX.
Why POWER does it differently I did not yet figure out but I hope
this weirdness is not so widespread.

   bool msi_enabled(const PCIDevice *dev)
   {
   return msi_present(dev) 
  diff --git a/hw/msi.h b/hw/msi.h
  index 75747ab..6ec1f99 100644
  --- a/hw/msi.h
  +++ b/hw/msi.h
  @@ -31,6 +31,7 @@ struct MSIMessage {
   
   extern bool msi_supported;
   
  +void msi_set_message(PCIDevice *dev, MSIMessage msg);
   bool msi_enabled(const PCIDevice *dev);
   int msi_init(struct PCIDevice *dev, uint8_t offset,
unsigned int nr_vectors, bool msi64bit, bool 
  msi_per_vector_mask);
  diff --git a/hw/msix.c b/hw/msix.c
  index ded3c55..5f7d6d3 100644
  --- a/hw/msix.c
  +++ b/hw/msix.c
  @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
  unsigned vector)
   return msg;
   }
   
  +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
  +{
  +uint8_t *table_entry = dev-msix_table_page + vector * 
  PCI_MSIX_ENTRY_SIZE;
  +
  +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
  +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
  +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = 
  ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
  +}
  +
   /* Add MSI-X capability to the config space for the device. */
   /* Given a bar and its size, add MSI-X table on top of it
* and fill MSI-X capability in the config space.
  diff --git a/hw/msix.h b/hw/msix.h
  index 50aee82..26a437e 100644
  --- a/hw/msix.h
  +++ b/hw/msix.h
  @@ -4,6 +4,8 @@
   #include qemu-common.h
   #include pci.h
   
  +void msix_set_message(PCIDevice *dev, int vector, MSIMessage msg);
  +
   int msix_init(PCIDevice *pdev, unsigned short nentries,
 MemoryRegion *bar,
 unsigned bar_nr, unsigned bar_size);
  -- 
  1.7.10
 
  ps. double '-' and git version is an end-of-patch scissor as I read 
  somewhere, cannot recall where exactly :)
 
 
 
 
 
 
  On 21/06/12 20:56, Jan Kiszka wrote:
  On 2012-06-21 12:50, 

Re: [PATCH 0/8] KVM: Optimize MMU notifier's THP page invalidation -v4

2012-07-18 Thread Marcelo Tosatti
On Mon, Jul 02, 2012 at 05:52:39PM +0900, Takuya Yoshikawa wrote:
 v3-v4: Resolved trace_kvm_age_page() issue -- patch 6,7
 v2-v3: Fixed intersection calculations. -- patch 3, 8
 
   Takuya
 
 Takuya Yoshikawa (8):
   KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva()
   KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva()
   KVM: MMU: Make kvm_handle_hva() handle range of addresses
   KVM: Introduce kvm_unmap_hva_range() for 
 kvm_mmu_notifier_invalidate_range_start()
   KVM: Separate rmap_pde from kvm_lpage_info-write_count
   KVM: MMU: Add memslot parameter to hva handlers
   KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
   KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
 
  arch/powerpc/include/asm/kvm_host.h |2 +
  arch/powerpc/kvm/book3s_64_mmu_hv.c |   47 ---
  arch/x86/include/asm/kvm_host.h |3 +-
  arch/x86/kvm/mmu.c  |  107 
 +++
  arch/x86/kvm/x86.c  |   11 
  include/linux/kvm_host.h|8 +++
  virt/kvm/kvm_main.c |3 +-
  7 files changed, 131 insertions(+), 50 deletions(-)
 
 
 From v2:
 
 The new test result was impressively good, see below, and THP page
 invalidation was more than 5 times faster on my x86 machine.
 
 Before:
   ...
   19.852 us  |  __mmu_notifier_invalidate_range_start();
   28.033 us  |  __mmu_notifier_invalidate_range_start();
   19.066 us  |  __mmu_notifier_invalidate_range_start();
   44.715 us  |  __mmu_notifier_invalidate_range_start();
   31.613 us  |  __mmu_notifier_invalidate_range_start();
   20.659 us  |  __mmu_notifier_invalidate_range_start();
   19.979 us  |  __mmu_notifier_invalidate_range_start();
   20.416 us  |  __mmu_notifier_invalidate_range_start();
   20.632 us  |  __mmu_notifier_invalidate_range_start();
   22.316 us  |  __mmu_notifier_invalidate_range_start();
   ...
 
 After:
   ...
   4.089 us   |  __mmu_notifier_invalidate_range_start();
   4.096 us   |  __mmu_notifier_invalidate_range_start();
   3.560 us   |  __mmu_notifier_invalidate_range_start();
   3.376 us   |  __mmu_notifier_invalidate_range_start();
   3.772 us   |  __mmu_notifier_invalidate_range_start();
   3.353 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.332 us   |  __mmu_notifier_invalidate_range_start();
   3.337 us   |  __mmu_notifier_invalidate_range_start();
   ...

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msi/msix: added API to set MSI message address and data

2012-07-18 Thread Alexey Kardashevskiy
On 19/07/12 01:23, Michael S. Tsirkin wrote:
 On Wed, Jul 18, 2012 at 11:17:12PM +1000, Alexey Kardashevskiy wrote:
 On 18/07/12 22:43, Michael S. Tsirkin wrote:
 On Thu, Jun 21, 2012 at 09:39:10PM +1000, Alexey Kardashevskiy wrote:
 Added (msi|msix)_set_message() functions.

 Currently msi_notify()/msix_notify() write to these vectors to
 signal the guest about an interrupt so the correct values have to
 written there by the guest or QEMU.

 For example, POWER guest never initializes MSI/MSIX vectors, instead
 it uses RTAS hypercalls. So in order to support MSIX for virtio-pci on
 POWER we have to initialize MSI/MSIX message from QEMU.

 Signed-off-by: Alexey Kardashevskiy a...@ozlabs.ru

 So guests do enable MSI through config space, but do
 not fill in vectors? 

 Yes. msix_capability_init() calls arch_setup_msi_irqs() which does 
 everything it needs to do (i.e. calls hypervisor) before 
 msix_capability_init() writes PCI_MSIX_FLAGS_ENABLE to the PCI_MSIX_FLAGS 
 register.

 These vectors are the PCI bus addresses, the way they are set is specific 
 for a PCI host controller, I do not see why the current scheme is a bug.
 
 I won't work with any real PCI device, will it? Real pci devices expect
 vectors to be written into their memory.


Yes. And the hypervisor does this. On POWER (at least book3s - server powerpc, 
the whole config space kitchen is hidden behind RTAS (kind of bios). For the 
guest, this RTAS is implemented in hypervisor, for the host - in the system 
firmware. So powerpc linux does not have to have PHB drivers. Kinda cool.

Usual powerpc server is running without the host linux at all, it is running a 
hypervisor called pHyp. And every guest knows that it is a guest, there is no 
full machine emulation, it is para-virtualization. In power-kvm, we replace 
that pHyp with the host linux and now QEMU plays a hypervisor role. Some day We 
will move the hypervisor to the host kernel completely (?) but now it is in 
QEMU.


 Very strange. Are you sure it's not
 just a guest bug? How does it work for other PCI devices?

 Did not get the question. It works the same for every PCI device under POWER 
 guest.
 
 I mean for real PCI devices.
 
 Can't we just fix guest drivers to program the vectors properly?

 Also pls address the comment below.

 Comment below.

 Thanks!

 ---
  hw/msi.c  |   13 +
  hw/msi.h  |1 +
  hw/msix.c |9 +
  hw/msix.h |2 ++
  4 files changed, 25 insertions(+)

 diff --git a/hw/msi.c b/hw/msi.c
 index 5233204..cc6102f 100644
 --- a/hw/msi.c
 +++ b/hw/msi.c
 @@ -105,6 +105,19 @@ static inline uint8_t msi_pending_off(const 
 PCIDevice* dev, bool msi64bit)
  return dev-msi_cap + (msi64bit ? PCI_MSI_PENDING_64 : 
 PCI_MSI_PENDING_32);
  }
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg)
 +{
 +uint16_t flags = pci_get_word(dev-config + msi_flags_off(dev));
 +bool msi64bit = flags  PCI_MSI_FLAGS_64BIT;
 +
 +if (msi64bit) {
 +pci_set_quad(dev-config + msi_address_lo_off(dev), msg.address);
 +} else {
 +pci_set_long(dev-config + msi_address_lo_off(dev), msg.address);
 +}
 +pci_set_word(dev-config + msi_data_off(dev, msi64bit), msg.data);
 +}
 +

 Please add documentation. Something like

 /*
  * Special API for POWER to configure the vectors through
  * a side channel. Should never be used by devices.
  */


 It is useful for any para-virtualized environment I believe, is not it?
 For s390 as well. Of course, if it supports PCI, for example, what I am not 
 sure it does though :)
 
 I expect the normal guest to program the address into MSI register using
 config accesses, same way that it enables MSI/MSIX.
 Why POWER does it differently I did not yet figure out but I hope
 this weirdness is not so widespread.


In para-virt I would expect the guest not to touch config space at all. At 
least it should use one interface rather than two but this is how it is.


  bool msi_enabled(const PCIDevice *dev)
  {
  return msi_present(dev) 
 diff --git a/hw/msi.h b/hw/msi.h
 index 75747ab..6ec1f99 100644
 --- a/hw/msi.h
 +++ b/hw/msi.h
 @@ -31,6 +31,7 @@ struct MSIMessage {
  
  extern bool msi_supported;
  
 +void msi_set_message(PCIDevice *dev, MSIMessage msg);
  bool msi_enabled(const PCIDevice *dev);
  int msi_init(struct PCIDevice *dev, uint8_t offset,
   unsigned int nr_vectors, bool msi64bit, bool 
 msi_per_vector_mask);
 diff --git a/hw/msix.c b/hw/msix.c
 index ded3c55..5f7d6d3 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -45,6 +45,15 @@ static MSIMessage msix_get_message(PCIDevice *dev, 
 unsigned vector)
  return msg;
  }
  
 +void msix_set_message(PCIDevice *dev, int vector, struct MSIMessage msg)
 +{
 +uint8_t *table_entry = dev-msix_table_page + vector * 
 PCI_MSIX_ENTRY_SIZE;
 +
 +pci_set_quad(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR, msg.address);
 +pci_set_long(table_entry + PCI_MSIX_ENTRY_DATA, msg.data);
 +table_entry[PCI_MSIX_ENTRY_VECTOR_CTRL] = 
 

RE: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation

2012-07-18 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, July 19, 2012 7:56 AM
 To: Bhushan Bharat-R65777
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; ag...@suse.de; Bhushan 
 Bharat-
 R65777
 Subject: Re: [PATCH 2/2 v4] KVM: PPC: booke: Add watchdog emulation
 
 On 07/18/2012 04:39 AM, Bharat Bhushan wrote:
  This patch adds the watchdog emulation in KVM. The watchdog
  emulation is enabled by KVM_ENABLE_CAP(KVM_CAP_PPC_WDT) ioctl.
  The kernel timer are used for watchdog emulation and emulates
  h/w watchdog state machine. On watchdog timer expiry, it exit to QEMU
  if TCR.WRC is non ZERO. QEMU can reset/shutdown etc depending upon how
  it is configured.
 
  Signed-off-by: Liu Yu yu@freescale.com
  Signed-off-by: Scott Wood scottw...@freescale.com
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 
 Please put a note before your signoff stating that it's been modified
 since the previous signoffs, such as:
 
 [bharat.bhus...@freescale.com: reworked patch]
 
  @@ -482,6 +484,7 @@ struct kvm_vcpu_arch {
  u8 osi_needed;
  u8 osi_enabled;
  u8 papr_enabled;
  +   u8 watchdog_enable;
 
 s/enable;/enabled;/
 
  +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
  +{
  +   unsigned long nr_jiffies;
  +
  +   nr_jiffies = watchdog_next_timeout(vcpu);
  +   spin_lock(vcpu-arch.wdt_lock);
 
 Do watchdog_next_timeout() from within the lock.

Why you think so? Is the next timeout calculation needed to be protected?

 
  +void kvmppc_watchdog_func(unsigned long data)
  +{
  +   struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
  +   u32 tsr, new_tsr;
  +   int final;
  +
  +   do {
  +   new_tsr = tsr = vcpu-arch.tsr;
  +   final = 0;
  +
  +   /* Time out event */
  +   if (tsr  TSR_ENW) {
  +   if (tsr  TSR_WIS) {
  +   final = 1;
  +   } else {
  +   new_tsr = tsr | TSR_WIS;
  +   }
 
 Unnecessary braces on the inner if/else.
 
  +   } else {
  +   new_tsr = tsr | TSR_ENW;
  +   }
  +   } while (cmpxchg(vcpu-arch.tsr, tsr, new_tsr) != tsr);
  +
  +   if (new_tsr  TSR_WIS) {
  +   smp_wmb();
  +   kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
  +   kvm_vcpu_kick(vcpu);
  +   }
  +
  +   /*
  +* If this is final watchdog expiry and some action is required
  +* then exit to userspace.
  +*/
  +   if (final  (vcpu-arch.tcr  TCR_WRC_MASK)) {
  +   smp_wmb();
  +   kvm_make_request(KVM_REQ_WATCHDOG, vcpu);
  +   kvm_vcpu_kick(vcpu);
  +   }
  +
  +
  +   /*
  +* Stop running the watchdog timer after final expiration to
  +* prevent the host from being flooded with timers if the
  +* guest sets a short period.
  +* Timers will resume when TSR/TCR is updated next time.
  +*/
  +   if (!final)
  +   arm_next_watchdog(vcpu);
  +}
   static void update_timer_ints(struct kvm_vcpu *vcpu)
 
 Blank line between functions
 
   static void kvmppc_core_check_exceptions(struct kvm_vcpu *vcpu)
  @@ -516,6 +627,13 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct
 kvm_vcpu *vcpu)
  goto out;
  }
 
  +   if (kvm_check_request(KVM_REQ_WATCHDOG, vcpu) 
  +   vcpu-arch.watchdog_enable) {
 
 Check .watchdog_enabled when you make the request, not here.
 
  +   kvm_run-exit_reason = KVM_EXIT_WATCHDOG;
  +   ret = 0;
  +   goto out;
  +   }
 
 I think we should check that ENW/WIS are still set.  Now that we don't
 use TSR[WRS], this is the only way QEMU can preemptively clear a pending
 final expiration after leaving debug halt.  If QEMU doesn't do this, and
 KVM comes back with a watchdog exit, QEMU won't know if it's a new
 legitimate one, or if it expired during the debug halt.  This would be
 less ugly than QEMU doing a reset-TSR[WIS,ENW]-and-ignore on the first
 watchdog exit after a debug halt, and letting the expiration happen
 again if the guest is really stuck.

ahh, What about clearing KVM_REQ_WATCHDOG request when QEMU clears ENW|WIS (or 
whenever KVM or QEMU clears ENW|WIS)?

Thanks
-Bharat

 
  if (sregs-u.e.update_special  KVM_SREGS_E_UPDATE_TSR) {
  +   u32 old_tsr = vcpu-arch.tsr;
  +
  vcpu-arch.tsr = sregs-u.e.tsr;
  +
  +   if ((old_tsr ^ vcpu-arch.tsr) 
  +   (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
  +   arm_next_watchdog(vcpu);
 
 Do we still do anything with TSR[WRS] at all?
 
  diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
  index 87f4dc8..3c50b81 100644
  --- a/arch/powerpc/kvm/powerpc.c
  +++ b/arch/powerpc/kvm/powerpc.c
  @@ -38,9 +38,11 @@
 
   int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
   {
  -   return !(v-arch.shared-msr  MSR_WE) ||
  -  !!(v-arch.pending_exceptions) ||
  -  v-requests;
  +   bool ret = !(v-arch.shared-msr  MSR_WE) ||
  +  

<    1   2