Re: [Qemu-devel] [RFC] alpha qemu arithmetic exceptions

2014-07-02 Thread Al Viro
More bugs: addl/v should sign-extend the result, as addl does.
As it is, we have
uint64_t helper_addlv(CPUAlphaState *env, uint64_t op1, uint64_t op2)
{
uint64_t tmp = op1;
op1 = (uint32_t)(op1 + op2);
if (unlikely((tmp ^ op2 ^ (-1UL)) & (tmp ^ op1) & (1UL << 31))) {
arith_excp(env, GETPC(), EXC_M_IOV, 0);
}
return op1;
}

IOW,
#include 

long r;
void __attribute__((noinline)) f(void)
{
asm __volatile(
"subl   $31, 1, $0\n\t"
"addl   $0, $0, $1\n\t"
"addl/v $0, $0, $0\n\t"
"subq   $0, $1, $0\n\t"
"stq$0, %0\n\t"
: "=m"(r): :"$0", "$1");
}

main()
{
f();
printf("%ld\n", r);
}

ends up printing 0 on actual hardware (all variants) and 4294967296 on
qemu.  Similar problem with subl/v - 

#include 

long r;
void __attribute__((noinline)) f(void)
{
asm __volatile(
"subl   $31, 1, $0\n\t"
"subl/v $31, 1, $1\n\t"
"subq   $0, $1, $0\n\t"
"stq$0, %0\n\t"
: "=m"(r): :"$0", "$1");
}

main()
{
f();
printf("%ld\n", r);
}

prints 0 on actual hw and -4294967296 on qemu.  What constraints do we have
on qemu host, anyway?  Two's-complement, (int32_t)(uint32_t)x == x for any
int32_t x?  helper_mullv() seems to assume that...

Oh, crap - our mull/v is sensitive to upper 32 bits of multiplicands.
If you put 1UL<<32 into one register, 1 into another and say mull/v,
result will be 0 and no overflow.  qemu does
int64_t res = (int64_t)op1 * (int64_t)op2;

if (unlikely((int32_t)res != res)) {
arith_excp(env, GETPC(), EXC_M_IOV, 0);
}
return (int64_t)((int32_t)res);
which leads to overflow trap triggered for no good reason...

Incidentally, all those guys ({add,sub,mul}[lq]/v) *do* assign the result
(same as the variant without /v would) before entering the trap.  So
arith_excp() is wrong here.

FWIW, why not just generate
trunc_i64_i32 tmp, va
trunc_i64_i32 tmp2, vb
muls2_i32 tmp2, tmp, tmp, tmp2
ext32s_i64 vc, tmp2
maybe_overflow_32 tmp
where maybe_overflow throws IOV unless tmp is 0 or -1?  That would appear
to suffice for mull/v.  mulq/v would be
muls2_i64 vc, tmp, va, vb
maybe_overflow_64 tmp
addl/v:
trunc_i64_i32 tmp, va
trunc_i64_i32 tmp2, vb
add2_i32 tmp2, tmp, tmp, zero, tmp2, zero
ext32s_i64 vc, tmp2
maybe_overflow_32 tmp
etc.

We'd need two helpers, differing only in argument type.  Simple
if (unlikely(arg && ~arg))
   arith_excp(env, GETPC(), EXC_M_IOV, 0);
would do.  Not sure what flags would be needed in DEFINE_HELPER_... for
those, though.  Comments?



Re: [Qemu-devel] [PATCH for 2.1 1/2] memory: introduce memory_region_init_ram_nofail() and memory_region_init_ram_ptr_nofail()

2014-07-02 Thread Michael S. Tsirkin
On Thu, Jul 03, 2014 at 02:10:55PM +0800, Hu Tao wrote:
> Introduce memory_region_init_ram_nofail() and
> memory_region_init_ram_ptr_nofail(), which are the same as
> memory_region_init_ram() and memory_region_init_ram_ptr()
> respectively. They will exit qemu if there is an error, this is the
> behaviour of old memory_region_init_ram() and
> memory_region_init_ram_ptr().
> 
> All existing calls to memory_region_init_ram() and
> memory_region_init_ram_ptr() are replaced with
> memory_region_init_ram_nofail() and memory_region_init_ram_ptr_nofail().
> 
> memory_region_init_ram() and memory_region_init_ram_ptr() are added an
> extra parameter errp to let callers handle the error.
> 
> This patch solves a problem that qemu just exits when using monitor
> command object_add to add a memory backend whose size is way too large.
> In the case we'd better give an error message and keep guest running.
> 
> How to reproduce:
> 
> 1. run qemu
> 2. (monitor)object_add memory-backend-ram,size=10G,id=ram0
> 
> 

Don't put two empty lines in a row please.

> Signed-off-by: Hu Tao 
> ---
>  backends/hostmem-ram.c   |  2 +-
>  exec.c   | 30 +
>  hw/block/pflash_cfi01.c  |  5 -
>  hw/block/pflash_cfi02.c  |  5 -
>  hw/core/loader.c |  2 +-
>  hw/display/vga.c |  2 +-
>  hw/display/vmware_vga.c  |  3 ++-
>  hw/i386/kvm/pci-assign.c |  9 
>  hw/i386/pc.c |  2 +-
>  hw/i386/pc_sysfw.c   |  4 ++--
>  hw/misc/ivshmem.c|  9 
>  hw/misc/vfio.c   |  3 ++-
>  hw/pci/pci.c |  2 +-
>  include/exec/memory.h| 32 ---
>  include/exec/ram_addr.h  |  4 ++--
>  memory.c | 57 
> +++-
>  numa.c   |  4 ++--
>  17 files changed, 134 insertions(+), 41 deletions(-)
> 
> diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
> index d9a8290..a67a134 100644
> --- a/backends/hostmem-ram.c
> +++ b/backends/hostmem-ram.c
> @@ -27,7 +27,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
> **errp)
>  
>  path = object_get_canonical_path_component(OBJECT(backend));
>  memory_region_init_ram(&backend->mr, OBJECT(backend), path,
> -   backend->size);
> +   backend->size, errp);
>  g_free(path);
>  }
>  

Sigh.  So you are still mixing a huge mechanical rename with
a bugfix.  I'm not merging this, please split up the patch:
1. rename existing functions and convert all users to _nofail
2. add parameter to qemu_ram_alloc variants,
   add new function and use in hostmem-ram


> diff --git a/exec.c b/exec.c
> index 5a2a25e..8c2a91d 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1224,7 +1224,7 @@ static int memory_try_enable_merging(void *addr, size_t 
> len)
>  return qemu_madvise(addr, len, QEMU_MADV_MERGEABLE);
>  }
>  
> -static ram_addr_t ram_block_add(RAMBlock *new_block)
> +static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
>  {
>  RAMBlock *block;
>  ram_addr_t old_ram_size, new_ram_size;
> @@ -1241,9 +1241,11 @@ static ram_addr_t ram_block_add(RAMBlock *new_block)
>  } else {
>  new_block->host = phys_mem_alloc(new_block->length);
>  if (!new_block->host) {
> -fprintf(stderr, "Cannot set up guest memory '%s': %s\n",
> -new_block->mr->name, strerror(errno));
> -exit(1);
> +error_setg_errno(errp, errno,
> + "cannot set up guest memory '%s'",
> + new_block->mr->name);
> +qemu_mutex_unlock_ramlist();
> +return -1;
>  }
>  memory_try_enable_merging(new_block->host, new_block->length);
>  }
> @@ -1294,6 +1296,7 @@ ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, 
> MemoryRegion *mr,
>  Error **errp)
>  {
>  RAMBlock *new_block;
> +ram_addr_t addr;
>  
>  if (xen_enabled()) {
>  error_setg(errp, "-mem-path not supported with Xen");
> @@ -1323,14 +1326,19 @@ ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, 
> MemoryRegion *mr,
>  return -1;
>  }
>  
> -return ram_block_add(new_block);
> +addr = ram_block_add(new_block, errp);
> +if (errp && *errp) {
> +g_free(new_block);

You want return -1 here. Don't rely on ram_block_add to return -1.

> +}
> +return addr;
>  }
>  #endif
>  
>  ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
> -   MemoryRegion *mr)
> +   MemoryRegion *mr, Error **errp)
>  {
>  RAMBlock *new_block;
> +ram_addr_t addr;
>  
>  size = TARGET_PAGE_ALIGN(size);
>  new_block = g_malloc0(sizeof(*new_block));
> @@ -1341,12 +1349,16 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, 
> void *hos

Re: [Qemu-devel] [Qemu-ppc] [PATCH v7 1/4] cpus: Define callback for QEMU "nmi" command

2014-07-02 Thread Nikunj A Dadhania
Alexey Kardashevskiy  writes:
> diff --git a/hw/core/nmi.c b/hw/core/nmi.c
> new file mode 100644
> index 000..db1295f
> --- /dev/null
> +++ b/hw/core/nmi.c
> @@ -0,0 +1,84 @@

[...]

> +
> +static void nmi_children(Object *o, struct do_nmi_s *ns);
> +

[...]

> +
> +void nmi_children(Object *o, struct do_nmi_s *ns)

Above declared as static and implemented non-static.

Regards
Nikunj




Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Michael S. Tsirkin
On Thu, Jul 03, 2014 at 01:57:24PM +0800, Chen, Tiejun wrote:
> On 2014/7/2 23:27, Michael S. Tsirkin wrote:
> >On Wed, Jul 02, 2014 at 03:15:02PM +, Ross Philipson wrote:
> >>>-Original Message-
> >>>From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> >>>Sent: Wednesday, July 02, 2014 7:33 AM
> >>>To: Ross Philipson; Michael S. Tsirkin; Stefano Stabellini
> >>>Cc: peter.mayd...@linaro.org; xen-de...@lists.xensource.com; Allen M.
> >>>Kay; kelly.zyta...@amd.com; qemu-devel@nongnu.org;
> >>>yang.z.zh...@intel.com; anth...@codemonkey.ws; Anthony Perard; Chen,
> >>>Tiejun
> >>>Subject: Re: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough
> >>>support
> >>>
> >>>Il 01/07/2014 19:39, Ross Philipson ha scritto:
> 
> We do IGD pass-through in our project (XenClient). The patches
> originally came from our project. We surface the same ISA bridge and
> have never had activation issues on any version of Widows from XP to
> Win8. We do not normally run server platforms so I can't say for sure
> there.
> >>>
> >>>The problem is not activation, the problem is that the patches are
> >>>making assumptions on the driver and the firmware that might work today
> >>>but are IMHO just not sane.
> >>
> >>Sure I don't think anybody is suggesting that activation is
> >>the main problem. It was just a potential problem with respect
> >>to one of the proposed solutions.
> >>
> >>When we first started doing this (back in 2009ish) we ran into
> >>all these problems with surfacing ISA bridges, giving guest
> >>drivers access to registers in the host bridge. etc. Nothing seemed
> >>sane; I sympathize.
> >
> >At some level, maybe Paolo is right.  Ignore existing drivers and ask
> >intel developers to update their drivers to do something sane on
> >hypervisors, even if they do ugly things on real hardware.
> >
> >A simple proposal since what I wrote earlier though apparently wasn't
> >very clear:
> >
> >   Detect Xen subsystem vendor id on vga card.
> >   If there, avoid poking at chipset. Instead
> > - use subsystem device # for card type
> > - use second half of BAR0 of device
> > - instead of access to pci host
> >
> >hypervisors will simply take BAR0 and double it in size,
> >make second part map to what would be the pci host.
> >
> >Tiejun, is there a chance this can be done not only
> >on Linux but on windows as well?
> 
> MST,
> 
> Looks this is paravirtualizaed way, right?
> 
> I can post this requirement to check but please make sure I really
> understand what you mean,
> 
> #1 We need to define a new Xen subsystem vendor id and emulate this value on
> vga card
> 
> #2 Native driver need to do:
> 
>   * if the subsystem id on vga is that emulated XEN subsystem id, the 
> native
> driver can get all necessary access including PCI host bridge at 0.0 and ISA
> bridge at 1f.0 from second half of that emulated BAR0 double the real size.
> 
> Right? If yes, I'd like to ask them.

Yes.

And in addition, get the device type from the low bits of
the subsystem id as opposed to the ISA bridge.

This way you don't need to modify the PC type at all.

> But question is how to walk from PCI config on PCI host to BAR0 on VGA:
> 
>   dev_priv->bridge_dev = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));
> 
>   pci_write/read_config_dword(dev_priv->bridge_dev,,,)
> 
> Thanks
> Tiejun

So you would have a helper: set dev_priv->bridge_dev to NULL, and then


i915_write/read_host_dword(dev, dev_priv, offset)
{
if (dev_priv->bridge_dev)
pci_write/read_config_dword(dev_priv->bridge_dev,,,)
else
iowrite/read16(dev->pv_io_base, )
}

The point being not touching anything except the vga device at all.

Note: we can't allow guests to change the config of the real host
bridge, so we end up whitelisting specific cards and specific registers
anyway.  So the only problem this solves is that of conflicts between
the host bridge emulated by qemu and the hardware one.

Also, maybe driver guys will see the pain this causes them
and will put some pressure on the hardware guys to
do it like this in future hardware :)

> >
> >
> >>>
> >>>I would have no problem with a clean patchset that adds a new machine
> >>>type and doesn't touch code in "-M pc", but it looks like mst disagrees.
> >>>   Ultimately, if a patchset is too hacky for upstream, you can include
> >>>it in your downstream XenClient (and XenServer) QEMU branch.  It
> >>>happens.
> >>>
> >>>Paolo
> >>>
> >>>-
> >>>No virus found in this message.
> >>>Checked by AVG - www.avg.com
> >>>Version: 2014.0.4592 / Virus Database: 3986/7769 - Release Date:
> >>>06/30/14
> >



[Qemu-devel] [Bug 1307473] Re: guest hang due to missing clock interrupt

2014-07-02 Thread Ilya Almametov
I can confirm that it's more kernel issue than qemu. I run kernel
3.11.0-24-generic which is left after upgrade from Saucy and have no
issues for at least two days. Before that with current 3.13.0-30-generic
kernel my Windows guests crashed every 3-4 hours.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1307473

Title:
  guest hang due to missing clock interrupt

Status in QEMU:
  New
Status in “linux” package in Ubuntu:
  Confirmed
Status in “qemu” package in Ubuntu:
  Confirmed

Bug description:
  
  I noticed on 2 different systems that after upgrade from precise to latest 
trusty VMs are crashing:

  - in case of Windows VMs I'm getting BSOD with error message: "A clock 
interrupt was not received on a secondary processor within the allocated time 
interval."
  - On linux VMs I'm noticing "hrtimer: interrupt took 2992229 ns" messages 
  - On some proprietary virtual appliances I'm noticing crashes an due to 
missing timer interrupts

  QEMU version is:
  QEMU emulator version 1.7.91 (Debian 2.0.0~rc1+dfsg-0ubuntu3)

  Full command line:

  qemu-system-x86_64 -enable-kvm -name win7eval -S -machine pc-
  i440fx-1.7,accel=kvm,usb=off -cpu host -m 4096 -realtime mlock=off
  -smp 4,sockets=1,cores=4,threads=1 -uuid 05e5089a-
  4aa1-6bb2-ef06-ab4d020a -no-user-config -nodefaults -chardev
  
socket,id=charmonitor,path=/var/lib/libvirt/qemu/win7eval.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
  -no-shutdown -boot strict=on -device piix3-usb-
  uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
  file=/var/vm/win7eval.qcow2,if=none,id=drive-virtio-disk0,format=qcow2
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-
  disk0,id=virtio-disk0,bootindex=1 -drive
  file=/home/damarion/iso/7600.16385.090713-1255_x86fre_enterprise_en-
  us_EVAL_Eval_Enterprise-GRMCENEVAL_EN_DVD.iso,if=none,id=drive-
  ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive
  =drive-ide0-0-0,id=ide0-0-0 -drive file=/home/damarion/iso/virtio-
  win-0.1-74.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw
  -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
  -netdev tap,fd=24,id=hostnet0 -device
  e1000,netdev=hostnet0,id=net0,mac=52:54:00:38:31:0a,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0 -device isa-
  serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0
  -vnc 127.0.0.1:1 -device VGA,id=video0,bus=pci.0,addr=0x2 -device
  virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1307473/+subscriptions



[Qemu-devel] [PATCH for 2.1 1/2] memory: introduce memory_region_init_ram_nofail() and memory_region_init_ram_ptr_nofail()

2014-07-02 Thread Hu Tao
Introduce memory_region_init_ram_nofail() and
memory_region_init_ram_ptr_nofail(), which are the same as
memory_region_init_ram() and memory_region_init_ram_ptr()
respectively. They will exit qemu if there is an error, this is the
behaviour of old memory_region_init_ram() and
memory_region_init_ram_ptr().

All existing calls to memory_region_init_ram() and
memory_region_init_ram_ptr() are replaced with
memory_region_init_ram_nofail() and memory_region_init_ram_ptr_nofail().

memory_region_init_ram() and memory_region_init_ram_ptr() are added an
extra parameter errp to let callers handle the error.

This patch solves a problem that qemu just exits when using monitor
command object_add to add a memory backend whose size is way too large.
In the case we'd better give an error message and keep guest running.

How to reproduce:

1. run qemu
2. (monitor)object_add memory-backend-ram,size=10G,id=ram0


Signed-off-by: Hu Tao 
---
 backends/hostmem-ram.c   |  2 +-
 exec.c   | 30 +
 hw/block/pflash_cfi01.c  |  5 -
 hw/block/pflash_cfi02.c  |  5 -
 hw/core/loader.c |  2 +-
 hw/display/vga.c |  2 +-
 hw/display/vmware_vga.c  |  3 ++-
 hw/i386/kvm/pci-assign.c |  9 
 hw/i386/pc.c |  2 +-
 hw/i386/pc_sysfw.c   |  4 ++--
 hw/misc/ivshmem.c|  9 
 hw/misc/vfio.c   |  3 ++-
 hw/pci/pci.c |  2 +-
 include/exec/memory.h| 32 ---
 include/exec/ram_addr.h  |  4 ++--
 memory.c | 57 +++-
 numa.c   |  4 ++--
 17 files changed, 134 insertions(+), 41 deletions(-)

diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index d9a8290..a67a134 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -27,7 +27,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 
 path = object_get_canonical_path_component(OBJECT(backend));
 memory_region_init_ram(&backend->mr, OBJECT(backend), path,
-   backend->size);
+   backend->size, errp);
 g_free(path);
 }
 
diff --git a/exec.c b/exec.c
index 5a2a25e..8c2a91d 100644
--- a/exec.c
+++ b/exec.c
@@ -1224,7 +1224,7 @@ static int memory_try_enable_merging(void *addr, size_t 
len)
 return qemu_madvise(addr, len, QEMU_MADV_MERGEABLE);
 }
 
-static ram_addr_t ram_block_add(RAMBlock *new_block)
+static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
 {
 RAMBlock *block;
 ram_addr_t old_ram_size, new_ram_size;
@@ -1241,9 +1241,11 @@ static ram_addr_t ram_block_add(RAMBlock *new_block)
 } else {
 new_block->host = phys_mem_alloc(new_block->length);
 if (!new_block->host) {
-fprintf(stderr, "Cannot set up guest memory '%s': %s\n",
-new_block->mr->name, strerror(errno));
-exit(1);
+error_setg_errno(errp, errno,
+ "cannot set up guest memory '%s'",
+ new_block->mr->name);
+qemu_mutex_unlock_ramlist();
+return -1;
 }
 memory_try_enable_merging(new_block->host, new_block->length);
 }
@@ -1294,6 +1296,7 @@ ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 Error **errp)
 {
 RAMBlock *new_block;
+ram_addr_t addr;
 
 if (xen_enabled()) {
 error_setg(errp, "-mem-path not supported with Xen");
@@ -1323,14 +1326,19 @@ ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, 
MemoryRegion *mr,
 return -1;
 }
 
-return ram_block_add(new_block);
+addr = ram_block_add(new_block, errp);
+if (errp && *errp) {
+g_free(new_block);
+}
+return addr;
 }
 #endif
 
 ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
-   MemoryRegion *mr)
+   MemoryRegion *mr, Error **errp)
 {
 RAMBlock *new_block;
+ram_addr_t addr;
 
 size = TARGET_PAGE_ALIGN(size);
 new_block = g_malloc0(sizeof(*new_block));
@@ -1341,12 +1349,16 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, 
void *host,
 if (host) {
 new_block->flags |= RAM_PREALLOC;
 }
-return ram_block_add(new_block);
+addr = ram_block_add(new_block, errp);
+if (errp && *errp) {
+g_free(new_block);
+}
+return addr;
 }
 
-ram_addr_t qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr)
+ram_addr_t qemu_ram_alloc(ram_addr_t size, MemoryRegion *mr, Error **errp)
 {
-return qemu_ram_alloc_from_ptr(size, NULL, mr);
+return qemu_ram_alloc_from_ptr(size, NULL, mr, errp);
 }
 
 void qemu_ram_free_from_ptr(ram_addr_t addr)
diff --git a/hw/block/pflash_cfi01.c b/hw/block/pflash_cfi01.c
index f9507b4..92b8b87 100644
--- a/hw/block/pflash_cfi01.c
+++ b/hw/block/pflash_cfi01.

[Qemu-devel] [PATCH for 2.1 2/2] memory-backend-file: improve error handling

2014-07-02 Thread Hu Tao
This patch fixes two problems of memory-backend-file:

1. If user adds a memory-backend-file object using object_add command,
   specifying a non-existing directory for property mem-path, qemu
   will core dump with message:

 /nonexistingdir: No such file or directory
 Bad ram offset f000
 Aborted (core dumped)

2. If user adds a memory-backend-file object using object_add command,
   specifying a size that is less than huge page size, qemu
   will core dump with message:

 Bad ram offset f000
 Aborted (core dumped)

Signed-off-by: Hu Tao 
---
 exec.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/exec.c b/exec.c
index 8c2a91d..35c2dcb 100644
--- a/exec.c
+++ b/exec.c
@@ -996,7 +996,7 @@ void qemu_mutex_unlock_ramlist(void)
 
 #define HUGETLBFS_MAGIC   0x958458f6
 
-static long gethugepagesize(const char *path)
+static long gethugepagesize(const char *path, Error **errp)
 {
 struct statfs fs;
 int ret;
@@ -1006,7 +1006,7 @@ static long gethugepagesize(const char *path)
 } while (ret != 0 && errno == EINTR);
 
 if (ret != 0) {
-perror(path);
+error_setg_errno(errp, errno, "failed to stat file %s", path);
 return 0;
 }
 
@@ -1024,17 +1024,19 @@ static void *file_ram_alloc(RAMBlock *block,
 char *filename;
 char *sanitized_name;
 char *c;
-void *area;
+void *area = NULL;
 int fd;
 unsigned long hpagesize;
 
-hpagesize = gethugepagesize(path);
+hpagesize = gethugepagesize(path, errp);
 if (!hpagesize) {
 goto error;
 }
 
 if (memory < hpagesize) {
-return NULL;
+error_setg(errp, "memory size 0x" RAM_ADDR_FMT " should be larger "
+   "than huge page size 0x%" PRIx64, memory, hpagesize);
+goto error;
 }
 
 if (kvm_enabled() && !kvm_has_sync_mmu()) {
@@ -1094,8 +1096,8 @@ static void *file_ram_alloc(RAMBlock *block,
 return area;
 
 error:
-if (mem_prealloc) {
-exit(1);
+if (area && area != MAP_FAILED) {
+munmap(area, memory);
 }
 return NULL;
 }
-- 
1.9.3




[Qemu-devel] [PATCH for 2.1 0/2] bug fixs for memory backend

2014-07-02 Thread Hu Tao
This series includes two patches to fix bugs of memory backend. See each
patch for the bugs and how to reproduce them.

Hu Tao (2):
  memory: introduce memory_region_init_ram_nofail() and
memory_region_init_ram_ptr_nofail()
  memory-backend-file: improve error handling

 backends/hostmem-ram.c   |  2 +-
 exec.c   | 46 --
 hw/block/pflash_cfi01.c  |  5 -
 hw/block/pflash_cfi02.c  |  5 -
 hw/core/loader.c |  2 +-
 hw/display/vga.c |  2 +-
 hw/display/vmware_vga.c  |  3 ++-
 hw/i386/kvm/pci-assign.c |  9 
 hw/i386/pc.c |  2 +-
 hw/i386/pc_sysfw.c   |  4 ++--
 hw/misc/ivshmem.c|  9 
 hw/misc/vfio.c   |  3 ++-
 hw/pci/pci.c |  2 +-
 include/exec/memory.h| 32 ---
 include/exec/ram_addr.h  |  4 ++--
 memory.c | 57 +++-
 numa.c   |  4 ++--
 17 files changed, 143 insertions(+), 48 deletions(-)

-- 
1.9.3




[Qemu-devel] [Bug 1307473] Re: guest hang due to missing clock interrupt

2014-07-02 Thread urusha
After installing kernel 3.15.1-031501-generic from kernel-ppa, both
machines work without issues from 2014-06-25. Seems it's kernel issue
that have already been solved upstream.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1307473

Title:
  guest hang due to missing clock interrupt

Status in QEMU:
  New
Status in “linux” package in Ubuntu:
  Confirmed
Status in “qemu” package in Ubuntu:
  Confirmed

Bug description:
  
  I noticed on 2 different systems that after upgrade from precise to latest 
trusty VMs are crashing:

  - in case of Windows VMs I'm getting BSOD with error message: "A clock 
interrupt was not received on a secondary processor within the allocated time 
interval."
  - On linux VMs I'm noticing "hrtimer: interrupt took 2992229 ns" messages 
  - On some proprietary virtual appliances I'm noticing crashes an due to 
missing timer interrupts

  QEMU version is:
  QEMU emulator version 1.7.91 (Debian 2.0.0~rc1+dfsg-0ubuntu3)

  Full command line:

  qemu-system-x86_64 -enable-kvm -name win7eval -S -machine pc-
  i440fx-1.7,accel=kvm,usb=off -cpu host -m 4096 -realtime mlock=off
  -smp 4,sockets=1,cores=4,threads=1 -uuid 05e5089a-
  4aa1-6bb2-ef06-ab4d020a -no-user-config -nodefaults -chardev
  
socket,id=charmonitor,path=/var/lib/libvirt/qemu/win7eval.monitor,server,nowait
  -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime
  -no-shutdown -boot strict=on -device piix3-usb-
  uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
  file=/var/vm/win7eval.qcow2,if=none,id=drive-virtio-disk0,format=qcow2
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-
  disk0,id=virtio-disk0,bootindex=1 -drive
  file=/home/damarion/iso/7600.16385.090713-1255_x86fre_enterprise_en-
  us_EVAL_Eval_Enterprise-GRMCENEVAL_EN_DVD.iso,if=none,id=drive-
  ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive
  =drive-ide0-0-0,id=ide0-0-0 -drive file=/home/damarion/iso/virtio-
  win-0.1-74.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw
  -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
  -netdev tap,fd=24,id=hostnet0 -device
  e1000,netdev=hostnet0,id=net0,mac=52:54:00:38:31:0a,bus=pci.0,addr=0x3
  -chardev pty,id=charserial0 -device isa-
  serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0
  -vnc 127.0.0.1:1 -device VGA,id=video0,bus=pci.0,addr=0x2 -device
  virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1307473/+subscriptions



Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Chen, Tiejun

On 2014/7/2 23:27, Michael S. Tsirkin wrote:

On Wed, Jul 02, 2014 at 03:15:02PM +, Ross Philipson wrote:

-Original Message-
From: Paolo Bonzini [mailto:pbonz...@redhat.com]
Sent: Wednesday, July 02, 2014 7:33 AM
To: Ross Philipson; Michael S. Tsirkin; Stefano Stabellini
Cc: peter.mayd...@linaro.org; xen-de...@lists.xensource.com; Allen M.
Kay; kelly.zyta...@amd.com; qemu-devel@nongnu.org;
yang.z.zh...@intel.com; anth...@codemonkey.ws; Anthony Perard; Chen,
Tiejun
Subject: Re: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough
support

Il 01/07/2014 19:39, Ross Philipson ha scritto:


We do IGD pass-through in our project (XenClient). The patches
originally came from our project. We surface the same ISA bridge and
have never had activation issues on any version of Widows from XP to
Win8. We do not normally run server platforms so I can't say for sure
there.


The problem is not activation, the problem is that the patches are
making assumptions on the driver and the firmware that might work today
but are IMHO just not sane.


Sure I don't think anybody is suggesting that activation is
the main problem. It was just a potential problem with respect
to one of the proposed solutions.

When we first started doing this (back in 2009ish) we ran into
all these problems with surfacing ISA bridges, giving guest
drivers access to registers in the host bridge. etc. Nothing seemed
sane; I sympathize.


At some level, maybe Paolo is right.  Ignore existing drivers and ask
intel developers to update their drivers to do something sane on
hypervisors, even if they do ugly things on real hardware.

A simple proposal since what I wrote earlier though apparently wasn't
very clear:

   Detect Xen subsystem vendor id on vga card.
   If there, avoid poking at chipset. Instead
- use subsystem device # for card type
- use second half of BAR0 of device
- instead of access to pci host

hypervisors will simply take BAR0 and double it in size,
make second part map to what would be the pci host.

Tiejun, is there a chance this can be done not only
on Linux but on windows as well?


MST,

Looks this is paravirtualizaed way, right?

I can post this requirement to check but please make sure I really 
understand what you mean,


#1 We need to define a new Xen subsystem vendor id and emulate this 
value on vga card


#2 Native driver need to do:

	* if the subsystem id on vga is that emulated XEN subsystem id, the 
native driver can get all necessary access including PCI host bridge at 
0.0 and ISA bridge at 1f.0 from second half of that emulated BAR0 double 
the real size.


Right? If yes, I'd like to ask them.

But question is how to walk from PCI config on PCI host to BAR0 on VGA:

dev_priv->bridge_dev = pci_get_bus_and_slot(0, PCI_DEVFN(0, 0));

pci_write/read_config_dword(dev_priv->bridge_dev,,,)

Thanks
Tiejun






I would have no problem with a clean patchset that adds a new machine
type and doesn't touch code in "-M pc", but it looks like mst disagrees.
   Ultimately, if a patchset is too hacky for upstream, you can include
it in your downstream XenClient (and XenServer) QEMU branch.  It
happens.

Paolo

-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2014.0.4592 / Virus Database: 3986/7769 - Release Date:
06/30/14






Re: [Qemu-devel] [Xen-devel] [RFC PATCH V3 1/2] xen: pass kernel initrd to qemu [and 1 more messages]

2014-07-02 Thread Chun Yan Liu


>>> On 7/2/2014 at 11:17 PM, in message
<21428.8829.273127.394...@mariner.uk.xensource.com>, Ian Jackson
 wrote: 
> Ian Campbell writes ("Re: [RFC PATCH V3 1/2] xen: pass kernel initrd to  
> qemu"): 
> > On Mon, 2014-06-23 at 15:22 +0100, Ian Jackson wrote: 
> > > If we are going to do this then I think the kernel, cmdline and 
> > > ramdisk (and bootloader) parameters shoudl be moved into the main part 
> > > of the domain_build_info struct.  This will involve a compatibility 
> > > layer: temporarily (for at least one release) 
> >  
> > I don't think so -- we would need to retain it forever or at least until 
> > some sort of "API break" event. We still guarantee that applications 
> > using the 4.2 API will be supported. 
>  
> Yes.  Sorry, I meant that the compatibility should be retained for 
> some considerable time.  So for now we should honour all the existing 
> config struct members plus also the new cmdline member which should 
> IMO be in the main part of the struct and not inside pv. 

No new member created, it's always 'cmdline' in libxl_domain_build_info.
'root' and 'extra' and new 'cmdline' are only words to config file.

Before, in libxl_domain_build_info, there are only u.pv.kernel|cmdline|ramdisk,
now since both PV and HVM support them, in theory we should move them
to main part, but considering the compatibility issue, I'm not sure which one
is better?
1. add u.hvm.kernel|cmdline|ramdisk and add hvm processing only (as in V2)
2. add u.kernel|cmdline|ramdisk (since now both PV and HVM have these) but
keep u.pv.kernel|cmdline|ramdisk, add hvm processing, add pv processing
u.kernel|cmdline|ramdisk too so that new users could use new APIs. (as in 
V3)

Your suggestions?

- Chunyan



>  
> > > Why are you deprecating root= and extra= ? 
> >  
> > I suggested this. They are suckful interfaces which expose Linux 
> > specifics (e.g. the root= syntax) in our guest cfg files. cmdline is the 
> > generic equivalent. 
>  
> I have spoken to Ian C about this and he has convinced me that I was 
> wrong to object to deprecating root= and extra=.  So please do what 
> Ian C says, not what I say.  Sorry. 
>  
> Thanks, 
> Ian. 
>  
> ___ 
> Xen-devel mailing list 
> xen-de...@lists.xen.org 
> http://lists.xen.org/xen-devel 
>  
>  





Re: [Qemu-devel] [PATCH] memory: introduce memory_region_init_ram_nofail() and memory_region_init_ram_ptr_nofail()

2014-07-02 Thread Hu Tao
Hi,

Sorry that I forgot to send a follow-up patch to this one, I'll resend
this patch with the follow-up.

Regards,
Hu



Re: [Qemu-devel] [PATCH v5] ppc: spapr-rtas - implement os-term rtas call

2014-07-02 Thread Nikunj A Dadhania
Alexey Kardashevskiy  writes:

> On 06/30/2014 06:35 PM, Nikunj A Dadhania wrote:
>> PAPR compliant guest calls this in absence of kdump. This finally
>> reaches the guest and can be handled according to the policies set by
>> higher level tools(like taking dump) for further analysis by tools like
>> crash.
>> 
>> Linux kernel calls ibm,os-term when extended property of os-term is set.
>> This makes sure that a return to the linux kernel is gauranteed.
>> 
>> CC: Benjamin Herrenschmidt 
>> CC: Anton Blanchard 
>> CC: Alexander Graf 
>> CC: Tyrel Datwyler 
>> Signed-off-by: Nikunj A Dadhania 
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index bbba51a..4e96381 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -382,7 +382,6 @@ int spapr_allocate_irq_block(int num, bool lsi, bool 
>> msi);
>>  #define RTAS_GET_SENSOR_STATE   (RTAS_TOKEN_BASE + 0x1D)
>>  #define RTAS_IBM_CONFIGURE_CONNECTOR(RTAS_TOKEN_BASE + 0x1E)
>>  #define RTAS_IBM_OS_TERM(RTAS_TOKEN_BASE + 0x1F)
>> -#define RTAS_IBM_EXTENDED_OS_TERM   (RTAS_TOKEN_BASE + 0x20)
>
>
> So we never ever going to implement this RTAS call?

Yeah, as its an RTAS property an not a call.

> I'd keep the number.

Regards
Nikunj




Re: [Qemu-devel] [PATCH v7 3/4] s390x: Migrate to new NMI interface

2014-07-02 Thread Alexey Kardashevskiy
On 06/23/2014 11:32 PM, Alexey Kardashevskiy wrote:
> On 06/16/2014 06:37 PM, Alexander Graf wrote:
>>
>> On 16.06.14 10:33, Alexey Kardashevskiy wrote:
>>> On 06/16/2014 05:16 PM, Cornelia Huck wrote:
 On Sat, 14 Jun 2014 12:41:50 +1000
 Alexey Kardashevskiy  wrote:

> On 06/13/2014 04:00 PM, Cornelia Huck wrote:
>> On Fri, 13 Jun 2014 13:36:58 +1000
>> Alexey Kardashevskiy  wrote:
>>
>>> This implements an NMI interface for s390 and s390-ccw machines.
>>>
>>> This removes #ifdef s390 branch in qmp_inject_nmi so new s390's
>>> nmi_monitor_handler() callback is going to be used for NMI.
>>>
>>> Since nmi_monitor_handler()-calling code is platform independent,
>>> CPUState::cpu_index is used instead of S390CPU::env.cpu_num.
>>> There should not be any change in behaviour as both @cpu_index and
>>> @cpu_num are global CPU numbers.
>>>
>>> Also, s390_cpu_restart() takes care of preforming operations in
>>> the specific CPU thread so no extra measure is required here either.
>> I find this paragraph a bit confusing; I'd just remove it.
> Besides bad english (please feel free to adjust it), what else is
> confusing
> here? I put it there because the spapr patch makes use of
> async_run_on_cpu() and maintainers may ask why I do not do the same for
> other platforms. This way I hoped I could reduce number of versions to
> post :)
 What about

 "Note that s390_cpu_restart() already takes care of the specified cpu,
 so we don't need to schedule via async_run_on_cpu()."
>>> I fail to see how exactly this is better or different but ok :)
>>>
>>>
>>> Alex, should I repost it with Cornelia's suggestion? What should happen
>>> next to this patchset? Who is supposed to pick it up? Thanks.
>>
>> Just post v8 of that single patch with the right message-id as reference. I
>> can pick up the patches, but I'd like at least an ack from Paolo on the
>> whole set.
> 
> 
> Anybody, ping? Or we are waiting till x86 machines got QOM'ed and then I'll
> repost it with x86 NMI handler? Thanks!


Paolo promised to ack (in irc) and obviously forgot :) Should I give up and
stop bothering noble people? :)



-- 
Alexey



Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Ming Lei
On Thu, Jul 3, 2014 at 12:21 AM, Paolo Bonzini  wrote:
> Il 02/07/2014 17:45, Ming Lei ha scritto:
>> The attachment debug patch skips aio_notify() if qemu_bh_schedule
>> is running from current aio context, but looks there is still 120K
>> writes triggered. (without the patch, 400K can be observed in
>> same test)
>
> Nice.  Another observation is that after aio_dispatch we'll always
> re-evaluate everything (bottom halves, file descriptors and timeouts),

The idea is very good.

If aio_notify() is called from the 1st aio_dispatch() in aio_poll(),
ctc->notifier might need to be set, but it can be handled easily.

> so we can skip the aio_notify if we're inside aio_dispatch.
>
> So what about this untested patch:
>
> diff --git a/aio-posix.c b/aio-posix.c
> index f921d4f..a23d85d 100644
> --- a/aio-posix.c
> +++ b/aio-posix.c

#include "qemu/atomic.h"

> @@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
>  AioHandler *node;
>  bool progress = false;
>
> +/* No need to set the event notifier during aio_notify.  */
> +ctx->running++;
> +
>  /*
>   * We have to walk very carefully in case qemu_aio_set_fd_handler is
>   * called while we're walking.
> @@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
>  /* Run our timers */
>  progress |= timerlistgroup_run_timers(&ctx->tlg);
>
> +smp_wmb();
> +ctx->iter_count++;
> +smp_wmb();
> +ctx->running--;
> +
>  return progress;
>  }
>
> diff --git a/async.c b/async.c
> index 5b6fe6b..1f56afa 100644
> --- a/async.c
> +++ b/async.c

#include "qemu/atomic.h"

> @@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
>
>  void aio_notify(AioContext *ctx)
>  {
> -event_notifier_set(&ctx->notifier);
> +uint32_t iter_count;
> +do {
> +iter_count = ctx->iter_count;
> +/* Read ctx->iter_count before ctx->running.  */
> +smb_rmb();

s/smb/smp

> +if (!ctx->running) {
> +event_notifier_set(&ctx->notifier);
> +return;
> +}
> +/* Read ctx->running before ctx->iter_count.  */
> +smb_rmb();

s/smb/smp

> +/* ctx might have gone to sleep.  */
> +} while (iter_count != ctx->iter_count);
>  }

Since both 'running' and 'iter_count'  may be read lockless, something
like ACCESS_ONCE() should be used to avoid compiler optimization.

>
>  static void aio_timerlist_notify(void *opaque)
> @@ -269,6 +279,7 @@ AioContext *aio_context_new(void)
>  ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
>  ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
>  ctx->thread_pool = NULL;
> +ctx->iter_count = ctx->running = 0;
>  qemu_mutex_init(&ctx->bh_lock);
>  rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
>  event_notifier_init(&ctx->notifier, false);
> diff --git a/include/block/aio.h b/include/block/aio.h
> index a92511b..9f51c4f 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -51,6 +51,9 @@ struct AioContext {
>  /* Protects all fields from multi-threaded access */
>  RFifoLock lock;
>
> +/* Used to avoid aio_notify while dispatching event handlers.
> + * Writes protected by lock or BQL, reads are lockless.
> + */
> +uint32_t iter_count, running;
> +
>  /* The list of registered AIO handlers */
>  QLIST_HEAD(, AioHandler) aio_handlers;
>

In my test, it does decrease write() very much, and I hope
a formal version can be applied soon.


Thanks,
-- 
Ming Lei



Re: [Qemu-devel] from which version qemu support clone on rbd

2014-07-02 Thread Brian Jackson

On Wednesday, July 2, 2014 8:12:17 PM CDT, yue wrote:

could you tell me why 'Qemu doesn't handle that level of abstraction'?
i know qcow2 well, you can tell me the comparation。



Qemu would have to have a lot of extra code (that already exists elsewhere) 
to support taking snapshots/clones/etc of every backend device it supports. 
In the case of qcow2, the support exists because of the whole system 
snapshotting feature. Otherwise it's "bloat" that Qemu doesn't normally 
handle (it's left to the management layers above to handle).


--Iggy


 
thanks








At 2014-07-02 11:36:12, "Brian Jackson"  wrote:
Qemu doesn't handle that level of abstraction. The closest approximation 
you could probably come up with is qemu-img's backing file support for 
qcow2 images.


You should stick to using the rbd tool to create clones of rbd devices. 
Alternatively, use a higher level tool (like openstack, etc) that  ...








Re: [Qemu-devel] [PATCH] qemu-img info: show nocow info

2014-07-02 Thread Chun Yan Liu


>>> On 7/2/2014 at 09:03 PM, in message <53b4031e.3030...@redhat.com>, Eric 
>>> Blake
 wrote: 
> On 07/02/2014 03:50 AM, Chunyan Liu wrote: 
> > Add nocow info in 'qemu-img info' output to show whether the file 
> > currently has NOCOW flag set or not. 
> >  
> > Signed-off-by: Chunyan Liu  
> > --- 
> >  block/qapi.c | 25 + 
> >  qapi/block-core.json |  3 ++- 
> >  2 files changed, 27 insertions(+), 1 deletion(-) 
> >  
>  
> > + 
> > +/* get NOCOW info */ 
> > +fd = qemu_open(bs->filename, O_RDONLY | O_NONBLOCK); 
> > +if (fd >= 0) { 
> > +if (ioctl(fd, FS_IOC_GETFLAGS, &attr) == 0 && (attr & 
> > FS_NOCOW_FL)) { 
> > +info->has_nocow = true; 
> > +info->nocow = true; 
>  
> Better is: 
>  
> if (ioctl... == 0) { 
> info->has_nocow = true; 
> info->nocow = !!(attr & FS_NOCOW_FL) 
> } 
>  
It won't work. FS_IOC_GETFLAGS is a common ioctl to all file systems, it only
gets flags info that the file is currently set. Output 'attr' contains the 
flags info.
Even the filesystem doesn't support NOCOW, that ioctl could return 0.

> to explicitly document the cases where we know nocow works but is not 
> set (and omitting it in cases where nocow doesn't exist because of the fs) 
>  
>  
> > +++ b/qapi/block-core.json 
> > @@ -126,7 +126,8 @@ 
> > '*backing-filename': 'str', '*full-backing-filename': 'str', 
> > '*backing-filename-format': 'str', '*snapshots':  
> ['SnapshotInfo'], 
> > '*backing-image': 'ImageInfo', 
> > -   '*format-specific': 'ImageInfoSpecific' } } 
> > +   '*format-specific': 'ImageInfoSpecific', 
> > +   '*nocow': 'bool' } } 
>  
> Missing documentation. When adding that, remember to add '(since 2.2)' 
> (or even since 2.1 if you think this is a bug fix that should go in 
> during hard freeze) 

OK. Thanks.

-Chunyan

>  
> --  
> Eric Blake   eblake redhat com+1-919-301-3266 
> Libvirt virtualization library http://libvirt.org 
>  
>  





Re: [Qemu-devel] [PATCH v5] ppc: spapr-rtas - implement os-term rtas call

2014-07-02 Thread Alexey Kardashevskiy
On 07/03/2014 01:41 PM, Alexey Kardashevskiy wrote:
> On 06/30/2014 06:35 PM, Nikunj A Dadhania wrote:
>> PAPR compliant guest calls this in absence of kdump. This finally
>> reaches the guest and can be handled according to the policies set by
>> higher level tools(like taking dump) for further analysis by tools like
>> crash.
>>
>> Linux kernel calls ibm,os-term when extended property of os-term is set.
>> This makes sure that a return to the linux kernel is gauranteed.
>>
>> CC: Benjamin Herrenschmidt 
>> CC: Anton Blanchard 
>> CC: Alexander Graf 
>> CC: Tyrel Datwyler 
>> Signed-off-by: Nikunj A Dadhania 
>>
>> ---
>>
>> v2: rebase to ppcnext
>> v3: Do not stop the VM, and update comments
>> v4: update spapr_register_rtas and qapi_event changes
>> v5: set ibm,extended-os-term as null encoded property
>> ---
>>  hw/ppc/spapr.c |  9 +
>>  hw/ppc/spapr_rtas.c| 15 +++
>>  include/hw/ppc/spapr.h |  1 -
>>  3 files changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 307c58d..e6c9014 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -520,6 +520,15 @@ static void *spapr_create_fdt_skel(hwaddr initrd_base,
>>  
>>  _FDT((fdt_property_cell(fdt, "rtas-error-log-max", 
>> RTAS_ERROR_LOG_MAX)));
>>  
>> +/*
>> + * According to PAPR, rtas ibm,os-term, does not gaurantee a return
>> + * back to the guest cpu.
>> + *
>> + * While an additional ibm,extended-os-term property indicates that
>> + * rtas call return will always occur. Set this property.
>> + */
>> +_FDT((fdt_property(fdt, "ibm,extended-os-term", NULL, 0)));
>> +
>>  _FDT((fdt_end_node(fdt)));
>>  
>>  /* interrupt controller */
>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>> index 9ba1ba6..2ec2a8e 100644
>> --- a/hw/ppc/spapr_rtas.c
>> +++ b/hw/ppc/spapr_rtas.c
>> @@ -277,6 +277,19 @@ static void rtas_ibm_set_system_parameter(PowerPCCPU 
>> *cpu,
>>  rtas_st(rets, 0, ret);
>>  }
>>  
>> +static void rtas_ibm_os_term(PowerPCCPU *cpu,
>> +sPAPREnvironment *spapr,
>> +uint32_t token, uint32_t nargs,
>> +target_ulong args,
>> +uint32_t nret, target_ulong rets)
>> +{
>> +target_ulong ret = 0;
>> +
>> +qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_PAUSE, &error_abort);
>> +
>> +rtas_st(rets, 0, ret);
>> +}
>> +
>>  static struct rtas_call {
>>  const char *name;
>>  spapr_rtas_fn fn;
>> @@ -404,6 +417,8 @@ static void core_rtas_register_types(void)
>>  spapr_rtas_register(RTAS_IBM_SET_SYSTEM_PARAMETER,
>>  "ibm,set-system-parameter",
>>  rtas_ibm_set_system_parameter);
>> +spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
>> +rtas_ibm_os_term);
>>  }
>>  
>>  type_init(core_rtas_register_types)
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index bbba51a..4e96381 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -382,7 +382,6 @@ int spapr_allocate_irq_block(int num, bool lsi, bool 
>> msi);
>>  #define RTAS_GET_SENSOR_STATE   (RTAS_TOKEN_BASE + 0x1D)
>>  #define RTAS_IBM_CONFIGURE_CONNECTOR(RTAS_TOKEN_BASE + 0x1E)
>>  #define RTAS_IBM_OS_TERM(RTAS_TOKEN_BASE + 0x1F)
>> -#define RTAS_IBM_EXTENDED_OS_TERM   (RTAS_TOKEN_BASE + 0x20)
> 
> 
> So we never ever going to implement this RTAS call?
> I'd keep the number.

ah, it is in ppc-next-2.2 already. Never mind :)

> 
> 
>>  
>>  #define RTAS_TOKEN_MAX  (RTAS_TOKEN_BASE + 0x21)
>>  
>>
> 
> 


-- 
Alexey



Re: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-02 Thread Hongyang Yang

Hi David,

On 07/01/2014 08:12 PM, Dr. David Alan Gilbert wrote:

* Hongyang Yang (yan...@cn.fujitsu.com) wrote:

Hi Yang,


Background:
   COLO HA project is a high availability solution. Both primary
VM (PVM) and secondary VM (SVM) run in parallel. They receive the
same request from client, and generate response in parallel too.
If the response packets from PVM and SVM are identical, they are
released immediately. Otherwise, a VM checkpoint (on demand) is
conducted. The idea is presented in Xen summit 2012, and 2013,
and academia paper in SOCC 2013. It's also presented in KVM forum
2013:
http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
Please refer to above document for detailed information.


Yes, I remember that talk - very interesting.

I didn't quite understand a couple of things though, perhaps you
can explain:
   1) If we ignore the TCP sequence number problem, in an SMP machine
don't we get other randomnesses - e.g. which core completes something
first, or who wins a lock contention, so the output stream might not
be identical - so do those normal bits of randomness cause the machines
to flag as out-of-sync?


It's about COLO agent, CCing Congyang, he can give the detailed
explanation.



   2) If the PVM has decided that the SVM is out of sync (due to 1) and
the PVM fails at about the same point - can we switch over to the SVM?


Yes, we can switch over, we have some mechanisms to ensure the SVM's state
is consentient:
- memory cache.
  The memory cache was initially the same as PVM's memory. At
checkpoint, we cache the dirty memory of PVM while transporting the
memory, write cached memory to SVM when we received all PVM memory
(we only need to write memory that was both dirty on PVM and SVM
from last checkpoint). This solves problem 2) you've mentioned above:
If PVM fails while checkpointing, SVM will discard the cached memory
and continue to run and to provide service just as it is.

- COLO Disk manager
  Like memory cache, COLO Disk manager caches the Disk modifications
of PVM, and write it to SVM Disk when checkpointing. If PVM fails while
checkpointing, SVM will discard the cached Disk modifications.



I'm worried that due to (1) there are periods where the system
is out-of-sync and a failure of the PVM is not protected.  Does that happen?
If so how often?


The attached was the architecture of kvm-COLO we proposed.
   - COLO Manager: Requires modifications of qemu
 - COLO Controller
 COLO Controller includes modifications of save/restore
   flow just like MC(macrocheckpoint), a memory cache on
   secondary VM which cache the dirty pages of primary VM
   and a failover module which provides APIs to communicate
   with external heartbead module.
 - COLO Disk Manager
 When pvm writes data into image, the colo disk manger
   captures this data and send it to the colo disk manger
   which makes sure the context of svm's image is consentient
   with the context of pvm's image.


I wonder if there is anyway to coordinate this between COLO, Michael
Hines microcheckpointing and the two separate reverse-execution
projects that also need to do some similar things.
Are there any standard APIs for the heartbeet thing we can already
tie into?


Sadly we have checked MC, it does not have heartbeat support for now.




   - COLO Agent("Proxy module" in the arch picture)
   We need an agent to compare the packets returned by
 Primary VM and Secondary VM, and decide whether to start a
 checkpoint according to some rules. It is a linux kernel
 module for host.


Why is that a kernel module, and how does it communicate the state
to the QEMU instance?


The reason we made this a kernel module is to gain better performance.
We can easily hook the packets in a kernel module.
QEMU instance uses ioctl() to communicate with the COLO Agent.




   - Other minor modifications
   We may need other modifications for better performance.


Dave
P.S. I'm starting to look at fault-tolerance stuff, but haven't
got very far yet, so starting to try and understand the details
of COLO, microcheckpointing, etc


--
Thanks,
Yang.



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
.



--
Thanks,
Yang.



Re: [Qemu-devel] [PATCH v5] ppc: spapr-rtas - implement os-term rtas call

2014-07-02 Thread Alexey Kardashevskiy
On 06/30/2014 06:35 PM, Nikunj A Dadhania wrote:
> PAPR compliant guest calls this in absence of kdump. This finally
> reaches the guest and can be handled according to the policies set by
> higher level tools(like taking dump) for further analysis by tools like
> crash.
> 
> Linux kernel calls ibm,os-term when extended property of os-term is set.
> This makes sure that a return to the linux kernel is gauranteed.
> 
> CC: Benjamin Herrenschmidt 
> CC: Anton Blanchard 
> CC: Alexander Graf 
> CC: Tyrel Datwyler 
> Signed-off-by: Nikunj A Dadhania 
> 
> ---
> 
> v2: rebase to ppcnext
> v3: Do not stop the VM, and update comments
> v4: update spapr_register_rtas and qapi_event changes
> v5: set ibm,extended-os-term as null encoded property
> ---
>  hw/ppc/spapr.c |  9 +
>  hw/ppc/spapr_rtas.c| 15 +++
>  include/hw/ppc/spapr.h |  1 -
>  3 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 307c58d..e6c9014 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -520,6 +520,15 @@ static void *spapr_create_fdt_skel(hwaddr initrd_base,
>  
>  _FDT((fdt_property_cell(fdt, "rtas-error-log-max", RTAS_ERROR_LOG_MAX)));
>  
> +/*
> + * According to PAPR, rtas ibm,os-term, does not gaurantee a return
> + * back to the guest cpu.
> + *
> + * While an additional ibm,extended-os-term property indicates that
> + * rtas call return will always occur. Set this property.
> + */
> +_FDT((fdt_property(fdt, "ibm,extended-os-term", NULL, 0)));
> +
>  _FDT((fdt_end_node(fdt)));
>  
>  /* interrupt controller */
> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> index 9ba1ba6..2ec2a8e 100644
> --- a/hw/ppc/spapr_rtas.c
> +++ b/hw/ppc/spapr_rtas.c
> @@ -277,6 +277,19 @@ static void rtas_ibm_set_system_parameter(PowerPCCPU 
> *cpu,
>  rtas_st(rets, 0, ret);
>  }
>  
> +static void rtas_ibm_os_term(PowerPCCPU *cpu,
> +sPAPREnvironment *spapr,
> +uint32_t token, uint32_t nargs,
> +target_ulong args,
> +uint32_t nret, target_ulong rets)
> +{
> +target_ulong ret = 0;
> +
> +qapi_event_send_guest_panicked(GUEST_PANIC_ACTION_PAUSE, &error_abort);
> +
> +rtas_st(rets, 0, ret);
> +}
> +
>  static struct rtas_call {
>  const char *name;
>  spapr_rtas_fn fn;
> @@ -404,6 +417,8 @@ static void core_rtas_register_types(void)
>  spapr_rtas_register(RTAS_IBM_SET_SYSTEM_PARAMETER,
>  "ibm,set-system-parameter",
>  rtas_ibm_set_system_parameter);
> +spapr_rtas_register(RTAS_IBM_OS_TERM, "ibm,os-term",
> +rtas_ibm_os_term);
>  }
>  
>  type_init(core_rtas_register_types)
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index bbba51a..4e96381 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -382,7 +382,6 @@ int spapr_allocate_irq_block(int num, bool lsi, bool msi);
>  #define RTAS_GET_SENSOR_STATE   (RTAS_TOKEN_BASE + 0x1D)
>  #define RTAS_IBM_CONFIGURE_CONNECTOR(RTAS_TOKEN_BASE + 0x1E)
>  #define RTAS_IBM_OS_TERM(RTAS_TOKEN_BASE + 0x1F)
> -#define RTAS_IBM_EXTENDED_OS_TERM   (RTAS_TOKEN_BASE + 0x20)


So we never ever going to implement this RTAS call?
I'd keep the number.


>  
>  #define RTAS_TOKEN_MAX  (RTAS_TOKEN_BASE + 0x21)
>  
> 


-- 
Alexey



[Qemu-devel] [PATCH v3 1/6] spapr: Move DT memory node rendering to a helper

2014-07-02 Thread Alexey Kardashevskiy
This moves recurring bits of code related to memory@xxx nodes
creation to a helper.

This makes use of the new helper for node@0.

Signed-off-by: Alexey Kardashevskiy 
---
 hw/ppc/spapr.c | 48 
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 506d4fc..a5ffcba 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -605,6 +605,31 @@ int spapr_h_cas_compose_response(target_ulong addr, 
target_ulong size)
 return 0;
 }
 
+static void spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
+   hwaddr size)
+{
+uint32_t associativity[] = {
+cpu_to_be32(0x4), /* length */
+cpu_to_be32(0x0), cpu_to_be32(0x0),
+cpu_to_be32(nodeid), cpu_to_be32(nodeid)
+};
+char mem_name[32];
+uint64_t mem_reg_property[2];
+int off;
+
+mem_reg_property[0] = cpu_to_be64(start);
+mem_reg_property[1] = cpu_to_be64(size);
+
+sprintf(mem_name, "memory@" TARGET_FMT_lx, start);
+off = fdt_add_subnode(fdt, 0, mem_name);
+_FDT(off);
+_FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
+_FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
+  sizeof(mem_reg_property;
+_FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
+  sizeof(associativity;
+}
+
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
 uint32_t associativity[] = {cpu_to_be32(0x4), cpu_to_be32(0x0),
@@ -623,29 +648,12 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, 
void *fdt)
 }
 
 /* RMA */
-mem_reg_property[0] = 0;
-mem_reg_property[1] = cpu_to_be64(spapr->rma_size);
-off = fdt_add_subnode(fdt, 0, "memory@0");
-_FDT(off);
-_FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-_FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-  sizeof(mem_reg_property;
-_FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-  sizeof(associativity;
+spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
 
 /* RAM: Node 0 */
 if (node0_size > spapr->rma_size) {
-mem_reg_property[0] = cpu_to_be64(spapr->rma_size);
-mem_reg_property[1] = cpu_to_be64(node0_size - spapr->rma_size);
-
-sprintf(mem_name, "memory@" TARGET_FMT_lx, spapr->rma_size);
-off = fdt_add_subnode(fdt, 0, mem_name);
-_FDT(off);
-_FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-_FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-  sizeof(mem_reg_property;
-_FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-  sizeof(associativity;
+spapr_populate_memory_node(fdt, 0, spapr->rma_size,
+   node0_size - spapr->rma_size);
 }
 
 /* RAM: Node 1 and beyond */
-- 
2.0.0




[Qemu-devel] [PATCH v3 5/6] spapr: Add a helper for node0_size calculation

2014-07-02 Thread Alexey Kardashevskiy
In multiple places there is a node0_size variable calculation
which assumes that NUMA node #0 and memory node #0 are the same
things which they are not. Since we are going to change it and
do not want to change it in multiple places, let's make a helper.

This adds a spapr_node0_size() helper and makes use of it.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* fixed bug when QEMU is started with RAM size
bigger that the only NUMA node like this:
 -m 8192 -smp 4 -numa node,nodeid=0,cpus=0-3,mem=4096

v2:
* removed duplicated "return ram_size" from spapr_node0_size()
---
 hw/ppc/spapr.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 680d7f9..c71ce1f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -282,6 +282,19 @@ static size_t create_page_sizes_prop(CPUPPCState *env, 
uint32_t *prop,
 return (p - prop) * sizeof(uint32_t);
 }
 
+static hwaddr spapr_node0_size(void)
+{
+if (nb_numa_nodes) {
+int i;
+for (i = 0; i < nb_numa_nodes; ++i) {
+if (numa_info[i].node_mem) {
+return MIN(pow2floor(numa_info[i].node_mem), ram_size);
+}
+}
+}
+return ram_size;
+}
+
 #define _FDT(exp) \
 do { \
 int ret = (exp);   \
@@ -803,9 +816,8 @@ static void spapr_reset_htab(sPAPREnvironment *spapr)
 
 /* Update the RMA size if necessary */
 if (spapr->vrma_adjust) {
-hwaddr node0_size = (nb_numa_nodes > 1) ?
-numa_info[0].node_mem : ram_size;
-spapr->rma_size = kvmppc_rma_size(node0_size, spapr->htab_shift);
+spapr->rma_size = kvmppc_rma_size(spapr_node0_size(),
+  spapr->htab_shift);
 }
 }
 
@@ -1236,7 +1248,7 @@ static void ppc_spapr_init(MachineState *machine)
 MemoryRegion *sysmem = get_system_memory();
 MemoryRegion *ram = g_new(MemoryRegion, 1);
 hwaddr rma_alloc_size;
-hwaddr node0_size = (nb_numa_nodes > 1) ? numa_info[0].node_mem : ram_size;
+hwaddr node0_size = spapr_node0_size();
 uint32_t initrd_base = 0;
 long kernel_size = 0, initrd_size = 0;
 long load_limit, rtas_limit, fw_size;
-- 
2.0.0




[Qemu-devel] [PATCH v3 2/6] spapr: Use DT memory node rendering helper for other nodes

2014-07-02 Thread Alexey Kardashevskiy
This finishes refactoring by using the spapr_populate_memory_node helper
for all nodes and removing leftovers from spapr_populate_memory().

This is not a part of the previous patch because the patches look
nicer apart.

Signed-off-by: Alexey Kardashevskiy 
---
 hw/ppc/spapr.c | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a5ffcba..832bfcf 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -632,13 +632,8 @@ static void spapr_populate_memory_node(void *fdt, int 
nodeid, hwaddr start,
 
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
-uint32_t associativity[] = {cpu_to_be32(0x4), cpu_to_be32(0x0),
-cpu_to_be32(0x0), cpu_to_be32(0x0),
-cpu_to_be32(0x0)};
-char mem_name[32];
 hwaddr node0_size, mem_start, node_size;
-uint64_t mem_reg_property[2];
-int i, off;
+int i;
 
 /* memory node(s) */
 if (nb_numa_nodes > 1 && numa_info[0].node_mem < ram_size) {
@@ -659,7 +654,6 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, 
void *fdt)
 /* RAM: Node 1 and beyond */
 mem_start = node0_size;
 for (i = 1; i < nb_numa_nodes; i++) {
-mem_reg_property[0] = cpu_to_be64(mem_start);
 if (mem_start >= ram_size) {
 node_size = 0;
 } else {
@@ -668,16 +662,7 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, 
void *fdt)
 node_size = ram_size - mem_start;
 }
 }
-mem_reg_property[1] = cpu_to_be64(node_size);
-associativity[3] = associativity[4] = cpu_to_be32(i);
-sprintf(mem_name, "memory@" TARGET_FMT_lx, mem_start);
-off = fdt_add_subnode(fdt, 0, mem_name);
-_FDT(off);
-_FDT((fdt_setprop_string(fdt, off, "device_type", "memory")));
-_FDT((fdt_setprop(fdt, off, "reg", mem_reg_property,
-  sizeof(mem_reg_property;
-_FDT((fdt_setprop(fdt, off, "ibm,associativity", associativity,
-  sizeof(associativity;
+spapr_populate_memory_node(fdt, i, mem_start, node_size);
 mem_start += node_size;
 }
 
-- 
2.0.0




[Qemu-devel] [PATCH v3 6/6] spapr: Fix ibm, associativity for memory nodes

2014-07-02 Thread Alexey Kardashevskiy
We want the associtivity lists of memory and CPU nodes to match but
memory nodes have incorrect domain#3 which is zero for CPU so they won't
match.

This clears domain#3 in the list to match CPUs associtivity lists.

Signed-off-by: Alexey Kardashevskiy 
---
 hw/ppc/spapr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c71ce1f..166b2fc 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -624,7 +624,7 @@ static void spapr_populate_memory_node(void *fdt, int 
nodeid, hwaddr start,
 uint32_t associativity[] = {
 cpu_to_be32(0x4), /* length */
 cpu_to_be32(0x0), cpu_to_be32(0x0),
-cpu_to_be32(nodeid), cpu_to_be32(nodeid)
+cpu_to_be32(0x0), cpu_to_be32(nodeid)
 };
 char mem_name[32];
 uint64_t mem_reg_property[2];
-- 
2.0.0




[Qemu-devel] [PATCH v3 4/6] spapr: Split memory nodes to power-of-two blocks

2014-07-02 Thread Alexey Kardashevskiy
Linux kernel expects nodes to have power-of-two size and
does WARN_ON if this is not the case:
[0.041456] WARNING: at drivers/base/memory.c:115
which is:
===
/* Validate blk_sz is a power of 2 and not less than section size */
if ((block_sz & (block_sz - 1)) || (block_sz < MIN_MEMORY_BLOCK_SIZE)) {
WARN_ON(1);
block_sz = MIN_MEMORY_BLOCK_SIZE;
}
===

This splits memory nodes into set of smaller blocks with
a size which is a power of two. This makes sure the start
address of every node is aligned to the node size.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* s/ffs/ffsl/ as addresses are 64bit long

v2:
* tiny code cleanup in "sizetmp = MIN(sizetmp, 1 << (ffs(mem_start) - 1))"
* updated commit log with a piece of kernel code doing WARN_ON
---
 hw/ppc/spapr.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ec6d541..680d7f9 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -662,8 +662,18 @@ static int spapr_populate_memory(sPAPREnvironment *spapr, 
void *fdt)
 mem_start += spapr->rma_size;
 node_size -= spapr->rma_size;
 }
-spapr_populate_memory_node(fdt, i, mem_start, node_size);
-mem_start += node_size;
+for ( ; node_size; ) {
+hwaddr sizetmp = pow2floor(node_size);
+
+/* mem_start != 0 here */
+if (ffsl(mem_start) < ffsl(sizetmp)) {
+sizetmp = 1ULL << (ffsl(mem_start) - 1);
+}
+
+spapr_populate_memory_node(fdt, i, mem_start, sizetmp);
+node_size -= sizetmp;
+mem_start += sizetmp;
+}
 }
 
 return 0;
-- 
2.0.0




[Qemu-devel] [PATCH v3 0/6] spapr: rework memory nodes

2014-07-02 Thread Alexey Kardashevskiy

c4177479 "spapr: make sure RMA is in first mode of first memory node"
introduced regression which prevents from running guests with memoryless
NUMA node#0 which may happen on real POWER8 boxes and which would make
sense to debug in QEMU.

This patchset aim is to fix that and also fix various code problems in
memory nodes generation.

These 2 patches could be merged (the resulting patch looks rather ugly):
spapr: Use DT memory node rendering helper for other nodes
spapr: Move DT memory node rendering to a helper


Alex, there are "numa: enable sparse node numbering ..." patches from Nish,
which set can go first so the other could rebase on top of it? Thanks!



Changes:
v3:
* fixed bug with ram_size bigger than the only NUMA node
* fixed bug with 64bit addresses in memory node creation loop

v2:
* minor cosmetic change in spapr_node0_size()
* spapr_populate_memory() fixed to work in a no-numa config
* patch changing max numa nodes is removed

Please comment. Thanks!




Alexey Kardashevskiy (6):
  spapr: Move DT memory node rendering to a helper
  spapr: Use DT memory node rendering helper for other nodes
  spapr: Refactor spapr_populate_memory() to allow memoryless nodes
  spapr: Split memory nodes to power-of-two blocks
  spapr: Add a helper for node0_size calculation
  spapr: Fix ibm,associativity for memory nodes

 hw/ppc/spapr.c | 111 -
 1 file changed, 63 insertions(+), 48 deletions(-)

-- 
2.0.0




[Qemu-devel] [PATCH v3 3/6] spapr: Refactor spapr_populate_memory() to allow memoryless nodes

2014-07-02 Thread Alexey Kardashevskiy
Current QEMU does not support memoryless NUMA nodes, however
actual hardware may have them so it makes sense to have a way
to emulate them in QEMU. This prepares SPAPR for that.

This moves 2 calls of spapr_populate_memory_node() into
the existing loop over numa nodes so first several nodes may
have no memory and this still will work.

If there is no numa configuration, the code assumes there is just
a single node at 0 and it has all the guest memory.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* rebased on top of new NodeInfo code

v2:
* fixed spapr_populate_memory() to work in no-numa config
---
 hw/ppc/spapr.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 832bfcf..ec6d541 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -632,36 +632,36 @@ static void spapr_populate_memory_node(void *fdt, int 
nodeid, hwaddr start,
 
 static int spapr_populate_memory(sPAPREnvironment *spapr, void *fdt)
 {
-hwaddr node0_size, mem_start, node_size;
-int i;
+hwaddr mem_start, node_size;
+int i, nb_nodes = nb_numa_nodes;
+NodeInfo *nodes = numa_info;
+NodeInfo ramnode;
 
-/* memory node(s) */
-if (nb_numa_nodes > 1 && numa_info[0].node_mem < ram_size) {
-node0_size = numa_info[0].node_mem;
-} else {
-node0_size = ram_size;
+/* No NUMA nodes, assume there is just one node with whole RAM */
+if (!nb_numa_nodes) {
+nb_nodes = 1;
+ramnode.node_mem = ram_size;
+nodes = &ramnode;
 }
 
-/* RMA */
-spapr_populate_memory_node(fdt, 0, 0, spapr->rma_size);
-
-/* RAM: Node 0 */
-if (node0_size > spapr->rma_size) {
-spapr_populate_memory_node(fdt, 0, spapr->rma_size,
-   node0_size - spapr->rma_size);
-}
-
-/* RAM: Node 1 and beyond */
-mem_start = node0_size;
-for (i = 1; i < nb_numa_nodes; i++) {
+for (i = 0, mem_start = 0; i < nb_nodes; ++i) {
+if (!nodes[i].node_mem) {
+continue;
+}
 if (mem_start >= ram_size) {
 node_size = 0;
 } else {
-node_size = numa_info[i].node_mem;
+node_size = nodes[i].node_mem;
 if (node_size > ram_size - mem_start) {
 node_size = ram_size - mem_start;
 }
 }
+if (!mem_start) {
+/* ppc_spapr_init() checks for rma_size <= node0_size already */
+spapr_populate_memory_node(fdt, i, 0, spapr->rma_size);
+mem_start += spapr->rma_size;
+node_size -= spapr->rma_size;
+}
 spapr_populate_memory_node(fdt, i, mem_start, node_size);
 mem_start += node_size;
 }
-- 
2.0.0




Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Chen, Tiejun


Is that about correct?

What are folks timezones and the best days next week to talk about
this on either Google Hangout or the phone?


UK timezone. Maybe Friday afternoon so that afterwards we can go have
enough beers to forget about all this.



Is this determined formally?

I mean I can ask if someone in Intel can attend this talk, and then they 
maybe answer your questions for GFX, Linux native driver, Windows driver.


Tiejun



Re: [Qemu-devel] [Xen-devel] [RFC PATCH V4 1/2] xen: pass kernel initrd to qemu

2014-07-02 Thread Chun Yan Liu


>>> On 7/2/2014 at 11:16 PM, in message
<1404314181.8137.7.ca...@kazak.uk.xensource.com>, Ian Campbell
 wrote: 
> On Tue, 2014-07-01 at 15:06 +0800, Chunyan Liu wrote: 
> > xen side patch to support xen HVM direct kernel boot: 
> > support 'kernel', 'ramdisk', 'cmdline' (and 'root', 'extra' as well 
> > which would be deprecated later) in HVM config file, parse config file, 
> > pass -kernel, -initrd, -append parameters to qemu. 
> >  
> > It's working with qemu-xen when using the default BIOS (seabios). 
> >  
> > [config example] 
> > kernel="/mnt/vmlinuz-3.0.13-0.27-default" 
> > ramdisk="/mnt/initrd-3.0.13-0.27-default" 
> > root="/dev/hda2" 
> > extra="console=tty0 console=ttyS0" 
>  
> Is this example complete? I think it will create a PV guest which isn't 
> what you are trying to demonstrate. 

This is a HVM example.
For PV guest, it already supports direct kernel boot before this patch, so
use as before. No changes in config file.
I'll update the examples after readding 'cmdline' parameter.

Thank you.
-Chunyan

>  
> Ian. 
>  
>  
> ___ 
> Xen-devel mailing list 
> xen-de...@lists.xen.org 
> http://lists.xen.org/xen-devel 
>  
>  





Re: [Qemu-devel] [PATCH] ide: fix double free

2014-07-02 Thread ChenLiang
On 2014/7/2 20:19, Paolo Bonzini wrote:

> Il 02/07/2014 13:57, ChenLiang ha scritto:
 Hmm, dbs->in_cancel will be true always. Although this will avoid freeing 
 dbs by dma_comlete.
 But it maybe a mistake.
>>>
>>> This was on purpose; I'm doing the free myself in dma_aio_cancel, so I 
>>> wanted to avoid the qemu_aio_release from dma_complete.  This was in case 
>>> of a recursive call to dma_complete.  But I don't see how that recursive 
>>> call could happen outside the "if (dbs->acb)"; and inside the "if" the 
>>> protection is there already.
>>>
>>> Can you gather the backtraces for _both_ calls to qemu_aio_release, rather 
>>> than just the second?
>>
>> (gdb) bt
>> #0  qemu_aio_release (p=0x7f44788d1290) at block.c:4260
>> #1  0x7f4477494e5e in dma_complete (dbs=0x7f44788d1290, ret=0) at 
>> dma-helpers.c:135
>> #2  0x7f44774952c2 in dma_aio_cancel (acb=0x7f44788d1290) at 
>> dma-helpers.c:195
>> #3  0x7f447744825b in bdrv_aio_cancel (acb=0x7f44788d1290) at 
>> block.c:3848
>> #4  0x7f4477513911 in ide_bus_reset (bus=0x7f44785f1bd8) at 
>> hw/ide/core.c:1957
>> #5  0x7f4477516b3c in piix3_reset (opaque=0x7f44785f1530) at 
>> hw/ide/piix.c:113
>> #6  0x7f4477647b9f in qemu_devices_reset () at vl.c:2131
>> #7  0x7f4477647c0f in qemu_system_reset (report=true) at vl.c:2140
>> #8  0x7f4477648127 in main_loop_should_exit () at vl.c:2274
>> #9  0x7f447764823a in main_loop () at vl.c:2323
>> #10 0x7f447764f6da in main (argc=57, argv=0x7fff5d194378, 
>> envp=0x7fff5d194548) at vl.c:4803
> 
> And the second is
> 
> #7  0x7f3cb525de5e in dma_complete (dbs=0x7f3cb63f3220, ret=0) at 
> dma-helpers.c:135
> #8  0x7f3cb525df3d in dma_bdrv_cb (opaque=0x7f3cb63f3220, ret=0) at 
> dma-helpers.c:152
> #9  0x7f3cb5212102 in bdrv_co_em_bh (opaque=0x7f3cb6398980) at 
> block.c:4127
> #10 0x7f3cb51f6cef in aio_bh_poll (ctx=0x7f3cb622a8f0) at async.c:70
> #11 0x7f3cb51f695a in aio_poll (ctx=0x7f3cb622a8f0, blocking=false) at 
> aio-posix.c:185
> #12 0x7f3cb51f7056 in aio_ctx_dispatch (source=0x7f3cb622a8f0, 
> callback=0x0, user_data=0x0)
> at async.c:167
> #13 0x7f3cb48b969a in g_main_context_dispatch () from 
> /usr/lib64/libglib-2.0.so.0
> 
> This explains why my patch "fixes" the bug.  It turns a double free
> into a dangling pointer: the second call now sees in_cancel == true
> and skips the free.
> 
> The second call should have happened within dma_aio_cancel's call to
> bdrv_aio_cancel.  This is the real bug.
> 
> What is your version of QEMU?  I cannot see any where bdrv_co_em_bh is
> at line 4127 or bdrv_aio_cancel is at line 3848.  Can you reproduce it
> with qemu.git master?
> 
> Paolo
> 
> .
> 


qemu master branch bt:

Program received signal SIGABRT, Aborted.
0x7fd548355b55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7fd548355b55 in raise () from /lib64/libc.so.6
#1  0x7fd548357131 in abort () from /lib64/libc.so.6
#2  0x7fd548393e0f in __libc_message () from /lib64/libc.so.6
#3  0x7fd548399618 in malloc_printerr () from /lib64/libc.so.6
#4  0x7fd54b15e80e in free_and_trace (mem=0x7fd54beb2230) at vl.c:2815
#5  0x7fd54b3453cd in qemu_aio_release (p=0x7fd54beb2230) at block.c:4813
#6  0x7fd54b15717d in dma_complete (dbs=0x7fd54beb2230, ret=0) at 
dma-helpers.c:132
#7  0x7fd54b157253 in dma_bdrv_cb (opaque=0x7fd54beb2230, ret=0) at 
dma-helpers.c:148
#8  0x7fd54b344db8 in bdrv_co_em_bh (opaque=0x7fd54bea4b30) at block.c:4676
#9  0x7fd54b335a72 in aio_bh_poll (ctx=0x7fd54bcec990) at async.c:81
#10 0x7fd54b34b1b4 in aio_poll (ctx=0x7fd54bcec990, blocking=false) at 
aio-posix.c:188
#11 0x7fd54b335ee0 in aio_ctx_dispatch (source=0x7fd54bcec990, 
callback=0x0, user_data=0x0) at async.c:211
#12 0x7fd549e3669a in g_main_context_dispatch () from 
/usr/lib64/libglib-2.0.so.0
#13 0x7fd54b348c45 in glib_pollfds_poll () at main-loop.c:190
#14 0x7fd54b348d3d in os_host_main_loop_wait (timeout=0) at main-loop.c:235
#15 0x7fd54b348e2f in main_loop_wait (nonblocking=0) at main-loop.c:484
#16 0x7fd54b15b0f8 in main_loop () at vl.c:2007
#17 0x7fd54b162a35 in main (argc=57, argv=0x7fff152720a8, 
envp=0x7fff15272278) at vl.c:4526

(gdb) bt
#0  qemu_aio_release (p=0x7f86420ebec0) at block.c:4811
#1  0x7f86412b617d in dma_complete (dbs=0x7f86420ebec0, ret=0) at 
dma-helpers.c:132
#2  0x7f86412b65ab in dma_aio_cancel (acb=0x7f86420ebec0) at 
dma-helpers.c:192
#3  0x7f86414a3996 in bdrv_aio_cancel (acb=0x7f86420ebec0) at block.c:4559
#4  0x7f86413906af in ide_bus_reset (bus=0x7f8641fe3a20) at 
hw/ide/core.c:2056
#5  0x7f86413967d6 in piix3_reset (opaque=0x7f8641fe32a0) at 
hw/ide/piix.c:114
#6  0x7f86412b9a37 in qemu_devices_reset () at vl.c:1829
#7  0x7f86412b9aef in qemu_system_reset (report=true) at vl.c:1842
#8  0x7f86412b9fe2 in main_loop_should_exit () at vl.c:1971
#9  0x7f86412ba100 in main_loop () at vl.c:2011
#10 0x7f86412c

Re: [Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Once an userfaultfd is created MADV_USERFAULT regions talks through
> the userfaultfd protocol with the thread responsible for doing the
> memory externalization of the process.
> 
> The protocol starts by userland writing the requested/preferred
> USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
> kernel knows it, it will ack it by allowing userland to read 64bit
> from the userfault fd that will contain the same 64bit
> USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
> will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
> will have to try again by writing an older protocol version if
> suitable for its usage too, and read it back again until it stops
> reading -1ULL. After that the userfaultfd protocol starts.
> 
> The protocol consists in the userfault fd reads 64bit in size
> providing userland the fault addresses. After a userfault address has
> been read and the fault is resolved by userland, the application must
> write back 128bits in the form of [ start, end ] range (64bit each)
> that will tell the kernel such a range has been mapped. Multiple read
> userfaults can be resolved in a single range write. poll() can be used
> to know when there are new userfaults to read (POLLIN) and when there
> are threads waiting a wakeup through a range write (POLLOUT).
> 
> Signed-off-by: Andrea Arcangeli 

> +#ifdef CONFIG_PROC_FS
> +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
> +{
> + struct userfaultfd_ctx *ctx = f->private_data;
> + int ret;
> + wait_queue_t *wq;
> + struct userfaultfd_wait_queue *uwq;
> + unsigned long pending = 0, total = 0;
> +
> + spin_lock(&ctx->fault_wqh.lock);
> + list_for_each_entry(wq, &ctx->fault_wqh.task_list, task_list) {
> + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
> + if (uwq->pending)
> + pending++;
> + total++;
> + }
> + spin_unlock(&ctx->fault_wqh.lock);
> +
> + ret = seq_printf(m, "pending:\t%lu\ntotal:\t%lu\n", pending, total);

This should show the protocol version, too.

> +
> +SYSCALL_DEFINE1(userfaultfd, int, flags)
> +{
> + int fd, error;
> + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)




Re: [Qemu-devel] [PATCH 00/10] RFC: userfault

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
> Hello everyone,
> 
> There's a large CC list for this RFC because this adds two new
> syscalls (userfaultfd and remap_anon_pages) and
> MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
> or on a completely different API if somebody has better ideas are
> welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

> 
> The combination of these features are what I would propose to
> implement postcopy live migration in qemu, and in general demand
> paging of remote memory, hosted in different cloud nodes.
> 
> The MADV_USERFAULT feature should be generic enough that it can
> provide the userfaults to the Android volatile range feature too, on
> access of reclaimed volatile pages.
> 
> If the access could ever happen in kernel context through syscalls
> (not not just from userland context), then userfaultfd has to be used
> to make the userfault unnoticeable to the syscall (no error will be
> returned). This latter feature is more advanced than what volatile
> ranges alone could do with SIGBUS so far (but it's optional, if the
> process doesn't call userfaultfd, the regular SIGBUS will fire, if the
> fd is closed SIGBUS will also fire for any blocked userfault that was
> waiting a userfaultfd_write ack).
> 
> userfaultfd is also a generic enough feature, that it allows KVM to
> implement postcopy live migration without having to modify a single
> line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
> other GUP features works just fine in combination with userfaults
> (userfaults trigger async page faults in the guest scheduler so those
> guest processes that aren't waiting for userfaults can keep running in
> the guest vcpus).
> 
> remap_anon_pages is the syscall to use to resolve the userfaults (it's
> not mandatory, vmsplice will likely still be used in the case of local
> postcopy live migration just to upgrade the qemu binary, but
> remap_anon_pages is faster and ideal for transferring memory across
> the network, it's zerocopy and doesn't touch the vma: it only holds
> the mmap_sem for reading).
> 
> The current behavior of remap_anon_pages is very strict to avoid any
> chance of memory corruption going unnoticed. mremap is not strict like
> that: if there's a synchronization bug it would drop the destination
> range silently resulting in subtle memory corruption for
> example. remap_anon_pages would return -EEXIST in that case. If there
> are holes in the source range remap_anon_pages will return -ENOENT.
> 
> If remap_anon_pages is used always with 2M naturally aligned
> addresses, transparent hugepages will not be splitted. In there could
> be 4k (or any size) holes in the 2M (or any size) source range,
> remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
> relax some of its strict checks (-ENOENT won't be returned if
> RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
> a noop on any hole in the source range). This flag is generally useful
> when implementing userfaults with THP granularity, but it shouldn't be
> set if doing the userfaults with PAGE_SIZE granularity if the
> developer wants to benefit from the strict -ENOENT behavior.
> 
> The remap_anon_pages syscall API is not vectored, as I expect it to be
> used mainly for demand paging (where there can be just one faulting
> range per userfault) or for large ranges (with the THP model as an
> alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
> granularity before starting the guest in the destination node) where
> vectoring isn't going to provide much performance advantages (thanks
> to the THP coarser granularity).
> 
> On the rmap side remap_anon_pages doesn't add much complexity: there's
> no need of nonlinear anon vmas to support it because I added the
> constraint that it will fail if the mapcount is more than 1. So in
> general the source range of remap_anon_pages should be marked
> MADV_DONTFORK to prevent any risk of failure if the process ever
> forks (like qemu can in some case).
> 
> One part that hasn't been tested is the poll() syscall on the
> userfaultfd because the postcopy migration thread currently is more
> efficient waiting on blocking read()s (I'll write some code to test
> poll() too). I also appended below a patch to trinity to exercise
> remap_anon_pages and userfaultfd and it completes trinity
> successfully.
> 
> The code can be found here:
> 
> git clone --reference linux 
> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
> 
> The branch is rebased so you can get updates for example with:
> 
> git fetch && git checkout -f origin/userfault
> 
> Comments welcome, thanks!
> Andrea
> 
> From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli 
> Date: Wed, 2 Jul 2014 18:32:35 +0200
> Subject: [PATCH] add remap_anon_pages and userfaultfd
> 
> Signed-off-by: Andrea Arcangeli 
> ---
>  include/syscall

Re: [Qemu-devel] from which version qemu support clone on rbd

2014-07-02 Thread yue

could you tell me why 'Qemu doesn't handle that level of abstraction'?
i know qcow2 well, you can tell me the comparation。
 
thanks







At 2014-07-02 11:36:12, "Brian Jackson"  wrote:
>Qemu doesn't handle that level of abstraction. The closest approximation 
>you could probably come up with is qemu-img's backing file support for 
>qcow2 images.
>
>You should stick to using the rbd tool to create clones of rbd devices. 
>Alternatively, use a higher level tool (like openstack, etc) that 
>supports this.
>
>--Iggy
>
>
>On 7/2/2014 10:17 AM, yue wrote:
>> hi,all
>> i now look at qemu 2.0, i do not find rbd-api related to clone.
>> if qemu support this function? and from which version?
>> clone api of rbd is very simple(one api), why qemu does not
>> implement?what is the reason?
>> thanks.
>>
>>
>


Re: [Qemu-devel] VM id

2014-07-02 Thread Eric Blake
On 07/02/2014 06:29 PM, Gary Jordan wrote:

[please don't top-post on technical lists]

> If I open two session in a migration operation, how does the qemu know
> which one should be accepted? I saw there was a ram_list to check, but no
> id of the guest.

I suggest using higher level software, like libvirt, to manage your
migrations.  The recipient qemu must have the same command line setup as
the sending side, and it is up to you (or your management app) to make
sure you are connecting to the correct port.  As I said before, qemu is
only concerned about a single guest, so as soon as you are involving
multiple qemu processes, you need a higher level application to keep
things straight.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] VM id

2014-07-02 Thread Gary Jordan
If I open two session in a migration operation, how does the qemu know
which one should be accepted? I saw there was a ram_list to check, but no
id of the guest.


2014-07-02 17:59 GMT-04:00 Eric Blake :

> On 07/02/2014 01:20 PM, Gary Jordan wrote:
> > Does Qemu have a VM id allocated for each VM?  I did not find this Id in
> > qemu. HOW deos qemu identify each VM, using thread Id or some other
> > identifiers?
>
> Each qemu process manages exactly one VM, so qemu doesn't care what id a
> guest has.  Higher-level management software, such as libvirt, has
> notions of a VM name and UUID (both of which can be specified on the
> command line parameters given to qemu, and the UUID can even be
> propagated to the guest, such as by SMBIOS readable by dmidecode in the
> guest), as well as a VM id (in libvirt's case, a sequentially increasing
> number for each VM that libvirt spawns a qemu process for).  But that's
> getting outside the realm of qemu, since qemu doesn't care what name or
> uuid you picked, only whether you have access to the monitor of the qemu
> process.
>
> --
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
>


Re: [Qemu-devel] e1000 autoneg timing, piix/osx

2014-07-02 Thread Gabriel L. Somlo
On Wed, Jul 02, 2014 at 05:14:26PM -0400, Gabriel L. Somlo wrote:
> On Wed, Jul 02, 2014 at 11:02:30PM +0200, Alexander Graf wrote:
> > 
> > On 02.07.14 22:49, Gabriel L. Somlo wrote:
> > >So it turns out everything I thought I knew (which was little indeed)
> > >was more or less wrong. The problem, as far as I'm observing it now,
> > >is that on PIIX, the OS X guest obsessively reads the ICR in a tight
> > >loop. It reads the injected LSC (and probably discards it) before
> > >unmasking the corresponding interrupt bit; later on, when it unmasks
> > >LSC, giving the emulated e1000 hardware a chance to raise the irq
> > >line, the actual LSC event has been flushed from the ICR, and the
> > >driver does not detect the link coming up.
> > >
> > > [...]
> > >
> > >Any clue as to why ICR gets read like that on PIIX, but not Q35 ?
> > 
> > Either way, why does the bit get cleared even though it hasn't been raised?
> > What does real hardware do with interrupts that have been masked?
> 
> The e1000 manual says ICR bits are cleared on read. It also says PCI
> interrupts are only generated if the corresponding bit in *both* ICR
> and IMS registers is 1. ICR bits are still cleared if read, even if
> masked and no actual interrupt is raised.
> 
> > Maybe it's using MSI on q35? :)
> > Maybe we also share the same IRQ line with another device on PIIX that gets
> > polled all the time? IDE maybe?
> 
> Even if that were the case, how come it's reading precisely our
> device's ICR register ? Isn't that too much of a coincidence ? 

Unless some other device on a shared IRQ line generated the interrupt,
and our e1000 driver then checked ICR. Although if that happened, our
driver should happily acknowledge the link-up event (I checked, and it
does NOT read IMS, so it has no way of knowing it's masked). Except it
does -- it knows the last thing it did was set_imc(0x), and
that it hasn't yet set_ims(anything) yet... 

And I think you're right:

(qemu) info pci
  Bus  0, device   0, function 0:
Host bridge: PCI device 8086:1237
  id ""
  Bus  0, device   1, function 0:
ISA bridge: PCI device 8086:7000
  id ""
  Bus  0, device   1, function 1:
IDE controller: PCI device 8086:7010
  BAR4: I/O at 0xc080 [0xc08f].
  id ""
  Bus  0, device   1, function 2:
USB controller: PCI device 8086:7020
  IRQ 11.
  BAR4: I/O at 0xc040 [0xc05f].
  id ""
  Bus  0, device   1, function 3:
Bridge: PCI device 8086:7113
  IRQ 9.
  id ""
  Bus  0, device   2, function 0:
VGA controller: PCI device 1013:00b8
  BAR0: 32 bit prefetchable memory at 0xfc00 [0xfdff].
  BAR1: 32 bit memory at 0xfebf [0xfebf0fff].
  BAR6: 32 bit memory at 0x [0xfffe].
  id ""
  Bus  0, device   3, function 0:
SATA controller: PCI device 8086:2922
  IRQ 11.
  BAR4: I/O at 0xc060 [0xc07f].
  BAR5: 32 bit memory at 0xfebf1000 [0xfebf1fff].
  id "ide"
  Bus  0, device   4, function 0:
Ethernet controller: PCI device 8086:100f
  IRQ 11.
  BAR0: 32 bit memory at 0xfebc [0xfebd].
  BAR1: I/O at 0x [0x003e].
  BAR6: 32 bit memory at 0x [0x0003fffe].
  id "vnet0"

so Ethernet, SATA, and USB, all sharing IRQ 11. Is there an easy way
to force one of those to use a different IRQ ?





Re: [Qemu-devel] VM id

2014-07-02 Thread Eric Blake
On 07/02/2014 01:20 PM, Gary Jordan wrote:
> Does Qemu have a VM id allocated for each VM?  I did not find this Id in
> qemu. HOW deos qemu identify each VM, using thread Id or some other
> identifiers?

Each qemu process manages exactly one VM, so qemu doesn't care what id a
guest has.  Higher-level management software, such as libvirt, has
notions of a VM name and UUID (both of which can be specified on the
command line parameters given to qemu, and the UUID can even be
propagated to the guest, such as by SMBIOS readable by dmidecode in the
guest), as well as a VM id (in libvirt's case, a sequentially increasing
number for each VM that libvirt spawns a qemu process for).  But that's
getting outside the realm of qemu, since qemu doesn't care what name or
uuid you picked, only whether you have access to the monitor of the qemu
process.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] e1000 autoneg timing, piix/osx

2014-07-02 Thread Alexander Graf


> Am 02.07.2014 um 23:14 schrieb "Gabriel L. Somlo" :
> 
>> On Wed, Jul 02, 2014 at 11:02:30PM +0200, Alexander Graf wrote:
>> 
>>> On 02.07.14 22:49, Gabriel L. Somlo wrote:
>>> So it turns out everything I thought I knew (which was little indeed)
>>> was more or less wrong. The problem, as far as I'm observing it now,
>>> is that on PIIX, the OS X guest obsessively reads the ICR in a tight
>>> loop. It reads the injected LSC (and probably discards it) before
>>> unmasking the corresponding interrupt bit; later on, when it unmasks
>>> LSC, giving the emulated e1000 hardware a chance to raise the irq
>>> line, the actual LSC event has been flushed from the ICR, and the
>>> driver does not detect the link coming up.
>>> 
>>> [...]
>>> 
>>> Any clue as to why ICR gets read like that on PIIX, but not Q35 ?
>> 
>> Either way, why does the bit get cleared even though it hasn't been raised?
>> What does real hardware do with interrupts that have been masked?
> 
> The e1000 manual says ICR bits are cleared on read. It also says PCI
> interrupts are only generated if the corresponding bit in *both* ICR
> and IMS registers is 1. ICR bits are still cleared if read, even if
> masked and no actual interrupt is raised.
> 
>> Maybe it's using MSI on q35? :)
>> Maybe we also share the same IRQ line with another device on PIIX that gets
>> polled all the time? IDE maybe?
> 
> Even if that were the case, how come it's reading precisely our
> device's ICR register ? Isn't that too much of a coincidence ? 

When PCI devices share an irq line, the OS needs to poke all devices on that 
line to figure out which device the IRQ came from. Maybe it's reading ICR for 
that purpose?

Alex

> 
> /me goes back to reading about PIIX and ICH10 (840 pages FTW!) :)
> 



Re: [Qemu-devel] [PATCH 2/2 v5] numa: enable sparse node numbering on ppc

2014-07-02 Thread Eduardo Habkost
On Wed, Jul 02, 2014 at 02:02:14PM -0700, Nishanth Aravamudan wrote:
> On 02.07.2014 [15:21:38 -0300], Eduardo Habkost wrote:
> > On Tue, Jul 01, 2014 at 01:50:06PM -0700, Nishanth Aravamudan wrote:
> > > On 01.07.2014 [17:39:57 -0300], Eduardo Habkost wrote:
> > > > On Tue, Jul 01, 2014 at 01:13:28PM -0700, Nishanth Aravamudan wrote:
> > > > [...]
> > > > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > > > > index 12472c6..cdefafe 100644
> > > > > --- a/hw/i386/pc.c
> > > > > +++ b/hw/i386/pc.c
> > > > > @@ -1121,6 +1121,18 @@ PcGuestInfo *pc_guest_info_init(ram_addr_t 
> > > > > below_4g_mem_size,
> > > > >  guest_info->ram_size = below_4g_mem_size + above_4g_mem_size;
> > > > >  guest_info->apic_id_limit = pc_apic_id_limit(max_cpus);
> > > > >  guest_info->apic_xrupt_override = kvm_allows_irq0_override();
> > > > > +/* No support for sparse NUMA node IDs yet: */
> > > > > +for (i = max_numa_nodeid - 1; i >= 0; i--) {
> > > > > +/* Report large node IDs first, to make mistakes easier to 
> > > > > spot */
> > > > > +if (!numa_info[i].present) {
> > > > > +error_report("numa: Node ID missing: %d", i);
> > > > > +exit(EXIT_FAILURE);
> > > > > +}
> > > > > +}
> > > > > +
> > > > > +/* This must be always true if all nodes are present */
> > > > > +assert(num_numa_nodes == max_numa_nodeid);
> > > > > +
> > > > 
> > > > I wonder if there's a better place where we could put this check.
> > > 
> > > Well, only i386 and ppc support NUMA, afaict. So I'm not sure where it
> > > makes sense to put it. I guess we could have a flag that the
> > > architectures set that indicates sparse NUMA support or not, and put
> > > this in the generic code.
> > > 
> > > Or do you mean putting this check somewhere else in the PC init code?
> > 
> > I mean somewhere else in the PC init code. But as today the code that
> > calls pc_guest_info_init() and pc_memory_init() is duplicated in both
> > pc_piix.c and pc_q35.c, this looks like the best place we have.
> 
> Ok, so if I send out another revision with the fixed j initialization
> below, is there anything else in my changes that you would like fixed?

I don't see any additional issues.

> 
[...]
> > > > Except for that, patch looks good to me. But I would be more comfortable
> > > > with it if we had automated tests to help ensure we are not breaking
> > > > compatibility of existing NUMA command-line conbinations with these
> > > > changes.
> > > 
> > > Is that the test target in the qemu source? Are there examples of any
> > > such NUMA tests already?
> > 
> > I use 'make check' to run them, they are in the tests/ directory.
> 
> Got it, thanks.
> 
> > I am not aware of any NUMA-related test, but I see two possible ways of
> > testing it: using qtest and asking for for the NUMA node info through
> > the monitor, or a unit test for numa.c that simply calls
> > numa_node_parse() and set_numa_nodes(), and then checks the result on
> > numa_info[] directly.
> 
> Do you have a preference for which of these to do?

The one we find to be easier. :)

An unit test may require untangling numa.o dependencies. qtest will
probably require parsing the "info numa" output.

A qtest case would cover more code (not just numa.c, but command-ilne
handling on vl.c, and monitor code).

> 
> > A third option may be using qtest and checking the resulting ACPI tables
> > directly. It would cover even more code, but would be specific to PC.
> 
> I'm not comfortable saying I can get to this, as I still don't really
> know the ACPI code, but I can put it on my todo list, at least.
> 
> > The tests won't be a requirement to me, but they would surely be welcome
> > (and would have detected the j=0 mistake above).
> 
> I think it makes sense to put this in now, as it would have caught the
> original issue(s) with sparse node numbering as well.
> 
> Thanks,
> Nish
> 

-- 
Eduardo



Re: [Qemu-devel] e1000 autoneg timing, piix/osx

2014-07-02 Thread Gabriel L. Somlo
On Wed, Jul 02, 2014 at 11:02:30PM +0200, Alexander Graf wrote:
> 
> On 02.07.14 22:49, Gabriel L. Somlo wrote:
> >So it turns out everything I thought I knew (which was little indeed)
> >was more or less wrong. The problem, as far as I'm observing it now,
> >is that on PIIX, the OS X guest obsessively reads the ICR in a tight
> >loop. It reads the injected LSC (and probably discards it) before
> >unmasking the corresponding interrupt bit; later on, when it unmasks
> >LSC, giving the emulated e1000 hardware a chance to raise the irq
> >line, the actual LSC event has been flushed from the ICR, and the
> >driver does not detect the link coming up.
> >
> > [...]
> >
> >Any clue as to why ICR gets read like that on PIIX, but not Q35 ?
> 
> Either way, why does the bit get cleared even though it hasn't been raised?
> What does real hardware do with interrupts that have been masked?

The e1000 manual says ICR bits are cleared on read. It also says PCI
interrupts are only generated if the corresponding bit in *both* ICR
and IMS registers is 1. ICR bits are still cleared if read, even if
masked and no actual interrupt is raised.

> Maybe it's using MSI on q35? :)
> Maybe we also share the same IRQ line with another device on PIIX that gets
> polled all the time? IDE maybe?

Even if that were the case, how come it's reading precisely our
device's ICR register ? Isn't that too much of a coincidence ? 

/me goes back to reading about PIIX and ICH10 (840 pages FTW!) :)




Re: [Qemu-devel] [PATCH 2/2 v5] numa: enable sparse node numbering on ppc

2014-07-02 Thread Nishanth Aravamudan
On 02.07.2014 [15:21:38 -0300], Eduardo Habkost wrote:
> On Tue, Jul 01, 2014 at 01:50:06PM -0700, Nishanth Aravamudan wrote:
> > On 01.07.2014 [17:39:57 -0300], Eduardo Habkost wrote:
> > > On Tue, Jul 01, 2014 at 01:13:28PM -0700, Nishanth Aravamudan wrote:
> > > [...]
> > > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > > > index 12472c6..cdefafe 100644
> > > > --- a/hw/i386/pc.c
> > > > +++ b/hw/i386/pc.c
> > > > @@ -1121,6 +1121,18 @@ PcGuestInfo *pc_guest_info_init(ram_addr_t 
> > > > below_4g_mem_size,
> > > >  guest_info->ram_size = below_4g_mem_size + above_4g_mem_size;
> > > >  guest_info->apic_id_limit = pc_apic_id_limit(max_cpus);
> > > >  guest_info->apic_xrupt_override = kvm_allows_irq0_override();
> > > > +/* No support for sparse NUMA node IDs yet: */
> > > > +for (i = max_numa_nodeid - 1; i >= 0; i--) {
> > > > +/* Report large node IDs first, to make mistakes easier to 
> > > > spot */
> > > > +if (!numa_info[i].present) {
> > > > +error_report("numa: Node ID missing: %d", i);
> > > > +exit(EXIT_FAILURE);
> > > > +}
> > > > +}
> > > > +
> > > > +/* This must be always true if all nodes are present */
> > > > +assert(num_numa_nodes == max_numa_nodeid);
> > > > +
> > > 
> > > I wonder if there's a better place where we could put this check.
> > 
> > Well, only i386 and ppc support NUMA, afaict. So I'm not sure where it
> > makes sense to put it. I guess we could have a flag that the
> > architectures set that indicates sparse NUMA support or not, and put
> > this in the generic code.
> > 
> > Or do you mean putting this check somewhere else in the PC init code?
> 
> I mean somewhere else in the PC init code. But as today the code that
> calls pc_guest_info_init() and pc_memory_init() is duplicated in both
> pc_piix.c and pc_q35.c, this looks like the best place we have.

Ok, so if I send out another revision with the fixed j initialization
below, is there anything else in my changes that you would like fixed?

> > > >  guest_info->numa_nodes = num_numa_nodes;
> > > >  guest_info->node_mem = g_malloc0(guest_info->numa_nodes *
> > > >  sizeof *guest_info->node_mem);
> > > [...]
> > > > diff --git a/numa.c b/numa.c
> > > > index 5930df0..a689e52 100644
> > > > --- a/numa.c
> > > > +++ b/numa.c
> > > [...]
> > > > @@ -225,9 +220,12 @@ void set_numa_nodes(void)
> > > >   * must cope with this anyway, because there are BIOSes out 
> > > > there in
> > > >   * real machines which also use this scheme.
> > > >   */
> > > > -if (i == num_numa_nodes) {
> > > > -for (i = 0; i < max_cpus; i++) {
> > > > -set_bit(i, numa_info[i % num_numa_nodes].node_cpu);
> > > > +if (i == max_numa_nodeid) {
> > > > +for (i = 0, j = 0; i < max_cpus; i++) {
> > > 
> > > Doesn't j need to be initialized to -1, here?
> > 
> > Arrgh, sorry had been messing with your suggestion to use a while loop.
> > You're right, it needs to be -1 here.
> > 
> > > Except for that, patch looks good to me. But I would be more comfortable
> > > with it if we had automated tests to help ensure we are not breaking
> > > compatibility of existing NUMA command-line conbinations with these
> > > changes.
> > 
> > Is that the test target in the qemu source? Are there examples of any
> > such NUMA tests already?
> 
> I use 'make check' to run them, they are in the tests/ directory.

Got it, thanks.

> I am not aware of any NUMA-related test, but I see two possible ways of
> testing it: using qtest and asking for for the NUMA node info through
> the monitor, or a unit test for numa.c that simply calls
> numa_node_parse() and set_numa_nodes(), and then checks the result on
> numa_info[] directly.

Do you have a preference for which of these to do?

> A third option may be using qtest and checking the resulting ACPI tables
> directly. It would cover even more code, but would be specific to PC.

I'm not comfortable saying I can get to this, as I still don't really
know the ACPI code, but I can put it on my todo list, at least.

> The tests won't be a requirement to me, but they would surely be welcome
> (and would have detected the j=0 mistake above).

I think it makes sense to put this in now, as it would have caught the
original issue(s) with sparse node numbering as well.

Thanks,
Nish




Re: [Qemu-devel] e1000 autoneg timing, piix/osx

2014-07-02 Thread Alexander Graf


On 02.07.14 22:49, Gabriel L. Somlo wrote:

On Wed, Jul 02, 2014 at 11:16:52AM +0200, Alexander Graf wrote:

Are you sure there's not just simply some irq unmasking event
after 5500ms we don't handle properly?

I poked around a bit, and the e1000 interrupt mask register is NOT the
problem (the LSC mask bit is clear at all times). If anything, maybe
the PIIX southbridge (or something further up "north") is masking PCI
interrupts (at least from e1000) until roughly 5500 ms into the boot
process ? Any ideas on how I could go about verifying this (without
access to the guest source, obviously :) ) would be very helpful...

Yeah, maybe the interrupt is masked and doesn't get delivered properly? See
if you can trace when the e1000 emulation starts kicking an interrupt and
when the guest tries to fetch it (there should be an ack register for IRQs
somewhere).

If we kick it but the guest doesn't react, the problem is further down -
check whether the IRQ ever got injected into the guest with trace points.

If we don't kick it, we mask it somewhere in the e1000 emulation and need to
make sure we do kick once we unmask :). I don't know whether the LSC mask is
the only one involved.

So it turns out everything I thought I knew (which was little indeed)
was more or less wrong. The problem, as far as I'm observing it now,
is that on PIIX, the OS X guest obsessively reads the ICR in a tight
loop. It reads the injected LSC (and probably discards it) before
unmasking the corresponding interrupt bit; later on, when it unmasks
LSC, giving the emulated e1000 hardware a chance to raise the irq
line, the actual LSC event has been flushed from the ICR, and the
driver does not detect the link coming up.

Here's how things work normally on Q35 (with INTERRUPT and PHY debugging
enabled, // my comments on the side):

e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Start link auto negotiation
e1000: Auto negotiation is completed
e1000: set_ics 4, ICR 0, IMR 0   // autoneg timer injects LSC
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ics 2, ICR 4, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ics 2, ICR 6, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ims df// guest unmasks interrupts
e1000: set_ics 0, ICR 6, IMR df
e1000: set_interrupt_cause: mit_irq_level=1  // first raising irq edge
e1000: ICR read: 6   // guest receives LSC (+more)

... and things work nicely from here on out :)


On PIIX, however, things look like this:

e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0

... <8155 "ICR read" repetitions deleted> ...

e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Start link auto negotiation
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0

... <145 "ICR read" repetitions deleted> ...

e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Auto negotiation is completed
e1000: set_ics 4, ICR 0, IMR 0   // autoneg timer inje

Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Alexander Graf


On 02.07.14 21:34, Scott Wood wrote:

On Wed, 2014-07-02 at 19:59 +0200, Alexander Graf wrote:

On 02.07.14 19:52, Scott Wood wrote:

On Wed, 2014-07-02 at 19:30 +0200, Alexander Graf wrote:

On 02.07.14 19:26, Scott Wood wrote:

On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:

On 02.07.14 00:50, Scott Wood wrote:

Plus, let's please not hardcode any more addresses that are going to be
a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
blocking that, but that has a TODO to parameterize).  How about
0xfULL?  Unless of course we're emulating an e500v1, which
doesn't support 36-bit physical addressing.  Parameterization would help
there as well.

I don't think we have to worry about e500v1. I'll change it :).

We theoretically support it elsewhere...  Once parameterized, it
shouldn't be hard to base the address for this, CCSRBAR, and similar
things on whether MAS7 is supported.

It gets parametrized in the machine file, CPU selection comes after
machine selection. So parameterizing it doesn't really solve it.

Why can't e500plat_init() look at args->cpu_model?  Or the
parameterization could take two sets of addresses, one for a 32-bit
layout and one for a 36-bit layout.  It might make sense to make this a
user-settable parameter; some OSes might not be able to handle a 36-bit
layout (or might not be as efficient at handling it) even on e500v2.
Many of the e500v2 boards can be built for either a 32 or 36 bit address
layout in U-Boot.


However, again, I don't think we have to worry about it.

It's not a huge worry, but it'd be nice to not break it gratuitously.
If we do break it we should explicitly disallow e500v1 with e500plat.

I'd prefer if we don't overparameterize - it'll just become a headache
further down.

"We shouldn't overparameterize" is a tautology.  The question is what
constitutes "over".  I don't think this is excessive.  Again, it's
parameterization that U-Boot already does, even disregarding e500v1, and
QEMU plays the role of U-Boot to a certain degree (even in the new mode
of actually running U-Boot, the address map is fixed).

Perhaps it could be simplified by just saying that, in 36-bit mode, all
physical addresses other than RAM have 0xf prepended.  This is similar
to what U-Boot does.


I think we should just have another machine type for that case - one 
that is 32bit and one that is 36bit compatible.





Today we don't explicitly disallow anything anywhere - you
could theoretically stick a G3 into e500plat. I don't see why we should
start with heavy sanity checks now :).

Ugh.

It should at least be documented, since unlike a G3, e500v1 isn't an
unrealistic expectation on a platform called e500plat.


Plus, the machine works just fine today if you don't pass in -device
eTSEC. It's not like we're moving all devices to the new "platform bus".

We have a TODO to move CCSR as well.


Yes, that's certainly a good goal to have :).


Alex




Re: [Qemu-devel] e1000 autoneg timing, piix/osx

2014-07-02 Thread Gabriel L. Somlo
On Wed, Jul 02, 2014 at 11:16:52AM +0200, Alexander Graf wrote:
>>> Are you sure there's not just simply some irq unmasking event
>>> after 5500ms we don't handle properly?
>> I poked around a bit, and the e1000 interrupt mask register is NOT the
>> problem (the LSC mask bit is clear at all times). If anything, maybe
>> the PIIX southbridge (or something further up "north") is masking PCI
>> interrupts (at least from e1000) until roughly 5500 ms into the boot
>> process ? Any ideas on how I could go about verifying this (without
>> access to the guest source, obviously :) ) would be very helpful...
> 
> Yeah, maybe the interrupt is masked and doesn't get delivered properly? See
> if you can trace when the e1000 emulation starts kicking an interrupt and
> when the guest tries to fetch it (there should be an ack register for IRQs
> somewhere).
> 
> If we kick it but the guest doesn't react, the problem is further down -
> check whether the IRQ ever got injected into the guest with trace points.
> 
> If we don't kick it, we mask it somewhere in the e1000 emulation and need to
> make sure we do kick once we unmask :). I don't know whether the LSC mask is
> the only one involved.

So it turns out everything I thought I knew (which was little indeed)
was more or less wrong. The problem, as far as I'm observing it now,
is that on PIIX, the OS X guest obsessively reads the ICR in a tight
loop. It reads the injected LSC (and probably discards it) before
unmasking the corresponding interrupt bit; later on, when it unmasks
LSC, giving the emulated e1000 hardware a chance to raise the irq
line, the actual LSC event has been flushed from the ICR, and the
driver does not detect the link coming up.

Here's how things work normally on Q35 (with INTERRUPT and PHY debugging
enabled, // my comments on the side):

e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Start link auto negotiation
e1000: Auto negotiation is completed
e1000: set_ics 4, ICR 0, IMR 0   // autoneg timer injects LSC
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ics 2, ICR 4, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ics 2, ICR 6, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_ims df// guest unmasks interrupts
e1000: set_ics 0, ICR 6, IMR df
e1000: set_interrupt_cause: mit_irq_level=1  // first raising irq edge
e1000: ICR read: 6   // guest receives LSC (+more)

... and things work nicely from here on out :)


On PIIX, however, things look like this:

e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0

... <8155 "ICR read" repetitions deleted> ...

e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: set_imc 
e1000: set_ics 0, ICR 0, IMR 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Start link auto negotiation
e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0

... <145 "ICR read" repetitions deleted> ...

e1000: ICR read: 0
e1000: set_interrupt_cause: mit_irq_level=0
e1000: Auto negotiation is completed
e1000: set_ics 4, ICR 0, IMR 0   // autoneg timer in

Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Scott Wood
On Wed, 2014-07-02 at 19:59 +0200, Alexander Graf wrote:
> On 02.07.14 19:52, Scott Wood wrote:
> > On Wed, 2014-07-02 at 19:30 +0200, Alexander Graf wrote:
> >> On 02.07.14 19:26, Scott Wood wrote:
> >>> On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:
>  On 02.07.14 00:50, Scott Wood wrote:
> > Plus, let's please not hardcode any more addresses that are going to be
> > a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
> > blocking that, but that has a TODO to parameterize).  How about
> > 0xfULL?  Unless of course we're emulating an e500v1, which
> > doesn't support 36-bit physical addressing.  Parameterization would help
> > there as well.
>  I don't think we have to worry about e500v1. I'll change it :).
> >>> We theoretically support it elsewhere...  Once parameterized, it
> >>> shouldn't be hard to base the address for this, CCSRBAR, and similar
> >>> things on whether MAS7 is supported.
> >> It gets parametrized in the machine file, CPU selection comes after
> >> machine selection. So parameterizing it doesn't really solve it.
> > Why can't e500plat_init() look at args->cpu_model?  Or the
> > parameterization could take two sets of addresses, one for a 32-bit
> > layout and one for a 36-bit layout.  It might make sense to make this a
> > user-settable parameter; some OSes might not be able to handle a 36-bit
> > layout (or might not be as efficient at handling it) even on e500v2.
> > Many of the e500v2 boards can be built for either a 32 or 36 bit address
> > layout in U-Boot.
> >
> >> However, again, I don't think we have to worry about it.
> > It's not a huge worry, but it'd be nice to not break it gratuitously.
> > If we do break it we should explicitly disallow e500v1 with e500plat.
> 
> I'd prefer if we don't overparameterize - it'll just become a headache 
> further down.

"We shouldn't overparameterize" is a tautology.  The question is what
constitutes "over".  I don't think this is excessive.  Again, it's
parameterization that U-Boot already does, even disregarding e500v1, and
QEMU plays the role of U-Boot to a certain degree (even in the new mode
of actually running U-Boot, the address map is fixed).

Perhaps it could be simplified by just saying that, in 36-bit mode, all
physical addresses other than RAM have 0xf prepended.  This is similar
to what U-Boot does.

> Today we don't explicitly disallow anything anywhere - you 
> could theoretically stick a G3 into e500plat. I don't see why we should 
> start with heavy sanity checks now :).

Ugh.

It should at least be documented, since unlike a G3, e500v1 isn't an
unrealistic expectation on a platform called e500plat.

> Plus, the machine works just fine today if you don't pass in -device 
> eTSEC. It's not like we're moving all devices to the new "platform bus".

We have a TODO to move CCSR as well.

-Scott





Re: [Qemu-devel] ResettRe: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Alex Williamson
On Wed, 2014-07-02 at 18:12 +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 02, 2014 at 04:50:15PM +0200, Paolo Bonzini wrote:
> > Il 02/07/2014 16:00, Konrad Rzeszutek Wilk ha scritto:
> > >With this long thread I lost a bit context about the challenges
> > >that exists. But let me try summarizing it here - which will hopefully
> > >get some consensus.
> > >
> > >1). Fix IGD hardware to not use Southbridge magic addresses.
> > >We can moan and moan but I doubt it is going to change.
> > 
> > There are two problems:
> > 
> > - Northbridge (i.e. MCH i.e. PCI host bridge) configuration space addresses
> > 
> > - Southbridge (i.e. PCH i.e. ISA bridge) vendor/device ID; some versions of
> > the driver identify it by class, some versions identify it by slot (1f.0).
> > 
> > To solve the first, make a new machine type, PIIX4-based, and pass through
> > the registers you need.  The patch must document _exactly_ why the registers
> > are safe to pass.  If they are not reserved on PIIX4, the patch must
> > document what the same offsets mean on PIIX4, and why it's sensible to
> > assume that firmware for virtual machine will not read/write them.  Bonus
> > point for also documenting the same for Q35.
> > 
> > Regarding the second, fixing IGD hardware to not rely on chipset magic is a
> > no-go, I agree.  I disagree that it's a no-go to define a "backdoor" that
> > lets a hypervisor pass the right information to the driver without hacking
> > the chipset device model.
> > 
> > The hardware folks would have to give us a place for a pair of registers
> > (something like data/address), and a bit somewhere else that would be always
> > 0 on hardware and always 1 if the hypervisor is implementing the pair of
> > registers.  This is similar to CPUID, which has the HYPERVISOR bit +
> > hypervisor-defined leaves at 0x4000.
> > 
> > The data/address pair could be in a BAR, in configuration space, in the low
> > VGA ports at 0x3c0-0x3df, wherever.  The hypervisor bit can be in the same

This all looks like wishful thinking to me.  Just look though the i915
driver, hardware seems to arbitrarily change between chips and I expect
the drivers have a hard enough time supporting real hardware.  I would
like to see a concise document/comment from Intel listing which
registers, opregions, gtt mappings are required to be mirrored to the
guest and what needs write access through to the host per device
generation though.  The dependency on MCH/PCH IDs is only part of the
story.  Things like opregions and gtt mappings may require identity
mapping to the host and therefore require reserved memory regions and
guest access.  In order to provide that access, we need to know exactly
what we're providing access to and whether it compromises the host
isolation.

I do want to note that we should not add any dependency on VGA space if
we do go the path of a paravirt interface.  VGA routing is a nightmare
and for a VFIO path forward, I think we'll want to rely on legacy-free
UEFI drivers.

> > place or somewhere else---again, whatever is convenient for the hardware
> > folks.  We just need *one bit* that is known-zero on all hardware, and 8
> > bytes in a reserved area.  I don't think it's too hard to find this space,
> > and I really, really would like Intel to follow up on a paravirtualized
> > backdoor.
> > 
> > That said, we have the problem of existing guests, so I agree something else
> > is needed.
> > 
> > > a) Two bridges - one 'passthrough' and the legacy ISA bridge
> > >that QEMU emulates. Both Linux and Windows are OK with
> > >two bridges (even thought it is pretty weird).
> > 
> > This is pretty much the only solution for existing Linux guests that look up
> > the southbridge by class.
> > 
> > The proposed solution here is to define a new "pci stub" device in QEMU that
> > lets you define a do-nothing device with your desired vendor ID, device ID,
> > class and optionally subsystem IDs.
> > 
> > The new machine type (the one that instantiates the special
> > IGD-passthrough-enabled northbridge) can then instantiate this stub device
> > at 1f.0 with the desired vendor ID, device ID and class ID.
> > 
> > If we cannot get the paravirtualized backdoor, it would also make sense to:
> > 
> > - have drivers standardize on a single way to probe the southbridge
> > 
> > - make this be neither by class (because the firmware wants to distinguish
> > the actual ISA bridge from the stub, and it can do so by looking up the
> > class), nor by slot (because this conflicts with the Q35 chipset model that
> > has the southbridge at 1f.0).
> > 
> > mst's proposal was to probe by subsystem id.  I'm not sure I understood the
> > details exactly, but I trust him. :)  However, in case it wasn't clear I
> > think a paravirtualized backdoor would still be better.
> 
> This was a paravirtualized idea actually.
> Since ISA bridge is just needed for type
> identification, stick this info in subsystem device id.
> guest could do
>   

[Qemu-devel] VM id

2014-07-02 Thread Gary Jordan
Does Qemu have a VM id allocated for each VM?  I did not find this Id in
qemu. HOW deos qemu identify each VM, using thread Id or some other
identifiers?


Re: [Qemu-devel] [PATCH 2/2 v5] numa: enable sparse node numbering on ppc

2014-07-02 Thread Eduardo Habkost
On Tue, Jul 01, 2014 at 01:50:06PM -0700, Nishanth Aravamudan wrote:
> On 01.07.2014 [17:39:57 -0300], Eduardo Habkost wrote:
> > On Tue, Jul 01, 2014 at 01:13:28PM -0700, Nishanth Aravamudan wrote:
> > [...]
> > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > > index 12472c6..cdefafe 100644
> > > --- a/hw/i386/pc.c
> > > +++ b/hw/i386/pc.c
> > > @@ -1121,6 +1121,18 @@ PcGuestInfo *pc_guest_info_init(ram_addr_t 
> > > below_4g_mem_size,
> > >  guest_info->ram_size = below_4g_mem_size + above_4g_mem_size;
> > >  guest_info->apic_id_limit = pc_apic_id_limit(max_cpus);
> > >  guest_info->apic_xrupt_override = kvm_allows_irq0_override();
> > > +/* No support for sparse NUMA node IDs yet: */
> > > +for (i = max_numa_nodeid - 1; i >= 0; i--) {
> > > +/* Report large node IDs first, to make mistakes easier to spot 
> > > */
> > > +if (!numa_info[i].present) {
> > > +error_report("numa: Node ID missing: %d", i);
> > > +exit(EXIT_FAILURE);
> > > +}
> > > +}
> > > +
> > > +/* This must be always true if all nodes are present */
> > > +assert(num_numa_nodes == max_numa_nodeid);
> > > +
> > 
> > I wonder if there's a better place where we could put this check.
> 
> Well, only i386 and ppc support NUMA, afaict. So I'm not sure where it
> makes sense to put it. I guess we could have a flag that the
> architectures set that indicates sparse NUMA support or not, and put
> this in the generic code.
> 
> Or do you mean putting this check somewhere else in the PC init code?

I mean somewhere else in the PC init code. But as today the code that
calls pc_guest_info_init() and pc_memory_init() is duplicated in both
pc_piix.c and pc_q35.c, this looks like the best place we have.

> 
> > >  guest_info->numa_nodes = num_numa_nodes;
> > >  guest_info->node_mem = g_malloc0(guest_info->numa_nodes *
> > >  sizeof *guest_info->node_mem);
> > [...]
> > > diff --git a/numa.c b/numa.c
> > > index 5930df0..a689e52 100644
> > > --- a/numa.c
> > > +++ b/numa.c
> > [...]
> > > @@ -225,9 +220,12 @@ void set_numa_nodes(void)
> > >   * must cope with this anyway, because there are BIOSes out 
> > > there in
> > >   * real machines which also use this scheme.
> > >   */
> > > -if (i == num_numa_nodes) {
> > > -for (i = 0; i < max_cpus; i++) {
> > > -set_bit(i, numa_info[i % num_numa_nodes].node_cpu);
> > > +if (i == max_numa_nodeid) {
> > > +for (i = 0, j = 0; i < max_cpus; i++) {
> > 
> > Doesn't j need to be initialized to -1, here?
> 
> Arrgh, sorry had been messing with your suggestion to use a while loop.
> You're right, it needs to be -1 here.
> 
> > Except for that, patch looks good to me. But I would be more comfortable
> > with it if we had automated tests to help ensure we are not breaking
> > compatibility of existing NUMA command-line conbinations with these
> > changes.
> 
> Is that the test target in the qemu source? Are there examples of any
> such NUMA tests already?

I use 'make check' to run them, they are in the tests/ directory.

I am not aware of any NUMA-related test, but I see two possible ways of
testing it: using qtest and asking for for the NUMA node info through
the monitor, or a unit test for numa.c that simply calls
numa_node_parse() and set_numa_nodes(), and then checks the result on
numa_info[] directly.

A third option may be using qtest and checking the resulting ACPI tables
directly. It would cover even more code, but would be specific to PC.

The tests won't be a requirement to me, but they would surely be welcome
(and would have detected the j=0 mistake above).

-- 
Eduardo



[Qemu-devel] [PATCH v2 9/9] PPC: Fix default config ordering and add eTSEC for ppc64

2014-07-02 Thread Alexander Graf
We messed up the ordering in our default configs for PPC. The top entries
are generic entries, then come sections that indicate that features are only
in because of a special feature (such as PReP).

Fix the ordering again and while at it add eTSEC support to the ppc64 target
so that we can spawn eTSEC adapters with qemu-system-ppc64.

Signed-off-by: Alexander Graf 
---
 default-configs/ppc-softmmu.mak   | 4 ++--
 default-configs/ppc64-softmmu.mak | 3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/default-configs/ppc-softmmu.mak b/default-configs/ppc-softmmu.mak
index 33f8d84..d725b23 100644
--- a/default-configs/ppc-softmmu.mak
+++ b/default-configs/ppc-softmmu.mak
@@ -45,8 +45,8 @@ CONFIG_PREP=y
 CONFIG_MAC=y
 CONFIG_E500=y
 CONFIG_OPENPIC_KVM=$(and $(CONFIG_E500),$(CONFIG_KVM))
+CONFIG_ETSEC=y
+CONFIG_LIBDECNUMBER=y
 # For PReP
 CONFIG_MC146818RTC=y
-CONFIG_ETSEC=y
 CONFIG_ISA_TESTDEV=y
-CONFIG_LIBDECNUMBER=y
diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index 37a15b7..bd30d69 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -46,6 +46,8 @@ CONFIG_PREP=y
 CONFIG_MAC=y
 CONFIG_E500=y
 CONFIG_OPENPIC_KVM=$(and $(CONFIG_E500),$(CONFIG_KVM))
+CONFIG_ETSEC=y
+CONFIG_LIBDECNUMBER=y
 # For pSeries
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(and $(CONFIG_PSERIES),$(CONFIG_KVM))
@@ -58,4 +60,3 @@ CONFIG_I82374=y
 CONFIG_I8257=y
 CONFIG_MC146818RTC=y
 CONFIG_ISA_TESTDEV=y
-CONFIG_LIBDECNUMBER=y
-- 
1.8.1.4




[Qemu-devel] [PATCH v2 3/9] qom: Expose property helpers for get/set of integers

2014-07-02 Thread Alexander Graf
We have helper functions to easily expose integers as QOM object properties.
However, these are read only.

This patch makes the getter function world accessible and adds a generic
setter for integer properties.

We can use these later with the generic object_property_add to not dupliate
simple logic all over the place.

Signed-off-by: Alexander Graf 
---
 include/qom/property.h | 128 +
 qom/property.c |  14 +-
 2 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/include/qom/property.h b/include/qom/property.h
index bb09523..470c722 100644
--- a/include/qom/property.h
+++ b/include/qom/property.h
@@ -63,6 +63,38 @@ void object_property_add_uint8_ptr(Object *obj, const char 
*name,
const uint8_t *v, Error **errp);
 
 /**
+ * object_property_get_uint8_ptr:
+ * @obj: the object to get the property from
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value we write the result to
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to read an integer value of type 'uint8' into the visitor.
+ * Use this as the 'get' argument in object_property_add if your field is a
+ * uint8_t value.
+ */
+void object_property_get_uint8_ptr(Object *obj, struct Visitor *v,
+   void *opaque, const char *name,
+   Error **errp);
+
+/**
+ * object_property_set_uint8_ptr:
+ * @obj: the object to set the property in
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to set an integer value of type 'uint8' to a given value.
+ * Use this as the 'set' argument in object_property_add if your field is a
+ * uint8_t value.
+ */
+void object_property_set_uint8_ptr(Object *obj, struct Visitor *v,
+   void *opaque, const char *name,
+   Error **errp);
+
+/**
  * object_property_add_uint16_ptr:
  * @obj: the object to add a property to
  * @name: the name of the property
@@ -76,6 +108,38 @@ void object_property_add_uint16_ptr(Object *obj, const char 
*name,
 const uint16_t *v, Error **errp);
 
 /**
+ * object_property_get_uint16_ptr:
+ * @obj: the object to get the property from
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value we write the result to
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to read an integer value of type 'uint16' into the visitor.
+ * Use this as the 'get' argument in object_property_add if your field is a
+ * uint16_t value.
+ */
+void object_property_get_uint16_ptr(Object *obj, struct Visitor *v,
+void *opaque, const char *name,
+Error **errp);
+
+/**
+ * object_property_set_uint16_ptr:
+ * @obj: the object to set the property in
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to set an integer value of type 'uint16' to a given value.
+ * Use this as the 'set' argument in object_property_add if your field is a
+ * uint16_t value.
+ */
+void object_property_set_uint16_ptr(Object *obj, struct Visitor *v,
+void *opaque, const char *name,
+Error **errp);
+
+/**
  * object_property_add_uint32_ptr:
  * @obj: the object to add a property to
  * @name: the name of the property
@@ -89,6 +153,38 @@ void object_property_add_uint32_ptr(Object *obj, const char 
*name,
 const uint32_t *v, Error **errp);
 
 /**
+ * object_property_get_uint32_ptr:
+ * @obj: the object to get the property from
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value we write the result to
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to read an integer value of type 'uint32' into the visitor.
+ * Use this as the 'get' argument in object_property_add if your field is a
+ * uint32_t value.
+ */
+void object_property_get_uint32_ptr(Object *obj, struct Visitor *v,
+void *opaque, const char *name,
+Error **errp);
+
+/**
+ * object_property_set_uint32_ptr:
+ * @obj: the object to set the property in
+ * @v: visitor to the property
+ * @opaque: pointer to the integer value
+ * @name: name of the property
+ * @errp: if an error occurs, a pointer to an area to store the error
+ *
+ * Visitor function to set an integer value of type 'uint32' to a given 

[Qemu-devel] [PATCH v2 5/9] sysbus: Add user map hints

2014-07-02 Thread Alexander Graf
We want to give the user the ability to tell our machine file where he wants
to have devices mapped to. This patch adds code to create these hints
dynamically and expose them as object properties that can only be modified
before device realization.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - make irq and pio properties uint64
  - ensure qom exposed pointers don't change due to realloc
  - fix sysbus_pass_irq
  - make properties write-once, not write-before-realize
  - make props only available via qom, no state pointers left
---
 hw/core/sysbus.c| 43 ---
 include/hw/sysbus.h |  1 +
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index f4e760d..e551e16 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -20,6 +20,7 @@
 #include "hw/sysbus.h"
 #include "monitor/monitor.h"
 #include "exec/address-spaces.h"
+#include "qom/property.h"
 
 static void sysbus_dev_print(Monitor *mon, DeviceState *dev, int indent);
 static char *sysbus_get_fw_dev_path(DeviceState *dev);
@@ -86,6 +87,34 @@ void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, 
hwaddr addr,
 sysbus_mmio_map_common(dev, n, addr, true, priority);
 }
 
+static void sysbus_property_set_uint64_ptr(Object *obj, Visitor *v,
+   void *opaque, const char *name,
+   Error **errp)
+{
+uint64_t *valp = opaque;
+
+if (*valp != SYSBUS_DYNAMIC) {
+error_setg(errp, "Attempt to set property '%s' twice", name);
+return;
+}
+
+object_property_set_uint64_ptr(obj, v, opaque, name, errp);
+}
+
+static void sysbus_init_int64_prop(SysBusDevice *dev, const char *propstr,
+   int n)
+{
+char *name = g_strdup_printf(propstr, n);
+Object *obj = OBJECT(dev);
+uint64_t *user_val = g_new(uint64_t, 1);
+
+*user_val = SYSBUS_DYNAMIC;
+
+object_property_add(obj, name, "uint64", object_property_get_uint64_ptr,
+sysbus_property_set_uint64_ptr,
+object_property_release_g_free, user_val, NULL);
+}
+
 /* Request an IRQ source.  The actual IRQ object may be populated later.  */
 void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p)
 {
@@ -94,6 +123,8 @@ void sysbus_init_irq(SysBusDevice *dev, qemu_irq *p)
 assert(dev->num_irq < QDEV_MAX_IRQ);
 n = dev->num_irq++;
 dev->irqp[n] = p;
+
+sysbus_init_int64_prop(dev, "irq[%d]", n);
 }
 
 /* Pass IRQs from a target device.  */
@@ -101,9 +132,8 @@ void sysbus_pass_irq(SysBusDevice *dev, SysBusDevice 
*target)
 {
 int i;
 assert(dev->num_irq == 0);
-dev->num_irq = target->num_irq;
 for (i = 0; i < dev->num_irq; i++) {
-dev->irqp[i] = target->irqp[i];
+sysbus_init_irq(dev, target->irqp[i]);
 }
 }
 
@@ -115,6 +145,8 @@ void sysbus_init_mmio(SysBusDevice *dev, MemoryRegion 
*memory)
 n = dev->num_mmio++;
 dev->mmio[n].addr = -1;
 dev->mmio[n].memory = memory;
+
+sysbus_init_int64_prop(dev, "mmio[%d]", n);
 }
 
 MemoryRegion *sysbus_mmio_get_region(SysBusDevice *dev, int n)
@@ -127,8 +159,13 @@ void sysbus_init_ioports(SysBusDevice *dev, pio_addr_t 
ioport, pio_addr_t size)
 pio_addr_t i;
 
 for (i = 0; i < size; i++) {
+int n;
+
 assert(dev->num_pio < QDEV_MAX_PIO);
-dev->pio[dev->num_pio++] = ioport++;
+n = dev->num_pio++;
+dev->pio[n] = ioport++;
+
+sysbus_init_int64_prop(dev, "pio[%d]", n);
 }
 }
 
diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
index f5aaa05..533184a 100644
--- a/include/hw/sysbus.h
+++ b/include/hw/sysbus.h
@@ -9,6 +9,7 @@
 #define QDEV_MAX_MMIO 32
 #define QDEV_MAX_PIO 32
 #define QDEV_MAX_IRQ 512
+#define SYSBUS_DYNAMIC -1ULL
 
 #define TYPE_SYSTEM_BUS "System"
 #define SYSTEM_BUS(obj) OBJECT_CHECK(IDEBus, (obj), TYPE_IDE_BUS)
-- 
1.8.1.4




[Qemu-devel] [PATCH v2 4/9] qom: Add generic object property g_free helper

2014-07-02 Thread Alexander Graf
A good amount of properties are really just g_new / g_malloc allocated memory.
There's no reason we need to have different release helpers for all of those.

This patch introduces a new g_free() based helper for property release and
replaces existing duplicated code implementations in object.c as well as
property.c with it.

Signed-off-by: Alexander Graf 
---
 include/qom/property.h | 12 
 qom/object.c   |  9 +
 qom/property.c | 17 -
 3 files changed, 17 insertions(+), 21 deletions(-)

diff --git a/include/qom/property.h b/include/qom/property.h
index 470c722..52675ba 100644
--- a/include/qom/property.h
+++ b/include/qom/property.h
@@ -229,4 +229,16 @@ void object_property_set_uint64_ptr(Object *obj, struct 
Visitor *v,
 void *opaque, const char *name,
 Error **errp);
 
+/**
+ * object_property_release_g_free:
+ * @obj: the object that owns the property
+ * @name: name of the property
+ * @opaque: pointer to glib allocated memory
+ *
+ * Generic helper function to free a property using g_free(). Use this if you
+ * want to make sure that your property is properly g_free()'d when the 
property
+ * gets released.
+ */
+void object_property_release_g_free(Object *obj, const char *name, void 
*opaque);
+
 #endif /* !QEMU_PROPERTY_H */
diff --git a/qom/object.c b/qom/object.c
index 94a19ce..b3dabb2 100644
--- a/qom/object.c
+++ b/qom/object.c
@@ -1406,13 +1406,6 @@ static Object *property_resolve_alias(Object *obj, void 
*opaque,
 return object_resolve_path_component(prop->target_obj, prop->target_name);
 }
 
-static void property_release_alias(Object *obj, const char *name, void *opaque)
-{
-AliasProperty *prop = opaque;
-
-g_free(prop);
-}
-
 void object_property_add_alias(Object *obj, const char *name,
Object *target_obj, const char *target_name,
Error **errp)
@@ -1441,7 +1434,7 @@ void object_property_add_alias(Object *obj, const char 
*name,
 op = object_property_add(obj, name, prop_type,
  property_get_alias,
  property_set_alias,
- property_release_alias,
+ object_property_release_g_free,
  prop, errp);
 op->resolve = property_resolve_alias;
 
diff --git a/qom/property.c b/qom/property.c
index b4f27e9..ce80a17 100644
--- a/qom/property.c
+++ b/qom/property.c
@@ -61,11 +61,9 @@ static void property_set_str(Object *obj, Visitor *v, void 
*opaque,
 g_free(value);
 }
 
-static void property_release_str(Object *obj, const char *name,
- void *opaque)
+void object_property_release_g_free(Object *obj, const char *name, void 
*opaque)
 {
-StringProperty *prop = opaque;
-g_free(prop);
+g_free(opaque);
 }
 
 void object_property_add_str(Object *obj, const char *name,
@@ -82,7 +80,7 @@ void object_property_add_str(Object *obj, const char *name,
 object_property_add(obj, name, "string",
 get ? property_get_str : NULL,
 set ? property_set_str : NULL,
-property_release_str,
+object_property_release_g_free,
 prop, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
@@ -122,13 +120,6 @@ static void property_set_bool(Object *obj, Visitor *v, 
void *opaque,
 prop->set(obj, value, errp);
 }
 
-static void property_release_bool(Object *obj, const char *name,
-  void *opaque)
-{
-BoolProperty *prop = opaque;
-g_free(prop);
-}
-
 void object_property_add_bool(Object *obj, const char *name,
   bool (*get)(Object *, Error **),
   void (*set)(Object *, bool, Error **),
@@ -143,7 +134,7 @@ void object_property_add_bool(Object *obj, const char *name,
 object_property_add(obj, name, "bool",
 get ? property_get_bool : NULL,
 set ? property_set_bool : NULL,
-property_release_bool,
+object_property_release_g_free,
 prop, &local_err);
 if (local_err) {
 error_propagate(errp, local_err);
-- 
1.8.1.4




[Qemu-devel] [PATCH v2 1/9] qom: Move property helpers to own file

2014-07-02 Thread Alexander Graf
We have accumulated a number of friendly helpers that make registration
of properties easier. However, their number is only increasing and they
start to clutter the core object.c file.

So let's move them into a separate C file and thus ensure that we have
room to grow :).

Signed-off-by: Alexander Graf 
---
 backends/hostmem-file.c |   1 +
 backends/hostmem.c  |   1 +
 backends/rng-egd.c  |   1 +
 backends/rng-random.c   |   1 +
 backends/rng.c  |   1 +
 backends/tpm.c  |   1 +
 hw/acpi/ich9.c  |   1 +
 hw/acpi/pcihp.c |   1 +
 hw/acpi/piix4.c |   1 +
 hw/core/machine.c   |   1 +
 hw/core/qdev.c  |   1 +
 hw/i386/acpi-build.c|   1 +
 hw/isa/lpc_ich9.c   |   1 +
 hw/ppc/spapr.c  |   1 +
 include/qom/object.h|  85 ---
 include/qom/property.h  | 104 
 memory.c|   1 +
 qom/Makefile.objs   |   2 +-
 qom/object.c| 197 ++--
 qom/property.c  | 212 
 target-i386/cpu.c   |   1 +
 ui/console.c|   1 +
 22 files changed, 340 insertions(+), 277 deletions(-)
 create mode 100644 include/qom/property.h
 create mode 100644 qom/property.c

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 5179994..5a944a3 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -13,6 +13,7 @@
 #include "sysemu/hostmem.h"
 #include "sysemu/sysemu.h"
 #include "qom/object_interfaces.h"
+#include "qom/property.h"
 
 /* hostmem-file.c */
 /**
diff --git a/backends/hostmem.c b/backends/hostmem.c
index ca10c51..18fc8ba 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -16,6 +16,7 @@
 #include "qapi/qmp/qerror.h"
 #include "qemu/config-file.h"
 #include "qom/object_interfaces.h"
+#include "qom/property.h"
 
 #ifdef CONFIG_NUMA
 #include 
diff --git a/backends/rng-egd.c b/backends/rng-egd.c
index 25bb3b4..f055ac1 100644
--- a/backends/rng-egd.c
+++ b/backends/rng-egd.c
@@ -14,6 +14,7 @@
 #include "sysemu/char.h"
 #include "qapi/qmp/qerror.h"
 #include "hw/qdev.h" /* just for DEFINE_PROP_CHR */
+#include "qom/property.h"
 
 #define TYPE_RNG_EGD "rng-egd"
 #define RNG_EGD(obj) OBJECT_CHECK(RngEgd, (obj), TYPE_RNG_EGD)
diff --git a/backends/rng-random.c b/backends/rng-random.c
index 601d9dc..c7f3ff7 100644
--- a/backends/rng-random.c
+++ b/backends/rng-random.c
@@ -14,6 +14,7 @@
 #include "sysemu/rng.h"
 #include "qapi/qmp/qerror.h"
 #include "qemu/main-loop.h"
+#include "qom/property.h"
 
 struct RndRandom
 {
diff --git a/backends/rng.c b/backends/rng.c
index 0f2fc11..7b82894 100644
--- a/backends/rng.c
+++ b/backends/rng.c
@@ -13,6 +13,7 @@
 #include "sysemu/rng.h"
 #include "qapi/qmp/qerror.h"
 #include "qom/object_interfaces.h"
+#include "qom/property.h"
 
 void rng_backend_request_entropy(RngBackend *s, size_t size,
  EntropyReceiveFunc *receive_entropy,
diff --git a/backends/tpm.c b/backends/tpm.c
index 01860c4..769a9b8 100644
--- a/backends/tpm.c
+++ b/backends/tpm.c
@@ -17,6 +17,7 @@
 #include "sysemu/tpm.h"
 #include "qemu/thread.h"
 #include "sysemu/tpm_backend_int.h"
+#include "qom/property.h"
 
 enum TpmType tpm_backend_get_type(TPMBackend *s)
 {
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index e7d6c77..2c1eb13 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -32,6 +32,7 @@
 #include "hw/acpi/acpi.h"
 #include "sysemu/kvm.h"
 #include "exec/address-spaces.h"
+#include "qom/property.h"
 
 #include "hw/i386/ich9.h"
 #include "hw/mem/pc-dimm.h"
diff --git a/hw/acpi/pcihp.c b/hw/acpi/pcihp.c
index fae663a..641378b 100644
--- a/hw/acpi/pcihp.c
+++ b/hw/acpi/pcihp.c
@@ -36,6 +36,7 @@
 #include "exec/address-spaces.h"
 #include "hw/pci/pci_bus.h"
 #include "qom/qom-qobject.h"
+#include "qom/property.h"
 #include "qapi/qmp/qint.h"
 
 //#define DEBUG
diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
index b72b34e..fc7d5b3 100644
--- a/hw/acpi/piix4.c
+++ b/hw/acpi/piix4.c
@@ -36,6 +36,7 @@
 #include "hw/mem/pc-dimm.h"
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/acpi_dev_interface.h"
+#include "qom/property.h"
 
 //#define DEBUG
 
diff --git a/hw/core/machine.c b/hw/core/machine.c
index cbba679..c25cc07 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -12,6 +12,7 @@
 
 #include "hw/boards.h"
 #include "qapi/visitor.h"
+#include "qom/property.h"
 
 static char *machine_get_accel(Object *obj, Error **errp)
 {
diff --git a/hw/core/qdev.c b/hw/core/qdev.c
index 3bdda8e..a4dca33 100644
--- a/hw/core/qdev.c
+++ b/hw/core/qdev.c
@@ -35,6 +35,7 @@
 #include "hw/hotplug.h"
 #include "hw/boards.h"
 #include "qapi-event.h"
+#include "qom/property.h"
 
 int qdev_hotplug = 0;
 static bool qdev_hot_added = false;
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ebc5f03..f95464c 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -51,6 +51,7 @@
 
 #include "qapi/qmp/qint.h"
 #include 

[Qemu-devel] [PATCH v2 6/9] sysbus: Make devices spawnable via -device

2014-07-02 Thread Alexander Graf
Now that we can properly map sysbus devices that haven't been connected to
something forcefully by C code, we can allow the -device command line option
to spawn them.

For machines that don't implement dynamic sysbus assignment in their board
files we add a new bool "has_dynamic_sysbus" to the machine class.
When that property is false (default), we bail out when we see dynamically
spawned sysbus devices, like we did before.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - use bool in MachineClass rather than property
---
 hw/core/machine.c   | 43 +++
 hw/core/sysbus.c|  7 ---
 include/hw/boards.h |  8 ++--
 vl.c|  1 +
 4 files changed, 50 insertions(+), 9 deletions(-)

diff --git a/hw/core/machine.c b/hw/core/machine.c
index c25cc07..713c9d8 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -13,6 +13,9 @@
 #include "hw/boards.h"
 #include "qapi/visitor.h"
 #include "qom/property.h"
+#include "hw/sysbus.h"
+#include "sysemu/sysemu.h"
+#include "qemu/error-report.h"
 
 static char *machine_get_accel(Object *obj, Error **errp)
 {
@@ -236,8 +239,44 @@ static void machine_set_firmware(Object *obj, const char 
*value, Error **errp)
 ms->firmware = g_strdup(value);
 }
 
+static int search_for_sysbus_device(Object *obj, void *opaque)
+{
+if (!object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE)) {
+/* Container or different device, traverse it for children */
+return object_child_foreach(obj, search_for_sysbus_device, opaque);
+}
+
+error_report("Device '%s' can not be handled by this machine",
+ qdev_fw_name(DEVICE(obj)));
+exit(1);
+}
+
+static void machine_init_notify(Notifier *notifier, void *data)
+{
+Object *machine = qdev_get_machine();
+ObjectClass *oc = object_get_class(machine);
+MachineClass *mc = MACHINE_CLASS(oc);
+Object *container;
+
+if (mc->has_dynamic_sysbus) {
+/* Our machine can handle dynamic sysbus devices, we're all good */
+return;
+}
+
+/*
+ * Loop through all dynamically created devices and check whether there
+ * are sysbus devices among them. If there are, error out.
+ */
+container = container_get(machine, "/peripheral");
+search_for_sysbus_device(container, NULL);
+container = container_get(machine, "/peripheral-anon");
+search_for_sysbus_device(container, NULL);
+}
+
 static void machine_initfn(Object *obj)
 {
+MachineState *ms = MACHINE(obj);
+
 object_property_add_str(obj, "accel",
 machine_get_accel, machine_set_accel, NULL);
 object_property_add_bool(obj, "kernel_irqchip",
@@ -275,6 +314,10 @@ static void machine_initfn(Object *obj)
 object_property_add_bool(obj, "usb", machine_get_usb, machine_set_usb, 
NULL);
 object_property_add_str(obj, "firmware",
 machine_get_firmware, machine_set_firmware, NULL);
+
+/* Register notifier when init is done for sysbus sanity checks */
+ms->sysbus_notifier.notify = machine_init_notify;
+qemu_add_machine_init_done_notifier(&ms->sysbus_notifier);
 }
 
 static void machine_finalize(Object *obj)
diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index e551e16..aacc446 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -294,13 +294,6 @@ static void sysbus_device_class_init(ObjectClass *klass, 
void *data)
 DeviceClass *k = DEVICE_CLASS(klass);
 k->init = sysbus_device_init;
 k->bus_type = TYPE_SYSTEM_BUS;
-/*
- * device_add plugs devices into suitable bus.  For "real" buses,
- * that actually connects the device.  For sysbus, the connections
- * need to be made separately, and device_add can't do that.  The
- * device would be left unconnected, and could not possibly work.
- */
-k->cannot_instantiate_with_device_add_yet = true;
 }
 
 static const TypeInfo sysbus_device_type_info = {
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 605a970..9514c62 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -35,7 +35,8 @@ struct QEMUMachine {
 use_sclp:1,
 no_floppy:1,
 no_cdrom:1,
-no_sdcard:1;
+no_sdcard:1,
+has_dynamic_sysbus:1;
 int is_default;
 const char *default_machine_opts;
 const char *default_boot_order;
@@ -93,7 +94,8 @@ struct MachineClass {
 use_sclp:1,
 no_floppy:1,
 no_cdrom:1,
-no_sdcard:1;
+no_sdcard:1,
+has_dynamic_sysbus:1;
 int is_default;
 const char *default_machine_opts;
 const char *default_boot_order;
@@ -110,6 +112,8 @@ struct MachineClass {
 struct MachineState {
 /*< private >*/
 Object parent_obj;
+Notifier sysbus_notifier;
+
 /*< public >*/
 
 char *accel;
diff --git a/vl.c b/vl.c
index 6e084c2..1df8d4e 100644
--- a/vl.c
+++ b/vl.c
@@ -1569,6 +1569,7 @@ static void machine_class_init(ObjectClass *oc, void 
*data)
 mc->no_floppy = qm->no_flop

[Qemu-devel] [PATCH v2 7/9] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Alexander Graf
For e500 our approach to supporting dynamically spawned sysbus devices is to
create a simple bus from the guest's point of view within which we map those
devices dynamically.

We allocate memory regions always within the "platform" hole in address
space and map IRQs to predetermined IRQ lines that are reserved for platform
device usage.

This maps really nicely into device tree logic, so we can just tell the
guest about our virtual simple bus in device tree as well.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - access sysbus properties via qom
  - move platform bus definitions to params
  - move platform bus to 36bit address space
  - make naming more consistent
  - remove device_type from platform bus dt node
  - remove id field in dt generation
---
 hw/ppc/e500.c | 251 ++
 hw/ppc/e500.h |   5 ++
 hw/ppc/e500plat.c |   6 ++
 3 files changed, 262 insertions(+)

diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
index bb2e75f..f7fe41c 100644
--- a/hw/ppc/e500.c
+++ b/hw/ppc/e500.c
@@ -36,6 +36,7 @@
 #include "exec/address-spaces.h"
 #include "qemu/host-utils.h"
 #include "hw/pci-host/ppce500.h"
+#include "qemu/error-report.h"
 
 #define EPAPR_MAGIC(0x45504150)
 #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
@@ -47,6 +48,8 @@
 
 #define RAM_SIZES_ALIGN(64UL << 20)
 
+#define E500_PLATFORM_BUS_PAGE_SHIFT 12
+
 /* TODO: parameterize */
 #define MPC8544_CCSRBAR_BASE   0xE000ULL
 #define MPC8544_CCSRBAR_SIZE   0x0010ULL
@@ -122,6 +125,76 @@ static void dt_serial_create(void *fdt, unsigned long long 
offset,
 }
 }
 
+typedef struct PlatformDevtreeData {
+void *fdt;
+const char *mpic;
+int irq_start;
+const char *node;
+} PlatformDevtreeData;
+
+static int sysbus_device_create_devtree(Object *obj, void *opaque)
+{
+PlatformDevtreeData *data = opaque;
+Object *dev;
+SysBusDevice *sbdev;
+bool matched = false;
+
+dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
+sbdev = (SysBusDevice *)dev;
+
+if (!sbdev) {
+/* Container, traverse it for children */
+return object_child_foreach(obj, sysbus_device_create_devtree, data);
+}
+
+if (!matched) {
+error_report("Device %s is not supported by this machine yet.",
+ qdev_fw_name(DEVICE(dev)));
+exit(1);
+}
+
+return 0;
+}
+
+static void platform_bus_create_devtree(PPCE500Params *params, void *fdt,
+const char *mpic)
+{
+gchar *node = g_strdup_printf("/platform@%"PRIx64, 
params->platform_bus_base);
+const char platcomp[] = "qemu,platform\0simple-bus";
+PlatformDevtreeData data;
+Object *container;
+uint64_t addr = params->platform_bus_base;
+uint64_t size = params->platform_bus_size;
+int irq_start = params->platform_bus_first_irq;
+
+/* Create a /platform node that we can put all devices into */
+
+qemu_fdt_add_subnode(fdt, node);
+qemu_fdt_setprop(fdt, node, "compatible", platcomp, sizeof(platcomp));
+
+/* Our platform bus region is less than 32bit big, so 1 cell is enough for
+   address and size */
+qemu_fdt_setprop_cells(fdt, node, "#size-cells", 1);
+qemu_fdt_setprop_cells(fdt, node, "#address-cells", 1);
+qemu_fdt_setprop_cells(fdt, node, "ranges", 0, addr >> 32, addr, size);
+
+qemu_fdt_setprop_phandle(fdt, node, "interrupt-parent", mpic);
+
+/* Loop through all devices and create nodes for known ones */
+
+data.fdt = fdt;
+data.mpic = mpic;
+data.irq_start = irq_start;
+data.node = node;
+
+container = container_get(qdev_get_machine(), "/peripheral");
+sysbus_device_create_devtree(container, &data);
+container = container_get(qdev_get_machine(), "/peripheral-anon");
+sysbus_device_create_devtree(container, &data);
+
+g_free(node);
+}
+
 static int ppce500_load_device_tree(MachineState *machine,
 PPCE500Params *params,
 hwaddr addr,
@@ -379,6 +452,10 @@ static int ppce500_load_device_tree(MachineState *machine,
 qemu_fdt_setprop_cell(fdt, pci, "#address-cells", 3);
 qemu_fdt_setprop_string(fdt, "/aliases", "pci0", pci);
 
+if (params->has_platform_bus) {
+platform_bus_create_devtree(params, fdt, mpic);
+}
+
 params->fixup_devtree(params, fdt);
 
 if (toplevel_compat) {
@@ -618,6 +695,169 @@ static qemu_irq *ppce500_init_mpic(PPCE500Params *params, 
MemoryRegion *ccsr,
 return mpic;
 }
 
+typedef struct PlatformBusNotifier {
+Notifier notifier;
+MemoryRegion *address_space_mem;
+qemu_irq *mpic;
+PPCE500Params params;
+} PlatformBusNotifier;
+
+typedef struct PlatformBusInitData {
+unsigned long *used_irqs;
+unsigned long *used_mem;
+MemoryRegion *mem;
+qemu_irq *irqs;
+int device_count;
+PPCE500Params *params;
+} PlatformBusInitData;
+
+static int platform_bus

[Qemu-devel] [PATCH v2 8/9] e500: Add support for eTSEC in device tree

2014-07-02 Thread Alexander Graf
This patch adds support to expose eTSEC devices in the dynamically created
guest facing device tree. This allows us to expose eTSEC devices into guests
without changes in the machine file.

Because we can now tell the guest about eTSEC devices this patch allows the
user to specify eTSEC devices via -device at all.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - fix device name (base on reg for value after @)
  - use qom properties to fetch mmio and irq props
  - remove useless interrupt-parent
  - make interrupts level triggered
---
 hw/ppc/e500.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
index f7fe41c..cbabcdf 100644
--- a/hw/ppc/e500.c
+++ b/hw/ppc/e500.c
@@ -37,6 +37,7 @@
 #include "qemu/host-utils.h"
 #include "hw/pci-host/ppce500.h"
 #include "qemu/error-report.h"
+#include "hw/net/fsl_etsec/etsec.h"
 
 #define EPAPR_MAGIC(0x45504150)
 #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
@@ -132,6 +133,37 @@ typedef struct PlatformDevtreeData {
 const char *node;
 } PlatformDevtreeData;
 
+static int create_devtree_etsec(eTSEC *etsec, PlatformDevtreeData *data)
+{
+Object *obj = OBJECT(etsec);
+uint64_t mmio0 = object_property_get_int(obj, "mmio[0]", NULL);
+uint64_t irq0 = object_property_get_int(obj, "irq[0]", NULL);
+uint64_t irq1 = object_property_get_int(obj, "irq[1]", NULL);
+uint64_t irq2 = object_property_get_int(obj, "irq[2]", NULL);
+gchar *node = g_strdup_printf("/platform/ethernet@%"PRIx64, mmio0);
+gchar *group = g_strdup_printf("%s/queue-group", node);
+void *fdt = data->fdt;
+
+qemu_fdt_add_subnode(fdt, node);
+qemu_fdt_setprop_string(fdt, node, "device_type", "network");
+qemu_fdt_setprop_string(fdt, node, "compatible", "fsl,etsec2");
+qemu_fdt_setprop_string(fdt, node, "model", "eTSEC");
+qemu_fdt_setprop(fdt, node, "local-mac-address", etsec->conf.macaddr.a, 6);
+qemu_fdt_setprop_cells(fdt, node, "fixed-link", 0, 1, 1000, 0, 0);
+
+qemu_fdt_add_subnode(fdt, group);
+qemu_fdt_setprop_cells(fdt, group, "reg", mmio0, 0x1000);
+qemu_fdt_setprop_cells(fdt, group, "interrupts",
+data->irq_start + irq0, 0x2,
+data->irq_start + irq1, 0x2,
+data->irq_start + irq2, 0x2);
+
+g_free(node);
+g_free(group);
+
+return 0;
+}
+
 static int sysbus_device_create_devtree(Object *obj, void *opaque)
 {
 PlatformDevtreeData *data = opaque;
@@ -147,6 +179,11 @@ static int sysbus_device_create_devtree(Object *obj, void 
*opaque)
 return object_child_foreach(obj, sysbus_device_create_devtree, data);
 }
 
+if (object_dynamic_cast(obj, TYPE_ETSEC_COMMON)) {
+create_devtree_etsec(ETSEC_COMMON(dev), data);
+matched = true;
+}
+
 if (!matched) {
 error_report("Device %s is not supported by this machine yet.",
  qdev_fw_name(DEVICE(dev)));
-- 
1.8.1.4




[Qemu-devel] [PATCH v2 2/9] qom: macroify integer property helpers

2014-07-02 Thread Alexander Graf
We have a bunch of nice helpers that allow us to easily register an integer
field as QOM property. However, we have those duplicated for every integer
size available.

This is very cumbersome (and prone to bugs) to work with and extend, so let's
strip out the only difference there is (the size) and generate the actual
functions via a macro.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - Rename DECLARE_INTEGER_VISITOR to DECLARE_PROP_SET_GET
  - Make macro take name and type arguments, enabling generation of "bool"
types later
---
 qom/property.c | 82 +-
 1 file changed, 24 insertions(+), 58 deletions(-)

diff --git a/qom/property.c b/qom/property.c
index 9e7c92a..944daff 100644
--- a/qom/property.c
+++ b/qom/property.c
@@ -151,62 +151,28 @@ void object_property_add_bool(Object *obj, const char 
*name,
 }
 }
 
-static void property_get_uint8_ptr(Object *obj, Visitor *v,
-   void *opaque, const char *name,
-   Error **errp)
-{
-uint8_t value = *(uint8_t *)opaque;
-visit_type_uint8(v, &value, name, errp);
-}
-
-static void property_get_uint16_ptr(Object *obj, Visitor *v,
-   void *opaque, const char *name,
-   Error **errp)
-{
-uint16_t value = *(uint16_t *)opaque;
-visit_type_uint16(v, &value, name, errp);
-}
-
-static void property_get_uint32_ptr(Object *obj, Visitor *v,
-   void *opaque, const char *name,
-   Error **errp)
-{
-uint32_t value = *(uint32_t *)opaque;
-visit_type_uint32(v, &value, name, errp);
-}
-
-static void property_get_uint64_ptr(Object *obj, Visitor *v,
-   void *opaque, const char *name,
-   Error **errp)
-{
-uint64_t value = *(uint64_t *)opaque;
-visit_type_uint64(v, &value, name, errp);
-}
+#define DECLARE_PROP_SET_GET(name, valtype)
\
+   
\
+static void glue(glue(property_get_,name),_ptr)(Object *obj, Visitor *v,   
\
+void *opaque,  
\
+const char *name,  
\
+Error **errp)  
\
+{  
\
+valtype value = *(valtype *)opaque;
\
+glue(visit_type_,name)(v, &value, name, errp); 
\
+}  
\
+   
\
+void glue(glue(object_property_add_,name),_ptr)(Object *obj, const char *name, 
\
+const valtype *v,  
\
+Error **errp)  
\
+{  
\
+ObjectPropertyAccessor *get = glue(glue(property_get_,name),_ptr); 
\
+object_property_add(obj, name, stringify(name), get, NULL, NULL, (void 
*)v,\
+errp); 
\
+}  
\
+
+DECLARE_PROP_SET_GET(uint8, uint8_t)
+DECLARE_PROP_SET_GET(uint16, uint16_t)
+DECLARE_PROP_SET_GET(uint32, uint32_t)
+DECLARE_PROP_SET_GET(uint64, uint64_t)
 
-void object_property_add_uint8_ptr(Object *obj, const char *name,
-   const uint8_t *v, Error **errp)
-{
-object_property_add(obj, name, "uint8", property_get_uint8_ptr,
-NULL, NULL, (void *)v, errp);
-}
-
-void object_property_add_uint16_ptr(Object *obj, const char *name,
-const uint16_t *v, Error **errp)
-{
-object_property_add(obj, name, "uint16", property_get_uint16_ptr,
-NULL, NULL, (void *)v, errp);
-}
-
-void object_property_add_uint32_ptr(Object *obj, const char *name,
-const uint32_t *v, Error **errp)
-{
-object_property_add(obj, name, "uint32", property_get_uint32_ptr,
-NULL, NULL, (void *)v, errp);
-}
-
-void object_property_add_uint64_ptr(Object *obj, const char *name,
-const uint64_t *v, Error **errp)
-{
-object_property_add(obj, name, "uint64", property_get_uint64_ptr,
-NULL, NULL, (void *)v, errp);
-}
-- 
1.8.1.4




[Qemu-devel] [PATCH v2 0/9] Dynamic sysbus device allocation support

2014-07-02 Thread Alexander Graf
Platforms without ISA and/or PCI have had a seriously hard time in the dynamic
device creation world of QEMU. Devices on these were modeled as SysBus devices
which can only be instantiated in machine files, not through -device.

Why is that so?

For Sysbus devices we didn't know who should be responsible for mapping them
when the machine file didn't do it. Turns out, the machine file is the perfect
place to map them even when it doesn't create them :).

This patch set enables machine files to declare sysbus device creation via the
-device command line option as possible. With this we can (in the machine file)
map sysbus devices to whatever the machine thinks is fitting.

Some times users do want to specify manually where to map a device. This is
very useful when you want to have stable offsets in memory and irq space.
This patch set adds support for "user mapping hints" that the machine can use
to map a device at a certain location.

As example this patch set only enables the eTSEC device on the e500plat machine
type. This device was not possible to get added to the machine at all.

  $ qemu-system-ppc -nographic -M ppce500 -device eTSEC,netdev=nd \
-netdev user,id=nd

The idea can easily be extended to any sysbus device on any machine type though.


This patch set is based on previous ideas and discussions, most notably:

  https://lists.gnu.org/archive/html/qemu-devel/2013-07/msg03614.html
  https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg00849.html



v1 -> v2:

  - new patch: qom: Move property helpers to own file
  - reworked patch: qom: Expose property helpers for get/set of integers
  - new patch: qom: Add generic object property g_free helper
  - new patch: PPC: Fix default config ordering and add eTSEC for ppc64
  - Rename DECLARE_INTEGER_VISITOR to DECLARE_PROP_SET_GET
  - Make macro take name and type arguments, enabling generation of "bool"
types later
  - make irq and pio properties uint64
  - ensure qom exposed pointers don't change due to realloc
  - fix sysbus_pass_irq
  - make properties write-once, not write-before-realize
  - make props only available via qom, no state pointers left
  - use bool in MachineClass rather than property
  - access sysbus properties via qom
  - move platform bus definitions to params
  - move platform bus to 36bit address space
  - make naming more consistent
  - remove device_type from platform bus dt node
  - remove id field in dt generation
  - fix device name (base on reg for value after @)
  - use qom properties to fetch mmio and irq props
  - remove useless interrupt-parent
  - make interrupts level triggered

Alexander Graf (9):
  qom: Move property helpers to own file
  qom: macroify integer property helpers
  qom: Expose property helpers for get/set of integers
  qom: Add generic object property g_free helper
  sysbus: Add user map hints
  sysbus: Make devices spawnable via -device
  PPC: e500: Support dynamically spawned sysbus devices
  e500: Add support for eTSEC in device tree
  PPC: Fix default config ordering and add eTSEC for ppc64

 backends/hostmem-file.c   |   1 +
 backends/hostmem.c|   1 +
 backends/rng-egd.c|   1 +
 backends/rng-random.c |   1 +
 backends/rng.c|   1 +
 backends/tpm.c|   1 +
 default-configs/ppc-softmmu.mak   |   4 +-
 default-configs/ppc64-softmmu.mak |   3 +-
 hw/acpi/ich9.c|   1 +
 hw/acpi/pcihp.c   |   1 +
 hw/acpi/piix4.c   |   1 +
 hw/core/machine.c |  44 ++
 hw/core/qdev.c|   1 +
 hw/core/sysbus.c  |  50 +--
 hw/i386/acpi-build.c  |   1 +
 hw/isa/lpc_ich9.c |   1 +
 hw/ppc/e500.c | 288 ++
 hw/ppc/e500.h |   5 +
 hw/ppc/e500plat.c |   6 +
 hw/ppc/spapr.c|   1 +
 include/hw/boards.h   |   8 +-
 include/hw/sysbus.h   |   1 +
 include/qom/object.h  |  85 ---
 include/qom/property.h| 244 
 memory.c  |   1 +
 qom/Makefile.objs |   2 +-
 qom/object.c  | 206 +--
 qom/property.c| 179 +++
 target-i386/cpu.c |   1 +
 ui/console.c  |   1 +
 vl.c  |   1 +
 31 files changed, 842 insertions(+), 300 deletions(-)
 create mode 100644 include/qom/property.h
 create mode 100644 qom/property.c

-- 
1.8.1.4




Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Alexander Graf


On 02.07.14 19:52, Scott Wood wrote:

On Wed, 2014-07-02 at 19:30 +0200, Alexander Graf wrote:

On 02.07.14 19:26, Scott Wood wrote:

On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:

On 02.07.14 00:50, Scott Wood wrote:

Plus, let's please not hardcode any more addresses that are going to be
a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
blocking that, but that has a TODO to parameterize).  How about
0xfULL?  Unless of course we're emulating an e500v1, which
doesn't support 36-bit physical addressing.  Parameterization would help
there as well.

I don't think we have to worry about e500v1. I'll change it :).

We theoretically support it elsewhere...  Once parameterized, it
shouldn't be hard to base the address for this, CCSRBAR, and similar
things on whether MAS7 is supported.

It gets parametrized in the machine file, CPU selection comes after
machine selection. So parameterizing it doesn't really solve it.

Why can't e500plat_init() look at args->cpu_model?  Or the
parameterization could take two sets of addresses, one for a 32-bit
layout and one for a 36-bit layout.  It might make sense to make this a
user-settable parameter; some OSes might not be able to handle a 36-bit
layout (or might not be as efficient at handling it) even on e500v2.
Many of the e500v2 boards can be built for either a 32 or 36 bit address
layout in U-Boot.


However, again, I don't think we have to worry about it.

It's not a huge worry, but it'd be nice to not break it gratuitously.
If we do break it we should explicitly disallow e500v1 with e500plat.


I'd prefer if we don't overparameterize - it'll just become a headache 
further down. Today we don't explicitly disallow anything anywhere - you 
could theoretically stick a G3 into e500plat. I don't see why we should 
start with heavy sanity checks now :).


Plus, the machine works just fine today if you don't pass in -device 
eTSEC. It's not like we're moving all devices to the new "platform bus".





@@ -122,6 +131,77 @@ static void dt_serial_create(void *fdt, unsigned long long 
offset,
}
}

+typedef struct PlatformDevtreeData {

+void *fdt;
+const char *mpic;
+int irq_start;
+const char *node;
+int id;
+} PlatformDevtreeData;

What is id?  How does irq_start work?

"id" is just a linear counter over all devices in the platform bus so
that if you need to have a unique identifier, you can have one.

"irq_start" is the offset of the first mpic irq that's connected to the
platform bus.

OK, but why is that here but no irq_end, and no address range?  How do
allocations from the irq range happen?

There are 2 phases:

1) Device association with the machine
2) Device tree generation

The allocation of IRQ ranges happens during the association phase. That
phase also updates all the hints in the devices to reflect their current
IRQ (and MMIO) mappings. The device tree generation phase only needs to
read those bits then - and add the IRQ offset to get from the "platform
bus IRQ range" to "MPIC IRQ range".

I think the answer to my original question is that irqs are allocated
based on zero because they go in an array, while memory regions are
allocated with their actual addresses because they don't.


Memory regions are allocated based on zero as well, they get mapped into 
a subregion. From a device's point of view, the regions for MMIO and 
IRQs that it sees all start at 0 relative to the platform bus.





+static int sysbus_device_create_devtree(Object *obj, void *opaque)
+{
+PlatformDevtreeData *data = opaque;
+Object *dev;
+SysBusDevice *sbdev;
+bool matched = false;
+
+dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
+sbdev = (SysBusDevice *)dev;
+
+if (!sbdev) {
+/* Container, traverse it for children */
+return object_child_foreach(obj, sysbus_device_create_devtree, data);
+}
+
+if (matched) {
+data->id++;
+} else {
+error_report("Device %s is not supported by this machine yet.",
+ qdev_fw_name(DEVICE(dev)));
+exit(1);
+}
+
+return 0;
+}

It's not clear to me how this function is creating a device tree node.

It's not yet - it's only the stub that allows to plug in specific device
code that then generates device tree nodes :).

How does the plugging in work?

It looks like all this does is increment id.

I'm not sure I understand. The plugging in is different code :). This
really only does increment an id. Maybe I'll just remove it if it
confuses you?

My confusion is that it is called sysbus_device_create_devtree(), not
sysbus_device_alloc_id().  Am I missing some sort of virtual function
mechanism here that would allow this function to be replaced?


I've removed the id bit - hope that makes it more obvious :).



/me looks at patch 6/6 again

Oh, you just add to this function in future patches.  I was expecting
something fancier given the QOM stuff and my misunderstanding about

Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Michael S. Tsirkin
On Wed, Jul 02, 2014 at 06:29:23PM +0200, Paolo Bonzini wrote:
> Il 02/07/2014 17:27, Michael S. Tsirkin ha scritto:
> > At some level, maybe Paolo is right.  Ignore existing drivers and ask
> > intel developers to update their drivers to do something sane on
> > hypervisors, even if they do ugly things on real hardware.
> > 
> > A simple proposal since what I wrote earlier though apparently wasn't
> > very clear:
> > 
> >   Detect Xen subsystem vendor id on vga card.
> >   If there, avoid poking at chipset. Instead
> > - use subsystem device # for card type
> 
> You mean for PCH type (aka PCH device id).
> 
> > - use second half of BAR0 of device
> > - instead of access to pci host
> > 
> > hypervisors will simply take BAR0 and double it in size,
> > make second part map to what would be the pci host.
> 
> Nice.  Detecting the backdoor via the subsystem vendor id
> is clever.
> 
> I'm not sure if it's possible to just double the size of BAR0 
> or not,

Why won't it be?
You just make one bit that is RO in hw RW in QEMU.

> but my laptop has:
> 
>   Region 0: Memory at d000 (64-bit, non-prefetchable) [size=4M]
>   Region 2: Memory at c000 (64-bit, prefetchable) [size=256M]
>   Region 4: I/O ports at 5000 [size=64]
> 
> and I hope we can reserve a few KB for hypervisors within those
> 4M, or 8 bytes for an address/data pair (like cf8/cfc) within BAR4's
> 64 bytes (or grow BAR4 to 128 bytes, or something like that).

Wasting IO isn't a good idea usually.

> Xen can still add the hacky machine type if they want for existing 
> hosts, but this would be a nice way forward.
> 
> Paolo

We need to get agreement from driver writers though,
specifically windows guys.

-- 
MST



Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Michael S. Tsirkin
On Wed, Jul 02, 2014 at 12:05:27PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 02, 2014 at 05:08:43PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jul 02, 2014 at 10:00:33AM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Jul 02, 2014 at 01:33:09PM +0200, Paolo Bonzini wrote:
> > > > Il 01/07/2014 19:39, Ross Philipson ha scritto:
> > > > >
> > > > >We do IGD pass-through in our project (XenClient). The patches
> > > > >originally came from our project. We surface the same ISA bridge and
> > > > >have never had activation issues on any version of Widows from XP to
> > > > >Win8. We do not normally run server platforms so I can't say for sure
> > > > >there.
> > > > 
> > > > The problem is not activation, the problem is that the patches are 
> > > > making
> > > > assumptions on the driver and the firmware that might work today but are
> > > > IMHO just not sane.
> > > > 
> > > > I would have no problem with a clean patchset that adds a new machine 
> > > > type
> > > > and doesn't touch code in "-M pc", but it looks like mst disagrees.
> > > > Ultimately, if a patchset is too hacky for upstream, you can include it 
> > > > in
> > > > your downstream XenClient (and XenServer) QEMU branch.  It happens.
> > > 
> > > And then this discussion will come back again in a year when folks
> > > rebase and ask: Why hasn't this been done upstream.
> > > 
> > > Then the discussion resumes ..
> > > 
> > > With this long thread I lost a bit context about the challenges
> > > that exists. But let me try summarizing it here - which will hopefully
> > > get some consensus.
> > 
> > Before I answer could you clarify please:
> > by Southbridge do you mean the PCH at slot 1f or the MCH at slot 0 or both?
> 
> MCH slot. We read/write from this (see intel_setup_mchbar) from couple of
> registers (0x44 and 0x48 if gen >= 4, otherwise 0x54). It is hard-coded
> in the i915_get_bridge_dev (see ec2a4c3fdc8e82fe82a25d800e85c1ea06b74372)
> as 0:0.0 BDF.
> 
> The PCH (does not matter where it sits) we only use the model:vendor id
> to figure out the pch_type (see intel_detect_pch).
> 
> I don't see why that model:vendor_id can't be exposed via checking the
> type of device:vendor_id of the IGD itself. CC-ing some Intel i915 authors.
> 
> So for the discussion here, when I say Southbridge I mean MCH.

OK so PIIX spec says:

0x10-4F reserved.

So far so good, it is likely harmless to stick something at 0x44 and
0x48 most guests will very likely just keep ticking.

0x54-0x57 deal with RAM though.

Maybe we can just stick to emulating gen >= 4 for now:
detect it on host and fail assignment.
How old is gen 4?





> > 
> > > 1). Fix IGD hardware to not use Southbridge magic addresses.
> > > We can moan and moan but I doubt it is going to change.
> > > 
> > > 2). Since we need the Southbridge magic addresses, we can expose
> > > an bridge. [I think everybody agrees that we need to do
> > > that since 1) is no go).
> > > 
> > > 3). What kind of bridge. We can do:
> > > 
> > >  a) Two bridges - one 'passthrough' and the legacy ISA bridge
> > > that QEMU emulates. Both Linux and Windows are OK with
> > > two bridges (even thought it is pretty weird).
> > > 
> > >  b) One bridge - the one that QEMU emulates - and lets emulate
> > > more of the registers (by emulate - I mean for some get the
> > > data from the real hardware).
> > > 
> > >b1). We can't use the legacy because the registers are
> > > above 256 (is that correct? Did I miss something?)
> > > 
> > >b2)  We would need to use the Q35.
> > > b2a). If we need Q35, that needs to be exposed in
> > >   for Xen guests. That means exposing the 
> > >   MMCONFIG and restructing the E820 to fit that
> > >   in.
> > >   Problem:
> > > - Migration is not working with Q35.
> > >   (But for v1 you wouldn't migrate, however
> > >later hardware will surely have SR-IOV so
> > >we will need to migrate).
> > > 
> > > - There are no developers who have an OK
> > >   from their management to focus on this.
> > >(Potential solution: Poke Intel management to 
> > > see
> > > if they can get more developers on it)
> > >   
> > > 
> > > 4). Code does a bit of sysfs that could use some refacturing with
> > > the KVM code.
> > > Problem: More time needed to do the code restructing.
> > > 
> > > 
> > > Is that about correct?
> > > 
> > > What are folks timezones and the best days next week to talk about
> > > this on either Google Hangout or the phone?



Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Scott Wood
On Wed, 2014-07-02 at 19:30 +0200, Alexander Graf wrote:
> On 02.07.14 19:26, Scott Wood wrote:
> > On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:
> >> On 02.07.14 00:50, Scott Wood wrote:
> >>> Plus, let's please not hardcode any more addresses that are going to be
> >>> a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
> >>> blocking that, but that has a TODO to parameterize).  How about
> >>> 0xfULL?  Unless of course we're emulating an e500v1, which
> >>> doesn't support 36-bit physical addressing.  Parameterization would help
> >>> there as well.
> >> I don't think we have to worry about e500v1. I'll change it :).
> > We theoretically support it elsewhere...  Once parameterized, it
> > shouldn't be hard to base the address for this, CCSRBAR, and similar
> > things on whether MAS7 is supported.
> 
> It gets parametrized in the machine file, CPU selection comes after 
> machine selection. So parameterizing it doesn't really solve it.

Why can't e500plat_init() look at args->cpu_model?  Or the
parameterization could take two sets of addresses, one for a 32-bit
layout and one for a 36-bit layout.  It might make sense to make this a
user-settable parameter; some OSes might not be able to handle a 36-bit
layout (or might not be as efficient at handling it) even on e500v2.
Many of the e500v2 boards can be built for either a 32 or 36 bit address
layout in U-Boot.

> However, again, I don't think we have to worry about it.

It's not a huge worry, but it'd be nice to not break it gratuitously.
If we do break it we should explicitly disallow e500v1 with e500plat.

>  @@ -122,6 +131,77 @@ static void dt_serial_create(void *fdt, unsigned 
>  long long offset,
> }
> }
> 
>  +typedef struct PlatformDevtreeData {
>  +void *fdt;
>  +const char *mpic;
>  +int irq_start;
>  +const char *node;
>  +int id;
>  +} PlatformDevtreeData;
> >>> What is id?  How does irq_start work?
> >> "id" is just a linear counter over all devices in the platform bus so
> >> that if you need to have a unique identifier, you can have one.
> >>
> >> "irq_start" is the offset of the first mpic irq that's connected to the
> >> platform bus.
> > OK, but why is that here but no irq_end, and no address range?  How do
> > allocations from the irq range happen?
> 
> There are 2 phases:
> 
>1) Device association with the machine
>2) Device tree generation
> 
> The allocation of IRQ ranges happens during the association phase. That 
> phase also updates all the hints in the devices to reflect their current 
> IRQ (and MMIO) mappings. The device tree generation phase only needs to 
> read those bits then - and add the IRQ offset to get from the "platform 
> bus IRQ range" to "MPIC IRQ range".

I think the answer to my original question is that irqs are allocated
based on zero because they go in an array, while memory regions are
allocated with their actual addresses because they don't.

>  +static int sysbus_device_create_devtree(Object *obj, void *opaque)
>  +{
>  +PlatformDevtreeData *data = opaque;
>  +Object *dev;
>  +SysBusDevice *sbdev;
>  +bool matched = false;
>  +
>  +dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
>  +sbdev = (SysBusDevice *)dev;
>  +
>  +if (!sbdev) {
>  +/* Container, traverse it for children */
>  +return object_child_foreach(obj, sysbus_device_create_devtree, 
>  data);
>  +}
>  +
>  +if (matched) {
>  +data->id++;
>  +} else {
>  +error_report("Device %s is not supported by this machine yet.",
>  + qdev_fw_name(DEVICE(dev)));
>  +exit(1);
>  +}
>  +
>  +return 0;
>  +}
> >>> It's not clear to me how this function is creating a device tree node.
> >> It's not yet - it's only the stub that allows to plug in specific device
> >> code that then generates device tree nodes :).
> > How does the plugging in work?
> >
> > It looks like all this does is increment id.
> 
> I'm not sure I understand. The plugging in is different code :). This 
> really only does increment an id. Maybe I'll just remove it if it 
> confuses you?

My confusion is that it is called sysbus_device_create_devtree(), not
sysbus_device_alloc_id().  Am I missing some sort of virtual function
mechanism here that would allow this function to be replaced?

/me looks at patch 6/6 again

Oh, you just add to this function in future patches.  I was expecting
something fancier given the QOM stuff and my misunderstanding about what
file patch 6/6 was touching. :-)

-Scott





Re: [Qemu-devel] [PATCH 6/6] e500: Add support for eTSEC in device tree

2014-07-02 Thread Alexander Graf


On 02.07.14 19:32, Scott Wood wrote:

On Wed, 2014-07-02 at 19:24 +0200, Alexander Graf wrote:

On 02.07.14 00:56, Scott Wood wrote:

On Tue, 2014-07-01 at 23:49 +0200, Alexander Graf wrote:

This patch adds support to expose eTSEC devices in the dynamically created
guest facing device tree. This allows us to expose eTSEC devices into guests
without changes in the machine file.

Because we can now tell the guest about eTSEC devices this patch allows the
user to specify eTSEC devices via -device at all.

Signed-off-by: Alexander Graf 
---
   hw/ppc/e500.c | 34 ++
   1 file changed, 34 insertions(+)

diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
index bf704b0..bebff6f 100644
--- a/hw/ppc/e500.c
+++ b/hw/ppc/e500.c
@@ -37,6 +37,7 @@
   #include "qemu/host-utils.h"
   #include "hw/pci-host/ppce500.h"
   #include "qemu/error-report.h"
+#include "hw/net/fsl_etsec/etsec.h"
   
   #define EPAPR_MAGIC(0x45504150)

   #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
@@ -139,6 +140,34 @@ typedef struct PlatformDevtreeData {
   int id;
   } PlatformDevtreeData;
   
+static int create_devtree_etsec(eTSEC *etsec, PlatformDevtreeData *data)

+{
+SysBusDevice *sbdev = &etsec->busdev;
+gchar *node = g_strdup_printf("/platform/ethernet@%d", data->id);

The unit address is supposed to match reg.  It's not an arbitrary
disambiguator.

So what do we do in case we don't have any reg, but only an IRQ line? Oh
well - I guess we can cross that line when we get to it.

To be theoretically correct (i.e. something that wouldn't break if used
in a real Open Firmware) you'd either leave out the unit address and put
the disambiguation directly in the name, or have a zero-length reg that
corresponds to ranges or a child node's reg.

If you just want to match what we currently do in the real hardware fdt,
use the reg of the first group node.


+qemu_fdt_setprop_cells(fdt, group, "interrupts",
+data->irq_start + sbdev->user_irqs[0], 0x0,
+data->irq_start + sbdev->user_irqs[1], 0x0,
+data->irq_start + sbdev->user_irqs[2], 0x0);

Are we still using two-cell interrupt specifiers?  If so, we should
switch before the assumption gets encoded into random device files.

Random device files should never get any device tree bits encoded.
Device tree generation is responsibility of the machine file.

Sigh.  I missed that this is in e500.c rather than the eTSEC file.  This
approach will not scale if we ever have multiple platforms wanting to
create device trees with overlap in the devices they want to describe.


It has to - device trees differ too much between architectures (and 
potentially even machines) to have them reasonably live in device files.



Alex




Re: [Qemu-devel] [PATCH 6/6] e500: Add support for eTSEC in device tree

2014-07-02 Thread Scott Wood
On Wed, 2014-07-02 at 19:24 +0200, Alexander Graf wrote:
> On 02.07.14 00:56, Scott Wood wrote:
> > On Tue, 2014-07-01 at 23:49 +0200, Alexander Graf wrote:
> >> This patch adds support to expose eTSEC devices in the dynamically created
> >> guest facing device tree. This allows us to expose eTSEC devices into 
> >> guests
> >> without changes in the machine file.
> >>
> >> Because we can now tell the guest about eTSEC devices this patch allows the
> >> user to specify eTSEC devices via -device at all.
> >>
> >> Signed-off-by: Alexander Graf 
> >> ---
> >>   hw/ppc/e500.c | 34 ++
> >>   1 file changed, 34 insertions(+)
> >>
> >> diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
> >> index bf704b0..bebff6f 100644
> >> --- a/hw/ppc/e500.c
> >> +++ b/hw/ppc/e500.c
> >> @@ -37,6 +37,7 @@
> >>   #include "qemu/host-utils.h"
> >>   #include "hw/pci-host/ppce500.h"
> >>   #include "qemu/error-report.h"
> >> +#include "hw/net/fsl_etsec/etsec.h"
> >>   
> >>   #define EPAPR_MAGIC(0x45504150)
> >>   #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
> >> @@ -139,6 +140,34 @@ typedef struct PlatformDevtreeData {
> >>   int id;
> >>   } PlatformDevtreeData;
> >>   
> >> +static int create_devtree_etsec(eTSEC *etsec, PlatformDevtreeData *data)
> >> +{
> >> +SysBusDevice *sbdev = &etsec->busdev;
> >> +gchar *node = g_strdup_printf("/platform/ethernet@%d", data->id);
> > The unit address is supposed to match reg.  It's not an arbitrary
> > disambiguator.
> 
> So what do we do in case we don't have any reg, but only an IRQ line? Oh 
> well - I guess we can cross that line when we get to it.

To be theoretically correct (i.e. something that wouldn't break if used
in a real Open Firmware) you'd either leave out the unit address and put
the disambiguation directly in the name, or have a zero-length reg that
corresponds to ranges or a child node's reg.

If you just want to match what we currently do in the real hardware fdt,
use the reg of the first group node.

> >> +qemu_fdt_setprop_cells(fdt, group, "interrupts",
> >> +data->irq_start + sbdev->user_irqs[0], 0x0,
> >> +data->irq_start + sbdev->user_irqs[1], 0x0,
> >> +data->irq_start + sbdev->user_irqs[2], 0x0);
> > Are we still using two-cell interrupt specifiers?  If so, we should
> > switch before the assumption gets encoded into random device files.
> 
> Random device files should never get any device tree bits encoded. 
> Device tree generation is responsibility of the machine file.

Sigh.  I missed that this is in e500.c rather than the eTSEC file.  This
approach will not scale if we ever have multiple platforms wanting to
create device trees with overlap in the devices they want to describe.

-Scott





Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Alexander Graf


On 02.07.14 19:26, Scott Wood wrote:

On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:

On 02.07.14 00:50, Scott Wood wrote:

Plus, let's please not hardcode any more addresses that are going to be
a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
blocking that, but that has a TODO to parameterize).  How about
0xfULL?  Unless of course we're emulating an e500v1, which
doesn't support 36-bit physical addressing.  Parameterization would help
there as well.

I don't think we have to worry about e500v1. I'll change it :).

We theoretically support it elsewhere...  Once parameterized, it
shouldn't be hard to base the address for this, CCSRBAR, and similar
things on whether MAS7 is supported.


It gets parametrized in the machine file, CPU selection comes after 
machine selection. So parameterizing it doesn't really solve it.


However, again, I don't think we have to worry about it.




@@ -122,6 +131,77 @@ static void dt_serial_create(void *fdt, unsigned long long 
offset,
   }
   }
   
+typedef struct PlatformDevtreeData {

+void *fdt;
+const char *mpic;
+int irq_start;
+const char *node;
+int id;
+} PlatformDevtreeData;

What is id?  How does irq_start work?

"id" is just a linear counter over all devices in the platform bus so
that if you need to have a unique identifier, you can have one.

"irq_start" is the offset of the first mpic irq that's connected to the
platform bus.

OK, but why is that here but no irq_end, and no address range?  How do
allocations from the irq range happen?


There are 2 phases:

  1) Device association with the machine
  2) Device tree generation

The allocation of IRQ ranges happens during the association phase. That 
phase also updates all the hints in the devices to reflect their current 
IRQ (and MMIO) mappings. The device tree generation phase only needs to 
read those bits then - and add the IRQ offset to get from the "platform 
bus IRQ range" to "MPIC IRQ range".





+static int sysbus_device_create_devtree(Object *obj, void *opaque)
+{
+PlatformDevtreeData *data = opaque;
+Object *dev;
+SysBusDevice *sbdev;
+bool matched = false;
+
+dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
+sbdev = (SysBusDevice *)dev;
+
+if (!sbdev) {
+/* Container, traverse it for children */
+return object_child_foreach(obj, sysbus_device_create_devtree, data);
+}
+
+if (matched) {
+data->id++;
+} else {
+error_report("Device %s is not supported by this machine yet.",
+ qdev_fw_name(DEVICE(dev)));
+exit(1);
+}
+
+return 0;
+}

It's not clear to me how this function is creating a device tree node.

It's not yet - it's only the stub that allows to plug in specific device
code that then generates device tree nodes :).

How does the plugging in work?

It looks like all this does is increment id.


I'm not sure I understand. The plugging in is different code :). This 
really only does increment an id. Maybe I'll just remove it if it 
confuses you?



Alex




Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Scott Wood
On Wed, 2014-07-02 at 19:12 +0200, Alexander Graf wrote:
> On 02.07.14 00:50, Scott Wood wrote:
> > Plus, let's please not hardcode any more addresses that are going to be
> > a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
> > blocking that, but that has a TODO to parameterize).  How about
> > 0xfULL?  Unless of course we're emulating an e500v1, which
> > doesn't support 36-bit physical addressing.  Parameterization would help
> > there as well.
> 
> I don't think we have to worry about e500v1. I'll change it :).

We theoretically support it elsewhere...  Once parameterized, it
shouldn't be hard to base the address for this, CCSRBAR, and similar
things on whether MAS7 is supported.

> >> @@ -122,6 +131,77 @@ static void dt_serial_create(void *fdt, unsigned long 
> >> long offset,
> >>   }
> >>   }
> >>   
> >> +typedef struct PlatformDevtreeData {
> >> +void *fdt;
> >> +const char *mpic;
> >> +int irq_start;
> >> +const char *node;
> >> +int id;
> >> +} PlatformDevtreeData;
> > What is id?  How does irq_start work?
> 
> "id" is just a linear counter over all devices in the platform bus so 
> that if you need to have a unique identifier, you can have one.
> 
> "irq_start" is the offset of the first mpic irq that's connected to the 
> platform bus.

OK, but why is that here but no irq_end, and no address range?  How do
allocations from the irq range happen?

> >> +static int sysbus_device_create_devtree(Object *obj, void *opaque)
> >> +{
> >> +PlatformDevtreeData *data = opaque;
> >> +Object *dev;
> >> +SysBusDevice *sbdev;
> >> +bool matched = false;
> >> +
> >> +dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
> >> +sbdev = (SysBusDevice *)dev;
> >> +
> >> +if (!sbdev) {
> >> +/* Container, traverse it for children */
> >> +return object_child_foreach(obj, sysbus_device_create_devtree, 
> >> data);
> >> +}
> >> +
> >> +if (matched) {
> >> +data->id++;
> >> +} else {
> >> +error_report("Device %s is not supported by this machine yet.",
> >> + qdev_fw_name(DEVICE(dev)));
> >> +exit(1);
> >> +}
> >> +
> >> +return 0;
> >> +}
> > It's not clear to me how this function is creating a device tree node.
> 
> It's not yet - it's only the stub that allows to plug in specific device 
> code that then generates device tree nodes :).

How does the plugging in work?

It looks like all this does is increment id.

-Scott





Re: [Qemu-devel] [PATCH 6/6] e500: Add support for eTSEC in device tree

2014-07-02 Thread Alexander Graf


On 02.07.14 00:56, Scott Wood wrote:

On Tue, 2014-07-01 at 23:49 +0200, Alexander Graf wrote:

This patch adds support to expose eTSEC devices in the dynamically created
guest facing device tree. This allows us to expose eTSEC devices into guests
without changes in the machine file.

Because we can now tell the guest about eTSEC devices this patch allows the
user to specify eTSEC devices via -device at all.

Signed-off-by: Alexander Graf 
---
  hw/ppc/e500.c | 34 ++
  1 file changed, 34 insertions(+)

diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
index bf704b0..bebff6f 100644
--- a/hw/ppc/e500.c
+++ b/hw/ppc/e500.c
@@ -37,6 +37,7 @@
  #include "qemu/host-utils.h"
  #include "hw/pci-host/ppce500.h"
  #include "qemu/error-report.h"
+#include "hw/net/fsl_etsec/etsec.h"
  
  #define EPAPR_MAGIC(0x45504150)

  #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
@@ -139,6 +140,34 @@ typedef struct PlatformDevtreeData {
  int id;
  } PlatformDevtreeData;
  
+static int create_devtree_etsec(eTSEC *etsec, PlatformDevtreeData *data)

+{
+SysBusDevice *sbdev = &etsec->busdev;
+gchar *node = g_strdup_printf("/platform/ethernet@%d", data->id);

The unit address is supposed to match reg.  It's not an arbitrary
disambiguator.


So what do we do in case we don't have any reg, but only an IRQ line? Oh 
well - I guess we can cross that line when we get to it.





+gchar *group = g_strdup_printf("%s/queue-group", node);
+void *fdt = data->fdt;
+
+qemu_fdt_add_subnode(fdt, node);
+qemu_fdt_setprop_string(fdt, node, "device_type", "network");
+qemu_fdt_setprop_string(fdt, node, "compatible", "fsl,etsec2");
+qemu_fdt_setprop_string(fdt, node, "model", "eTSEC");
+qemu_fdt_setprop(fdt, node, "local-mac-address", etsec->conf.macaddr.a, 6);
+qemu_fdt_setprop_cells(fdt, node, "fixed-link", 0, 1, 1000, 0, 0);
+
+qemu_fdt_add_subnode(fdt, group);
+qemu_fdt_setprop_cells(fdt, group, "reg", sbdev->user_mmios[0], 0x1000);
+qemu_fdt_setprop_phandle(fdt, group, "interrupt-parent", data->mpic);

Why not do interrupt-parent in the parent node, or top of tree?


Parent sounds appealing :). In fact, it's already there - this copy is 
simply useless.





+qemu_fdt_setprop_cells(fdt, group, "interrupts",
+data->irq_start + sbdev->user_irqs[0], 0x0,
+data->irq_start + sbdev->user_irqs[1], 0x0,
+data->irq_start + sbdev->user_irqs[2], 0x0);

Are we still using two-cell interrupt specifiers?  If so, we should
switch before the assumption gets encoded into random device files.


Random device files should never get any device tree bits encoded. 
Device tree generation is responsibility of the machine file.


So we can easily convert the whole thing to the 4-cell when we start to 
support different interrupt types :)



Also, why are these interrupts edge triggered?


Good catch - they're 0x2 on real hardware.


Alex




[Qemu-devel] [PATCH 03/10] mm: PT lock: export double_pt_lock/unlock

2014-07-02 Thread Andrea Arcangeli
Those two helpers are needed by remap_anon_pages.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h |  4 
 mm/fremap.c| 29 +
 2 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00faeda..0a7f0e1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1401,6 +1401,10 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
+/* mm/fremap.c */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
diff --git a/mm/fremap.c b/mm/fremap.c
index 72b8fa3..1e509f7 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -281,3 +281,32 @@ out_freed:
 
return err;
 }
+
+void double_pt_lock(spinlock_t *ptl1,
+   spinlock_t *ptl2)
+   __acquires(ptl1)
+   __acquires(ptl2)
+{
+   spinlock_t *ptl_tmp;
+
+   if (ptl1 > ptl2) {
+   /* exchange ptl1 and ptl2 */
+   ptl_tmp = ptl1;
+   ptl1 = ptl2;
+   ptl2 = ptl_tmp;
+   }
+   /* lock in virtual address order to avoid lock inversion */
+   spin_lock(ptl1);
+   if (ptl1 != ptl2)
+   spin_lock_nested(ptl2, SINGLE_DEPTH_NESTING);
+}
+
+void double_pt_unlock(spinlock_t *ptl1,
+ spinlock_t *ptl2)
+   __releases(ptl1)
+   __releases(ptl2)
+{
+   spin_unlock(ptl1);
+   if (ptl1 != ptl2)
+   spin_unlock(ptl2);
+}



[Qemu-devel] [PATCH for-2.1] PPC: Fix booke206 TLB with phys addrs > 32bit

2014-07-02 Thread Alexander Graf
We were truncating physical addresses to 32bit when using qemu-system-ppc
with a booke206 TLB implementation. This patch fixes that and makes the full
address space available.

Signed-off-by: Alexander Graf 
---
 target-ppc/mmu_helper.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/target-ppc/mmu_helper.c b/target-ppc/mmu_helper.c
index 4d6b1e2..4a34a73 100644
--- a/target-ppc/mmu_helper.c
+++ b/target-ppc/mmu_helper.c
@@ -897,10 +897,10 @@ static hwaddr booke206_tlb_to_page_size(CPUPPCState *env,
 
 /* TLB check function for MAS based SoftTLBs */
 static int ppcmas_tlb_check(CPUPPCState *env, ppcmas_tlb_t *tlb,
-hwaddr *raddrp,
- target_ulong address, uint32_t pid)
+hwaddr *raddrp, target_ulong address,
+uint32_t pid)
 {
-target_ulong mask;
+hwaddr mask;
 uint32_t tlb_pid;
 
 if (!msr_cm) {
-- 
1.8.1.4




Re: [Qemu-devel] [PATCH 5/6] PPC: e500: Support dynamically spawned sysbus devices

2014-07-02 Thread Alexander Graf


On 02.07.14 00:50, Scott Wood wrote:

On Tue, 2014-07-01 at 23:49 +0200, Alexander Graf wrote:

For e500 our approach to supporting dynamically spawned sysbus devices is to
create a simple bus from the guest's point of view within which we map those
devices dynamically.

We allocate memory regions always within the "platform" hole in address
space and map IRQs to predetermined IRQ lines that are reserved for platform
device usage.

This maps really nicely into device tree logic, so we can just tell the
guest about our virtual simple bus in device tree as well.

Signed-off-by: Alexander Graf 
---
  hw/ppc/e500.c | 251 ++
  hw/ppc/e500.h |   1 +
  hw/ppc/e500plat.c |   1 +
  3 files changed, 253 insertions(+)

diff --git a/hw/ppc/e500.c b/hw/ppc/e500.c
index bb2e75f..bf704b0 100644
--- a/hw/ppc/e500.c
+++ b/hw/ppc/e500.c
@@ -36,6 +36,7 @@
  #include "exec/address-spaces.h"
  #include "qemu/host-utils.h"
  #include "hw/pci-host/ppce500.h"
+#include "qemu/error-report.h"
  
  #define EPAPR_MAGIC(0x45504150)

  #define BINARY_DEVICE_TREE_FILE"mpc8544ds.dtb"
@@ -47,6 +48,14 @@
  
  #define RAM_SIZES_ALIGN(64UL << 20)
  
+#define E500_PLATFORM_BASE 0xF000ULL

Should the IRQ and address range be parameterized?  Even if the platform
bus is going to be restricted to only e500plat, it seems like it would
be good to keep all assumptions that are e500plat-specific inside the
e500plat file.


Good idea. The only thing I'll leave here is the page_shift. The fact 
that we only allocate in 4k chunks is inherently implementation specific.



Plus, let's please not hardcode any more addresses that are going to be
a problem for giving guests a large amount of RAM (yes, CCSRBAR is also
blocking that, but that has a TODO to parameterize).  How about
0xfULL?  Unless of course we're emulating an e500v1, which
doesn't support 36-bit physical addressing.  Parameterization would help
there as well.


I don't think we have to worry about e500v1. I'll change it :).




+#define E500_PLATFORM_HOLE (128ULL * 1024 * 1024) /* 128 MB */
+#define E500_PLATFORM_PAGE_SHIFT   12
+#define E500_PLATFORM_HOLE_PAGES   (E500_PLATFORM_HOLE >> \
+E500_PLATFORM_PAGE_SHIFT)
+#define E500_PLATFORM_FIRST_IRQ5
+#define E500_PLATFORM_NUM_IRQS 10

What is the "hole"?  If that's meant to be the size to go along with
E500_PLATFORM_BASE, that seems like odd terminology.


True - renamed to "size".




+
  /* TODO: parameterize */
  #define MPC8544_CCSRBAR_BASE   0xE000ULL
  #define MPC8544_CCSRBAR_SIZE   0x0010ULL
@@ -122,6 +131,77 @@ static void dt_serial_create(void *fdt, unsigned long long 
offset,
  }
  }
  
+typedef struct PlatformDevtreeData {

+void *fdt;
+const char *mpic;
+int irq_start;
+const char *node;
+int id;
+} PlatformDevtreeData;

What is id?  How does irq_start work?


"id" is just a linear counter over all devices in the platform bus so 
that if you need to have a unique identifier, you can have one.


"irq_start" is the offset of the first mpic irq that's connected to the 
platform bus.





+static int sysbus_device_create_devtree(Object *obj, void *opaque)
+{
+PlatformDevtreeData *data = opaque;
+Object *dev;
+SysBusDevice *sbdev;
+bool matched = false;
+
+dev = object_dynamic_cast(obj, TYPE_SYS_BUS_DEVICE);
+sbdev = (SysBusDevice *)dev;
+
+if (!sbdev) {
+/* Container, traverse it for children */
+return object_child_foreach(obj, sysbus_device_create_devtree, data);
+}
+
+if (matched) {
+data->id++;
+} else {
+error_report("Device %s is not supported by this machine yet.",
+ qdev_fw_name(DEVICE(dev)));
+exit(1);
+}
+
+return 0;
+}

It's not clear to me how this function is creating a device tree node.


It's not yet - it's only the stub that allows to plug in specific device 
code that then generates device tree nodes :).





+
+static void platform_create_devtree(void *fdt, const char *node, uint64_t addr,
+const char *mpic, int irq_start,
+int nr_irqs)
+{
+const char platcomp[] = "qemu,platform\0simple-bus";
+PlatformDevtreeData data;
+Object *container;
+
+/* Create a /platform node that we can put all devices into */
+
+qemu_fdt_add_subnode(fdt, node);
+qemu_fdt_setprop(fdt, node, "compatible", platcomp, sizeof(platcomp));
+qemu_fdt_setprop_string(fdt, node, "device_type", "platform");

Where did this device_type come from?

device_type is deprecated and new uses should not be introduced.


Fair enough, will remove it :)




diff --git a/hw/ppc/e500.h b/hw/ppc/e500.h
index 08b25fa..3a588ed 100644
--- a/hw/ppc/e500.h
+++ b/hw/ppc/e500.h
@@ -11,6 +11,7 @@ typedef struct PPCE500Params {
  void (*fixup_devtree)(struct PPCE500P

[Qemu-devel] [PATCH 06/10] mm: sys_remap_anon_pages

2014-07-02 Thread Andrea Arcangeli
This new syscall will move anon pages across vmas, atomically and
without touching the vmas.

It only works on non shared anonymous pages because those can be
relocated without generating non linear anon_vmas in the rmap code.

It is the ideal mechanism to handle userspace page faults. Normally
the destination vma will have VM_USERFAULT set with
madvise(MADV_USERFAULT) while the source vma will normally have
VM_DONTCOPY set with madvise(MADV_DONTFORK).

MADV_DONTFORK set in the source vma avoids remap_anon_pages to fail if
the process forks during the userland page fault.

The thread triggering the sigbus signal handler by touching an
unmapped hole in the MADV_USERFAULT region, should take care to
receive the data belonging in the faulting virtual address in the
source vma. The data can come from the network, storage or any other
I/O device. After the data has been safely received in the private
area in the source vma, it will call remap_anon_pages to map the page
in the faulting address in the destination vma atomically. And finally
it will return from the signal handler.

It is an alternative to mremap.

It only works if the vma protection bits are identical from the source
and destination vma.

It can remap non shared anonymous pages within the same vma too.

If the source virtual memory range has any unmapped holes, or if the
destination virtual memory range is not a whole unmapped hole,
remap_anon_pages will fail respectively with -ENOENT or -EEXIST. This
provides a very strict behavior to avoid any chance of memory
corruption going unnoticed if there are userland race conditions. Only
one thread should resolve the userland page fault at any given time
for any given faulting address. This means that if two threads try to
both call remap_anon_pages on the same destination address at the same
time, the second thread will get an explicit error from this syscall.

The syscall retval will return "len" is succesful. The syscall however
can be interrupted by fatal signals or errors. If interrupted it will
return the number of bytes successfully remapped before the
interruption if any, or the negative error if none. It will never
return zero. Either it will return an error or an amount of bytes
successfully moved. If the retval reports a "short" remap, the
remap_anon_pages syscall should be repeated by userland with
src+retval, dst+reval, len-retval if it wants to know about the error
that interrupted it.

The RAP_ALLOW_SRC_HOLES flag can be specified to prevent -ENOENT
errors to materialize if there are holes in the source virtual range
that is being remapped. The holes will be accounted as successfully
remapped in the retval of the syscall. This is mostly useful to remap
hugepage naturally aligned virtual regions without knowing if there
are transparent hugepage in the regions or not, but preventing the
risk of having to split the hugepmd during the remap.

The main difference with mremap is that if used to fill holes in
unmapped anonymous memory vmas (if used in combination with
MADV_USERFAULT) remap_anon_pages won't create lots of unmergeable
vmas. mremap instead would create lots of vmas (because of non linear
vma->vm_pgoff) leading to -ENOMEM failures (the number of vmas is
limited).

MADV_USERFAULT and remap_anon_pages() can be tested with a program
like below:

===
 #define _GNU_SOURCE
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 

 #define USE_USERFAULT
 #define THP

 #define MADV_USERFAULT 18

 #define SIZE (1024*1024*1024)

 #define SYS_remap_anon_pages 317

 static volatile unsigned char *c, *tmp;

 void userfault_sighandler(int signum, siginfo_t *info, void *ctx)
 {
unsigned char *addr = info->si_addr;
int len = 4096;
int ret;

addr = (unsigned char *) ((unsigned long) addr & ~((getpagesize())-1));
 #ifdef THP
addr = (unsigned char *) ((unsigned long) addr & ~((2*1024*1024)-1));
len = 2*1024*1024;
 #endif
if (addr >= c && addr < c + SIZE) {
unsigned long offset = addr - c;
ret = syscall(SYS_remap_anon_pages, c+offset, tmp+offset, len, 
0);
if (ret != len)
perror("sigbus remap_anon_pages"), exit(1);
//printf("sigbus offset %lu\n", offset);
return;
}

printf("sigbus error addr %p c %p tmp %p\n", addr, c, tmp), exit(1);
 }

 int main()
 {
struct sigaction sa;
int ret;
unsigned long i;
 #ifndef THP
/*
 * Fails with THP due lack of alignment because of memset
 * pre-filling the destination
 */
c = mmap(0, SIZE, PROT_READ|PROT_WRITE,
 MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (c == MAP_FAILED)
perror("mmap"), exit(1);
tmp = mmap(0, SIZE, PROT_READ|PROT_WRITE,
   MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (tmp == MAP_FAILED)
   

[Qemu-devel] [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andrea Arcangeli
Once an userfaultfd is created MADV_USERFAULT regions talks through
the userfaultfd protocol with the thread responsible for doing the
memory externalization of the process.

The protocol starts by userland writing the requested/preferred
USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
kernel knows it, it will ack it by allowing userland to read 64bit
from the userfault fd that will contain the same 64bit
USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
will have to try again by writing an older protocol version if
suitable for its usage too, and read it back again until it stops
reading -1ULL. After that the userfaultfd protocol starts.

The protocol consists in the userfault fd reads 64bit in size
providing userland the fault addresses. After a userfault address has
been read and the fault is resolved by userland, the application must
write back 128bits in the form of [ start, end ] range (64bit each)
that will tell the kernel such a range has been mapped. Multiple read
userfaults can be resolved in a single range write. poll() can be used
to know when there are new userfaults to read (POLLIN) and when there
are threads waiting a wakeup through a range write (POLLOUT).

Signed-off-by: Andrea Arcangeli 
---
 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/Makefile  |   1 +
 fs/userfaultfd.c | 557 +++
 include/linux/syscalls.h |   1 +
 include/linux/userfaultfd.h  |  40 +++
 init/Kconfig |  10 +
 kernel/sys_ni.c  |   1 +
 mm/huge_memory.c |  20 +-
 mm/memory.c  |   5 +-
 10 files changed, 629 insertions(+), 8 deletions(-)
 create mode 100644 fs/userfaultfd.c
 create mode 100644 include/linux/userfaultfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 08bc856..5aa2da4 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -361,3 +361,4 @@
 352i386sched_getattr   sys_sched_getattr
 353i386renameat2   sys_renameat2
 354i386remap_anon_pagessys_remap_anon_pages
+355i386userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 37bd179..7dca902 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -324,6 +324,7 @@
 315common  sched_getattr   sys_sched_getattr
 316common  renameat2   sys_renameat2
 317common  remap_anon_pagessys_remap_anon_pages
+318common  userfaultfd sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..e00e243 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
 obj-$(CONFIG_SIGNALFD) += signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
+obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 000..4902fa3
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,557 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2014  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct userfaultfd_ctx {
+   /* pseudo fd refcounting */
+   struct kref kref;
+   /* waitqueue head for the userfaultfd page faults */
+   wait_queue_head_t fault_wqh;
+   /* waitqueue head for the pseudo fd to wakeup poll/read */
+   wait_queue_head_t fd_wqh;
+   /* userfaultfd syscall flags */
+   unsigned int flags;
+   /* state machine */
+   unsigned int state;
+   /* released */
+   bool released;
+};
+
+struct userfaultfd_wait_queue {
+   unsigned long address;
+   wait_queue_t wq;
+   bool pending;
+   struct userfaultfd_ctx *ctx;
+};
+
+#define USERFAULTFD_PROTOCOL ((__u64) 0xaa)
+#define USERFAULTFD_UNKNOWN_PROTOCOL ((__u64) -1ULL)
+
+enum {
+   USERFAULTFD_STATE_ASK_PROTOCOL,
+   USERFAULTFD_STATE_ACK_PROTOCOL,
+   USERFAULTFD_STATE_ACK_UNKNOWN_PROTOCOL,
+   USERFAULTFD_STATE_RUNNING,
+};
+
+/**
+ * struct mm_slot - userlandfd information 

[Qemu-devel] [PATCH 10/10] userfaultfd: use VM_FAULT_RETRY in handle_userfault()

2014-07-02 Thread Andrea Arcangeli
This optimizes the userfault handler to repeat the fault without
returning to userland if it's a page faults and it teaches it to
handle FOLL_NOWAIT if it's a nonblocking gup invocation from KVM. The
FOLL_NOWAIT part is actually more than an optimization because if
FOLL_NOWAIT is set the gup caller assumes the mmap_sem cannot be
released (and it could assume that the structures protected by it
potentially read earlier cannot have become stale).

The locking rules to comply with FAULT_FLAG_KILLABLE,
FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT flags looks quite
convoluted (and nor well documented, aside from a "Caution" comment in
__lock_page_or_retry) so this is not a trivial change and in turn it's
kept incremental at the end of the patchset.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c| 68 ++---
 include/linux/userfaultfd.h |  6 ++--
 mm/huge_memory.c|  8 +++---
 mm/memory.c |  4 +--
 4 files changed, 74 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index deed8cb..b8b0fb7 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -155,12 +155,29 @@ static void userfaultfd_ctx_put(struct userfaultfd_ctx 
*ctx)
kref_put(&ctx->kref, userfaultfd_free);
 }
 
-int handle_userfault(struct vm_area_struct *vma, unsigned long address)
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY_NOWAIT and
+ * FAULT_FLAG_KILLABLE are not straightforward. The "Caution"
+ * recommendation in __lock_page_or_retry is not an understatement.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set, the mmap_sem must be released
+ * before returning VM_FAULT_RETRY only if FAULT_FLAG_RETRY_NOWAIT is
+ * not set.
+ *
+ * If FAULT_FLAG_ALLOW_RETRY is set but FAULT_FLAG_KILLABLE is not
+ * set, VM_FAULT_RETRY can still be returned if and only if there are
+ * fatal_signal_pending()s, and the mmap_sem must be released before
+ * returning it.
+ */
+int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+unsigned int flags)
 {
struct mm_struct *mm = vma->vm_mm;
struct mm_slot *slot;
struct userfaultfd_ctx *ctx;
struct userfaultfd_wait_queue uwq;
+   int ret;
 
BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
@@ -188,10 +205,53 @@ int handle_userfault(struct vm_area_struct *vma, unsigned 
long address)
__add_wait_queue(&ctx->fault_wqh, &uwq.wq);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (fatal_signal_pending(current))
+   if (fatal_signal_pending(current) || ctx->released) {
+   /*
+* If we have to fail because the task is
+* killed or the file was relased, so simulate
+* VM_FAULT_SIGBUS or just return to userland
+* through VM_FAULT_RETRY if we come from a
+* page fault.
+*/
+   ret = VM_FAULT_SIGBUS;
+   if (fatal_signal_pending(current) &&
+   (flags & FAULT_FLAG_KILLABLE)) {
+   /*
+* If FAULT_FLAG_KILLABLE is set we
+* and there's a fatal signal pending
+* can return VM_FAULT_RETRY
+* regardless if
+* FAULT_FLAG_ALLOW_RETRY is set or
+* not as long as we release the
+* mmap_sem. The page fault will
+* return stright to userland then to
+* handle the fatal signal.
+*/
+   up_read(&mm->mmap_sem);
+   ret = VM_FAULT_RETRY;
+   }
+   break;
+   }
+   if (!uwq.pending) {
+   ret = 0;
+   if (flags & FAULT_FLAG_ALLOW_RETRY) {
+   ret = VM_FAULT_RETRY;
+   if (!(flags & FAULT_FLAG_RETRY_NOWAIT))
+   up_read(&mm->mmap_sem);
+   }
break;
-   if (!uwq.pending)
+   }
+   if (((FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT) &
+flags) ==
+   (FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_RETRY_NOWAIT)) {
+   ret = VM_FAULT_RETRY;
+   /*
+* The mmap_sem must not be released if
+* FAULT_FLAG_RETRY_NOWAIT is set despite we
+* return VM_FAULT_RETRY (FOLL_NOWAIT case).
+*/
  

Re: [Qemu-devel] ResettRe: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Michael S. Tsirkin
On Wed, Jul 02, 2014 at 12:23:37PM -0400, Konrad Rzeszutek Wilk wrote:
> On Wed, Jul 02, 2014 at 04:50:15PM +0200, Paolo Bonzini wrote:
> > Il 02/07/2014 16:00, Konrad Rzeszutek Wilk ha scritto:
> > >With this long thread I lost a bit context about the challenges
> > >that exists. But let me try summarizing it here - which will hopefully
> > >get some consensus.
> > >
> > >1). Fix IGD hardware to not use Southbridge magic addresses.
> > >We can moan and moan but I doubt it is going to change.
> > 
> > There are two problems:
> > 
> > - Northbridge (i.e. MCH i.e. PCI host bridge) configuration space addresses
> 
> Right. So in  drivers/gpu/drm/i915/i915_dma.c:
> 1135 #define MCHBAR_I915 0x44 
>
> 1136 #define MCHBAR_I965 0x48 
> 
> 1147 int reg = INTEL_INFO(dev)->gen >= 4 ? MCHBAR_I965 : MCHBAR_I915; 
>
> 1152 if (INTEL_INFO(dev)->gen >= 4)   
>
> 1153 pci_read_config_dword(dev_priv->bridge_dev, reg + 4, 
> &temp_hi); 
> 1154 pci_read_config_dword(dev_priv->bridge_dev, reg, &temp_lo);  
>
> 1155 mchbar_addr = ((u64)temp_hi << 32) | temp_lo;
> 
> and
> 
> 1139 #define DEVEN_REG 0x54   
>
> 
> 1193 int mchbar_reg = INTEL_INFO(dev)->gen >= 4 ? MCHBAR_I965 : 
> MCHBAR_I915; 
> 1202 if (IS_I915G(dev) || IS_I915GM(dev)) {   
>
> 1203 pci_read_config_dword(dev_priv->bridge_dev, DEVEN_REG, 
> &temp);  
> 1204 enabled = !!(temp & DEVEN_MCHBAR_EN);
>
> 1205 } else { 
>
> 1206 pci_read_config_dword(dev_priv->bridge_dev, mchbar_reg, 
> &temp); 
> 1207 enabled = temp & 1;  
>
> 1208 }
> > 
> > - Southbridge (i.e. PCH i.e. ISA bridge) vendor/device ID; some versions of
> > the driver identify it by class, some versions identify it by slot (1f.0).
> 
> Right, So in  drivers/gpu/drm/i915/i915_drv.c the giant intel_detect_pch
> which sets the pch_type based on :
> 
>  432 if (pch->vendor == PCI_VENDOR_ID_INTEL) {
>
>  433 unsigned short id = pch->device & 
> INTEL_PCH_DEVICE_ID_MASK;
>  434 dev_priv->pch_id = id;   
>
>  435  
>
>  436 if (id == INTEL_PCH_IBX_DEVICE_ID_TYPE) { 
> 
> It checks for 0x3b00, 0x1c00, 0x1e00, 0x8c00 and 0x9c00.
> The INTEL_PCH_DEVICE_ID_MASK is 0xff00
> > 
> > To solve the first, make a new machine type, PIIX4-based, and pass through
> > the registers you need.  The patch must document _exactly_ why the registers
> > are safe to pass.  If they are not reserved on PIIX4, the patch must
> > document what the same offsets mean on PIIX4, and why it's sensible to
> > assume that firmware for virtual machine will not read/write them.  Bonus
> > point for also documenting the same for Q35.
> 
> OK. They look to be related to setting up an MBAR , but I don't understand
> why it is needed. Hopefully some of the i915 folks CC-ed here can answer.
> 
> > 
> > Regarding the second, fixing IGD hardware to not rely on chipset magic is a
> > no-go, I agree.  I disagree that it's a no-go to define a "backdoor" that
> > lets a hypervisor pass the right information to the driver without hacking
> > the chipset device model.
> > 
> > The hardware folks would have to give us a place for a pair of registers
> > (something like data/address), and a bit somewhere else that would be always
> > 0 on hardware and always 1 if the hypervisor is implementing the pair of
> > registers.  This is similar to CPUID, which has the HYPERVISOR bit +
> > hypervisor-defined leaves at 0x4000.
> > 
> > The data/address pair could be in a BAR, in configuration space, in the low
> > VGA ports at 0x3c0-0x3df, wherever.  The hypervisor bit can be in the same
> > place or somewhere else---again, whatever is convenient for the hardware
> > folks.  We just need *one bit* that is known-zero on all hardware, and 8
> > bytes in a reserved area.  I don't think it's too hard to find this space,
> > and I really, really would like Intel to follow up on a paravirtualized
> > backdoor.
> > 
> > That said, we have the problem of existing guests, so I agree something else
> > is needed.
> > 
> > > a) Two bridges - one 'passthrough' and the legacy ISA bridge
> > >that QEMU emulates. Both Linux and Windows are OK with
> > >two bridges (even thought it is pretty weird).
> > 
> > This is pretty much the only solution for existing Linux guests that look u

[Qemu-devel] [PATCH 00/10] RFC: userfault

2014-07-02 Thread Andrea Arcangeli
Hello everyone,

There's a large CC list for this RFC because this adds two new
syscalls (userfaultfd and remap_anon_pages) and
MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
or on a completely different API if somebody has better ideas are
welcome now.

The combination of these features are what I would propose to
implement postcopy live migration in qemu, and in general demand
paging of remote memory, hosted in different cloud nodes.

The MADV_USERFAULT feature should be generic enough that it can
provide the userfaults to the Android volatile range feature too, on
access of reclaimed volatile pages.

If the access could ever happen in kernel context through syscalls
(not not just from userland context), then userfaultfd has to be used
to make the userfault unnoticeable to the syscall (no error will be
returned). This latter feature is more advanced than what volatile
ranges alone could do with SIGBUS so far (but it's optional, if the
process doesn't call userfaultfd, the regular SIGBUS will fire, if the
fd is closed SIGBUS will also fire for any blocked userfault that was
waiting a userfaultfd_write ack).

userfaultfd is also a generic enough feature, that it allows KVM to
implement postcopy live migration without having to modify a single
line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
other GUP features works just fine in combination with userfaults
(userfaults trigger async page faults in the guest scheduler so those
guest processes that aren't waiting for userfaults can keep running in
the guest vcpus).

remap_anon_pages is the syscall to use to resolve the userfaults (it's
not mandatory, vmsplice will likely still be used in the case of local
postcopy live migration just to upgrade the qemu binary, but
remap_anon_pages is faster and ideal for transferring memory across
the network, it's zerocopy and doesn't touch the vma: it only holds
the mmap_sem for reading).

The current behavior of remap_anon_pages is very strict to avoid any
chance of memory corruption going unnoticed. mremap is not strict like
that: if there's a synchronization bug it would drop the destination
range silently resulting in subtle memory corruption for
example. remap_anon_pages would return -EEXIST in that case. If there
are holes in the source range remap_anon_pages will return -ENOENT.

If remap_anon_pages is used always with 2M naturally aligned
addresses, transparent hugepages will not be splitted. In there could
be 4k (or any size) holes in the 2M (or any size) source range,
remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
relax some of its strict checks (-ENOENT won't be returned if
RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
a noop on any hole in the source range). This flag is generally useful
when implementing userfaults with THP granularity, but it shouldn't be
set if doing the userfaults with PAGE_SIZE granularity if the
developer wants to benefit from the strict -ENOENT behavior.

The remap_anon_pages syscall API is not vectored, as I expect it to be
used mainly for demand paging (where there can be just one faulting
range per userfault) or for large ranges (with the THP model as an
alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
granularity before starting the guest in the destination node) where
vectoring isn't going to provide much performance advantages (thanks
to the THP coarser granularity).

On the rmap side remap_anon_pages doesn't add much complexity: there's
no need of nonlinear anon vmas to support it because I added the
constraint that it will fail if the mapcount is more than 1. So in
general the source range of remap_anon_pages should be marked
MADV_DONTFORK to prevent any risk of failure if the process ever
forks (like qemu can in some case).

One part that hasn't been tested is the poll() syscall on the
userfaultfd because the postcopy migration thread currently is more
efficient waiting on blocking read()s (I'll write some code to test
poll() too). I also appended below a patch to trinity to exercise
remap_anon_pages and userfaultfd and it completes trinity
successfully.

The code can be found here:

git clone --reference linux 
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 

The branch is rebased so you can get updates for example with:

git fetch && git checkout -f origin/userfault

Comments welcome, thanks!
Andrea

>From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli 
Date: Wed, 2 Jul 2014 18:32:35 +0200
Subject: [PATCH] add remap_anon_pages and userfaultfd

Signed-off-by: Andrea Arcangeli 
---
 include/syscalls-x86_64.h   |   2 +
 syscalls/remap_anon_pages.c | 100 
 syscalls/syscalls.h |   2 +
 syscalls/userfaultfd.c  |  12 ++
 4 files changed, 116 insertions(+)
 create mode 100644 syscalls/remap_anon_pages.c
 create mode 100644 syscalls/userfaultfd.c

diff --

[Qemu-devel] [PATCH 05/10] mm: swp_entry_swapcount

2014-07-02 Thread Andrea Arcangeli
Provide a new swapfile method for remap_anon_pages to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped
in multiple vmas, when the page is swapped back in, it could get
mapped in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/swap.h |  6 ++
 mm/swapfile.c| 13 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3d86a9a..3d7cae5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,6 +452,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -553,6 +554,11 @@ static inline int page_swapcount(struct page *page)
return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+   return 0;
+}
+
 #define reuse_swap_page(page)  (page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c524f7..f516555 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -877,6 +877,19 @@ int page_swapcount(struct page *page)
return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+   int count = 0;
+   struct swap_info_struct *p;
+
+   p = swap_info_get(entry);
+   if (p) {
+   count = swap_count(p->swap_map[swp_offset(entry)]);
+   spin_unlock(&p->lock);
+   }
+   return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content



[Qemu-devel] [PATCH 04/10] mm: rmap preparation for remap_anon_pages

2014-07-02 Thread Andrea Arcangeli
remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 24 
 mm/rmap.c|  9 +
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1928463..94c37ca 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1907,6 +1907,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
struct anon_vma *anon_vma;
int ret = 1;
+   struct address_space *mapping;
 
BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
@@ -1918,10 +1919,24 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * page_lock_anon_vma_read except the write lock is taken to serialise
 * against parallel split or collapse operations.
 */
-   anon_vma = page_get_anon_vma(page);
-   if (!anon_vma)
-   goto out;
-   anon_vma_lock_write(anon_vma);
+   for (;;) {
+   mapping = ACCESS_ONCE(page->mapping);
+   anon_vma = page_get_anon_vma(page);
+   if (!anon_vma)
+   goto out;
+   anon_vma_lock_write(anon_vma);
+   /*
+* We don't hold the page lock here so
+* remap_anon_pages_huge_pmd can change the anon_vma
+* from under us until we obtain the anon_vma
+* lock. Verify that we obtained the anon_vma lock
+* before remap_anon_pages did.
+*/
+   if (likely(mapping == ACCESS_ONCE(page->mapping)))
+   break;
+   anon_vma_unlock_write(anon_vma);
+   put_anon_vma(anon_vma);
+   }
 
ret = 0;
if (!PageCompound(page))
@@ -2420,6 +2435,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 * Prevent all access to pagetables with the exception of
 * gup_fast later hanlded by the ptep_clear_flush and the VM
 * handled by the anon_vma lock + PG_lock.
+* remap_anon_pages is prevented to race as well by the mmap_sem.
 */
down_write(&mm->mmap_sem);
if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index b7e94eb..59a7e7d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
 
+repeat:
rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
 
+   /* check if remap_anon_pages changed the anon_vma */
+   if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != 
anon_mapping)) {
+   anon_vma_unlock_read(anon_vma);
+   put_anon_vma(anon_vma);
+   anon_vma = NULL;
+   goto repeat;
+   }
+
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
 * Oops, we held the last refcount, release the l

[Qemu-devel] [PATCH 09/10] userfaultfd: make userfaultfd_write non blocking

2014-07-02 Thread Andrea Arcangeli
It is generally inefficient to ask the wakeup of userfault ranges
where there's not a single userfault address read through
userfaultfd_read earlier and in turn waiting a wakeup. However it may
come handy to wakeup the same userfault range twice in case of
multiple thread faulting on the same address. But we should still
return an error so if the application thinks this occurrence can never
happen it will know it hit a bug. So just return -ENOENT instead of
blocking.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 34 +-
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 4902fa3..deed8cb 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -378,9 +378,7 @@ static ssize_t userfaultfd_write(struct file *file, const 
char __user *buf,
 size_t count, loff_t *ppos)
 {
struct userfaultfd_ctx *ctx = file->private_data;
-   ssize_t res;
__u64 range[2];
-   DECLARE_WAITQUEUE(wait, current);
 
if (ctx->state == USERFAULTFD_STATE_ASK_PROTOCOL) {
__u64 protocol;
@@ -408,34 +406,12 @@ static ssize_t userfaultfd_write(struct file *file, const 
char __user *buf,
if (range[0] >= range[1])
return -ERANGE;
 
-   spin_lock(&ctx->fd_wqh.lock);
-   __add_wait_queue(&ctx->fd_wqh, &wait);
-   for (;;) {
-   set_current_state(TASK_INTERRUPTIBLE);
-   /* always take the fd_wqh lock before the fault_wqh lock */
-   if (find_userfault(ctx, NULL, POLLOUT)) {
-   if (!wake_userfault(ctx, range)) {
-   res = sizeof(range);
-   break;
-   }
-   }
-   if (signal_pending(current)) {
-   res = -ERESTARTSYS;
-   break;
-   }
-   if (file->f_flags & O_NONBLOCK) {
-   res = -EAGAIN;
-   break;
-   }
-   spin_unlock(&ctx->fd_wqh.lock);
-   schedule();
-   spin_lock(&ctx->fd_wqh.lock);
-   }
-   __remove_wait_queue(&ctx->fd_wqh, &wait);
-   __set_current_state(TASK_RUNNING);
-   spin_unlock(&ctx->fd_wqh.lock);
+   /* always take the fd_wqh lock before the fault_wqh lock */
+   if (find_userfault(ctx, NULL, POLLOUT))
+   if (!wake_userfault(ctx, range))
+   return sizeof(range);
 
-   return res;
+   return -ENOENT;
 }
 
 #ifdef CONFIG_PROC_FS



[Qemu-devel] [PATCH 07/10] waitqueue: add nr wake parameter to __wake_up_locked_key

2014-07-02 Thread Andrea Arcangeli
Userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli 
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..b28be5a 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -142,7 +142,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t 
*old)
 }
 
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void 
*key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -173,7 +174,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)  \
-   __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+   __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)   \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)  \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..551007f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -105,9 +105,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int 
mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
 {
-   __wake_up_common(q, mode, 1, 0, key);
+   __wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -282,7 +283,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
-   __wake_up_locked_key(q, mode, key);
+   __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index c0365c1..d4ffd68 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
-   __wake_up_locked_key(wq, TASK_NORMAL, &k);
+   __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
 }



[Qemu-devel] [PATCH 02/10] mm: madvise MADV_USERFAULT

2014-07-02 Thread Andrea Arcangeli
MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the
vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if
userland touches a still unmapped virtual address, a sigbus signal is
sent instead of allocating a new page. The sigbus signal handler will
then resolve the page fault in userland by calling the
remap_anon_pages syscall.

This functionality is needed to reliably implement postcopy live
migration in KVM (without having to use a special chardevice that
would disable all advanced Linux VM features, like swapping, KSM, THP,
automatic NUMA balancing, etc...).

MADV_USERFAULT could also be used to offload parts of anonymous memory
regions to remote nodes or to implement network distributed shared
memory.

Here I enlarged the vm_flags to 64bit as we run out of bits (noop on
64bit kernels). An alternative is to find some combination of flags
that are mutually exclusive if set.

Signed-off-by: Andrea Arcangeli 
---
 arch/alpha/include/uapi/asm/mman.h |  3 ++
 arch/mips/include/uapi/asm/mman.h  |  3 ++
 arch/parisc/include/uapi/asm/mman.h|  3 ++
 arch/xtensa/include/uapi/asm/mman.h|  3 ++
 fs/proc/task_mmu.c |  1 +
 include/linux/mm.h |  1 +
 include/uapi/asm-generic/mman-common.h |  3 ++
 mm/huge_memory.c   | 61 +-
 mm/madvise.c   | 17 ++
 mm/memory.c| 13 
 10 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 0086b47..a10313c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -60,6 +60,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index cfcb876..d9d11a4 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -84,6 +84,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 294d251..7bc7b7b 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -66,6 +66,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP70  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 71  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 72/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 #define MAP_VARIABLE   0
diff --git a/arch/xtensa/include/uapi/asm/mman.h 
b/arch/xtensa/include/uapi/asm/mman.h
index 00eed67..5448d88 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -90,6 +90,9 @@
   overrides the coredump filter bits */
 #define MADV_DODUMP17  /* Clear the MADV_NODUMP flag */
 
+#define MADV_USERFAULT 18  /* Trigger user faults if not mapped */
+#define MADV_NOUSERFAULT 19/* Don't trigger user faults */
+
 /* compatibility flags */
 #define MAP_FILE   0
 
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fb91692..8636cda 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -568,6 +568,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct 
vm_area_struct *vma)
[ilog2(VM_HUGEPAGE)]= "hg",
[ilog2(VM_NOHUGEPAGE)]  = "nh",
[ilog2(VM_MERGEABLE)]   = "mg",
+   [ilog2(VM_USERFAULT)]   = "uf",
};
size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..00faeda 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -139,6 +139,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HUGEPAGE0x2000  /* MADV_HUGEPAGE marked this vma */
 #define VM_NOHUGEPAGE  0x4000  /* MADV_NOHUGEPAGE marked this vma */
 #define VM_MERGEABLE   0x8000  /* KSM may merge identical pages */
+#define VM_USERFAULT   0x1ULL  /* Trigger user faults if not mapped */
 
 #if defined(CONFIG_X86)
 # define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
once (x86) */
diff --git a/include/uapi/asm

[Qemu-devel] [PATCH 01/10] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits

2014-07-02 Thread Andrea Arcangeli
We run out of 32bits in vm_flags, noop change for 64bit archs.

Signed-off-by: Andrea Arcangeli 
---
 fs/proc/task_mmu.c   | 4 ++--
 include/linux/huge_mm.h  | 4 ++--
 include/linux/ksm.h  | 4 ++--
 include/linux/mm_types.h | 2 +-
 mm/huge_memory.c | 2 +-
 mm/ksm.c | 2 +-
 mm/madvise.c | 2 +-
 mm/mremap.c  | 2 +-
 8 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cfa63ee..fb91692 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, 
struct vm_area_struct *vma)
/*
 * Don't forget to update Documentation/ on changes.
 */
-   static const char mnemonics[BITS_PER_LONG][2] = {
+   static const char mnemonics[BITS_PER_LONG+1][2] = {
/*
 * In case if we meet a flag we don't know about.
 */
-   [0 ... (BITS_PER_LONG-1)] = "??",
+   [0 ... (BITS_PER_LONG)] = "??",
 
[ilog2(VM_READ)]= "rd",
[ilog2(VM_WRITE)]   = "wr",
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b826239..3a2c57e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -125,7 +125,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, 
unsigned long address,
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
 extern int hugepage_madvise(struct vm_area_struct *vma,
-   unsigned long *vm_flags, int advice);
+   vm_flags_t *vm_flags, int advice);
 extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
@@ -187,7 +187,7 @@ static inline int split_huge_page(struct page *page)
 #define split_huge_page_pmd_mm(__mm, __address, __pmd) \
do { } while (0)
 static inline int hugepage_madvise(struct vm_area_struct *vma,
-  unsigned long *vm_flags, int advice)
+  vm_flags_t *vm_flags, int advice)
 {
BUG();
return 0;
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 3be6bb1..8b35253 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -18,7 +18,7 @@ struct mem_cgroup;
 
 #ifdef CONFIG_KSM
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags);
+   unsigned long end, int advice, vm_flags_t *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
 
@@ -94,7 +94,7 @@ static inline int PageKsm(struct page *page)
 
 #ifdef CONFIG_MMU
 static inline int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags)
+   unsigned long end, int advice, vm_flags_t *vm_flags)
 {
return 0;
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 96c5750..cd42c8c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -217,7 +217,7 @@ struct page_frag {
 #endif
 };
 
-typedef unsigned long __nocast vm_flags_t;
+typedef unsigned long long __nocast vm_flags_t;
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33514d8..7e0776a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1929,7 +1929,7 @@ out:
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
-unsigned long *vm_flags, int advice)
+vm_flags_t *vm_flags, int advice)
 {
switch (advice) {
case MADV_HUGEPAGE:
diff --git a/mm/ksm.c b/mm/ksm.c
index 346ddc9..6052cf2 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1736,7 +1736,7 @@ static int ksm_scan_thread(void *nothing)
 }
 
 int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
-   unsigned long end, int advice, unsigned long *vm_flags)
+   unsigned long end, int advice, vm_flags_t *vm_flags)
 {
struct mm_struct *mm = vma->vm_mm;
int err;
diff --git a/mm/madvise.c b/mm/madvise.c
index a402f8f..b31aad1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -49,7 +49,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
int error = 0;
pgoff_t pgoff;
-   unsigned long new_flags = vma->vm_flags;
+   vm_flags_t new_flags = vma->vm_flags;
 
switch (behavior) {
case MADV_NORMAL:
diff --git a/mm/mremap.c b/mm/mremap.c
index 05f1180..fa7db87 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -239,7 +239,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 {
struct mm_struct *mm = vma->vm_mm;
struct vm_area_s

Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Konrad Rzeszutek Wilk
On Wed, Jul 02, 2014 at 06:29:23PM +0200, Paolo Bonzini wrote:
> Il 02/07/2014 17:27, Michael S. Tsirkin ha scritto:
> > At some level, maybe Paolo is right.  Ignore existing drivers and ask
> > intel developers to update their drivers to do something sane on
> > hypervisors, even if they do ugly things on real hardware.
> > 
> > A simple proposal since what I wrote earlier though apparently wasn't
> > very clear:
> > 
> >   Detect Xen subsystem vendor id on vga card.
> >   If there, avoid poking at chipset. Instead
> > - use subsystem device # for card type
> 
> You mean for PCH type (aka PCH device id).
> 
> > - use second half of BAR0 of device
> > - instead of access to pci host
> > 
> > hypervisors will simply take BAR0 and double it in size,
> > make second part map to what would be the pci host.
> 
> Nice.  Detecting the backdoor via the subsystem vendor id
> is clever.
> 
> I'm not sure if it's possible to just double the size of BAR0 
> or not, but my laptop has:
> 
>   Region 0: Memory at d000 (64-bit, non-prefetchable) [size=4M]
>   Region 2: Memory at c000 (64-bit, prefetchable) [size=256M]
>   Region 4: I/O ports at 5000 [size=64]
> 
> and I hope we can reserve a few KB for hypervisors within those
> 4M, or 8 bytes for an address/data pair (like cf8/cfc) within BAR4's
> 64 bytes (or grow BAR4 to 128 bytes, or something like that).
> 
> Xen can still add the hacky machine type if they want for existing 
> hosts, but this would be a nice way forward.

It would be good to understand first why i915 in the first place
needs to setup the bridge MBAR if it has not been set. As in, why
is this region needed? Is it needed to flush the pipeline (as in
you need to write there?) or .. 

Perhaps it is not needed anymore with the current hardware and
what can be done is put a stake in the ground saying that only
genX or later will be supported.

The commit ids allude to power managament and the earlier commits
did poke there - but I don't see it on the latest tree.
> 
> Paolo
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org
> http://lists.xen.org/xen-devel



Re: [Qemu-devel] [PATCH for-2.1] hw/arm/vexpress: Alias NOR flash at 0 for vexpress-a9

2014-07-02 Thread Greg Bellows
Reviewed-by: Greg Bellows 


On 2 July 2014 09:07, Peter Maydell  wrote:

> Make the vexpress-a9 board alias the first NOR flash region at
> address zero, like vexpress-a15. This makes "-bios" actually usable
> on this board.
>
> Signed-off-by: Peter Maydell 
> ---
> Looking back through the archives to 2012 when the vexpress-a15
> flash alias went in, I seem to have been under the impression that
> the A9 daughterboard didn't have a similar alias, but it does.
> (For both boards, there is a mechanism for letting the guest
> dynamically remap lowmem which we don't implement; the rationale
> for defaulting to flash at 0 holds for both.)
>
> This is a fairly long standing bug but it's more interesting
> now we support -bios on this board which is a new-in-2.1 thing.
> Hence the for-2.1 tag.
> ---
>  hw/arm/vexpress.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/arm/vexpress.c b/hw/arm/vexpress.c
> index 3d83e6c..a88732c 100644
> --- a/hw/arm/vexpress.c
> +++ b/hw/arm/vexpress.c
> @@ -84,6 +84,7 @@ enum {
>  };
>
>  static hwaddr motherboard_legacy_map[] = {
> +[VE_NORFLASHALIAS] = 0,
>  /* CS7: 0x1000 .. 0x1002 */
>  [VE_SYSREGS] = 0x1000,
>  [VE_SP810] = 0x10001000,
> @@ -114,7 +115,6 @@ static hwaddr motherboard_legacy_map[] = {
>  [VE_VIDEORAM] = 0x4c00,
>  [VE_ETHERNET] = 0x4e00,
>  [VE_USB] = 0x4f00,
> -[VE_NORFLASHALIAS] = -1, /* not present */
>  };
>
>  static hwaddr motherboard_aseries_map[] = {
> --
> 2.0.0
>
>
>


Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Ming Lei
On Thu, Jul 3, 2014 at 12:38 AM, Paolo Bonzini  wrote:
>
>> On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini  wrote:
>> > Il 02/07/2014 18:13, Ming Lei ha scritto:
>> >
>> >> That must be for generating guest irq, which should have been
>> >> processed as batch easily.
>> >
>> >
>> > No, guest irqs are generated (with notify_guest) on every I/O completion
>> > even in 2.0.
>>
>> In 2.0, only notify_guest() is called after event batch is completed,
>
> Ah, you're right.
>
>> I wrote one patch days ago for fixing the problem, and today
>> it is just verified that writes can be decreased to 10K from 120K.
>
> Great.  Can you test/review my aio_notify patch and submit both then?  Bonus
> points, but not a requirement IMO,  if it also helps non-dataplane.

OK, I will take a look and test tomorrow, and it is a bit late for me today.


Thanks,
-- 
Ming Lei



Re: [Qemu-devel] [libvirt] Not able to run "virsh qemu-agent-command" when socat is working

2014-07-02 Thread Eric Blake
On 07/02/2014 01:13 AM, Puneet Bakshi wrote:
> Hi,
> 
> I am running qemu guest agent in Windows 2k8. I am able to execute
> "qemu-agent-commands" using socat but not through "virsh
> qemu-agent-command".
> 

> *Host CentOS system*
> 
> socat returns response appropriately.
> [root@sdsr720-14 ~]# echo "{'execute':'guest-ping'}" | socat
> stdio,ignoreeof /var/lib/libvirt/qemu/g06.agent
> {"return": {}}

Note your spelling...

> 
> *"virsh qemu-agent-command" returns blank.*
> 
> [root@sdsr720-14 ~]# virsh qemu-agent-command vm_win_06 '{ "execute":
> "guest_ping"}'

and compare it to here.  There is no guest_ping command, only
guest-ping.  It's a bug in libvirt that guest-agent-command doesn't
output a useful error message when attempting to run a non-existing
command, but you'll never hit that bug if you pass valid commands to the
agent in the first place.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Paolo Bonzini

> On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini  wrote:
> > Il 02/07/2014 18:13, Ming Lei ha scritto:
> >
> >> That must be for generating guest irq, which should have been
> >> processed as batch easily.
> >
> >
> > No, guest irqs are generated (with notify_guest) on every I/O completion
> > even in 2.0.
> 
> In 2.0, only notify_guest() is called after event batch is completed,

Ah, you're right.

> I wrote one patch days ago for fixing the problem, and today
> it is just verified that writes can be decreased to 10K from 120K.

Great.  Can you test/review my aio_notify patch and submit both then?  Bonus
points, but not a requirement IMO,  if it also helps non-dataplane.

This would only leave rt_sigprocmask.

> BTW, what do you think about the patch v4 for submitting I/O as batch?

I was waiting for Kevin to review it.

Paolo



Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Paolo Bonzini
Il 02/07/2014 17:27, Michael S. Tsirkin ha scritto:
> At some level, maybe Paolo is right.  Ignore existing drivers and ask
> intel developers to update their drivers to do something sane on
> hypervisors, even if they do ugly things on real hardware.
> 
> A simple proposal since what I wrote earlier though apparently wasn't
> very clear:
> 
>   Detect Xen subsystem vendor id on vga card.
>   If there, avoid poking at chipset. Instead
>   - use subsystem device # for card type

You mean for PCH type (aka PCH device id).

>   - use second half of BAR0 of device
>   - instead of access to pci host
> 
> hypervisors will simply take BAR0 and double it in size,
> make second part map to what would be the pci host.

Nice.  Detecting the backdoor via the subsystem vendor id
is clever.

I'm not sure if it's possible to just double the size of BAR0 
or not, but my laptop has:

Region 0: Memory at d000 (64-bit, non-prefetchable) [size=4M]
Region 2: Memory at c000 (64-bit, prefetchable) [size=256M]
Region 4: I/O ports at 5000 [size=64]

and I hope we can reserve a few KB for hypervisors within those
4M, or 8 bytes for an address/data pair (like cf8/cfc) within BAR4's
64 bytes (or grow BAR4 to 128 bytes, or something like that).

Xen can still add the hacky machine type if they want for existing 
hosts, but this would be a nice way forward.

Paolo



Re: [Qemu-devel] ResettRe: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Paolo Bonzini

Il 02/07/2014 18:23, Konrad Rzeszutek Wilk ha scritto:

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 651e65e..03f2829 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -433,6 +433,8 @@ void intel_detect_pch(struct drm_device *dev)
unsigned short id = pch->device & 
INTEL_PCH_DEVICE_ID_MASK;
dev_priv->pch_id = id;

+   if (pch->subsystem_vendor == PCI_VENDOR_ID_XEN)
+   id = pch->device & INTEL_PCH_DEVICE_ID_MASK;


Actually you could look at *dev*'s subsystem IDs and skip the pch lookup 
completely.


Paolo


if (id == INTEL_PCH_IBX_DEVICE_ID_TYPE) {
dev_priv->pch_type = PCH_IBX;
DRM_DEBUG_KMS("Found Ibex Peak PCH\n");

>





Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Ming Lei
On Thu, Jul 3, 2014 at 12:23 AM, Paolo Bonzini  wrote:
> Il 02/07/2014 18:13, Ming Lei ha scritto:
>
>> That must be for generating guest irq, which should have been
>> processed as batch easily.
>
>
> No, guest irqs are generated (with notify_guest) on every I/O completion
> even in 2.0.

In 2.0, only notify_guest() is called after event batch is completed,
I wrote one patch days ago for fixing the problem, and today
it is just verified that writes can be decreased to 10K from 120K.

BTW, what do you think about the patch v4 for submitting I/O
as batch?


Thanks,
-- 
Ming Lei



Re: [Qemu-devel] ResettRe: [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Konrad Rzeszutek Wilk
On Wed, Jul 02, 2014 at 04:50:15PM +0200, Paolo Bonzini wrote:
> Il 02/07/2014 16:00, Konrad Rzeszutek Wilk ha scritto:
> >With this long thread I lost a bit context about the challenges
> >that exists. But let me try summarizing it here - which will hopefully
> >get some consensus.
> >
> >1). Fix IGD hardware to not use Southbridge magic addresses.
> >We can moan and moan but I doubt it is going to change.
> 
> There are two problems:
> 
> - Northbridge (i.e. MCH i.e. PCI host bridge) configuration space addresses

Right. So in  drivers/gpu/drm/i915/i915_dma.c:
1135 #define MCHBAR_I915 0x44   
 
1136 #define MCHBAR_I965 0x48 

1147 int reg = INTEL_INFO(dev)->gen >= 4 ? MCHBAR_I965 : MCHBAR_I915;   
 
1152 if (INTEL_INFO(dev)->gen >= 4) 
 
1153 pci_read_config_dword(dev_priv->bridge_dev, reg + 4, 
&temp_hi); 
1154 pci_read_config_dword(dev_priv->bridge_dev, reg, &temp_lo);
 
1155 mchbar_addr = ((u64)temp_hi << 32) | temp_lo;

and

1139 #define DEVEN_REG 0x54 
 

1193 int mchbar_reg = INTEL_INFO(dev)->gen >= 4 ? MCHBAR_I965 : 
MCHBAR_I915; 
1202 if (IS_I915G(dev) || IS_I915GM(dev)) { 
 
1203 pci_read_config_dword(dev_priv->bridge_dev, DEVEN_REG, 
&temp);  
1204 enabled = !!(temp & DEVEN_MCHBAR_EN);  
 
1205 } else {   
 
1206 pci_read_config_dword(dev_priv->bridge_dev, mchbar_reg, 
&temp); 
1207 enabled = temp & 1;
 
1208 }
> 
> - Southbridge (i.e. PCH i.e. ISA bridge) vendor/device ID; some versions of
> the driver identify it by class, some versions identify it by slot (1f.0).

Right, So in  drivers/gpu/drm/i915/i915_drv.c the giant intel_detect_pch
which sets the pch_type based on :

 432 if (pch->vendor == PCI_VENDOR_ID_INTEL) {  
 
 433 unsigned short id = pch->device & 
INTEL_PCH_DEVICE_ID_MASK;
 434 dev_priv->pch_id = id; 
 
 435
 
 436 if (id == INTEL_PCH_IBX_DEVICE_ID_TYPE) { 

It checks for 0x3b00, 0x1c00, 0x1e00, 0x8c00 and 0x9c00.
The INTEL_PCH_DEVICE_ID_MASK is 0xff00
> 
> To solve the first, make a new machine type, PIIX4-based, and pass through
> the registers you need.  The patch must document _exactly_ why the registers
> are safe to pass.  If they are not reserved on PIIX4, the patch must
> document what the same offsets mean on PIIX4, and why it's sensible to
> assume that firmware for virtual machine will not read/write them.  Bonus
> point for also documenting the same for Q35.

OK. They look to be related to setting up an MBAR , but I don't understand
why it is needed. Hopefully some of the i915 folks CC-ed here can answer.

> 
> Regarding the second, fixing IGD hardware to not rely on chipset magic is a
> no-go, I agree.  I disagree that it's a no-go to define a "backdoor" that
> lets a hypervisor pass the right information to the driver without hacking
> the chipset device model.
> 
> The hardware folks would have to give us a place for a pair of registers
> (something like data/address), and a bit somewhere else that would be always
> 0 on hardware and always 1 if the hypervisor is implementing the pair of
> registers.  This is similar to CPUID, which has the HYPERVISOR bit +
> hypervisor-defined leaves at 0x4000.
> 
> The data/address pair could be in a BAR, in configuration space, in the low
> VGA ports at 0x3c0-0x3df, wherever.  The hypervisor bit can be in the same
> place or somewhere else---again, whatever is convenient for the hardware
> folks.  We just need *one bit* that is known-zero on all hardware, and 8
> bytes in a reserved area.  I don't think it's too hard to find this space,
> and I really, really would like Intel to follow up on a paravirtualized
> backdoor.
> 
> That said, we have the problem of existing guests, so I agree something else
> is needed.
> 
> > a) Two bridges - one 'passthrough' and the legacy ISA bridge
> >that QEMU emulates. Both Linux and Windows are OK with
> >two bridges (even thought it is pretty weird).
> 
> This is pretty much the only solution for existing Linux guests that look up
> the southbridge by class.

Right.
> 
> The proposed solution here is to define a new "pci stub" device in QEMU that
> lets you define a do-nothing device with your desired vendor ID, device ID,
> class and optionally subsystem IDs.


> 
> The new machine type (the one that insta

Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Paolo Bonzini

Il 02/07/2014 18:13, Ming Lei ha scritto:

That must be for generating guest irq, which should have been
processed as batch easily.


No, guest irqs are generated (with notify_guest) on every I/O completion 
even in 2.0.


Paolo



Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Paolo Bonzini
Il 02/07/2014 17:45, Ming Lei ha scritto:
> The attachment debug patch skips aio_notify() if qemu_bh_schedule
> is running from current aio context, but looks there is still 120K
> writes triggered. (without the patch, 400K can be observed in
> same test)

Nice.  Another observation is that after aio_dispatch we'll always
re-evaluate everything (bottom halves, file descriptors and timeouts),
so we can skip the aio_notify if we're inside aio_dispatch.

So what about this untested patch:

diff --git a/aio-posix.c b/aio-posix.c
index f921d4f..a23d85d 100644
--- a/aio-posix.c
+++ b/aio-posix.c
@@ -124,6 +124,9 @@ static bool aio_dispatch(AioContext *ctx)
 AioHandler *node;
 bool progress = false;
 
+/* No need to set the event notifier during aio_notify.  */
+ctx->running++;
+
 /*
  * We have to walk very carefully in case qemu_aio_set_fd_handler is
  * called while we're walking.
@@ -169,6 +171,11 @@ static bool aio_dispatch(AioContext *ctx)
 /* Run our timers */
 progress |= timerlistgroup_run_timers(&ctx->tlg);
 
+smp_wmb();
+ctx->iter_count++;
+smp_wmb();
+ctx->running--;
+
 return progress;
 }
 
diff --git a/async.c b/async.c
index 5b6fe6b..1f56afa 100644
--- a/async.c
+++ b/async.c
@@ -249,7 +249,19 @@ ThreadPool *aio_get_thread_pool(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-event_notifier_set(&ctx->notifier);
+uint32_t iter_count;
+do {
+iter_count = ctx->iter_count;
+/* Read ctx->iter_count before ctx->running.  */
+smb_rmb();
+if (!ctx->running) {
+event_notifier_set(&ctx->notifier);
+return;
+}
+/* Read ctx->running before ctx->iter_count.  */
+smb_rmb();
+/* ctx might have gone to sleep.  */
+} while (iter_count != ctx->iter_count);
 }
 
 static void aio_timerlist_notify(void *opaque)
@@ -269,6 +279,7 @@ AioContext *aio_context_new(void)
 ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
 ctx->pollfds = g_array_new(FALSE, FALSE, sizeof(GPollFD));
 ctx->thread_pool = NULL;
+ctx->iter_count = ctx->running = 0;
 qemu_mutex_init(&ctx->bh_lock);
 rfifolock_init(&ctx->lock, aio_rfifolock_cb, ctx);
 event_notifier_init(&ctx->notifier, false);
diff --git a/include/block/aio.h b/include/block/aio.h
index a92511b..9f51c4f 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -51,6 +51,9 @@ struct AioContext {
 /* Protects all fields from multi-threaded access */
 RFifoLock lock;
 
+/* Used to avoid aio_notify while dispatching event handlers.
+ * Writes protected by lock or BQL, reads are lockless.
+ */
+uint32_t iter_count, running;
+
 /* The list of registered AIO handlers */
 QLIST_HEAD(, AioHandler) aio_handlers;
 

Please review carefully.

> So is there still other writes not found in the path?

What do perf or gdb say? :)

Paolo




Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Ming Lei
On Wed, Jul 2, 2014 at 11:45 PM, Ming Lei  wrote:
> On Wed, Jul 2, 2014 at 4:54 PM, Stefan Hajnoczi  wrote:
>> On Tue, Jul 01, 2014 at 06:49:30PM +0200, Paolo Bonzini wrote:
>>> Il 01/07/2014 16:49, Ming Lei ha scritto:
>>> >Let me provide some data when running randread(bs 4k, libaio)
>>> >from VM for 10sec:
>>> >
>>> >1), qemu.git/master
>>> >- write(): 731K
>>> >- rt_sigprocmask(): 417K
>>> >- read(): 21K
>>> >- ppoll(): 10K
>>> >- io_submit(): 5K
>>> >- io_getevents(): 4K
>>> >
>>> >2), qemu 2.0
>>> >- write(): 9K
>>> >- read(): 28K
>>> >- ppoll(): 16K
>>> >- io_submit(): 12K
>>> >- io_getevents(): 10K
>>> >
>>> >>> The sigprocmask can probably be optimized away since the thread's
>>> >>> signal mask remains unchanged most of the time.
>>> >>>
>>> >>> I'm not sure what is causing the write().
>>> >I am investigating it...
>>>
>>> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
>>> is aio_notify (from qemu_bh_schedule).
>>
>> Aha!  We shouldn't be executing qemu_coroutine_new() very often since we
>> try to keep a freelist of coroutines.
>>
>> I think a tweak to the freelist could make the rt_sigprocmask() calls go
>> away since we should be reusing coroutines instead of allocating/freeing
>> them all the time.
>>
>>> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
>>> that bypasses coroutines in the common case of no I/O throttling, no
>>> copy-on-write, etc.
>>
>> I tried that in 2012 and couldn't measure an improvement above the noise
>> threshold, although it was without dataplane.
>>
>> BTW, we cannot eliminate the BH because the block layer guarantees that
>> callbacks are not invoked with reentrancy.  They are always invoked
>> directly from the event loop through a BH.  This simplifies callers
>> since they don't need to worry about callbacks happening while they are
>> still in bdrv_aio_readv(), for example.
>>
>> Removing this guarantee (by making callers safe first) is orthogonal to
>> coroutines.  But it's hard to do since it requires auditing a lot of
>> code.
>>
>> Another idea is to skip aio_notify() when we're sure the event loop
>> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
>> might be tricky though.
>
> The attachment debug patch skips aio_notify() if qemu_bh_schedule
> is running from current aio context, but looks there is still 120K
> writes triggered. (without the patch, 400K can be observed in
> same test)
>
> So is there still other writes not found in the path?

That must be for generating guest irq, which should have been
processed as batch easily.


Thanks,
-- 
Ming Lei



Re: [Qemu-devel] [Xen-devel] [v5][PATCH 0/5] xen: add Intel IGD passthrough support

2014-07-02 Thread Konrad Rzeszutek Wilk
On Wed, Jul 02, 2014 at 05:08:43PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jul 02, 2014 at 10:00:33AM -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Jul 02, 2014 at 01:33:09PM +0200, Paolo Bonzini wrote:
> > > Il 01/07/2014 19:39, Ross Philipson ha scritto:
> > > >
> > > >We do IGD pass-through in our project (XenClient). The patches
> > > >originally came from our project. We surface the same ISA bridge and
> > > >have never had activation issues on any version of Widows from XP to
> > > >Win8. We do not normally run server platforms so I can't say for sure
> > > >there.
> > > 
> > > The problem is not activation, the problem is that the patches are making
> > > assumptions on the driver and the firmware that might work today but are
> > > IMHO just not sane.
> > > 
> > > I would have no problem with a clean patchset that adds a new machine type
> > > and doesn't touch code in "-M pc", but it looks like mst disagrees.
> > > Ultimately, if a patchset is too hacky for upstream, you can include it in
> > > your downstream XenClient (and XenServer) QEMU branch.  It happens.
> > 
> > And then this discussion will come back again in a year when folks
> > rebase and ask: Why hasn't this been done upstream.
> > 
> > Then the discussion resumes ..
> > 
> > With this long thread I lost a bit context about the challenges
> > that exists. But let me try summarizing it here - which will hopefully
> > get some consensus.
> 
> Before I answer could you clarify please:
> by Southbridge do you mean the PCH at slot 1f or the MCH at slot 0 or both?

MCH slot. We read/write from this (see intel_setup_mchbar) from couple of
registers (0x44 and 0x48 if gen >= 4, otherwise 0x54). It is hard-coded
in the i915_get_bridge_dev (see ec2a4c3fdc8e82fe82a25d800e85c1ea06b74372)
as 0:0.0 BDF.

The PCH (does not matter where it sits) we only use the model:vendor id
to figure out the pch_type (see intel_detect_pch).

I don't see why that model:vendor_id can't be exposed via checking the
type of device:vendor_id of the IGD itself. CC-ing some Intel i915 authors.

So for the discussion here, when I say Southbridge I mean MCH.
> 
> > 1). Fix IGD hardware to not use Southbridge magic addresses.
> > We can moan and moan but I doubt it is going to change.
> > 
> > 2). Since we need the Southbridge magic addresses, we can expose
> > an bridge. [I think everybody agrees that we need to do
> > that since 1) is no go).
> > 
> > 3). What kind of bridge. We can do:
> > 
> >  a) Two bridges - one 'passthrough' and the legacy ISA bridge
> > that QEMU emulates. Both Linux and Windows are OK with
> > two bridges (even thought it is pretty weird).
> > 
> >  b) One bridge - the one that QEMU emulates - and lets emulate
> > more of the registers (by emulate - I mean for some get the
> > data from the real hardware).
> > 
> >b1). We can't use the legacy because the registers are
> > above 256 (is that correct? Did I miss something?)
> > 
> >b2)  We would need to use the Q35.
> > b2a). If we need Q35, that needs to be exposed in
> >   for Xen guests. That means exposing the 
> >   MMCONFIG and restructing the E820 to fit that
> >   in.
> >   Problem:
> > - Migration is not working with Q35.
> >   (But for v1 you wouldn't migrate, however
> >later hardware will surely have SR-IOV so
> >we will need to migrate).
> > 
> > - There are no developers who have an OK
> >   from their management to focus on this.
> >(Potential solution: Poke Intel management to see
> > if they can get more developers on it)
> >   
> > 
> > 4). Code does a bit of sysfs that could use some refacturing with
> > the KVM code.
> > Problem: More time needed to do the code restructing.
> > 
> > 
> > Is that about correct?
> > 
> > What are folks timezones and the best days next week to talk about
> > this on either Google Hangout or the phone?



Re: [Qemu-devel] [PATCH v2 2.1 1/3] blockjob: Fix recent BLOCK_JOB_READY regression

2014-07-02 Thread Eric Blake
On 07/02/2014 09:03 AM, Paolo Bonzini wrote:

 What if an underlying device doesn't support [rw]error=stop?  Not all
 do...
>>>
>>> Then the "fix" is to add support to the underlying device.  IDE, SCSI
>>> and virtio-blk (plus virtio-scsi via SCSI of course) are covered;
>>
>> Where "covered" means "device model calls bdrv_error_action() somewhere"
>> rather than "device model calls bdrv_error_action() exactly when it
>> should".
>>
>> Case in point: SCSI calls it when UNMAP fails, but IDE doesn't call it
>> when TRIM fails.  IDE and virtio-blk call it for I/O beyond the end of
>> the medium, but SCSI doesn't.
>>
>> This is of course fixable.  I'm working on it.
>>
>>>   the
>>> main one that's left out is SD.
>>
>> Qdevified devices with a qdev_prop_drive: isa-fdc, sysbus-fdc,
>> SUNW,fdtwo, nand, onenand, cfi.pflash01, cfi.pflash02, spapr-nvram,
>> scsi-generic, nvme.  SD isn't in this list, because it still hasn't been
>> qdevified.  There may be more.
> 
> I think there is a page with unfinished transition.  Can you add this one?

Listed: http://wiki.qemu.org/CodeTransitions#Reliable_block_job_polling

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC] alpha qemu arithmetic exceptions

2014-07-02 Thread Al Viro
On Wed, Jul 02, 2014 at 08:26:53AM -0700, Richard Henderson wrote:
> On 07/01/2014 11:17 PM, Al Viro wrote:
> > If we don't want FE_INEXACT seen by fetestexcept() after rounding 4.5, we'd
> > better not use FPCR.INE - *all* variants of actual hardware (at least from
> > 21064A to 21264) set that sucker, and 4.7 in Architecture Reference Manual
> > very clearly requires such behaviour for any subset that isn't completely
> > without floating point support.
> 
> Um, where do you see that?  I see:
> 
> # 4.7.6.4 IEEE-Compliant Arithmetic Without Inexact Exception
> # This model is similar to the model in Section 4.7.6.3, except this
> # model does not signal inexact results either by the inexact status
> # flag or by trapping. [...] This model is implemented by using IEEE
> # floating-point instructions with the /SU or /SV trap qualifiers.
> 
> The important words to me being "does not signal" and "inexact status flag".
> 
> Thus in sysdeps/alpha/fpu/s_nearbyint.c I explicitly use cvttq/svd and not
> cvttq/svid.  By my reading that means no inexact shall be raised.

What does that have to do with exceptions?  cvttq/svd is not going to raise
one; it *does* set that bit in FPCR, though.  What happens afterwards is
that fetestexcept() calls osf_getsysinfo(2) with GSI_IEEE_FP_CONTROL for op.
Which does
w = current_thread_info()->ieee_state & IEEE_SW_MASK;
w = swcr_update_status(w, rdfpcr());
and hands the value of w to caller.  Now, look at swcr_update_status()
(in arch/alpha/include/uapi/asm/fpu.h these days) and note that on 21264
it will throw away the status bits of ->ieee_state and use 6 bits from
FPCR instead.

Note, BTW, that appendix B (IEEE conformance) claims (in B.1) conversions as
hardware-implemented, with "Software routines support remainder, round to
integer in floating-point format, and convert binary to/from decimal" right
next to it.



Re: [Qemu-devel] [regression] dataplane: throughout -40% by commit 580b6b2aa2

2014-07-02 Thread Ming Lei
On Wed, Jul 2, 2014 at 4:54 PM, Stefan Hajnoczi  wrote:
> On Tue, Jul 01, 2014 at 06:49:30PM +0200, Paolo Bonzini wrote:
>> Il 01/07/2014 16:49, Ming Lei ha scritto:
>> >Let me provide some data when running randread(bs 4k, libaio)
>> >from VM for 10sec:
>> >
>> >1), qemu.git/master
>> >- write(): 731K
>> >- rt_sigprocmask(): 417K
>> >- read(): 21K
>> >- ppoll(): 10K
>> >- io_submit(): 5K
>> >- io_getevents(): 4K
>> >
>> >2), qemu 2.0
>> >- write(): 9K
>> >- read(): 28K
>> >- ppoll(): 16K
>> >- io_submit(): 12K
>> >- io_getevents(): 10K
>> >
>> >>> The sigprocmask can probably be optimized away since the thread's
>> >>> signal mask remains unchanged most of the time.
>> >>>
>> >>> I'm not sure what is causing the write().
>> >I am investigating it...
>>
>> I would guess sigprocmask is getcontext (from qemu_coroutine_new) and write
>> is aio_notify (from qemu_bh_schedule).
>
> Aha!  We shouldn't be executing qemu_coroutine_new() very often since we
> try to keep a freelist of coroutines.
>
> I think a tweak to the freelist could make the rt_sigprocmask() calls go
> away since we should be reusing coroutines instead of allocating/freeing
> them all the time.
>
>> Both can be eliminated by introducing a fast path in bdrv_aio_{read,write}v,
>> that bypasses coroutines in the common case of no I/O throttling, no
>> copy-on-write, etc.
>
> I tried that in 2012 and couldn't measure an improvement above the noise
> threshold, although it was without dataplane.
>
> BTW, we cannot eliminate the BH because the block layer guarantees that
> callbacks are not invoked with reentrancy.  They are always invoked
> directly from the event loop through a BH.  This simplifies callers
> since they don't need to worry about callbacks happening while they are
> still in bdrv_aio_readv(), for example.
>
> Removing this guarantee (by making callers safe first) is orthogonal to
> coroutines.  But it's hard to do since it requires auditing a lot of
> code.
>
> Another idea is to skip aio_notify() when we're sure the event loop
> isn't blocked in g_poll().  Doing this is a thread-safe and lockless way
> might be tricky though.

The attachment debug patch skips aio_notify() if qemu_bh_schedule
is running from current aio context, but looks there is still 120K
writes triggered. (without the patch, 400K can be observed in
same test)

So is there still other writes not found in the path?


Thanks,
-- 
Ming Lei
diff --git a/async.c b/async.c
index 5b6fe6b..5aa9982 100644
--- a/async.c
+++ b/async.c
@@ -40,6 +40,18 @@ struct QEMUBH {
 bool deleted;
 };
 
+static __thread AioContext *my_ctx = NULL;
+
+AioContext *get_current_aio_context(void)
+{
+return my_ctx;
+}
+
+void set_current_aio_context(AioContext *ctx)
+{
+my_ctx = ctx;
+}
+
 QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
 {
 QEMUBH *bh;
@@ -131,7 +143,9 @@ void qemu_bh_schedule(QEMUBH *bh)
  */
 smp_mb();
 bh->scheduled = 1;
-aio_notify(ctx);
+
+if (get_current_aio_context() != ctx)
+aio_notify(ctx);
 }
 
 
diff --git a/include/block/aio.h b/include/block/aio.h
index a92511b..29f29e2 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -307,4 +307,7 @@ static inline void aio_timer_init(AioContext *ctx,
 timer_init(ts, ctx->tlg.tl[type], scale, cb, opaque);
 }
 
+AioContext *get_current_aio_context(void);
+void set_current_aio_context(AioContext *ctx);
+
 #endif
diff --git a/iothread.c b/iothread.c
index 1fbf9f1..beb32ad 100644
--- a/iothread.c
+++ b/iothread.c
@@ -36,6 +36,8 @@ static void *iothread_run(void *opaque)
 qemu_cond_signal(&iothread->init_done_cond);
 qemu_mutex_unlock(&iothread->init_done_lock);
 
+set_current_aio_context(iothread->ctx);
+
 while (!iothread->stopping) {
 aio_context_acquire(iothread->ctx);
 while (!iothread->stopping && aio_poll(iothread->ctx, true)) {


Re: [Qemu-devel] from which version qemu support clone on rbd

2014-07-02 Thread Brian Jackson
Qemu doesn't handle that level of abstraction. The closest approximation 
you could probably come up with is qemu-img's backing file support for 
qcow2 images.


You should stick to using the rbd tool to create clones of rbd devices. 
Alternatively, use a higher level tool (like openstack, etc) that 
supports this.


--Iggy


On 7/2/2014 10:17 AM, yue wrote:

hi,all
i now look at qemu 2.0, i do not find rbd-api related to clone.
if qemu support this function? and from which version?
clone api of rbd is very simple(one api), why qemu does not
implement?what is the reason?
thanks.






Re: [Qemu-devel] [PATCH 4/6] sysbus: Make devices spawnable via -device

2014-07-02 Thread Alexander Graf


On 02.07.14 08:32, Paolo Bonzini wrote:

Il 01/07/2014 23:49, Alexander Graf ha scritto:

+
+static void machine_init_notify(Notifier *notifier, void *data)
+{
+Object *machine = qdev_get_machine();
+Object *container;
+
+if (object_property_find(machine, "has-dynamic-sysbus", NULL)) {
+/* Our machine can handle dynamic sysbus devices, we're all 
good */

+return;
+}


Does it need to be a property, or can it simply be a bool in 
MachineClass?


Sure - I'll change it to be a bool in MachineClass and QEMUMachine (I 
really don't want to have the qom-machinification also as part of this 
patch set)



Alex




[Qemu-devel] from which version qemu support clone on rbd

2014-07-02 Thread yue
hi,all
 
i now look at qemu 2.0, i do not find rbd-api related to clone.
if qemu support this function? and from which version?
clone api of rbd is very simple(one api), why qemu does not implement?what is 
the reason?
 
 
 
thanks.

  1   2   3   >