date:20150305

Re: [Qemu-devel] [Bug] qemu_coroutine_enter abort and report error "Co-routine re-entered recursively"

2015-03-05 Thread Halsey Pian

 

Qemu version: qemu-2.2.0 release

Platform: x86_64

 

 

From: Halsey Pian [mailto:halsey.p...@gmail.com] 
Sent: 2015年3月6日 15:04
To: qemu-devel@nongnu.org
Cc: halsey.p...@gmail.com
Subject: [Qemu-devel][Bug] qemu_coroutine_enter abort and report error 
"Co-routine re-entered recursively"

 

Hi All,

 

I have two threads to write two seperate qcow2 files,  but after a while,  the 
writing would be aborted in qemu_coroutine_enter, and report error “"Co-routine 
re-entered recursively” .

 

Qemu should be thread safe, right? It seems that there are some variables is 
not thread safe? Could you have a chance to look it? Thanks!

 

Call stack:

 

#0 0x75e18989__GI_raise(sig=sig@entry=6) 
(../nptl/sysdeps/unix/sysv/linux/raise.c:56)

#1 0x75e1a098__GI_abort() (abort.c:90)

#2 0x7728c034qemu_coroutine_enter(co=0x7fffe0004800, 
opaque=0x0) (qemu-coroutine.c:117)

#3 0x7727df39bdrv_co_io_em_complete(opaque=0x77fd6ae0, 
ret=0) (block.c:4847)

#4 0x77270314thread_pool_completion_bh(opaque=0x7fffe0006ad0) 
(thread-pool.c:187)

#5 0x7726f873 aio_bh_poll(ctx=0x7fffe0001d00) (async.c:82)

#6 0x7728340baio_dispatch(ctx=0x7fffe0001d00) (aio-posix.c:137)

#7 0x772837b0aio_poll(ctx=0x7fffe0001d00, blocking=true) 
(aio-posix.c:248)

#8 ?? 0x772795a8 in bdrv_prwv_co (bs=0x7fffdc0021c0, 
offset=12071639552, qiov=0x7fffe67fa590, is_write=true, flags=(unknown: 0)) 
(block.c:2703)

#9 ?? 0x7727966a in bdrv_rw_co (bs=0x7fffdc0021c0, 
sector_num=23577421, buf=0x7fffe4629250 
"\234\b\335Ǽ\254\213q\301\366\315=\005oI\301\245=\373\004+2?H\212\025\035+\262\274C;X\301FaP\324\335\061ҝ&Y\316=\347\335\020\365\003goɿ\214\312S=\v2]\373\363C\311\341\334\r5k\346k\204\332\023\264\315陌\230\203J\222u\214\066",
 nb_sectors=128, is_write=true, flags=(unknown: 0)) (block.c:2726)

#10 0x77279758  bdrv_write(bs=0x7fffdc0021c0, sector_num=23577421, 
buf=0x7fffe4629250 
"\234\b\335Ǽ\254\213q\301\366\315=\005oI\301\245=\373\004+2?H\212\025\035+\262\274C;X\301FaP\324\335\061ҝ&Y\316=\347\335\020\365\003goɿ\214\312S=\v2]\373\363C\311\341\334\r5k\346k\204\332\023\264\315陌\230\203J\222u\214\066",
 nb_sectors=128) (block.c:2760)

 

 

Best Regards

Halsey Pian

[Qemu-devel] [Bug] qemu_coroutine_enter abort and report error "Co-routine re-entered recursively"

2015-03-05 Thread Halsey Pian

Hi All,

 

I have two threads to write two seperate qcow2 files,  but after a while,  the 
writing would be aborted in qemu_coroutine_enter, and report error “"Co-routine 
re-entered recursively” .

 

Qemu should be thread safe, right? It seems that there are some variables is 
not thread safe? Could you have a chance to look it? Thanks!

 

Call stack:

 

#0 0x75e18989__GI_raise(sig=sig@entry=6) 
(../nptl/sysdeps/unix/sysv/linux/raise.c:56)

#1 0x75e1a098__GI_abort() (abort.c:90)

#2 0x7728c034qemu_coroutine_enter(co=0x7fffe0004800, 
opaque=0x0) (qemu-coroutine.c:117)

#3 0x7727df39bdrv_co_io_em_complete(opaque=0x77fd6ae0, 
ret=0) (block.c:4847)

#4 0x77270314thread_pool_completion_bh(opaque=0x7fffe0006ad0) 
(thread-pool.c:187)

#5 0x7726f873 aio_bh_poll(ctx=0x7fffe0001d00) (async.c:82)

#6 0x7728340baio_dispatch(ctx=0x7fffe0001d00) (aio-posix.c:137)

#7 0x772837b0aio_poll(ctx=0x7fffe0001d00, blocking=true) 
(aio-posix.c:248)

#8 ?? 0x772795a8 in bdrv_prwv_co (bs=0x7fffdc0021c0, 
offset=12071639552, qiov=0x7fffe67fa590, is_write=true, flags=(unknown: 0)) 
(block.c:2703)

#9 ?? 0x7727966a in bdrv_rw_co (bs=0x7fffdc0021c0, 
sector_num=23577421, buf=0x7fffe4629250 
"\234\b\335Ǽ\254\213q\301\366\315=\005oI\301\245=\373\004+2?H\212\025\035+\262\274C;X\301FaP\324\335\061ҝ&Y\316=\347\335\020\365\003goɿ\214\312S=\v2]\373\363C\311\341\334\r5k\346k\204\332\023\264\315陌\230\203J\222u\214\066",
 nb_sectors=128, is_write=true, flags=(unknown: 0)) (block.c:2726)

#10 0x77279758  bdrv_write(bs=0x7fffdc0021c0, sector_num=23577421, 
buf=0x7fffe4629250 
"\234\b\335Ǽ\254\213q\301\366\315=\005oI\301\245=\373\004+2?H\212\025\035+\262\274C;X\301FaP\324\335\061ҝ&Y\316=\347\335\020\365\003goɿ\214\312S=\v2]\373\363C\311\341\334\r5k\346k\204\332\023\264\315陌\230\203J\222u\214\066",
 nb_sectors=128) (block.c:2760)

 

 

Best Regards

Halsey Pian

Re: [Qemu-devel] 9pfs-proxy: -retval vs errno vs -1

2015-03-05 Thread Aneesh Kumar K.V

Michael Tokarev  writes:

> Another interesting tidbit is in hw/9pfs/virtio-9p-proxy.c.
>
> All filesystem methods use common v9fs_request() function,
> which returns -errno.  So far so good.
>
> Now, *all* places which call this function, does this:
>
> retval = v9fs_request(...);
> if (retval < 0) {
> errno = -retval;
> }
> return retval;
>
> and *some* does this:
>
> retval = v9fs_request(...);
> if (retval < 0) {
> errno = -retval;
> retval = -1;
> }
> return retval;

We should be able to drop that retval = -1;

-aneesh

[Qemu-devel] [PATCH 0/6] Clean up ISA dependencies so we make ISA optional to build

2015-03-05 Thread David Gibson

At present, ISA bus support is always included in the build for all
targets.  However these days there are a number of targets that have
never had ISA, and even more where many of the individual machines
don't have ISA.

Unfortunately there are some awkward dependencies in the core code on
ISA, although b19c1c0 "isa: remove isa_mem_base variable" did already
remove one.

This series engages in some yak shaving to make the necessary
dependency cleanups, then make inclusion of ISA support optional.

Given the date, this is obviously aimed at qemu 2.4, not 2.3.

David Gibson (6):
  Split serial-isa into its own config option
  Remove monitor.c dependency on CONFIG_I8259
  pc: Use MachineClass callbacks for "irq" and "pic" hmp commands
  target-ppc: Convert PReP to machine class
  prep: Use MachineClass callbacks for "irq" and "pic" hmp commands
  Allow ISA bus to be configured out

 default-configs/alpha-softmmu.mak |  1 +
 default-configs/arm-softmmu.mak   |  1 +
 default-configs/i386-softmmu.mak  |  1 +
 default-configs/mips-softmmu.mak  |  1 +
 default-configs/mips64-softmmu.mak|  1 +
 default-configs/mips64el-softmmu.mak  |  1 +
 default-configs/mipsel-softmmu.mak|  1 +
 default-configs/moxie-softmmu.mak |  2 ++
 default-configs/pci.mak   |  1 +
 default-configs/ppc-softmmu.mak   |  1 +
 default-configs/ppc64-softmmu.mak |  1 +
 default-configs/ppcemb-softmmu.mak|  1 +
 default-configs/sh4-softmmu.mak   |  1 +
 default-configs/sh4eb-softmmu.mak |  1 +
 default-configs/sparc-softmmu.mak |  1 +
 default-configs/sparc64-softmmu.mak   |  1 +
 default-configs/unicore32-softmmu.mak |  1 +
 default-configs/x86_64-softmmu.mak|  1 +
 hw/char/Makefile.objs |  3 +-
 hw/i386/pc.c  |  2 ++
 hw/intc/i8259.c   |  4 +--
 hw/isa/Makefile.objs  |  2 +-
 hw/ppc/prep.c | 32 ++--
 include/hw/boards.h   |  2 ++
 include/hw/i386/pc.h  |  4 +--
 monitor.c | 57 ++-
 26 files changed, 95 insertions(+), 30 deletions(-)

-- 
2.1.0

[Qemu-devel] [PATCH 3/6] pc: Use MachineClass callbacks for "irq" and "pic" hmp commands

2015-03-05 Thread David Gibson

Currently PC machine types rely on fallback code in the monitor
implementation to correctly implement these hmp commands.  Now that we have
MachineClass callbacks to control this properly, instantiate them in
pc_generic_machine_class_init().

Since this sets the MachineClass callbacks correctly for all x86 machine
types, we can now remove the TARGET_I386 fallback case from the monitor
code.

Signed-off-by: David Gibson 
---
 hw/i386/pc.c | 2 ++
 monitor.c| 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index b229856..cb48165 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1522,6 +1522,8 @@ static void pc_generic_machine_class_init(ObjectClass 
*oc, void *data)
 mc->default_display = qm->default_display;
 mc->compat_props = qm->compat_props;
 mc->hw_version = qm->hw_version;
+mc->hmp_info_irq = i8259_hmp_info_irq;
+mc->hmp_info_pic = i8259_hmp_info_pic;
 }
 
 void qemu_register_pc_machine(QEMUMachine *m)
diff --git a/monitor.c b/monitor.c
index ca226a9..30da438 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1078,7 +1078,7 @@ static void hmp_info_pic(Monitor *mon, const QDict *qdict)
 sun4m_hmp_info_pic(mon, qdict);
 #elif defined(TARGET_LM32)
 lm32_hmp_info_pic(mon, qdict);
-#elif defined(TARGET_i386) || defined(TARGET_PPC) || defined(TARGET_MIPS)
+#elif defined(TARGET_PPC) || defined(TARGET_MIPS)
 i8259_hmp_info_pic(mon, qdict);
 #endif
 }
@@ -1100,7 +1100,7 @@ static void hmp_info_irq(Monitor *mon, const QDict *qdict)
 sun4m_hmp_info_irq(mon, qdict);
 #elif defined(TARGET_LM32)
 lm32_hmp_info_irq(mon, qdict);
-#elif defined(TARGET_i386) || defined(TARGET_PPC) || defined(TARGET_MIPS)
+#elif defined(TARGET_PPC) || defined(TARGET_MIPS)
 i8259_hmp_info_irq(mon, qdict);
 #endif
 }
-- 
2.1.0

[Qemu-devel] [PATCH 6/6] Allow ISA bus to be configured out

2015-03-05 Thread David Gibson

Currently, the code to handle the legacy ISA bus is always included in
qemu.  However there are lots of platforms that don't include ISA legacy
devies, and quite a few that have never used ISA legacy devices at all.

This patch allows the ISA bus code to be disabled in the configuration for
platforms where it doesn't make sense.  For now, the default configs are
adjusted to include ISA on all platforms including PCI (since
CONFIG_IDE_CORE which is in pci.mak requires ISA support) and also several
others which include ISA devices.  We may want to pare this down in future.

This patch becomes more useful since b19c1c0 "isa: remove isa_mem_base
variable." since that removes a dependency on isa-bus.c from vga.c.

Signed-off-by: David Gibson 
---
 default-configs/moxie-softmmu.mak | 1 +
 default-configs/pci.mak   | 1 +
 default-configs/sparc-softmmu.mak | 1 +
 default-configs/unicore32-softmmu.mak | 1 +
 hw/isa/Makefile.objs  | 2 +-
 5 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/default-configs/moxie-softmmu.mak 
b/default-configs/moxie-softmmu.mak
index 7e22863..e00d099 100644
--- a/default-configs/moxie-softmmu.mak
+++ b/default-configs/moxie-softmmu.mak
@@ -1,5 +1,6 @@
 # Default configuration for moxie-softmmu
 
+CONFIG_ISA_BUS=y
 CONFIG_MC146818RTC=y
 CONFIG_SERIAL=y
 CONFIG_SERIAL_ISA=y
diff --git a/default-configs/pci.mak b/default-configs/pci.mak
index 58a2c0a..b082500 100644
--- a/default-configs/pci.mak
+++ b/default-configs/pci.mak
@@ -1,4 +1,5 @@
 CONFIG_PCI=y
+CONFIG_ISA_BUS=y
 CONFIG_VIRTIO_PCI=y
 CONFIG_VIRTIO=y
 CONFIG_USB_UHCI=y
diff --git a/default-configs/sparc-softmmu.mak 
b/default-configs/sparc-softmmu.mak
index ab796b3..004b0f4 100644
--- a/default-configs/sparc-softmmu.mak
+++ b/default-configs/sparc-softmmu.mak
@@ -1,5 +1,6 @@
 # Default configuration for sparc-softmmu
 
+CONFIG_ISA_BUS=y
 CONFIG_ECC=y
 CONFIG_ESP=y
 CONFIG_ESCC=y
diff --git a/default-configs/unicore32-softmmu.mak 
b/default-configs/unicore32-softmmu.mak
index de38577..5f6c4a8 100644
--- a/default-configs/unicore32-softmmu.mak
+++ b/default-configs/unicore32-softmmu.mak
@@ -1,4 +1,5 @@
 # Default configuration for unicore32-softmmu
+CONFIG_ISA_BUS=y
 CONFIG_PUV3=y
 CONFIG_PTIMER=y
 CONFIG_PCKBD=y
diff --git a/hw/isa/Makefile.objs b/hw/isa/Makefile.objs
index 9164556..fb37c55 100644
--- a/hw/isa/Makefile.objs
+++ b/hw/isa/Makefile.objs
@@ -1,4 +1,4 @@
-common-obj-y += isa-bus.o
+common-obj-$(CONFIG_ISA_BUS) += isa-bus.o
 common-obj-$(CONFIG_APM) += apm.o
 common-obj-$(CONFIG_I82378) += i82378.o
 common-obj-$(CONFIG_PC87312) += pc87312.o
-- 
2.1.0

[Qemu-devel] [PATCH 5/6] prep: Use MachineClass callbacks for "irq" and "pic" hmp commands

2015-03-05 Thread David Gibson

Currently all ppc targets rely on fallback code in monitor.c to implement
the "irq" and "pic" hmp commands, by calling into the i8259 code.  For the
PReP machine type, which does usually have an ISA bridge and legacy IO,
including an i8259, this patch correctly sets the MachineClass callbacks
to implement those commands properly without the fallback.

In fact PReP is the only ppc machine for which the i8259 implementation
of those hmp commands makes sense.  The other machine types won't typically
have an i8259 at all.  So we can remove the fallback case from the monitor
meaning that other ppc targets will correctly implement those commands
as no-ops.

Signed-off-by: David Gibson 
---
 hw/ppc/prep.c | 2 ++
 monitor.c | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/prep.c b/hw/ppc/prep.c
index dfc8689..b99e87d 100644
--- a/hw/ppc/prep.c
+++ b/hw/ppc/prep.c
@@ -572,6 +572,8 @@ static void prep_machine_class_init(ObjectClass *oc, void 
*data)
 mc->init = ppc_prep_init;
 mc->max_cpus = MAX_CPUS;
 mc->default_boot_order = "cad";
+mc->hmp_info_irq = i8259_hmp_info_irq;
+mc->hmp_info_pic = i8259_hmp_info_pic;
 }
 
 static const TypeInfo prep_machine_info = {
diff --git a/monitor.c b/monitor.c
index 30da438..3165539 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1078,7 +1078,7 @@ static void hmp_info_pic(Monitor *mon, const QDict *qdict)
 sun4m_hmp_info_pic(mon, qdict);
 #elif defined(TARGET_LM32)
 lm32_hmp_info_pic(mon, qdict);
-#elif defined(TARGET_PPC) || defined(TARGET_MIPS)
+#elif defined(TARGET_MIPS)
 i8259_hmp_info_pic(mon, qdict);
 #endif
 }
@@ -1100,7 +1100,7 @@ static void hmp_info_irq(Monitor *mon, const QDict *qdict)
 sun4m_hmp_info_irq(mon, qdict);
 #elif defined(TARGET_LM32)
 lm32_hmp_info_irq(mon, qdict);
-#elif defined(TARGET_PPC) || defined(TARGET_MIPS)
+#elif defined(TARGET_MIPS)
 i8259_hmp_info_irq(mon, qdict);
 #endif
 }
-- 
2.1.0

[Qemu-devel] [PATCH 4/6] target-ppc: Convert PReP to machine class

2015-03-05 Thread David Gibson

The more commonly used ppc machine types: spapr, and newworld Mac have
already been converted to the newer MachineClass representation, but some
others still use QEMUMachine.

This patch cleans things up slightly, by converting the "prep" machine
type to the new style.

Signed-off-by: David Gibson 
---
 hw/ppc/prep.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/hw/ppc/prep.c b/hw/ppc/prep.c
index 15df7f3..dfc8689 100644
--- a/hw/ppc/prep.c
+++ b/hw/ppc/prep.c
@@ -44,6 +44,8 @@
 #include "exec/address-spaces.h"
 #include "elf.h"
 
+#define TYPE_PREP_MACHINE "PReP-machine"
+
 //#define HARD_DEBUG_PPC_IO
 //#define DEBUG_PPC_IO
 
@@ -561,17 +563,27 @@ static void ppc_prep_init(MachineState *machine)
  graphic_width, graphic_height, graphic_depth);
 }
 
-static QEMUMachine prep_machine = {
-.name = "prep",
-.desc = "PowerPC PREP platform",
-.init = ppc_prep_init,
-.max_cpus = MAX_CPUS,
-.default_boot_order = "cad",
+static void prep_machine_class_init(ObjectClass *oc, void *data)
+{
+MachineClass *mc = MACHINE_CLASS(oc);
+
+mc->name = "prep";
+mc->desc = "PowerPC PREP platform";
+mc->init = ppc_prep_init;
+mc->max_cpus = MAX_CPUS;
+mc->default_boot_order = "cad";
+}
+
+static const TypeInfo prep_machine_info = {
+.name  = TYPE_PREP_MACHINE,
+.parent= TYPE_MACHINE,
+.instance_size = sizeof(MachineState),
+.class_init= prep_machine_class_init,
 };
 
-static void prep_machine_init(void)
+static void prep_machine_register_types(void)
 {
-qemu_register_machine(&prep_machine);
+type_register_static(&prep_machine_info);
 }
 
-machine_init(prep_machine_init);
+type_init(prep_machine_register_types)
-- 
2.1.0

[Qemu-devel] [PATCH 1/6] Split serial-isa into its own config option

2015-03-05 Thread David Gibson

At present, the core device model code for 8250-like serial ports
(serial.c) and the code for serial ports attached to ISA-style legacy IO
(serial-isa.c) are both controlled by the CONFIG_ISA variable.

There are lots and lots of embedded platforms that have 8250-like serial
ports but have never had anything resembling ISA legacy IO.  Therefore,
split serial-isa into its own CONFIG_SERIAL_ISA option so it can be
disabled for platforms where it's not appropriate.

For now, I enabled CONFIG_SERIAL_ISA in every default-config where
CONFIG_SERIAL is enabled, excepting microblaze and xtensa, where it's
pretty clear there isn't legacy IO stuff.

Signed-off-by: David Gibson 
---
 default-configs/alpha-softmmu.mak| 1 +
 default-configs/arm-softmmu.mak  | 1 +
 default-configs/i386-softmmu.mak | 1 +
 default-configs/mips-softmmu.mak | 1 +
 default-configs/mips64-softmmu.mak   | 1 +
 default-configs/mips64el-softmmu.mak | 1 +
 default-configs/mipsel-softmmu.mak   | 1 +
 default-configs/moxie-softmmu.mak| 1 +
 default-configs/ppc-softmmu.mak  | 1 +
 default-configs/ppc64-softmmu.mak| 1 +
 default-configs/ppcemb-softmmu.mak   | 1 +
 default-configs/sh4-softmmu.mak  | 1 +
 default-configs/sh4eb-softmmu.mak| 1 +
 default-configs/sparc64-softmmu.mak  | 1 +
 default-configs/x86_64-softmmu.mak   | 1 +
 hw/char/Makefile.objs| 3 ++-
 16 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/default-configs/alpha-softmmu.mak 
b/default-configs/alpha-softmmu.mak
index 7f6161e..e0d75e3 100644
--- a/default-configs/alpha-softmmu.mak
+++ b/default-configs/alpha-softmmu.mak
@@ -3,6 +3,7 @@
 include pci.mak
 include usb.mak
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_I8254=y
 CONFIG_PCKBD=y
 CONFIG_VGA_CIRRUS=y
diff --git a/default-configs/arm-softmmu.mak b/default-configs/arm-softmmu.mak
index 149ae1b..d862268 100644
--- a/default-configs/arm-softmmu.mak
+++ b/default-configs/arm-softmmu.mak
@@ -7,6 +7,7 @@ CONFIG_ISA_MMIO=y
 CONFIG_NAND=y
 CONFIG_ECC=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PTIMER=y
 CONFIG_SD=y
 CONFIG_MAX7310=y
diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 0b8ce4b..6e9c6c1 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -9,6 +9,7 @@ CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
 CONFIG_VMMOUSE=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PARALLEL=y
 CONFIG_I8254=y
 CONFIG_PCSPK=y
diff --git a/default-configs/mips-softmmu.mak b/default-configs/mips-softmmu.mak
index cce2c81..28dee61 100644
--- a/default-configs/mips-softmmu.mak
+++ b/default-configs/mips-softmmu.mak
@@ -9,6 +9,7 @@ CONFIG_VGA_ISA_MM=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PARALLEL=y
 CONFIG_I8254=y
 CONFIG_PCSPK=y
diff --git a/default-configs/mips64-softmmu.mak 
b/default-configs/mips64-softmmu.mak
index 7a88a08..464e8f1 100644
--- a/default-configs/mips64-softmmu.mak
+++ b/default-configs/mips64-softmmu.mak
@@ -9,6 +9,7 @@ CONFIG_VGA_ISA_MM=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PARALLEL=y
 CONFIG_I8254=y
 CONFIG_PCSPK=y
diff --git a/default-configs/mips64el-softmmu.mak 
b/default-configs/mips64el-softmmu.mak
index 095de43..1b5d3f6 100644
--- a/default-configs/mips64el-softmmu.mak
+++ b/default-configs/mips64el-softmmu.mak
@@ -9,6 +9,7 @@ CONFIG_VGA_ISA_MM=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PARALLEL=y
 CONFIG_I8254=y
 CONFIG_PCSPK=y
diff --git a/default-configs/mipsel-softmmu.mak 
b/default-configs/mipsel-softmmu.mak
index 0e25108..ff0e2c6 100644
--- a/default-configs/mipsel-softmmu.mak
+++ b/default-configs/mipsel-softmmu.mak
@@ -9,6 +9,7 @@ CONFIG_VGA_ISA_MM=y
 CONFIG_VGA_CIRRUS=y
 CONFIG_VMWARE_VGA=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_PARALLEL=y
 CONFIG_I8254=y
 CONFIG_PCSPK=y
diff --git a/default-configs/moxie-softmmu.mak 
b/default-configs/moxie-softmmu.mak
index 1a95476..7e22863 100644
--- a/default-configs/moxie-softmmu.mak
+++ b/default-configs/moxie-softmmu.mak
@@ -2,4 +2,5 @@
 
 CONFIG_MC146818RTC=y
 CONFIG_SERIAL=y
+CONFIG_SERIAL_ISA=y
 CONFIG_VGA=y
diff --git a/default-configs/ppc-softmmu.mak b/default-configs/ppc-softmmu.mak
index 4b60e69..c969b5b 100644
--- a/default-configs/ppc-softmmu.mak
+++ b/default-configs/ppc-softmmu.mak
@@ -47,5 +47,6 @@ CONFIG_PLATFORM_BUS=y
 CONFIG_ETSEC=y
 CONFIG_LIBDECNUMBER=y
 # For PReP
+CONFIG_SERIAL_ISA=y
 CONFIG_MC146818RTC=y
 CONFIG_ISA_TESTDEV=y
diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index de71e41..3a1e26a 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -51,6 +51,7 @@ CONFIG_LIBDECNUMBER=y
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(and $(CONFIG_PSERIES),$(CONFIG_KVM))
 # For PReP
+CONFIG_SERIAL_ISA=y
 CONFIG_I82378=y
 CONFIG_I8259=y
 CONFIG_I8254=y
diff --git a/default-configs/ppcemb-softmmu.mak

[Qemu-devel] [PATCH 2/6] Remove monitor.c dependency on CONFIG_I8259

2015-03-05 Thread David Gibson

The hmp commands "irq" and "pic" are a bit of a mess.  They're implemented
on a number of targets, but not all.  On sparc32 and LM32 they do target
specific things, but on the remainder (i386, ppc and mips) they call into
the i8259 PIC code.

But really, what these commands do shouldn't be dependent on the target
arch, but on the specific machine that's in use.  On ppc, for example,
the "prep" machine usually does have an ISA bridge with an i8259, but
most of the other machine types have never had an i8259 at all.  Similarly
the sparc specific target would stop working if we ever had a sparc32
machine that wasn't sun4m.

This patch cleans things up by implementing these hmp commands on all
targets via a MachineClass callback.  If the callback is NULL, for now
we fallback to target specific defaults that match the existing behaviour.
The hope is we can remove those later with target specific cleanups.

Signed-off-by: David Gibson 
---
 hw/intc/i8259.c  |  4 ++--
 include/hw/boards.h  |  2 ++
 include/hw/i386/pc.h |  4 ++--
 monitor.c| 57 ++--
 4 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/hw/intc/i8259.c b/hw/intc/i8259.c
index 0f5c025..43e90b9 100644
--- a/hw/intc/i8259.c
+++ b/hw/intc/i8259.c
@@ -429,7 +429,7 @@ static void pic_realize(DeviceState *dev, Error **errp)
 pc->parent_realize(dev, errp);
 }
 
-void hmp_info_pic(Monitor *mon, const QDict *qdict)
+void i8259_hmp_info_pic(Monitor *mon, const QDict *qdict)
 {
 int i;
 PICCommonState *s;
@@ -447,7 +447,7 @@ void hmp_info_pic(Monitor *mon, const QDict *qdict)
 }
 }
 
-void hmp_info_irq(Monitor *mon, const QDict *qdict)
+void i8259_hmp_info_irq(Monitor *mon, const QDict *qdict)
 {
 #ifndef DEBUG_IRQ_COUNT
 monitor_printf(mon, "irq statistic code not compiled.\n");
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 3ddc449..214a778 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -111,6 +111,8 @@ struct MachineClass {
 
 HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
DeviceState *dev);
+void (*hmp_info_irq)(Monitor *mon, const QDict *qdict);
+void (*hmp_info_pic)(Monitor *mon, const QDict *qdict);
 };
 
 /**
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 08ab67d..0f376c6 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -121,8 +121,8 @@ qemu_irq *i8259_init(ISABus *bus, qemu_irq parent_irq);
 qemu_irq *kvm_i8259_init(ISABus *bus);
 int pic_read_irq(DeviceState *d);
 int pic_get_output(DeviceState *d);
-void hmp_info_pic(Monitor *mon, const QDict *qdict);
-void hmp_info_irq(Monitor *mon, const QDict *qdict);
+void i8259_hmp_info_pic(Monitor *mon, const QDict *qdict);
+void i8259_hmp_info_irq(Monitor *mon, const QDict *qdict);
 
 /* Global System Interrupts */
 
diff --git a/monitor.c b/monitor.c
index c86a89e..ca226a9 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1064,6 +1064,48 @@ static void hmp_info_history(Monitor *mon, const QDict 
*qdict)
 }
 }
 
+static void hmp_info_pic(Monitor *mon, const QDict *qdict)
+{
+MachineClass *mc = MACHINE_GET_CLASS(current_machine);
+
+if (mc->hmp_info_pic) {
+(mc->hmp_info_pic)(mon, qdict);
+} else {
+/* FIXME: Backwards compat fallbacks.  These can go away once
+ * we've finished converting to natively using MachineClass,
+ * rather thatn QEMUMachine */
+#if defined(TARGET_SPARC) && !defined(TARGET_SPARC64)
+sun4m_hmp_info_pic(mon, qdict);
+#elif defined(TARGET_LM32)
+lm32_hmp_info_pic(mon, qdict);
+#elif defined(TARGET_i386) || defined(TARGET_PPC) || defined(TARGET_MIPS)
+i8259_hmp_info_pic(mon, qdict);
+#endif
+}
+}
+
+static void hmp_info_irq(Monitor *mon, const QDict *qdict)
+{
+/* FIXME: The ifdefs can go away once the sun4m and LM32 machines
+ * are converted to use machine classes natively */
+MachineClass *mc = MACHINE_GET_CLASS(current_machine);
+
+if (mc->hmp_info_irq) {
+(mc->hmp_info_irq)(mon, qdict);
+} else {
+/* FIXME: Backwards compat fallbacks.  These can go away once
+ * we've finished converting to natively using MachineClass,
+ * rather thatn QEMUMachine */
+#if defined(TARGET_SPARC) && !defined(TARGET_SPARC64)
+sun4m_hmp_info_irq(mon, qdict);
+#elif defined(TARGET_LM32)
+lm32_hmp_info_irq(mon, qdict);
+#elif defined(TARGET_i386) || defined(TARGET_PPC) || defined(TARGET_MIPS)
+i8259_hmp_info_irq(mon, qdict);
+#endif
+}
+}
+
 static void hmp_info_cpustats(Monitor *mon, const QDict *qdict)
 {
 CPUState *cpu;
@@ -2661,35 +2703,20 @@ static mon_cmd_t info_cmds[] = {
 .help   = "show the command line history",
 .mhandler.cmd = hmp_info_history,
 },
-#if defined(TARGET_I386) || defined(TARGET_PPC) || defined(TARGET_MIPS) || \
-defined(TARGET_LM32) || (defined(TARGET_SPARC) && !defined(TARGET_SPARC

Re: [Qemu-devel] [PATCH v4 01/10] cpu/apic: drop icc bus/bridge/

2015-03-05 Thread Chen Fan



On 03/06/2015 02:17 AM, Eduardo Habkost wrote:

On Fri, Feb 13, 2015 at 06:25:24PM +0800, Zhu Guihua wrote:

From: Chen Fan 

ICC bus was invented only to provide hotplug capability to
CPU and APIC because at the time being hotplug was available only for
BUS attached devices.

Now this patch is to drop ICC bus impl, and switch to bus-less
CPU+APIC hotplug, handling them in the same manner as pc-dimm.

Signed-off-by: Chen Fan 
Signed-off-by: Zhu Guihua 
---
  hw/i386/kvm/apic.c  | 10 --
  hw/i386/pc.c| 21 +
  hw/i386/pc_piix.c   |  9 +
  hw/i386/pc_q35.c|  9 +
  hw/intc/apic.c  | 16 +++-
  hw/intc/apic_common.c   | 14 +-
  include/hw/i386/apic_internal.h |  6 ++
  include/hw/i386/pc.h|  3 ++-
  target-i386/cpu.c   | 19 +++
  target-i386/cpu.h   |  3 +--
  10 files changed, 43 insertions(+), 67 deletions(-)

What about hw/i386/xen/xen_apic.c:xen_apic_realize()?

   $ make
 CCx86_64-softmmu/hw/i386/xen/xen_apic.o
   /home/ehabkost/rh/proj/virt/qemu/hw/i386/xen/xen_apic.c: In function 
‘xen_apic_realize’:
   /home/ehabkost/rh/proj/virt/qemu/hw/i386/xen/xen_apic.c:44:29: error: 
‘APICCommonState’ has no member named ‘io_memory’
memory_region_init_io(&s->io_memory, OBJECT(s), &xen_apic_io_ops, s,
^
   /home/ehabkost/rh/proj/virt/qemu/rules.mak:57: recipe for target 
'hw/i386/xen/xen_apic.o' failed
   make[1]: *** [hw/i386/xen/xen_apic.o] Error 1
   Makefile:169: recipe for target 'subdir-x86_64-softmmu' failed
   make: *** [subdir-x86_64-softmmu] Error 2
Oh, I'm sorry for that, because no xen platform environment. we forgot 
xen ;). so

I want to fix it and rebase our patches under your x86 tree.

Thanks,
Chen

Re: [Qemu-devel] [PATCH RFC v3 24/27] COLO NIC: Implement NIC checkpoint and failover

2015-03-05 Thread zhanghailiang


On 2015/3/6 1:12, Dr. David Alan Gilbert wrote:

* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:

Signed-off-by: zhanghailiang 
Signed-off-by: Gao feng 
---
  include/net/colo-nic.h |  3 ++-
  migration/colo.c   | 22 ++
  net/colo-nic.c | 19 +++
  3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/include/net/colo-nic.h b/include/net/colo-nic.h
index 67c9807..ddc21cd 100644
--- a/include/net/colo-nic.h
+++ b/include/net/colo-nic.h
@@ -20,5 +20,6 @@ void colo_add_nic_devices(NetClientState *nc);
  void colo_remove_nic_devices(NetClientState *nc);

  int colo_proxy_compare(void);
-
+int colo_proxy_failover(void);
+int colo_proxy_checkpoint(void);
  #endif
diff --git a/migration/colo.c b/migration/colo.c
index 579aabf..874971c 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -94,6 +94,11 @@ static void slave_do_failover(void)
  ;
  }

+if (colo_proxy_failover() != 0) {
+error_report("colo proxy failed to do failover");
+}
+colo_proxy_destroy(COLO_SECONDARY_MODE);




Hi, Dave


I'm not sure if this is the best thing to do on a secondary failover.
If I understand correctly, when it's running, we have:


---+
|br0---eth0
|
  slave +-tun - xt_SECCOLO - br1---eth1
|
---+

what I think that colo-proxy-destroy  is doing is rewiring that as:


---+
| +--br0---eth0
| |
  slave +-tun +  br1---eth1
|
---+



Yes, you got it.


but now we've lost the sequence number adjustment data that
was held in xt_SECCOLO and so you are likely to break existing TCP
connections.



In our test, we didn't come across the 'break existing TCP connections' 
situation,
We only adjust the sequence number at the beginning of building connection, 
after
the connection is build, this data in xt_SECCOLO is useless ...


Also, I don't think colo-proxy-script is passed a flag to let it
know whether the reason it's doing a slave_uninstall is due to
a failover or a simple shutdown; and so it assumes it has
to do the rewire for a failover.
(Actually the script in the qemu repo is newer than the script in
the colo-proxy repo, that one doesn't have the rewire at all).



You are right, we should distinguish between shutdown and failover for the 
slave_uninstall,
Actually, using script to do the corresponding work maybe not so appropriate,
we are trying to fix the net-related part.

Thanks,
zhanghailiang

Dave


+
  colo = NULL;

  if (!autostart) {
@@ -115,7 +120,7 @@ static void master_do_failover(void)
  if (!colo_runstate_is_stopped()) {
  vm_stop_force_state(RUN_STATE_COLO);
  }
-
+colo_proxy_destroy(COLO_PRIMARY_MODE);
  if (s->state != MIG_STATE_ERROR) {
  migrate_set_state(s, MIG_STATE_COLO, MIG_STATE_COMPLETED);
  }
@@ -245,6 +250,11 @@ static int do_colo_transaction(MigrationState *s, QEMUFile 
*control)

  qemu_fflush(trans);

+ret = colo_proxy_checkpoint();
+if (ret < 0) {
+goto out;
+}
+
  ret = colo_ctl_put(s->file, COLO_CHECKPOINT_SEND);
  if (ret < 0) {
  goto out;
@@ -387,8 +397,6 @@ out:
  qemu_bh_schedule(s->cleanup_bh);
  qemu_mutex_unlock_iothread();

-colo_proxy_destroy(COLO_PRIMARY_MODE);
-
  return NULL;
  }

@@ -508,6 +516,12 @@ void *colo_process_incoming_checkpoints(void *opaque)
  goto out;
  }

+ret = colo_proxy_checkpoint();
+if (ret < 0) {
+goto out;
+}
+DPRINTF("proxy begin to do checkpoint\n");
+
  ret = colo_ctl_get(f, COLO_CHECKPOINT_SEND);
  if (ret < 0) {
  goto out;
@@ -584,6 +598,7 @@ out:
  * just kill slave
  */
  error_report("SVM is going to exit!");
+colo_proxy_destroy(COLO_SECONDARY_MODE);
  exit(1);
  } else {
  /* if we went here, means master may dead, we are doing failover */
@@ -610,6 +625,5 @@ out:

  loadvm_exit_colo();

-colo_proxy_destroy(COLO_SECONDARY_MODE);
  return NULL;
  }
diff --git a/net/colo-nic.c b/net/colo-nic.c
index 563d661..02a454d 100644
--- a/net/colo-nic.c
+++ b/net/colo-nic.c
@@ -379,6 +379,25 @@ void colo_proxy_destroy(int side)
  cp_info.index = -1;
  colo_nic_side = -1;
  }
+
+int colo_proxy_failover(void)
+{
+if (colo_proxy_send(NULL, 0, COLO_FAILOVER) < 0) {
+return -1;
+}
+
+return 0;
+}
+
+int colo_proxy_checkpoint(void)
+{
+if (colo_proxy_send(NULL, 0, COLO_CHECKPOINT) < 0) {
+return -1;
+}
+
+return 0;
+}
+
  /*
  do checkpoint: return 1
  error: return -1
--
1.7.12.4



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

.

Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo

2015-03-05 Thread zhanghailiang


On 2015/3/6 9:48, zhanghailiang wrote:

On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:

From: "Dr. David Alan Gilbert" 


Hi Dave,



Hi,
   I'm getting COLO running on a couple of our machines here
and wanted to see what was actually going on, so I merged
in my recent rolling-stats code:

http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html

with the following patch, and now I get on the primary side,
info migrate shows me:

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
colo: on
Migration status: colo
total time: 0 milliseconds
colo checkpoint (ms): Min/Max: 0, 1 Mean: -1.1415868e-13 (Weighted: 
4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 
0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 
0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) 
Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 
62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 
62@1425561742681, 61@1425561742743, 80@1425561742824
colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 
127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 
227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 
96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 
55294@1425561742744, 145582@1425561742825

which suggests I've got a problem with the packet comparison; but that's
a separate issue I'll look at.



There is an obvious mistake we have made in proxy, the macro 
'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,


s/IPS_UNTRACKED_BIT/IPS_COLO_TEMPLATE_BIT


so please fix it before do the follow test. Sorry for this low-grade mistake, 
we should do full test before issue it. ;)

To be honest, the proxy part in github is not integrated, we have cut it just 
for easy review and understand, so there may be some mistakes.

Thanks,
zhanghailiang



Dave

Dr. David Alan Gilbert (1):
   COLO: Add primary side rolling statistics

  hmp.c | 12 
  include/migration/migration.h |  3 +++
  migration/colo.c  | 15 +++
  migration/migration.c | 30 ++
  qapi-schema.json  | 11 ++-
  5 files changed, 70 insertions(+), 1 deletion(-)

Re: [Qemu-devel] [RFC 0/1] Rolling stats on colo

2015-03-05 Thread zhanghailiang


On 2015/3/5 21:31, Dr. David Alan Gilbert (git) wrote:

From: "Dr. David Alan Gilbert" 


Hi Dave,



Hi,
   I'm getting COLO running on a couple of our machines here
and wanted to see what was actually going on, so I merged
in my recent rolling-stats code:

http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg00648.html

with the following patch, and now I get on the primary side,
info migrate shows me:

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
colo: on
Migration status: colo
total time: 0 milliseconds
colo checkpoint (ms): Min/Max: 0, 1 Mean: -1.1415868e-13 (Weighted: 
4.3136025e-158) Count: 4020 Values: 0@1425561742237, 0@1425561742300, 
0@1425561742363, 0@1425561742426, 0@1425561742489, 0@1425561742555, 
0@1425561742618, 0@1425561742681, 0@1425561742743, 0@1425561742824
colo paused time (ms): Min/Max: 55, 2789 Mean: 63.9 (Weighted: 76.243584) 
Count: 4019 Values: 62@1425561742237, 62@1425561742300, 62@1425561742363, 
62@1425561742426, 61@1425561742489, 65@1425561742555, 62@1425561742618, 
62@1425561742681, 61@1425561742743, 80@1425561742824
colo checkpoint size: Min/Max: 18351, 2.1731606e+08 Mean: 150096.4 (Weighted: 
127195.56) Count: 4020 Values: 211246@1425561742238, 186622@1425561742301, 
227662@1425561742364, 219454@1425561742428, 268702@1425561742490, 
96334@1425561742556, 47086@1425561742619, 42982@1425561742682, 
55294@1425561742744, 145582@1425561742825

which suggests I've got a problem with the packet comparison; but that's
a separate issue I'll look at.



There is an obvious mistake we have made in proxy, the macro 
'IPS_UNTRACKED_BIT' in colo-patch-for-kernel.patch should be 14,
so please fix it before do the follow test. Sorry for this low-grade mistake, 
we should do full test before issue it. ;)

To be honest, the proxy part in github is not integrated, we have cut it just 
for easy review and understand, so there may be some mistakes.

Thanks,
zhanghailiang



Dave

Dr. David Alan Gilbert (1):
   COLO: Add primary side rolling statistics

  hmp.c | 12 
  include/migration/migration.h |  3 +++
  migration/colo.c  | 15 +++
  migration/migration.c | 30 ++
  qapi-schema.json  | 11 ++-
  5 files changed, 70 insertions(+), 1 deletion(-)

Re: [Qemu-devel] E5-2620v2 - emulation stop error

2015-03-05 Thread Andrey Korolyov

On Fri, Mar 6, 2015 at 1:14 AM, Andrey Korolyov  wrote:
> Hello,
>
> recently I`ve got a couple of shiny new Intel 2620v2s for future
> replacement of the E5-2620v1, but I experienced relatively many events
> with emulation errors, all traces looks simular to the one below. I am
> running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
> can switch to some other versions if necessary. Most of crashes
> happened during reboot cycle or at the end of ACPI-based shutdown
> action, if this can help. I have zero clues of what can introduce such
> a mess inside same processor family using identical software, as
> 2620v1 has no simular problem ever. Please let me know if there can be
> some side measures for making entire story more clear.
>
> Thanks!
>
> KVM internal error. Suberror: 2
> extra data[0]: 80d1
> extra data[1]: 8b0d
> EAX=0003 EBX= ECX= EDX=
> ESI= EDI= EBP= ESP=6cd4
> EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =   9300
> CS =f000 000f  9b00
> SS =   9300
> DS =   9300
> FS =   9300
> GS =   9300
> LDT=   8200
> TR =   8b00
> GDT= 000f6e98 0037
> IDT=  03ff
> CR0=0010 CR2= CR3= CR4=
> DR0= DR1= DR2=
> DR3=
> DR6=0ff0 DR7=0400
> EFER=
> Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
> 10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
> b8 00 e0 00 00 8e


It turns out that those errors are introduced by APICv, which gets
enabled due to different feature set. If anyone is interested in
reproducing/fixing this exactly on 3.10, it takes about one hundred of
migrations/power state changes for an issue to appear, guest OS can be
Linux or Win.

[Qemu-devel] [PATCH] user-exec.c: fix build on NetBSD/sparc64 and NetBSD/arm

2015-03-05 Thread Tobias Nygren

A couple of #ifdef changes necessary to use NetBSD's ucontext
structs on sparc64 and arm.

Signed-off-by: Tobias Nygren 
---
 user-exec.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/user-exec.c b/user-exec.c
index 1ff8673..8f57e8a 100644
--- a/user-exec.c
+++ b/user-exec.c
@@ -404,6 +404,10 @@ int cpu_signal_handler(int host_signum, void *pinfo,
 struct sigcontext *uc = puc;
 unsigned long pc = uc->sc_pc;
 void *sigmask = (void *)(long)uc->sc_mask;
+#elif defined(__NetBSD__)
+ucontext_t *uc = puc;
+unsigned long pc = _UC_MACHINE_PC(uc);
+void *sigmask = (void *)&uc->uc_sigmask;
 #endif
 #endif
 
@@ -441,15 +445,25 @@ int cpu_signal_handler(int host_signum, void *pinfo,
 
 #elif defined(__arm__)
 
+#if defined(__NetBSD__)
+#include 
+#endif
+
 int cpu_signal_handler(int host_signum, void *pinfo,
void *puc)
 {
 siginfo_t *info = pinfo;
+#if defined(__NetBSD__)
+ucontext_t *uc = puc;
+#else
 struct ucontext *uc = puc;
+#endif
 unsigned long pc;
 int is_write;
 
-#if defined(__GLIBC__) && (__GLIBC__ < 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ 
<= 3))
+#if defined(__NetBSD__)
+pc = uc->uc_mcontext.__gregs[_REG_R15];
+#elif defined(__GLIBC__) && (__GLIBC__ < 2 || (__GLIBC__ == 2 && 
__GLIBC_MINOR__ <= 3))
 pc = uc->uc_mcontext.gregs[R15];
 #else
 pc = uc->uc_mcontext.arm_pc;
-- 
2.3.0

Re: [Qemu-devel] [PATCH v3 for-2.3 10/24] hw/apci: add _PRT method for extra PCI root busses

2015-03-05 Thread Michael S. Tsirkin

On Thu, Mar 05, 2015 at 11:55:34PM +0200, Marcel Apfelbaum wrote:
> On 03/05/2015 09:52 PM, Michael S. Tsirkin wrote:
> >On Thu, Mar 05, 2015 at 04:55:08PM +0200, Marcel Apfelbaum wrote:
> >>Signed-off-by: Marcel Apfelbaum 
> >>---
> >>  hw/i386/acpi-build.c | 78 
> >> 
> >>  1 file changed, 78 insertions(+)
> >>
> >>diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> >>index e5709e8..f0401d2 100644
> >>--- a/hw/i386/acpi-build.c
> >>+++ b/hw/i386/acpi-build.c
> >>@@ -664,6 +664,83 @@ static void build_append_pci_bus_devices(Aml 
> >>*parent_scope, PCIBus *bus,
> >>  aml_append(parent_scope, method);
> >>  }
> >>
> >>+static Aml *build_prt(void)
> >>+{
> >>+Aml *method, *pkg, *if_ctx, *while_ctx;
> >>+
> >>+method = aml_method("_PRT", 0);
> >>+
> >>+aml_append(method, aml_store(aml_package(128), aml_local(0)));
> >>+aml_append(method, aml_store(aml_int(0), aml_local(1)));
> >>+while_ctx = aml_while(aml_lless(aml_local(1), aml_int(128)));
> >>+{
> >>+aml_append(while_ctx,
> >>+aml_store(aml_shiftright(aml_local(1), aml_int(2)), 
> >>aml_local(2)));
> >>+aml_append(while_ctx,
> >>+aml_store(aml_and(aml_add(aml_local(1), aml_local(2)), 
> >>aml_int(3)),
> >>+  aml_local(3)));
> >>+
> >>+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(0)));
> >>+{
> >>+pkg = aml_package(4);
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_name("LNKD"));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> >>+}
> >>+aml_append(while_ctx, if_ctx);
> >>+
> >>+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(1)));
> >>+{
> >>+pkg = aml_package(4);
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_name("LNKA"));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> >>+}
> >>+aml_append(while_ctx, if_ctx);
> >>+
> >>+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(2)));
> >>+{
> >>+pkg = aml_package(4);
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_name("LNKB"));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> >>+}
> >>+aml_append(while_ctx, if_ctx);
> >>+
> >>+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(3)));
> >>+{
> >>+pkg = aml_package(4);
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(pkg, aml_name("LNKC"));
> >>+aml_append(pkg, aml_int(0));
> >>+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> >>+}
> >>+aml_append(while_ctx, if_ctx);
> >>+
> >>+aml_append(while_ctx,
> >>+aml_store(aml_or(aml_shiftleft(aml_local(2), aml_int(16)),
> >>+ aml_int(0x)),
> >>+  aml_index(aml_local(4), aml_int(0;
> >>+aml_append(while_ctx,
> >>+aml_store(aml_and(aml_local(1), aml_int(3)),
> >>+  aml_index(aml_local(4), aml_int(1;
> >>+aml_append(while_ctx,
> >>+aml_store(aml_local(4), aml_index(aml_local(0), 
> >>aml_local(1;
> >>+aml_append(while_ctx, aml_increment(aml_local(1)));
> >>+}
> >>+aml_append(method, while_ctx);
> >>+aml_append(method, aml_return(aml_local(0)));
> >>+
> >>+return method;
> >>+}
> >>+
> >
> >Pls improve readability of this code using comments, sub-functions and
> >local variables.
> It is the exact "copy" of the static aml code we had, witch by itself
> wasn't so nice, so it is so much I can do.

That one has *some* comments at least.
But yes, we can do better.

Duplication of code is also a problem.



> However, I'll try to improve it.
> 
> Thanks,
> Marcel
> 
> 
> >
> >
> >>  static void
> >>  build_ssdt(GArray *table_data, GArray *linker,
> >> AcpiCpuInfo *cpu, AcpiPmInfo *pm, AcpiMiscInfo *misc,
> >>@@ -708,6 +785,7 @@ build_ssdt(GArray *table_data, GArray *linker,
> >>  aml_append(dev, aml_name_decl("_HID", aml_string("PNP0A03")));
> >>  aml_append(dev,
> >>  aml_name_decl("_BBN", aml_int((uint8_t)bus_info->bus)));
> >>+aml_append(dev, build_prt());
> >>  aml_append(scope, dev);
> >>  aml_append(ssdt, scope);
> >>  }
> >>--
> >>2.1.0

[Qemu-devel] [PATCH 3/8] net/dp8393x: always calculate proper checksums

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Hervé Poussineau 
---
 hw/net/dp8393x.c |   12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 4f3e8a2..802f2b0 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -21,16 +21,10 @@
 #include "qemu/timer.h"
 #include "net/net.h"
 #include "hw/mips/mips.h"
+#include 
 
 //#define DEBUG_SONIC
 
-/* Calculate CRCs properly on Rx packets */
-#define SONIC_CALCULATE_RXCRC
-
-#if defined(SONIC_CALCULATE_RXCRC)
-/* For crc32 */
-#include 
-#endif
 
 #ifdef DEBUG_SONIC
 #define DPRINTF(fmt, ...) \
@@ -763,11 +757,7 @@ static ssize_t nic_receive(NetClientState *nc, const 
uint8_t * buf, size_t size)
 s->regs[SONIC_TRBA0] = s->regs[SONIC_CRBA0];
 
 /* Calculate the ethernet checksum */
-#ifdef SONIC_CALCULATE_RXCRC
 checksum = cpu_to_le32(crc32(0, buf, rx_len));
-#else
-checksum = 0;
-#endif
 
 /* Put packet into RBA */
 DPRINTF("Receive packet at %08x\n", (s->regs[SONIC_CRBA1] << 16) | 
s->regs[SONIC_CRBA0]);
-- 
1.7.10.4

[Qemu-devel] [PATCH 7/8] net/dp8393x: add PROM to store MAC address

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Laurent Vivier 
Signed-off-by: Hervé Poussineau 
---
 hw/mips/mips_jazz.c |1 +
 hw/net/dp8393x.c|   18 ++
 2 files changed, 19 insertions(+)

diff --git a/hw/mips/mips_jazz.c b/hw/mips/mips_jazz.c
index 16a8368..cb33c9c 100644
--- a/hw/mips/mips_jazz.c
+++ b/hw/mips/mips_jazz.c
@@ -280,6 +280,7 @@ static void mips_jazz_init(MachineState *machine,
 qdev_init_nofail(dev);
 sysbus = SYS_BUS_DEVICE(dev);
 sysbus_mmio_map(sysbus, 0, 0x80001000);
+sysbus_mmio_map(sysbus, 1, 0x8000b000);
 sysbus_connect_irq(sysbus, 0, rc4030[4]);
 break;
 } else if (is_help_option(nd->model)) {
diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 53c0cdc..7b658d9 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -25,6 +25,7 @@
 
 //#define DEBUG_SONIC
 
+#define SONIC_PROM_SIZE 0x1000
 
 #ifdef DEBUG_SONIC
 #define DPRINTF(fmt, ...) \
@@ -156,6 +157,7 @@ typedef struct dp8393xState {
 NICConf conf;
 NICState *nic;
 MemoryRegion mmio;
+MemoryRegion prom;
 
 /* Registers */
 uint8_t cam[16][6];
@@ -813,12 +815,15 @@ static void dp8393x_instance_init(Object *obj)
 dp8393xState *s = DP8393X(obj);
 
 sysbus_init_mmio(sbd, &s->mmio);
+sysbus_init_mmio(sbd, &s->prom);
 sysbus_init_irq(sbd, &s->irq);
 }
 
 static void dp8393x_realize(DeviceState *dev, Error **errp)
 {
 dp8393xState *s = DP8393X(dev);
+int i, checksum;
+uint8_t *prom;
 
 address_space_init(&s->as, s->dma_mr, "dp8393x");
 memory_region_init_io(&s->mmio, NULL, &dp8393x_ops, s,
@@ -830,6 +835,19 @@ static void dp8393x_realize(DeviceState *dev, Error **errp)
 
 s->watchdog = timer_new_ns(QEMU_CLOCK_VIRTUAL, dp8393x_watchdog, s);
 s->regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */
+
+memory_region_init_rom_device(&s->prom, NULL, NULL, NULL,
+  "dp8393x-prom", SONIC_PROM_SIZE, NULL);
+prom = memory_region_get_ram_ptr(&s->prom);
+checksum = 0;
+for (i = 0; i < 6; i++) {
+prom[i] = s->conf.macaddr.a[i];
+checksum += prom[i];
+if (checksum > 0xff) {
+checksum = (checksum + 1) & 0xff;
+}
+}
+prom[7] = 0xff - checksum;
 }
 
 static Property dp8393x_properties[] = {
-- 
1.7.10.4

[Qemu-devel] [PATCH 6/8] net/dp8393x: QOM'ify

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Laurent Vivier 
Signed-off-by: Hervé Poussineau 
---
 hw/mips/mips_jazz.c|   12 +--
 hw/net/dp8393x.c   |   83 +---
 include/hw/mips/mips.h |5 ---
 3 files changed, 67 insertions(+), 33 deletions(-)

diff --git a/hw/mips/mips_jazz.c b/hw/mips/mips_jazz.c
index 84fb87d..16a8368 100644
--- a/hw/mips/mips_jazz.c
+++ b/hw/mips/mips_jazz.c
@@ -271,8 +271,16 @@ static void mips_jazz_init(MachineState *machine,
 if (!nd->model)
 nd->model = g_strdup("dp83932");
 if (strcmp(nd->model, "dp83932") == 0) {
-dp83932_init(nd, 0x80001000, 2, get_system_memory(), rc4030[4],
- rc4030_dma_mr);
+qemu_check_nic_model(nd, "dp83932");
+
+dev = qdev_create(NULL, "dp8393x");
+qdev_set_nic_properties(dev, nd);
+qdev_prop_set_uint8(dev, "it_shift", 2);
+qdev_prop_set_ptr(dev, "dma_mr", rc4030_dma_mr);
+qdev_init_nofail(dev);
+sysbus = SYS_BUS_DEVICE(dev);
+sysbus_mmio_map(sysbus, 0, 0x80001000);
+sysbus_connect_irq(sysbus, 0, rc4030[4]);
 break;
 } else if (is_help_option(nd->model)) {
 fprintf(stderr, "qemu: Supported NICs: dp83932\n");
diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 809f493..53c0cdc 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -17,10 +17,10 @@
  * with this program; if not, see .
  */
 
-#include "hw/hw.h"
-#include "qemu/timer.h"
+#include "hw/sysbus.h"
+#include "hw/devices.h"
 #include "net/net.h"
-#include "hw/mips/mips.h"
+#include "qemu/timer.h"
 #include 
 
 //#define DEBUG_SONIC
@@ -139,9 +139,14 @@ do { printf("sonic ERROR: %s: " fmt, __func__ , ## 
__VA_ARGS__); } while (0)
 #define SONIC_ISR_PINT   0x0800
 #define SONIC_ISR_LCD0x1000
 
+#define TYPE_DP8393X "dp8393x"
+#define DP8393X(obj) OBJECT_CHECK(dp8393xState, (obj), TYPE_DP8393X)
+
 typedef struct dp8393xState {
+SysBusDevice parent_obj;
+
 /* Hardware */
-int it_shift;
+uint8_t it_shift;
 qemu_irq irq;
 #ifdef DEBUG_SONIC
 int irq_level;
@@ -150,7 +155,6 @@ typedef struct dp8393xState {
 int64_t wt_last_update;
 NICConf conf;
 NICState *nic;
-MemoryRegion *address_space;
 MemoryRegion mmio;
 
 /* Registers */
@@ -162,6 +166,7 @@ typedef struct dp8393xState {
 int loopback_packet;
 
 /* Memory access */
+void *dma_mr;
 AddressSpace as;
 } dp8393xState;
 
@@ -771,9 +776,9 @@ static ssize_t dp8393x_receive(NetClientState *nc, const 
uint8_t * buf,
 return size;
 }
 
-static void dp8393x_reset(void *opaque)
+static void dp8393x_reset(DeviceState *dev)
 {
-dp8393xState *s = opaque;
+dp8393xState *s = DP8393X(dev);
 timer_del(s->watchdog);
 
 s->regs[SONIC_CR] = SONIC_CR_RST | SONIC_CR_STP | SONIC_CR_RXDIS;
@@ -802,33 +807,59 @@ static NetClientInfo net_dp83932_info = {
 .receive = dp8393x_receive,
 };
 
-void dp83932_init(NICInfo *nd, hwaddr base, int it_shift,
-  MemoryRegion *address_space,
-  qemu_irq irq, MemoryRegion *dma_mr)
+static void dp8393x_instance_init(Object *obj)
 {
-dp8393xState *s;
+SysBusDevice *sbd = SYS_BUS_DEVICE(obj);
+dp8393xState *s = DP8393X(obj);
 
-qemu_check_nic_model(nd, "dp83932");
+sysbus_init_mmio(sbd, &s->mmio);
+sysbus_init_irq(sbd, &s->irq);
+}
 
-s = g_malloc0(sizeof(dp8393xState));
+static void dp8393x_realize(DeviceState *dev, Error **errp)
+{
+dp8393xState *s = DP8393X(dev);
+
+address_space_init(&s->as, s->dma_mr, "dp8393x");
+memory_region_init_io(&s->mmio, NULL, &dp8393x_ops, s,
+  "dp8393x", 0x40 << s->it_shift);
+
+s->nic = qemu_new_nic(&net_dp83932_info, &s->conf,
+  object_get_typename(OBJECT(dev)), dev->id, s);
+qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 
-s->address_space = address_space;
-address_space_init(&s->as, dma_mr, "dp8393x-dma");
-s->it_shift = it_shift;
-s->irq = irq;
 s->watchdog = timer_new_ns(QEMU_CLOCK_VIRTUAL, dp8393x_watchdog, s);
 s->regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */
+}
 
-s->conf.macaddr = nd->macaddr;
-s->conf.peers.ncs[0] = nd->netdev;
+static Property dp8393x_properties[] = {
+DEFINE_NIC_PROPERTIES(dp8393xState, conf),
+DEFINE_PROP_PTR("dma_mr", dp8393xState, dma_mr),
+DEFINE_PROP_UINT8("it_shift", dp8393xState, it_shift, 0),
+DEFINE_PROP_END_OF_LIST(),
+};
 
-s->nic = qemu_new_nic(&net_dp83932_info, &s->conf, nd->model, nd->name, s);
+static void dp8393x_class_init(ObjectClass *klass, void *data)
+{
+DeviceClass *dc = DEVICE_CLASS(klass);
 
-qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
-qemu_register_reset(dp8393x_reset, s);
-dp8393x_reset(s);
+set_bit(DEVICE_CATEGORY_NETWORK, dc->categories)

[Qemu-devel] [PATCH 4/8] net/dp8393x: do not use old_mmio accesses

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Hervé Poussineau 
---
 hw/net/dp8393x.c |  112 ++
 1 file changed, 28 insertions(+), 84 deletions(-)

diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 802f2b0..f86a281 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -473,8 +473,10 @@ static void do_command(dp8393xState *s, uint16_t command)
 do_load_cam(s);
 }
 
-static uint16_t read_register(dp8393xState *s, int reg)
+static uint64_t dp8393x_read(void *opaque, hwaddr addr, unsigned int size)
 {
+dp8393xState *s = opaque;
+int reg = addr >> s->it_shift;
 uint16_t val = 0;
 
 switch (reg) {
@@ -503,14 +505,18 @@ static uint16_t read_register(dp8393xState *s, int reg)
 return val;
 }
 
-static void write_register(dp8393xState *s, int reg, uint16_t val)
+static void dp8393x_write(void *opaque, hwaddr addr, uint64_t data,
+  unsigned int size)
 {
+dp8393xState *s = opaque;
+int reg = addr >> s->it_shift;
+
 DPRINTF("write 0x%04x to reg %s\n", val, reg_names[reg]);
 
 switch (reg) {
 /* Command register */
 case SONIC_CR:
-do_command(s, val);
+do_command(s, data);
 break;
 /* Prevent write to read-only registers */
 case SONIC_CAP2:
@@ -523,36 +529,36 @@ static void write_register(dp8393xState *s, int reg, 
uint16_t val)
 /* Accept write to some registers only when in reset mode */
 case SONIC_DCR:
 if (s->regs[SONIC_CR] & SONIC_CR_RST) {
-s->regs[reg] = val & 0xbfff;
+s->regs[reg] = data & 0xbfff;
 } else {
 DPRINTF("writing to DCR invalid\n");
 }
 break;
 case SONIC_DCR2:
 if (s->regs[SONIC_CR] & SONIC_CR_RST) {
-s->regs[reg] = val & 0xf017;
+s->regs[reg] = data & 0xf017;
 } else {
 DPRINTF("writing to DCR2 invalid\n");
 }
 break;
 /* 12 lower bytes are Read Only */
 case SONIC_TCR:
-s->regs[reg] = val & 0xf000;
+s->regs[reg] = data & 0xf000;
 break;
 /* 9 lower bytes are Read Only */
 case SONIC_RCR:
-s->regs[reg] = val & 0xffe0;
+s->regs[reg] = data & 0xffe0;
 break;
 /* Ignore most significant bit */
 case SONIC_IMR:
-s->regs[reg] = val & 0x7fff;
+s->regs[reg] = data & 0x7fff;
 dp8393x_update_irq(s);
 break;
 /* Clear bits by writing 1 to them */
 case SONIC_ISR:
-val &= s->regs[reg];
-s->regs[reg] &= ~val;
-if (val & SONIC_ISR_RBE) {
+data &= s->regs[reg];
+s->regs[reg] &= ~data;
+if (data & SONIC_ISR_RBE) {
 do_read_rra(s);
 }
 dp8393x_update_irq(s);
@@ -562,17 +568,17 @@ static void write_register(dp8393xState *s, int reg, 
uint16_t val)
 case SONIC_REA:
 case SONIC_RRP:
 case SONIC_RWP:
-s->regs[reg] = val & 0xfffe;
+s->regs[reg] = data & 0xfffe;
 break;
 /* Invert written value for some registers */
 case SONIC_CRCT:
 case SONIC_FAET:
 case SONIC_MPT:
-s->regs[reg] = val ^ 0x;
+s->regs[reg] = data ^ 0x;
 break;
 /* All other registers have no special contrainst */
 default:
-s->regs[reg] = val;
+s->regs[reg] = data;
 }
 
 if (reg == SONIC_WT0 || reg == SONIC_WT1) {
@@ -580,6 +586,14 @@ static void write_register(dp8393xState *s, int reg, 
uint16_t val)
 }
 }
 
+static const MemoryRegionOps dp8393x_ops = {
+.read = dp8393x_read,
+.write = dp8393x_write,
+.impl.min_access_size = 2,
+.impl.max_access_size = 2,
+.endianness = DEVICE_NATIVE_ENDIAN,
+};
+
 static void dp8393x_watchdog(void *opaque)
 {
 dp8393xState *s = opaque;
@@ -597,76 +611,6 @@ static void dp8393x_watchdog(void *opaque)
 dp8393x_update_irq(s);
 }
 
-static uint32_t dp8393x_readw(void *opaque, hwaddr addr)
-{
-dp8393xState *s = opaque;
-int reg;
-
-if ((addr & ((1 << s->it_shift) - 1)) != 0) {
-return 0;
-}
-
-reg = addr >> s->it_shift;
-return read_register(s, reg);
-}
-
-static uint32_t dp8393x_readb(void *opaque, hwaddr addr)
-{
-uint16_t v = dp8393x_readw(opaque, addr & ~0x1);
-return (v >> (8 * (addr & 0x1))) & 0xff;
-}
-
-static uint32_t dp8393x_readl(void *opaque, hwaddr addr)
-{
-uint32_t v;
-v = dp8393x_readw(opaque, addr);
-v |= dp8393x_readw(opaque, addr + 2) << 16;
-return v;
-}
-
-static void dp8393x_writew(void *opaque, hwaddr addr, uint32_t val)
-{
-dp8393xState *s = opaque;
-int reg;
-
-if ((addr & ((1 << s->it_shift) - 1)) != 0) {
-return;
-}
-
-reg = addr >> s-

[Qemu-devel] [PATCH 8/8] net/dp8393x: add load/save support

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Hervé Poussineau 
---
 hw/net/dp8393x.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 7b658d9..49fa2a8 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -850,6 +850,17 @@ static void dp8393x_realize(DeviceState *dev, Error **errp)
 prom[7] = 0xff - checksum;
 }
 
+static const VMStateDescription vmstate_dp8393x = {
+.name = "dp8393x",
+.version_id = 0,
+.minimum_version_id = 0,
+.fields = (VMStateField []) {
+VMSTATE_BUFFER_UNSAFE(cam, dp8393xState, 0, 16 * 6),
+VMSTATE_UINT16_ARRAY(regs, dp8393xState, 0x40),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static Property dp8393x_properties[] = {
 DEFINE_NIC_PROPERTIES(dp8393xState, conf),
 DEFINE_PROP_PTR("dma_mr", dp8393xState, dma_mr),
@@ -864,6 +875,7 @@ static void dp8393x_class_init(ObjectClass *klass, void 
*data)
 set_bit(DEVICE_CATEGORY_NETWORK, dc->categories);
 dc->realize = dp8393x_realize;
 dc->reset = dp8393x_reset;
+dc->vmsd = &vmstate_dp8393x;
 dc->props = dp8393x_properties;
 }
 
-- 
1.7.10.4

[Qemu-devel] [PATCH 2/8] rc4030: use AddressSpace and address_space_rw in users

2015-03-05 Thread Hervé Poussineau

Now that rc4030 internally uses an AddressSpace for DMA handling, make its root
memory region public. This is especially usefull for dp8393x netcard, which now
uses well known QEMU types and methods.

Signed-off-by: Hervé Poussineau 
---
 hw/dma/rc4030.c|   14 --
 hw/mips/mips_jazz.c|6 +++---
 hw/net/dp8393x.c   |   38 ++
 include/hw/mips/mips.h |   10 --
 4 files changed, 29 insertions(+), 39 deletions(-)

diff --git a/hw/dma/rc4030.c b/hw/dma/rc4030.c
index 93a52f6..adea807 100644
--- a/hw/dma/rc4030.c
+++ b/hw/dma/rc4030.c
@@ -763,12 +763,6 @@ static void rc4030_save(QEMUFile *f, void *opaque)
 qemu_put_be32(f, s->itr);
 }
 
-void rc4030_dma_memory_rw(void *opaque, hwaddr addr, uint8_t *buf, int len, 
int is_write)
-{
-rc4030State *s = opaque;
-address_space_rw(&s->dma_as, addr, buf, len, is_write);
-}
-
 static void rc4030_do_dma(void *opaque, int n, uint8_t *buf, int len, int 
is_write)
 {
 rc4030State *s = opaque;
@@ -864,9 +858,9 @@ static rc4030_dma *rc4030_allocate_dmas(void *opaque, int n)
 return s;
 }
 
-void *rc4030_init(qemu_irq timer, qemu_irq jazz_bus,
-  qemu_irq **irqs, rc4030_dma **dmas,
-  MemoryRegion *sysmem)
+MemoryRegion *rc4030_init(qemu_irq timer, qemu_irq jazz_bus,
+  qemu_irq **irqs, rc4030_dma **dmas,
+  MemoryRegion *sysmem)
 {
 rc4030State *s;
 int i;
@@ -901,5 +895,5 @@ void *rc4030_init(qemu_irq timer, qemu_irq jazz_bus,
 &s->dma_mrs[i]);
 }
 address_space_init(&s->dma_as, &s->dma_mr, "rc4030_dma");
-return s;
+return &s->dma_mr;
 }
diff --git a/hw/mips/mips_jazz.c b/hw/mips/mips_jazz.c
index ef5dd7d..84fb87d 100644
--- a/hw/mips/mips_jazz.c
+++ b/hw/mips/mips_jazz.c
@@ -135,7 +135,7 @@ static void mips_jazz_init(MachineState *machine,
 CPUMIPSState *env;
 qemu_irq *rc4030, *i8259;
 rc4030_dma *dmas;
-void* rc4030_opaque;
+MemoryRegion *rc4030_dma_mr;
 MemoryRegion *isa_mem = g_new(MemoryRegion, 1);
 MemoryRegion *isa_io = g_new(MemoryRegion, 1);
 MemoryRegion *rtc = g_new(MemoryRegion, 1);
@@ -217,7 +217,7 @@ static void mips_jazz_init(MachineState *machine,
 cpu_mips_clock_init(env);
 
 /* Chipset */
-rc4030_opaque = rc4030_init(env->irq[6], env->irq[3], &rc4030, &dmas,
+rc4030_dma_mr = rc4030_init(env->irq[6], env->irq[3], &rc4030, &dmas,
 address_space);
 memory_region_init_io(dma_dummy, NULL, &dma_dummy_ops, NULL, "dummy_dma", 
0x1000);
 memory_region_add_subregion(address_space, 0x8000d000, dma_dummy);
@@ -272,7 +272,7 @@ static void mips_jazz_init(MachineState *machine,
 nd->model = g_strdup("dp83932");
 if (strcmp(nd->model, "dp83932") == 0) {
 dp83932_init(nd, 0x80001000, 2, get_system_memory(), rc4030[4],
- rc4030_opaque, rc4030_dma_memory_rw);
+ rc4030_dma_mr);
 break;
 } else if (is_help_option(nd->model)) {
 fprintf(stderr, "qemu: Supported NICs: dp83932\n");
diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 7ce13d2..4f3e8a2 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -168,8 +168,7 @@ typedef struct dp8393xState {
 int loopback_packet;
 
 /* Memory access */
-void (*memory_rw)(void *opaque, hwaddr addr, uint8_t *buf, int len, int 
is_write);
-void* mem_opaque;
+AddressSpace as;
 } dp8393xState;
 
 static void dp8393x_update_irq(dp8393xState *s)
@@ -201,7 +200,7 @@ static void do_load_cam(dp8393xState *s)
 
 while (s->regs[SONIC_CDC] & 0x1f) {
 /* Fill current entry */
-s->memory_rw(s->mem_opaque,
+address_space_rw(&s->as,
 (s->regs[SONIC_URRA] << 16) | s->regs[SONIC_CDP],
 (uint8_t *)data, size, 0);
 s->cam[index][0] = data[1 * width] & 0xff;
@@ -220,7 +219,7 @@ static void do_load_cam(dp8393xState *s)
 }
 
 /* Read CAM enable */
-s->memory_rw(s->mem_opaque,
+address_space_rw(&s->as,
 (s->regs[SONIC_URRA] << 16) | s->regs[SONIC_CDP],
 (uint8_t *)data, size, 0);
 s->regs[SONIC_CE] = data[0 * width];
@@ -240,7 +239,7 @@ static void do_read_rra(dp8393xState *s)
 /* Read memory */
 width = (s->regs[SONIC_DCR] & SONIC_DCR_DW) ? 2 : 1;
 size = sizeof(uint16_t) * 4 * width;
-s->memory_rw(s->mem_opaque,
+address_space_rw(&s->as,
 (s->regs[SONIC_URRA] << 16) | s->regs[SONIC_RRP],
 (uint8_t *)data, size, 0);
 
@@ -353,7 +352,7 @@ static void do_transmit_packets(dp8393xState *s)
 (s->regs[SONIC_UTDA] << 16) | s->regs[SONIC_CTDA]);
 size = sizeof(uint16_t) * 6 * width;
 s->regs[SONIC_TTDA] = s->regs[SONIC_CTDA];
-s->memory_rw(s->mem_opaque,
+address_space_rw(&s->as,
 ((s->regs[SONIC_UTDA] << 16) | s->regs[SONIC_TTDA]) +

[Qemu-devel] [PATCH 1/8] rc4030: create custom DMA address space

2015-03-05 Thread Hervé Poussineau

Add a new memory region in system address space where DMA address space
definition (the 'translation table') belongs, so we can update on the fly
the DMA address space.

Signed-off-by: Hervé Poussineau 
---
 hw/dma/rc4030.c |  154 ++-
 1 file changed, 117 insertions(+), 37 deletions(-)

diff --git a/hw/dma/rc4030.c b/hw/dma/rc4030.c
index af26632..93a52f6 100644
--- a/hw/dma/rc4030.c
+++ b/hw/dma/rc4030.c
@@ -25,6 +25,7 @@
 #include "hw/hw.h"
 #include "hw/mips/mips.h"
 #include "qemu/timer.h"
+#include "exec/address-spaces.h"
 
 //
 /* debug rc4030 */
@@ -47,6 +48,8 @@ do { fprintf(stderr, "rc4030 ERROR: %s: " fmt, __func__ , ## 
__VA_ARGS__); } whi
 //
 /* rc4030 emulation */
 
+#define MAX_TL_ENTRIES 512
+
 typedef struct dma_pagetable_entry {
 int32_t frame;
 int32_t owner;
@@ -96,6 +99,11 @@ typedef struct rc4030State
 qemu_irq timer_irq;
 qemu_irq jazz_bus_irq;
 
+MemoryRegion dma_tt; /* translation table */
+MemoryRegion dma_mrs[MAX_TL_ENTRIES]; /* translation aliases */
+MemoryRegion dma_mr; /* whole DMA memory region */
+AddressSpace dma_as;
+
 MemoryRegion iomem_chipset;
 MemoryRegion iomem_jazzio;
 } rc4030State;
@@ -265,6 +273,89 @@ static uint32_t rc4030_readb(void *opaque, hwaddr addr)
 return (v >> (8 * (addr & 0x3))) & 0xff;
 }
 
+static void rc4030_dma_as_update_one(rc4030State *s, int index, uint32_t frame)
+{
+if (index < MAX_TL_ENTRIES) {
+memory_region_set_enabled(&s->dma_mrs[index], false);
+}
+
+if (!frame) {
+return;
+}
+
+if (index >= MAX_TL_ENTRIES) {
+qemu_log_mask(LOG_UNIMP,
+  "rc4030: trying to use too high "
+  "translation table entry %d (max allowed=%d)",
+  index, MAX_TL_ENTRIES);
+return;
+}
+memory_region_set_alias_offset(&s->dma_mrs[index], frame);
+memory_region_set_enabled(&s->dma_mrs[index], true);
+}
+
+static void rc4030_dma_tt_write(void *opaque, hwaddr addr, uint64_t data,
+unsigned int size)
+{
+rc4030State *s = opaque;
+
+/* write memory */
+memcpy(memory_region_get_ram_ptr(&s->dma_tt) + addr, &data, size);
+
+/* update dma address space (only if frame field has been written) */
+if (addr % sizeof(dma_pagetable_entry) == 0) {
+int index = addr / sizeof(dma_pagetable_entry);
+memory_region_transaction_begin();
+rc4030_dma_as_update_one(s, index, (uint32_t)data);
+memory_region_transaction_commit();
+}
+}
+
+static const MemoryRegionOps rc4030_dma_tt_ops = {
+.write = rc4030_dma_tt_write,
+.impl.min_access_size = 4,
+.impl.max_access_size = 4,
+};
+
+static void rc4030_dma_tt_update(rc4030State *s, uint32_t new_tl_base,
+ uint32_t new_tl_limit)
+{
+int entries, i;
+dma_pagetable_entry *dma_tl_contents;
+
+if (s->dma_tl_limit) {
+/* write old dma tl table to physical memory */
+memory_region_del_subregion(get_system_memory(), &s->dma_tt);
+cpu_physical_memory_write(s->dma_tl_limit & 0x7fff,
+  memory_region_get_ram_ptr(&s->dma_tt),
+  s->dma_tl_limit);
+}
+
+s->dma_tl_base = new_tl_base;
+s->dma_tl_limit = new_tl_limit;
+new_tl_base &= 0x7fff;
+
+if (s->dma_tl_limit) {
+memory_region_init_rom_device(&s->dma_tt, NULL,
+  &rc4030_dma_tt_ops, s, "dma_tt",
+  s->dma_tl_limit, NULL);
+dma_tl_contents = memory_region_get_ram_ptr(&s->dma_tt);
+cpu_physical_memory_read(new_tl_base, dma_tl_contents, 
s->dma_tl_limit);
+
+memory_region_transaction_begin();
+entries = s->dma_tl_limit / sizeof(dma_pagetable_entry);
+for (i = 0; i < entries; i++) {
+rc4030_dma_as_update_one(s, i, dma_tl_contents[i].frame);
+}
+memory_region_add_subregion(get_system_memory(), new_tl_base,
+&s->dma_tt);
+memory_region_transaction_commit();
+} else {
+memory_region_init(&s->dma_tt, NULL, "dma_tt", 0);
+}
+}
+
+
 static void rc4030_writel(void *opaque, hwaddr addr, uint32_t val)
 {
 rc4030State *s = opaque;
@@ -279,11 +370,11 @@ static void rc4030_writel(void *opaque, hwaddr addr, 
uint32_t val)
 break;
 /* DMA transl. table base */
 case 0x0018:
-s->dma_tl_base = val;
+rc4030_dma_tt_update(s, val, s->dma_tl_limit);
 break;
 /* DMA transl. table limit */
 case 0x0020:
-s->dma_tl_limit = val;
+rc4030_dma_tt_update(s, s->dma_tl_base, val);
 break;
 /* DMA transl. table invalidated */
 case 0x0028:

[Qemu-devel] [PATCH 0/8] net/dp8393x improvements

2015-03-05 Thread Hervé Poussineau

Hi,

This patchset improves dp8393x network card emulation to current QEMU standards,
mostly decouples it from MIPS rc4030 chipset emulation, and add PROM and 
load/save
functionalities.
Only required cleanup has been done on the rc4030 side.

Patchset has been tested on MIPS Jazz emulation and on (yet unpublished)
m68k Quadra 800 emulation.

I expect those patches go through a MIPS tree, as rc4030 and dp8393x are 
currently
only used in MIPS Jazz emulation.

Hervé Poussineau (8):
  rc4030: create custom DMA address space
  rc4030: use AddressSpace and address_space_rw in users
  net/dp8393x: always calculate proper checksums
  net/dp8393x: do not use old_mmio accesses
  net/dp8393x: use dp8393x_ prefix for all functions
  net/dp8393x: QOM'ify
  net/dp8393x: add PROM to store MAC address
  net/dp8393x: add load/save support

 hw/dma/rc4030.c|  166 ---
 hw/mips/mips_jazz.c|   17 ++-
 hw/net/dp8393x.c   |  343 
 include/hw/mips/mips.h |   13 +-
 4 files changed, 305 insertions(+), 234 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH 5/8] net/dp8393x: use dp8393x_ prefix for all functions

2015-03-05 Thread Hervé Poussineau

Signed-off-by: Hervé Poussineau 
---
 hw/net/dp8393x.c |   80 --
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index f86a281..809f493 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -183,7 +183,7 @@ static void dp8393x_update_irq(dp8393xState *s)
 qemu_set_irq(s->irq, level);
 }
 
-static void do_load_cam(dp8393xState *s)
+static void dp8393x_do_load_cam(dp8393xState *s)
 {
 uint16_t data[8];
 int width, size;
@@ -225,7 +225,7 @@ static void do_load_cam(dp8393xState *s)
 dp8393x_update_irq(s);
 }
 
-static void do_read_rra(dp8393xState *s)
+static void dp8393x_do_read_rra(dp8393xState *s)
 {
 uint16_t data[8];
 int width, size;
@@ -265,7 +265,7 @@ static void do_read_rra(dp8393xState *s)
 s->regs[SONIC_CR] &= ~SONIC_CR_RRRA;
 }
 
-static void do_software_reset(dp8393xState *s)
+static void dp8393x_do_software_reset(dp8393xState *s)
 {
 timer_del(s->watchdog);
 
@@ -273,7 +273,7 @@ static void do_software_reset(dp8393xState *s)
 s->regs[SONIC_CR] |= SONIC_CR_RST | SONIC_CR_RXDIS;
 }
 
-static void set_next_tick(dp8393xState *s)
+static void dp8393x_set_next_tick(dp8393xState *s)
 {
 uint32_t ticks;
 int64_t delay;
@@ -289,7 +289,7 @@ static void set_next_tick(dp8393xState *s)
 timer_mod(s->watchdog, s->wt_last_update + delay);
 }
 
-static void update_wt_regs(dp8393xState *s)
+static void dp8393x_update_wt_regs(dp8393xState *s)
 {
 int64_t elapsed;
 uint32_t val;
@@ -304,33 +304,33 @@ static void update_wt_regs(dp8393xState *s)
 val -= elapsed / 500;
 s->regs[SONIC_WT1] = (val >> 16) & 0x;
 s->regs[SONIC_WT0] = (val >> 0)  & 0x;
-set_next_tick(s);
+dp8393x_set_next_tick(s);
 
 }
 
-static void do_start_timer(dp8393xState *s)
+static void dp8393x_do_start_timer(dp8393xState *s)
 {
 s->regs[SONIC_CR] &= ~SONIC_CR_STP;
-set_next_tick(s);
+dp8393x_set_next_tick(s);
 }
 
-static void do_stop_timer(dp8393xState *s)
+static void dp8393x_do_stop_timer(dp8393xState *s)
 {
 s->regs[SONIC_CR] &= ~SONIC_CR_ST;
-update_wt_regs(s);
+dp8393x_update_wt_regs(s);
 }
 
-static void do_receiver_enable(dp8393xState *s)
+static void dp8393x_do_receiver_enable(dp8393xState *s)
 {
 s->regs[SONIC_CR] &= ~SONIC_CR_RXDIS;
 }
 
-static void do_receiver_disable(dp8393xState *s)
+static void dp8393x_do_receiver_disable(dp8393xState *s)
 {
 s->regs[SONIC_CR] &= ~SONIC_CR_RXEN;
 }
 
-static void do_transmit_packets(dp8393xState *s)
+static void dp8393x_do_transmit_packets(dp8393xState *s)
 {
 NetClientState *nc = qemu_get_queue(s->nic);
 uint16_t data[12];
@@ -439,12 +439,12 @@ static void do_transmit_packets(dp8393xState *s)
 dp8393x_update_irq(s);
 }
 
-static void do_halt_transmission(dp8393xState *s)
+static void dp8393x_do_halt_transmission(dp8393xState *s)
 {
 /* Nothing to do */
 }
 
-static void do_command(dp8393xState *s, uint16_t command)
+static void dp8393x_do_command(dp8393xState *s, uint16_t command)
 {
 if ((s->regs[SONIC_CR] & SONIC_CR_RST) && !(command & SONIC_CR_RST)) {
 s->regs[SONIC_CR] &= ~SONIC_CR_RST;
@@ -454,23 +454,23 @@ static void do_command(dp8393xState *s, uint16_t command)
 s->regs[SONIC_CR] |= (command & SONIC_CR_MASK);
 
 if (command & SONIC_CR_HTX)
-do_halt_transmission(s);
+dp8393x_do_halt_transmission(s);
 if (command & SONIC_CR_TXP)
-do_transmit_packets(s);
+dp8393x_do_transmit_packets(s);
 if (command & SONIC_CR_RXDIS)
-do_receiver_disable(s);
+dp8393x_do_receiver_disable(s);
 if (command & SONIC_CR_RXEN)
-do_receiver_enable(s);
+dp8393x_do_receiver_enable(s);
 if (command & SONIC_CR_STP)
-do_stop_timer(s);
+dp8393x_do_stop_timer(s);
 if (command & SONIC_CR_ST)
-do_start_timer(s);
+dp8393x_do_start_timer(s);
 if (command & SONIC_CR_RST)
-do_software_reset(s);
+dp8393x_do_software_reset(s);
 if (command & SONIC_CR_RRRA)
-do_read_rra(s);
+dp8393x_do_read_rra(s);
 if (command & SONIC_CR_LCAM)
-do_load_cam(s);
+dp8393x_do_load_cam(s);
 }
 
 static uint64_t dp8393x_read(void *opaque, hwaddr addr, unsigned int size)
@@ -483,7 +483,7 @@ static uint64_t dp8393x_read(void *opaque, hwaddr addr, 
unsigned int size)
 /* Update data before reading it */
 case SONIC_WT0:
 case SONIC_WT1:
-update_wt_regs(s);
+dp8393x_update_wt_regs(s);
 val = s->regs[reg];
 break;
 /* Accept read to some registers only when in reset mode */
@@ -516,7 +516,7 @@ static void dp8393x_write(void *opaque, hwaddr addr, 
uint64_t data,
 switch (reg) {
 /* Command register */
 case SONIC_CR:
-do_command(s, data);
+dp8393x_do_command(s, data);
 break;
 /* Prevent

[Qemu-devel] E5-2620v2 - emulation stop error

2015-03-05 Thread Andrey Korolyov

Hello,

recently I`ve got a couple of shiny new Intel 2620v2s for future
replacement of the E5-2620v1, but I experienced relatively many events
with emulation errors, all traces looks simular to the one below. I am
running qemu-2.1 on x86 on top of 3.10 branch for testing purposes but
can switch to some other versions if necessary. Most of crashes
happened during reboot cycle or at the end of ACPI-based shutdown
action, if this can help. I have zero clues of what can introduce such
a mess inside same processor family using identical software, as
2620v1 has no simular problem ever. Please let me know if there can be
some side measures for making entire story more clear.

Thanks!

KVM internal error. Suberror: 2
extra data[0]: 80d1
extra data[1]: 8b0d
EAX=0003 EBX= ECX= EDX=
ESI= EDI= EBP= ESP=6cd4
EIP=d3f9 EFL=00010202 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT= 000f6e98 0037
IDT=  03ff
CR0=0010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=48 18 67 8c 00 8c d1 8e d9 66 5a 66 58 66 5d 66 c3 cd 02 cb 
10 cb cd 13 cb cd 15 cb cd 16 cb cd 18 cb cd 19 cb cd 1c cb fa fc 66
b8 00 e0 00 00 8e

[Qemu-devel] [PATCH 09/21] userfaultfd: prevent khugepaged to merge if userfaultfd is armed

2015-03-05 Thread Andrea Arcangeli

If userfaultfd is armed on a certain vma we can't "fill" the holes
with zeroes or we'll break the userland on demand paging. The holes if
the userfault is armed, are really missing information (not zeroes)
that the userland has to load from network or elsewhere.

The same issue happens for wrprotected ptes that we can't just convert
into a single writable pmd_trans_huge.

We could however in theory still merge across zeropages if only
VM_UFFD_MISSING is set (so if VM_UFFD_WP is not set)... that could be
slightly improved but it'd be much more complex code for a tiny corner
case.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5374132..8f1b6a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2145,7 +2145,8 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
 _pte++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-   if (++none_or_zero <= khugepaged_max_ptes_none)
+   if (!userfaultfd_armed(vma) &&
+   ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out;
@@ -2593,7 +2594,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 _pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-   if (++none_or_zero <= khugepaged_max_ptes_none)
+   if (!userfaultfd_armed(vma) &&
+   ++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out_unmap;

[Qemu-devel] [PATCH 18/21] userfaultfd: UFFDIO_REMAP uABI

2015-03-05 Thread Andrea Arcangeli

This implements the uABI of UFFDIO_REMAP.

Notably one mode bitflag is also forwarded (and in turn known) by the
lowlevel remap_pages method.

Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 61251e6..db6e99a 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -19,7 +19,8 @@
 #define UFFD_API_RANGE_IOCTLS  \
((__u64)1 << _UFFDIO_WAKE | \
 (__u64)1 << _UFFDIO_COPY | \
-(__u64)1 << _UFFDIO_ZEROPAGE)
+(__u64)1 << _UFFDIO_ZEROPAGE | \
+(__u64)1 << _UFFDIO_REMAP)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -34,6 +35,7 @@
 #define _UFFDIO_WAKE   (0x02)
 #define _UFFDIO_COPY   (0x03)
 #define _UFFDIO_ZEROPAGE   (0x04)
+#define _UFFDIO_REMAP  (0x05)
 #define _UFFDIO_API(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -50,6 +52,8 @@
  struct uffdio_copy)
 #define UFFDIO_ZEROPAGE_IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
  struct uffdio_zeropage)
+#define UFFDIO_REMAP   _IOWR(UFFDIO, _UFFDIO_REMAP,\
+ struct uffdio_remap)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -122,4 +126,25 @@ struct uffdio_zeropage {
__s64 wake;
 };
 
+struct uffdio_remap {
+   __u64 dst;
+   __u64 src;
+   __u64 len;
+   /*
+* Especially if used to atomically remove memory from the
+* address space the wake on the dst range is not needed.
+*/
+#define UFFDIO_REMAP_MODE_DONTWAKE ((__u64)1<<0)
+#define UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES  ((__u64)1<<1)
+   __u64 mode;
+
+   /*
+* "remap" and "wake" are written by the ioctl and must be at
+* the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 remap;
+   __s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

Re: [Qemu-devel] [PATCH v3 for-2.3 10/24] hw/apci: add _PRT method for extra PCI root busses

2015-03-05 Thread Marcel Apfelbaum


On 03/05/2015 09:52 PM, Michael S. Tsirkin wrote:

On Thu, Mar 05, 2015 at 04:55:08PM +0200, Marcel Apfelbaum wrote:

Signed-off-by: Marcel Apfelbaum 
---
  hw/i386/acpi-build.c | 78 
  1 file changed, 78 insertions(+)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index e5709e8..f0401d2 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -664,6 +664,83 @@ static void build_append_pci_bus_devices(Aml 
*parent_scope, PCIBus *bus,
  aml_append(parent_scope, method);
  }

+static Aml *build_prt(void)
+{
+Aml *method, *pkg, *if_ctx, *while_ctx;
+
+method = aml_method("_PRT", 0);
+
+aml_append(method, aml_store(aml_package(128), aml_local(0)));
+aml_append(method, aml_store(aml_int(0), aml_local(1)));
+while_ctx = aml_while(aml_lless(aml_local(1), aml_int(128)));
+{
+aml_append(while_ctx,
+aml_store(aml_shiftright(aml_local(1), aml_int(2)), aml_local(2)));
+aml_append(while_ctx,
+aml_store(aml_and(aml_add(aml_local(1), aml_local(2)), aml_int(3)),
+  aml_local(3)));
+
+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(0)));
+{
+pkg = aml_package(4);
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_name("LNKD"));
+aml_append(pkg, aml_int(0));
+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
+}
+aml_append(while_ctx, if_ctx);
+
+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(1)));
+{
+pkg = aml_package(4);
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_name("LNKA"));
+aml_append(pkg, aml_int(0));
+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
+}
+aml_append(while_ctx, if_ctx);
+
+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(2)));
+{
+pkg = aml_package(4);
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_name("LNKB"));
+aml_append(pkg, aml_int(0));
+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
+}
+aml_append(while_ctx, if_ctx);
+
+if_ctx = aml_if(aml_equal(aml_local(3), aml_int(3)));
+{
+pkg = aml_package(4);
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_int(0));
+aml_append(pkg, aml_name("LNKC"));
+aml_append(pkg, aml_int(0));
+aml_append(if_ctx, aml_store(pkg, aml_local(4)));
+}
+aml_append(while_ctx, if_ctx);
+
+aml_append(while_ctx,
+aml_store(aml_or(aml_shiftleft(aml_local(2), aml_int(16)),
+ aml_int(0x)),
+  aml_index(aml_local(4), aml_int(0;
+aml_append(while_ctx,
+aml_store(aml_and(aml_local(1), aml_int(3)),
+  aml_index(aml_local(4), aml_int(1;
+aml_append(while_ctx,
+aml_store(aml_local(4), aml_index(aml_local(0), aml_local(1;
+aml_append(while_ctx, aml_increment(aml_local(1)));
+}
+aml_append(method, while_ctx);
+aml_append(method, aml_return(aml_local(0)));
+
+return method;
+}
+


Pls improve readability of this code using comments, sub-functions and
local variables.

It is the exact "copy" of the static aml code we had, witch by itself
wasn't so nice, so it is so much I can do.
However, I'll try to improve it.

Thanks,
Marcel






  static void
  build_ssdt(GArray *table_data, GArray *linker,
 AcpiCpuInfo *cpu, AcpiPmInfo *pm, AcpiMiscInfo *misc,
@@ -708,6 +785,7 @@ build_ssdt(GArray *table_data, GArray *linker,
  aml_append(dev, aml_name_decl("_HID", aml_string("PNP0A03")));
  aml_append(dev,
  aml_name_decl("_BBN", aml_int((uint8_t)bus_info->bus)));
+aml_append(dev, build_prt());
  aml_append(scope, dev);
  aml_append(ssdt, scope);
  }
--
2.1.0

[Qemu-devel] [PATCH v3 for-2.3 15/24] hw/pci: made pci_bus_num a PCIBusClass method

2015-03-05 Thread Marcel Apfelbaum

From: Marcel Apfelbaum 

Refactoring it as a method of PCIBusClass will allow
different implementations for subclasses.

Signed-off-by: Marcel Apfelbaum 
---
 hw/i386/kvm/pci-assign.c |  1 +
 hw/pci/pci.c |  7 ---
 hw/pci/pci_bus.c | 10 ++
 hw/scsi/megasas.c|  1 +
 hw/xen/xen_pt.c  |  1 +
 include/hw/pci/pci.h |  1 -
 include/hw/pci/pci_bus.h |  6 ++
 7 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 9db7c77..ad573ec 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -35,6 +35,7 @@
 #include "qemu/range.h"
 #include "sysemu/sysemu.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/pci/msi.h"
 #include "kvm_i386.h"
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 196989f..e386f2c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -301,13 +301,6 @@ PCIBus *pci_register_bus(DeviceState *parent, const char 
*name,
 return bus;
 }
 
-int pci_bus_num(PCIBus *s)
-{
-if (pci_bus_is_root(s))
-return 0;   /* pci host bridge */
-return s->parent_dev->config[PCI_SECONDARY_BUS];
-}
-
 static int get_pci_config_device(QEMUFile *f, void *pv, size_t size)
 {
 PCIDevice *s = container_of(pv, PCIDevice, config);
diff --git a/hw/pci/pci_bus.c b/hw/pci/pci_bus.c
index 0922a75..ed99208 100644
--- a/hw/pci/pci_bus.c
+++ b/hw/pci/pci_bus.c
@@ -469,6 +469,15 @@ static bool pcibus_is_root(PCIBus *bus)
 return !bus->parent_dev;
 }
 
+static int pcibus_num(PCIBus *bus)
+{
+if (pcibus_is_root(bus)) {
+return 0;   /* pci host bridge */
+}
+
+return bus->parent_dev->config[PCI_SECONDARY_BUS];
+}
+
 static void pci_bus_class_init(ObjectClass *klass, void *data)
 {
 BusClass *k = BUS_CLASS(klass);
@@ -482,6 +491,7 @@ static void pci_bus_class_init(ObjectClass *klass, void 
*data)
 k->reset = pcibus_reset;
 
 pbc->is_root = pcibus_is_root;
+pbc->bus_num = pcibus_num;
 }
 
 static const TypeInfo pci_bus_info = {
diff --git a/hw/scsi/megasas.c b/hw/scsi/megasas.c
index 4852237..fa4e3d0 100644
--- a/hw/scsi/megasas.c
+++ b/hw/scsi/megasas.c
@@ -20,6 +20,7 @@
 
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
 #include "sysemu/dma.h"
 #include "sysemu/block-backend.h"
 #include "hw/pci/msi.h"
diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index f2893b2..cf56a48 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -55,6 +55,7 @@
 #include 
 
 #include "hw/pci/pci.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/xen/xen.h"
 #include "hw/xen/xen_backend.h"
 #include "xen_pt.h"
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index ae2c4a5..a69cf94 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -375,7 +375,6 @@ PCIDevice *pci_nic_init_nofail(NICInfo *nd, PCIBus *rootbus,
 
 PCIDevice *pci_vga_init(PCIBus *bus);
 
-int pci_bus_num(PCIBus *s);
 void pci_for_each_device(PCIBus *bus, int bus_num,
  void (*fn)(PCIBus *bus, PCIDevice *d, void *opaque),
  void *opaque);
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 306ef10..553814e 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -24,6 +24,7 @@ typedef struct PCIBusClass {
 /*< public >*/
 
 bool (*is_root)(PCIBus *bus);
+int (*bus_num)(PCIBus *bus);
 } PCIBusClass;
 
 struct PCIBus {
@@ -54,6 +55,11 @@ static inline bool pci_bus_is_root(PCIBus *bus)
 return PCI_BUS_GET_CLASS(bus)->is_root(bus);
 }
 
+static inline int pci_bus_num(PCIBus *bus)
+{
+return PCI_BUS_GET_CLASS(bus)->bus_num(bus);
+}
+
 typedef struct PCIBridgeWindows PCIBridgeWindows;
 
 /*
-- 
2.1.0

[Qemu-devel] [PATCH v3 for-2.3 23/24] hw/pci_bus: add support for NUMA nodes

2015-03-05 Thread Marcel Apfelbaum

PCI root buses can be attached to a specific NUMA node.
PCI buses are not attached be default to a NUMA node.

Signed-off-by: Marcel Apfelbaum 
---
 hw/pci/pci_bus.c | 7 +++
 include/hw/pci/pci_bus.h | 6 ++
 include/sysemu/sysemu.h  | 1 +
 3 files changed, 14 insertions(+)

diff --git a/hw/pci/pci_bus.c b/hw/pci/pci_bus.c
index ed99208..15882a7 100644
--- a/hw/pci/pci_bus.c
+++ b/hw/pci/pci_bus.c
@@ -13,6 +13,7 @@
 #include "hw/pci/pci_bus.h"
 #include "hw/pci/pci_bridge.h"
 #include "monitor/monitor.h"
+#include "sysemu/sysemu.h"
 
 typedef struct {
 uint16_t class;
@@ -478,6 +479,11 @@ static int pcibus_num(PCIBus *bus)
 return bus->parent_dev->config[PCI_SECONDARY_BUS];
 }
 
+static uint16_t pcibus_numa_node(PCIBus *bus)
+{
+return NUMA_NODE_UNASSIGNED;
+}
+
 static void pci_bus_class_init(ObjectClass *klass, void *data)
 {
 BusClass *k = BUS_CLASS(klass);
@@ -492,6 +498,7 @@ static void pci_bus_class_init(ObjectClass *klass, void 
*data)
 
 pbc->is_root = pcibus_is_root;
 pbc->bus_num = pcibus_num;
+pbc->numa_node = pcibus_numa_node;
 }
 
 static const TypeInfo pci_bus_info = {
diff --git a/include/hw/pci/pci_bus.h b/include/hw/pci/pci_bus.h
index 553814e..75cd1fa 100644
--- a/include/hw/pci/pci_bus.h
+++ b/include/hw/pci/pci_bus.h
@@ -25,6 +25,7 @@ typedef struct PCIBusClass {
 
 bool (*is_root)(PCIBus *bus);
 int (*bus_num)(PCIBus *bus);
+uint16_t (*numa_node)(PCIBus *bus);
 } PCIBusClass;
 
 struct PCIBus {
@@ -60,6 +61,11 @@ static inline int pci_bus_num(PCIBus *bus)
 return PCI_BUS_GET_CLASS(bus)->bus_num(bus);
 }
 
+static inline int pci_bus_numa_node(PCIBus *bus)
+{
+return PCI_BUS_GET_CLASS(bus)->numa_node(bus);
+}
+
 typedef struct PCIBridgeWindows PCIBridgeWindows;
 
 /*
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index e7135e1..934eb5d 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -136,6 +136,7 @@ extern const char *mem_path;
 extern int mem_prealloc;
 
 #define MAX_NODES 128
+#define NUMA_NODE_UNASSIGNED MAX_NODES
 
 /* The following shall be true for all CPUs:
  *   cpu->cpu_index < max_cpus <= MAX_CPUMASK_BITS
-- 
2.1.0

[Qemu-devel] [PATCH v2 2/2] iotests: add O_DIRECT alignment probing test

2015-03-05 Thread Stefan Hajnoczi

This test case checks that image files can be opened even if I/O
produces EIO errors.  QEMU should not refuse opening failed disks since
the guest may be configured for multipath I/O where accessing failed
disks is expected.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/128 | 82 ++
 tests/qemu-iotests/128.out |  5 +++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 88 insertions(+)
 create mode 100755 tests/qemu-iotests/128
 create mode 100644 tests/qemu-iotests/128.out

diff --git a/tests/qemu-iotests/128 b/tests/qemu-iotests/128
new file mode 100755
index 000..249a865
--- /dev/null
+++ b/tests/qemu-iotests/128
@@ -0,0 +1,82 @@
+#!/bin/bash
+#
+# Test that opening O_DIRECT succeeds when image file I/O produces EIO
+#
+# Copyright (C) 2015 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+# creator
+owner=stefa...@redhat.com
+
+seq=`basename $0`
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+devname="eiodev$$"
+
+_setup_eiodev()
+{
+   # This test should either be run as root or with passwordless sudo
+   for cmd in "" "sudo -n"; do
+   echo "0 $((1024 * 1024 * 1024 / 512)) error" | \
+   $cmd dmsetup create "$devname" 2>/dev/null
+   if [ "$?" -eq 0 ]; then
+   return
+   fi
+   done
+   _notrun "root privileges required to run dmsetup"
+}
+
+_cleanup_eiodev()
+{
+   for cmd in "" "sudo -n"; do
+   $cmd dmsetup remove "$devname" 2>/dev/null
+   if [ "$?" -eq 0 ]; then
+   return
+   fi
+   done
+}
+
+_cleanup()
+{
+   _cleanup_eiodev
+}
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common.rc
+. ./common.filter
+
+_supported_fmt raw
+_supported_proto file
+_supported_os Linux
+
+_setup_eiodev
+
+TEST_IMG="/dev/mapper/$devname"
+
+echo
+echo "== reading from error device =="
+# Opening image should succeed but the read operation should fail
+$QEMU_IO --format "$IMGFMT" --nocache -c "read 0 65536" "$TEST_IMG" | 
_filter_qemu_io
+
+# success, all done
+echo "*** done"
+rm -f $seq.full
+status=0
diff --git a/tests/qemu-iotests/128.out b/tests/qemu-iotests/128.out
new file mode 100644
index 000..4e43f5f
--- /dev/null
+++ b/tests/qemu-iotests/128.out
@@ -0,0 +1,5 @@
+QA output created by 128
+
+== reading from error device ==
+read failed: Input/output error
+*** done
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index 87eec39..71f19d4 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -121,3 +121,4 @@
 114 rw auto quick
 116 rw auto quick
 123 rw auto quick
+128 rw auto quick
-- 
2.1.0

[Qemu-devel] [PATCH v2 1/2] block/raw-posix: fix launching with failed disks

2015-03-05 Thread Stefan Hajnoczi

Since commit c25f53b06eba1575d5d0e92a0132455c97825b83 ("raw: Probe
required direct I/O alignment") QEMU has failed to launch if image files
produce I/O errors.

Previously, QEMU would launch successfully and the guest would see the
errors when attempting I/O.

This is a regression and may prevent multipath I/O inside the guest,
where QEMU must launch and let the guest figure out by itself which
disks are online.

Tweak the alignment probing code in raw-posix.c to explicitly look for
EINVAL on Linux instead of bailing.  The kernel refuses misaligned
requests with this error code and other error codes can be ignored.

Signed-off-by: Stefan Hajnoczi 
---
 block/raw-posix.c | 29 +++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 3263d2b..f0b4488 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -272,6 +272,31 @@ static int probe_physical_blocksize(int fd, unsigned int 
*blk_size)
 #endif
 }
 
+/* Check if read is allowed with given memory buffer and length.
+ *
+ * This function is used to check O_DIRECT memory buffer and request alignment.
+ */
+static bool raw_is_io_aligned(int fd, void *buf, size_t len)
+{
+ssize_t ret = pread(fd, buf, len, 0);
+
+if (ret >= 0) {
+return true;
+}
+
+#ifdef __linux__
+/* The Linux kernel returns EINVAL for misaligned O_DIRECT reads.  Ignore
+ * other errors (e.g. real I/O error), which could happen on a failed
+ * drive, since we only care about probing alignment.
+ */
+if (errno != EINVAL) {
+return true;
+}
+#endif
+
+return false;
+}
+
 static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
 {
 BDRVRawState *s = bs->opaque;
@@ -307,7 +332,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int 
fd, Error **errp)
 size_t align;
 buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
 for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
-if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
+if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) {
 s->buf_align = align;
 break;
 }
@@ -319,7 +344,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int 
fd, Error **errp)
 size_t align;
 buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE);
 for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
-if (pread(fd, buf, align, 0) >= 0) {
+if (raw_is_io_aligned(fd, buf, align)) {
 bs->request_alignment = align;
 break;
 }
-- 
2.1.0

[Qemu-devel] [PATCH v2 0/2] block/raw-posix: fix launching with failed disks

2015-03-05 Thread Stefan Hajnoczi

Guests configured for multipath I/O might be started up with failed disks
attached.  QEMU should not refuse starting when a disk returns I/O errors (and
in the past this behavior was implemented correctly).

This patch series fixes a regression that prevents QEMU from opening failed
disks and adds a qemu-iotests test case to cover this use case.

Stefan Hajnoczi (2):
  block/raw-posix: fix launching with failed disks
  iotests: add O_DIRECT alignment probing test

 block/raw-posix.c  | 29 ++--
 tests/qemu-iotests/128 | 82 ++
 tests/qemu-iotests/128.out |  5 +++
 tests/qemu-iotests/group   |  1 +
 4 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100755 tests/qemu-iotests/128
 create mode 100644 tests/qemu-iotests/128.out

-- 
2.1.0

[Qemu-devel] [PATCH 4/6 v5] linux-user: Support tilegx architecture in syscall

2015-03-05 Thread Chen Gang

Add tilegx architecture in "syscall_defs.h", all related features (ioctrl,
and stat) are based on Linux kernel tilegx 64-bit implementation.

Signed-off-by: Chen Gang 
---
 linux-user/syscall_defs.h | 38 ++
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/linux-user/syscall_defs.h b/linux-user/syscall_defs.h
index edd5f3c..023f4b5 100644
--- a/linux-user/syscall_defs.h
+++ b/linux-user/syscall_defs.h
@@ -64,8 +64,9 @@
 #endif
 
 #if defined(TARGET_I386) || defined(TARGET_ARM) || defined(TARGET_SH4) \
-|| defined(TARGET_M68K) || defined(TARGET_CRIS) || 
defined(TARGET_UNICORE32) \
-|| defined(TARGET_S390X) || defined(TARGET_OPENRISC)
+|| defined(TARGET_M68K) || defined(TARGET_CRIS) \
+|| defined(TARGET_UNICORE32) || defined(TARGET_S390X) \
+|| defined(TARGET_OPENRISC) || defined(TARGET_TILEGX)
 
 #define TARGET_IOC_SIZEBITS14
 #define TARGET_IOC_DIRBITS 2
@@ -365,7 +366,8 @@ int do_sigaction(int sig, const struct target_sigaction 
*act,
 || defined(TARGET_PPC) || defined(TARGET_MIPS) || defined(TARGET_SH4) \
 || defined(TARGET_M68K) || defined(TARGET_ALPHA) || defined(TARGET_CRIS) \
 || defined(TARGET_MICROBLAZE) || defined(TARGET_UNICORE32) \
-|| defined(TARGET_S390X) || defined(TARGET_OPENRISC)
+|| defined(TARGET_S390X) || defined(TARGET_OPENRISC) \
+|| defined(TARGET_TILEGX)
 
 #if defined(TARGET_SPARC)
 #define TARGET_SA_NOCLDSTOP8u
@@ -1922,6 +1924,32 @@ struct target_stat64 {
 unsigned int __unused5;
 };
 
+#elif defined(TARGET_TILEGX)
+
+/* Copy from Linux kernel "uapi/asm-generic/stat.h" */
+struct target_stat {
+abi_ulong st_dev;   /* Device.  */
+abi_ulong st_ino;   /* File serial number.  */
+unsigned int st_mode;   /* File mode.  */
+unsigned int st_nlink;  /* Link count.  */
+unsigned int st_uid;/* User ID of the file's owner.  */
+unsigned int st_gid;/* Group ID of the file's group. */
+abi_ulong st_rdev;  /* Device number, if device.  */
+abi_ulong __pad1;
+abi_long  st_size;  /* Size of file, in bytes.  */
+int st_blksize; /* Optimal block size for I/O.  */
+int __pad2;
+abi_long st_blocks; /* Number 512-byte blocks allocated. */
+abi_long target_st_atime;   /* Time of last access.  */
+abi_ulong target_st_atime_nsec;
+abi_long target_st_mtime;   /* Time of last modification.  */
+abi_ulong target_st_mtime_nsec;
+abi_long target_st_ctime;   /* Time of last status change.  */
+abi_ulong target_st_ctime_nsec;
+unsigned int __unused4;
+unsigned int __unused5;
+};
+
 #else
 #error unsupported CPU
 #endif
@@ -2264,7 +2292,9 @@ struct target_flock {
 struct target_flock64 {
short  l_type;
short  l_whence;
-#if defined(TARGET_PPC) || defined(TARGET_X86_64) || defined(TARGET_MIPS) || 
defined(TARGET_SPARC) || defined(TARGET_HPPA) || defined (TARGET_MICROBLAZE)
+#if defined(TARGET_PPC) || defined(TARGET_X86_64) || defined(TARGET_MIPS) \
+|| defined(TARGET_SPARC) || defined(TARGET_HPPA) \
+|| defined(TARGET_MICROBLAZE) || defined(TARGET_TILEGX)
 int __pad;
 #endif
unsigned long long l_start;
-- 
1.9.3

Re: [Qemu-devel] [PATCH 6/6] target-i386: Call cpu_exec_init() on realize

2015-03-05 Thread Eduardo Habkost

On Thu, Mar 05, 2015 at 05:44:58PM +0100, Andreas Färber wrote:
> Am 05.03.2015 um 17:42 schrieb Igor Mammedov:
[...]
> >> @@ -2840,7 +2842,6 @@ static void x86_cpu_initfn(Object *obj)
> >>  CPUX86State *env = &cpu->env;
> >>  
> >>  cs->env_ptr = env;
> >> -cpu_exec_init(env);
> > looks wrong, later in this function we do
> >  env->cpuid_apic_id = x86_cpu_apic_id_from_index(cs->cpu_index);
> > and with this patch will always yield 0
> 
> Being tackled in Eduardo's APIC series. ;)

Which is already queued at the x86 tree mentioned in the cover letter,
BTW:
  https://github.com/ehabkost/qemu.git x86

-- 
Eduardo

[Qemu-devel] [PATCH 3/6 v5] linux-user: tilegx: Add target features support within qemu

2015-03-05 Thread Chen Gang

They are for target features within qemu which independent from outside.

Signed-off-by: Chen Gang 
---
 linux-user/tilegx/target_cpu.h | 35 +++
 linux-user/tilegx/target_signal.h  | 28 ++
 linux-user/tilegx/target_structs.h | 48 ++
 3 files changed, 111 insertions(+)
 create mode 100644 linux-user/tilegx/target_cpu.h
 create mode 100644 linux-user/tilegx/target_signal.h
 create mode 100644 linux-user/tilegx/target_structs.h

diff --git a/linux-user/tilegx/target_cpu.h b/linux-user/tilegx/target_cpu.h
new file mode 100644
index 000..c96e81d
--- /dev/null
+++ b/linux-user/tilegx/target_cpu.h
@@ -0,0 +1,35 @@
+/*
+ * TILE-Gx specific CPU ABI and functions for linux-user
+ *
+ * Copyright (c) 2015 Chen Gang
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+#ifndef TARGET_CPU_H
+#define TARGET_CPU_H
+
+static inline void cpu_clone_regs(CPUTLGState *env, target_ulong newsp)
+{
+if (newsp) {
+env->regs[TILEGX_R_SP] = newsp;
+}
+env->regs[TILEGX_R_RE] = 0;
+}
+
+static inline void cpu_set_tls(CPUTLGState *env, target_ulong newtls)
+{
+env->regs[TILEGX_R_TP] = newtls;
+}
+
+#endif
diff --git a/linux-user/tilegx/target_signal.h 
b/linux-user/tilegx/target_signal.h
new file mode 100644
index 000..fbab216
--- /dev/null
+++ b/linux-user/tilegx/target_signal.h
@@ -0,0 +1,28 @@
+#ifndef TARGET_SIGNAL_H
+#define TARGET_SIGNAL_H
+
+#include "cpu.h"
+
+/* this struct defines a stack used during syscall handling */
+
+typedef struct target_sigaltstack {
+abi_ulong ss_sp;
+abi_ulong ss_size;
+abi_long ss_flags;
+} target_stack_t;
+
+/*
+ * sigaltstack controls
+ */
+#define TARGET_SS_ONSTACK 1
+#define TARGET_SS_DISABLE 2
+
+#define TARGET_MINSIGSTKSZ2048
+#define TARGET_SIGSTKSZ   8192
+
+static inline abi_ulong get_sp_from_cpustate(CPUTLGState *state)
+{
+return state->regs[TILEGX_R_SP];
+}
+
+#endif /* TARGET_SIGNAL_H */
diff --git a/linux-user/tilegx/target_structs.h 
b/linux-user/tilegx/target_structs.h
new file mode 100644
index 000..13a1505
--- /dev/null
+++ b/linux-user/tilegx/target_structs.h
@@ -0,0 +1,48 @@
+/*
+ * TILE-Gx specific structures for linux-user
+ *
+ * Copyright (c) 2015 Chen Gang
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see .
+ */
+#ifndef TARGET_STRUCTS_H
+#define TARGET_STRUCTS_H
+
+struct target_ipc_perm {
+abi_int __key;  /* Key.  */
+abi_uint uid;   /* Owner's user ID.  */
+abi_uint gid;   /* Owner's group ID.  */
+abi_uint cuid;  /* Creator's user ID.  */
+abi_uint cgid;  /* Creator's group ID.  */
+abi_uint mode;/* Read/write permission.  */
+abi_ushort __seq;   /* Sequence number.  */
+abi_ushort __pad2;
+abi_ulong __unused1;
+abi_ulong __unused2;
+};
+
+struct target_shmid_ds {
+struct target_ipc_perm shm_perm;/* operation permission struct */
+abi_long shm_segsz; /* size of segment in bytes */
+abi_ulong shm_atime;/* time of last shmat() */
+abi_ulong shm_dtime;/* time of last shmdt() */
+abi_ulong shm_ctime;/* time of last change by shmctl() */
+abi_int shm_cpid;   /* pid of creator */
+abi_int shm_lpid;   /* pid of last shmop */
+abi_ulong shm_nattch;   /* number of current attaches */
+abi_ulong __unused4;
+abi_ulong __unused5;
+};
+
+#endif
-- 
1.9.3

[Qemu-devel] [PATCH 6/6 v5] linux-user/syscall.c: conditionalize syscalls which are not defined in tilegx

2015-03-05 Thread Chen Gang

For tilegx, several syscall macros are not supported, so switch them to
avoid building break.

Signed-off-by: Chen Gang 
---
 linux-user/syscall.c | 50 +-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 5720195..d1a00ad 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -213,7 +213,7 @@ static int gettid(void) {
 return -ENOSYS;
 }
 #endif
-#ifdef __NR_getdents
+#if defined(TARGET_NR_getdents) && defined(__NR_getdents)
 _syscall3(int, sys_getdents, uint, fd, struct linux_dirent *, dirp, uint, 
count);
 #endif
 #if !defined(__NR_getdents) || \
@@ -5580,6 +5580,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 ret = get_errno(write(arg1, p, arg3));
 unlock_user(p, arg2, 0);
 break;
+#ifdef TARGET_NR_open
 case TARGET_NR_open:
 if (!(p = lock_user_string(arg1)))
 goto efault;
@@ -5588,6 +5589,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
   arg3));
 unlock_user(p, arg1, 0);
 break;
+#endif
 case TARGET_NR_openat:
 if (!(p = lock_user_string(arg2)))
 goto efault;
@@ -5602,9 +5604,11 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 case TARGET_NR_brk:
 ret = do_brk(arg1);
 break;
+#ifdef TARGET_NR_fork
 case TARGET_NR_fork:
 ret = get_errno(do_fork(cpu_env, SIGCHLD, 0, 0, 0, 0));
 break;
+#endif
 #ifdef TARGET_NR_waitpid
 case TARGET_NR_waitpid:
 {
@@ -5639,6 +5643,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 unlock_user(p, arg1, 0);
 break;
 #endif
+#ifdef TARGET_NR_link
 case TARGET_NR_link:
 {
 void * p2;
@@ -5652,6 +5657,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 unlock_user(p, arg1, 0);
 }
 break;
+#endif
 #if defined(TARGET_NR_linkat)
 case TARGET_NR_linkat:
 {
@@ -5669,12 +5675,14 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 }
 break;
 #endif
+#ifdef TARGET_NR_unlink
 case TARGET_NR_unlink:
 if (!(p = lock_user_string(arg1)))
 goto efault;
 ret = get_errno(unlink(p));
 unlock_user(p, arg1, 0);
 break;
+#endif
 #if defined(TARGET_NR_unlinkat)
 case TARGET_NR_unlinkat:
 if (!(p = lock_user_string(arg2)))
@@ -5791,12 +5799,14 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 }
 break;
 #endif
+#ifdef TARGET_NR_mknod
 case TARGET_NR_mknod:
 if (!(p = lock_user_string(arg1)))
 goto efault;
 ret = get_errno(mknod(p, arg2, arg3));
 unlock_user(p, arg1, 0);
 break;
+#endif
 #if defined(TARGET_NR_mknodat)
 case TARGET_NR_mknodat:
 if (!(p = lock_user_string(arg2)))
@@ -5805,12 +5815,14 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 unlock_user(p, arg2, 0);
 break;
 #endif
+#ifdef TARGET_NR_chmod
 case TARGET_NR_chmod:
 if (!(p = lock_user_string(arg1)))
 goto efault;
 ret = get_errno(chmod(p, arg2));
 unlock_user(p, arg1, 0);
 break;
+#endif
 #ifdef TARGET_NR_break
 case TARGET_NR_break:
 goto unimplemented;
@@ -5945,6 +5957,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 }
 break;
 #endif
+#ifdef TARGET_NR_utimes
 case TARGET_NR_utimes:
 {
 struct timeval *tvp, tv[2];
@@ -5963,6 +5976,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 unlock_user(p, arg1, 0);
 }
 break;
+#endif
 #if defined(TARGET_NR_futimesat)
 case TARGET_NR_futimesat:
 {
@@ -5991,12 +6005,14 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 case TARGET_NR_gtty:
 goto unimplemented;
 #endif
+#ifdef TARGET_NR_access
 case TARGET_NR_access:
 if (!(p = lock_user_string(arg1)))
 goto efault;
 ret = get_errno(access(path(p), arg2));
 unlock_user(p, arg1, 0);
 break;
+#endif
 #if defined(TARGET_NR_faccessat) && defined(__NR_faccessat)
 case TARGET_NR_faccessat:
 if (!(p = lock_user_string(arg2)))
@@ -6021,6 +6037,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 case TARGET_NR_kill:
 ret = get_errno(kill(arg1, target_to_host_signal(arg2)));
 break;
+#ifdef TARGET_NR_rename
 case TARGET_NR_rename:
 {
 void *p2;
@@ -6034,6 +6051,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 unlock_user(p, arg1, 0);
 }
 break;
+#endif
 #if defined(TARGET_NR_renameat)
 case TARGET_NR_renameat:
 {
@@ -6049,12 +6067,14 @@ abi_long do_syscall(void *cpu_env, int num, abi_long 
arg1,
 }
 break;
 #endif
+#ifdef TARG

Re: [Qemu-devel] [PATCH 0/6] target-i386: Remove side-effects from X86CPU::instance_init

2015-03-05 Thread Eduardo Habkost

On Thu, Mar 05, 2015 at 12:38:44PM -0300, Eduardo Habkost wrote:
> Eduardo Habkost (6):
>   cpu: No need to zero-initialize numa_node
>   cpu: Initialize breakpoint/watchpoint lists on cpu_common_initfn()
>   cpu: Reorder cpu->as and cpu->thread_id initialization

Andreas, do you want to queue the patches above through your qom-cpu
tree? They are not a hard requirement for the patches below.

(The only difference is that the commit message in patch 6/6 refer to
the modified version of cpu_exec_init())

>   target-i386: Rename optimize_flags_init()
>   target-i386: Move TCG initialization to realize time
>   target-i386: Call cpu_exec_init() on realize
> 
>  exec.c  | 12 +---
>  qom/cpu.c   |  2 ++
>  target-i386/cpu.c   | 16 
>  target-i386/cpu.h   |  2 +-
>  target-i386/translate.c |  2 +-
>  5 files changed, 17 insertions(+), 17 deletions(-)
> 
> -- 
> 2.1.0
> 
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

2015-03-05 Thread Pavel Emelyanov

> +static int mcopy_atomic_pte(struct mm_struct *dst_mm,
> + pmd_t *dst_pmd,
> + struct vm_area_struct *dst_vma,
> + unsigned long dst_addr,
> + unsigned long src_addr)
> +{
> + struct mem_cgroup *memcg;
> + pte_t _dst_pte, *dst_pte;
> + spinlock_t *ptl;
> + struct page *page;
> + void *page_kaddr;
> + int ret;
> +
> + ret = -ENOMEM;
> + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
> + if (!page)
> + goto out;

Not a fatal thing, but still quite inconvenient. If there are two tasks that
have anonymous private VMAs that are still not COW-ed from each other, then
it will be impossible to keep the pages shared with userfault. Thus if we do
post-copy memory migration for tasks, then these guys will have their
memory COW-ed.


Thanks,
Pavel

[Qemu-devel] [PATCH 5/6 v5] linux-user: Support tilegx architecture in linux-user

2015-03-05 Thread Chen Gang

Add main working flow feature and loading elf64 tilegx binary feature,
based on Linux kernel tilegx 64-bit implementation.

After this patch, qemu can successfully load elf64 tilegx binary for
linux-user, and the working flow reaches the first correct instruction
position "__start".

Signed-off-by: Chen Gang 
---
 include/elf.h|  2 ++
 linux-user/elfload.c | 23 
 linux-user/main.c| 74 
 3 files changed, 99 insertions(+)

diff --git a/include/elf.h b/include/elf.h
index a516584..139b22d 100644
--- a/include/elf.h
+++ b/include/elf.h
@@ -133,6 +133,8 @@ typedef int64_t  Elf64_Sxword;
 
 #define EM_AARCH64  183
 
+#define EM_TILEGX   191 /* TILE-Gx */
+
 /* This is the info that is needed to parse the dynamic section of the file */
 #define DT_NULL0
 #define DT_NEEDED  1
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 399c021..2571cb8 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -1189,6 +1189,29 @@ static inline void init_thread(struct target_pt_regs 
*regs, struct image_info *i
 
 #endif /* TARGET_S390X */
 
+#ifdef TARGET_TILEGX
+
+/* 42 bits real used address, a half for user mode */
+#define ELF_START_MMAP (0x00200ULL)
+
+#define elf_check_arch(x) ((x) == EM_TILEGX)
+
+#define ELF_CLASS   ELFCLASS64
+#define ELF_DATAELFDATA2LSB
+#define ELF_ARCHEM_TILEGX
+
+static inline void init_thread(struct target_pt_regs *regs,
+   struct image_info *infop)
+{
+regs->lr = infop->entry;
+regs->sp = infop->start_stack;
+
+}
+
+#define ELF_EXEC_PAGESIZE65536 /* TILE-Gx page size is 64KB */
+
+#endif /* TARGET_TILEGX */
+
 #ifndef ELF_PLATFORM
 #define ELF_PLATFORM (NULL)
 #endif
diff --git a/linux-user/main.c b/linux-user/main.c
index d92702a..8d98ca4 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -3418,6 +3418,20 @@ void cpu_loop(CPUS390XState *env)
 
 #endif /* TARGET_S390X */
 
+#ifdef TARGET_TILEGX
+void cpu_loop(CPUTLGState *env)
+{
+CPUState *cs = CPU(tilegx_env_get_cpu(env));
+
+while (1) {
+cpu_exec_start(cs);
+cpu_tilegx_exec(env);
+cpu_exec_end(cs);
+process_pending_signals(env);
+}
+}
+#endif
+
 THREAD CPUState *thread_cpu;
 
 void task_settid(TaskState *ts)
@@ -4392,6 +4406,66 @@ int main(int argc, char **argv, char **envp)
 env->psw.mask = regs->psw.mask;
 env->psw.addr = regs->psw.addr;
 }
+#elif defined(TARGET_TILEGX)
+{
+env->regs[0] = regs->r0;
+env->regs[1] = regs->r1;
+env->regs[2] = regs->r2;
+env->regs[3] = regs->r3;
+env->regs[4] = regs->r4;
+env->regs[5] = regs->r5;
+env->regs[6] = regs->r6;
+env->regs[7] = regs->r7;
+env->regs[8] = regs->r8;
+env->regs[9] = regs->r9;
+env->regs[10] = regs->r10;
+env->regs[11] = regs->r11;
+env->regs[12] = regs->r12;
+env->regs[13] = regs->r13;
+env->regs[14] = regs->r14;
+env->regs[15] = regs->r15;
+env->regs[16] = regs->r16;
+env->regs[17] = regs->r17;
+env->regs[18] = regs->r18;
+env->regs[19] = regs->r19;
+env->regs[20] = regs->r20;
+env->regs[21] = regs->r21;
+env->regs[22] = regs->r22;
+env->regs[23] = regs->r23;
+env->regs[24] = regs->r24;
+env->regs[25] = regs->r25;
+env->regs[26] = regs->r26;
+env->regs[27] = regs->r27;
+env->regs[28] = regs->r28;
+env->regs[29] = regs->r29;
+env->regs[30] = regs->r30;
+env->regs[31] = regs->r31;
+env->regs[32] = regs->r32;
+env->regs[33] = regs->r33;
+env->regs[34] = regs->r34;
+env->regs[35] = regs->r35;
+env->regs[36] = regs->r36;
+env->regs[37] = regs->r37;
+env->regs[38] = regs->r38;
+env->regs[39] = regs->r39;
+env->regs[40] = regs->r40;
+env->regs[41] = regs->r41;
+env->regs[42] = regs->r42;
+env->regs[43] = regs->r43;
+env->regs[44] = regs->r44;
+env->regs[45] = regs->r45;
+env->regs[46] = regs->r46;
+env->regs[47] = regs->r47;
+env->regs[48] = regs->r48;
+env->regs[49] = regs->r49;
+env->regs[50] = regs->r50;
+env->regs[51] = regs->r51;
+env->regs[52] = regs->r52; /* TILEGX_R_BP */
+env->regs[53] = regs->tp;  /* TILEGX_R_TP */
+env->regs[54] = regs->sp;  /* TILEGX_R_SP */
+env->regs[55] = regs->lr;  /* TILEGX_R_LR */
+env->pc = regs->lr;
+}
 #else
 #error unsupported target CPU
 #endif
-- 
1.9.3

[Qemu-devel] [PATCH 0/6 v5] tilegx: Can load elf64 tilegx binary successfully for linux-user

2015-03-05 Thread Chen Gang

After load elf64 tilegx binary for linux-user, the working flow reaches
1st correct instruction "__start". Next, we shall load all instructions
for qemu using.

This patch is based on Linux kernel tile architecture tilegx 64-bit
implementation, and also based on tilegx architecture ABI reference.

The related test:

  [root@localhost qemu]# ./configure --target-list=tilegx-linux-user && make 
  [root@localhost qemu]# ./tilegx-linux-user/qemu-tilegx -d all ./test.tgx
  CPU Reset (CPU 0)
  CPU Reset (CPU 0)
  host mmap_min_addr=0x1
  Reserved 0xe bytes of guest address space
  Relocating guest address space from 0x0001 to 0x1
  guest_base  0x0  
  startend  size prot 
  0001-000e 000d r-x
  000e-000f 0001 rw-
  0040-0041 0001 ---
  0041-00400081 0080 rw-
  start_brk   0x
  end_code0x000d86f7
  start_code  0x0001
  start_data  0x000e86f8
  end_data0x000ea208
  start_stack 0x00400080f250
  brk 0x000ec2b0
  entry   0x00010f60
  PROLOGUE: [size=40]
  0x7fcc44c716f0:  push   %rbp 
  0x7fcc44c716f1:  push   %rbx 
  0x7fcc44c716f2:  push   %r12 
  0x7fcc44c716f4:  push   %r13 
  0x7fcc44c716f6:  push   %r14 
  0x7fcc44c716f8:  push   %r15 
  0x7fcc44c716fa:  mov%rdi,%r14
  0x7fcc44c716fd:  add$0xfb78,%rsp
  0x7fcc44c71704:  jmpq   *%rsi
  0x7fcc44c71706:  add$0x488,%rsp
  0x7fcc44c7170d:  pop%r15 
  0x7fcc44c7170f:  pop%r14 
  0x7fcc44c71711:  pop%r13 
  0x7fcc44c71713:  pop%r12 
  0x7fcc44c71715:  pop%rbx 
  0x7fcc44c71716:  pop%rbp 
  0x7fcc44c71717:  retq 

  Load elf64 tilegx successfully
  reach code start position: [00010f60] _start

  [root@localhost qemu]# echo $?
  0
  [root@localhost qemu]#


Chen Gang (6):
  target-tilegx: Firstly add TILE-Gx with minimized features
  linux-user: tilegx: Firstly add architecture related features
  linux-user: tilegx: Add target features support within qemu
  linux-user: Support tilegx architecture in syscall
  linux-user: Support tilegx architecture in linux-user
  linux-user/syscall.c: conditionalize syscalls which are not defined in
tilegx

 configure |   3 +
 default-configs/tilegx-linux-user.mak |   1 +
 include/elf.h |   2 +
 linux-user/elfload.c  |  23 +++
 linux-user/main.c |  74 +
 linux-user/syscall.c  |  50 +-
 linux-user/syscall_defs.h |  38 -
 linux-user/tilegx/syscall.h   |  80 ++
 linux-user/tilegx/syscall_nr.h| 278 +
 linux-user/tilegx/target_cpu.h|  35 +
 linux-user/tilegx/target_signal.h |  28 
 linux-user/tilegx/target_structs.h|  48 ++
 linux-user/tilegx/termbits.h  | 285 ++
 target-tilegx/Makefile.objs   |   1 +
 target-tilegx/cpu-qom.h   |  71 +
 target-tilegx/cpu.c   | 153 ++
 target-tilegx/cpu.h   |  85 ++
 target-tilegx/helper.h|   0
 target-tilegx/translate.c |  53 +++
 19 files changed, 1303 insertions(+), 5 deletions(-)
 create mode 100644 default-configs/tilegx-linux-user.mak
 create mode 100644 linux-user/tilegx/syscall.h
 create mode 100644 linux-user/tilegx/syscall_nr.h
 create mode 100644 linux-user/tilegx/target_cpu.h
 create mode 100644 linux-user/tilegx/target_signal.h
 create mode 100644 linux-user/tilegx/target_structs.h
 create mode 100644 linux-user/tilegx/termbits.h
 create mode 100644 target-tilegx/Makefile.objs
 create mode 100644 target-tilegx/cpu-qom.h
 create mode 100644 target-tilegx/cpu.c
 create mode 100644 target-tilegx/cpu.h
 create mode 100644 target-tilegx/helper.h
 create mode 100644 target-tilegx/translate.c

-- 
1.9.3

[Qemu-devel] [PATCH 1/6 v5] target-tilegx: Firstly add TILE-Gx with minimized features

2015-03-05 Thread Chen Gang

It is the configure and build system support for TILE-Gx (tilegx will be
used in configure and real sub-directory name), and at present, it is
linux-user only.

Signed-off-by: Chen Gang 
---
 configure |   3 +
 default-configs/tilegx-linux-user.mak |   1 +
 target-tilegx/Makefile.objs   |   1 +
 target-tilegx/cpu-qom.h   |  71 
 target-tilegx/cpu.c   | 153 ++
 target-tilegx/cpu.h   |  85 +++
 target-tilegx/helper.h|   0
 target-tilegx/translate.c |  53 
 8 files changed, 367 insertions(+)
 create mode 100644 default-configs/tilegx-linux-user.mak
 create mode 100644 target-tilegx/Makefile.objs
 create mode 100644 target-tilegx/cpu-qom.h
 create mode 100644 target-tilegx/cpu.c
 create mode 100644 target-tilegx/cpu.h
 create mode 100644 target-tilegx/helper.h
 create mode 100644 target-tilegx/translate.c

diff --git a/configure b/configure
index 7ba4bcb..9586502 100755
--- a/configure
+++ b/configure
@@ -5191,6 +5191,9 @@ case "$target_name" in
   s390x)
 gdb_xml_files="s390x-core64.xml s390-acr.xml s390-fpr.xml"
   ;;
+  tilegx)
+TARGET_ARCH=tilegx
+  ;;
   unicore32)
   ;;
   xtensa|xtensaeb)
diff --git a/default-configs/tilegx-linux-user.mak 
b/default-configs/tilegx-linux-user.mak
new file mode 100644
index 000..3e47493
--- /dev/null
+++ b/default-configs/tilegx-linux-user.mak
@@ -0,0 +1 @@
+# Default configuration for tilegx-linux-user
diff --git a/target-tilegx/Makefile.objs b/target-tilegx/Makefile.objs
new file mode 100644
index 000..dcf2fe4
--- /dev/null
+++ b/target-tilegx/Makefile.objs
@@ -0,0 +1 @@
+obj-y += cpu.o translate.o
diff --git a/target-tilegx/cpu-qom.h b/target-tilegx/cpu-qom.h
new file mode 100644
index 000..4ee11e1
--- /dev/null
+++ b/target-tilegx/cpu-qom.h
@@ -0,0 +1,71 @@
+/*
+ * QEMU TILE-Gx CPU
+ *
+ * Copyright (c) 2015 Chen Gang
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see
+ * 
+ */
+#ifndef QEMU_TILEGX_CPU_QOM_H
+#define QEMU_TILEGX_CPU_QOM_H
+
+#include "qom/cpu.h"
+
+#define TYPE_TILEGX_CPU "tilegx-cpu"
+
+#define TILEGX_CPU_CLASS(klass) \
+OBJECT_CLASS_CHECK(TileGXCPUClass, (klass), TYPE_TILEGX_CPU)
+#define TILEGX_CPU(obj) \
+OBJECT_CHECK(TileGXCPU, (obj), TYPE_TILEGX_CPU)
+#define TILEGX_CPU_GET_CLASS(obj) \
+OBJECT_GET_CLASS(TileGXCPUClass, (obj), TYPE_TILEGX_CPU)
+
+/**
+ * TileGXCPUClass:
+ * @parent_realize: The parent class' realize handler.
+ * @parent_reset: The parent class' reset handler.
+ *
+ * A Tile-Gx CPU model.
+ */
+typedef struct TileGXCPUClass {
+/*< private >*/
+CPUClass parent_class;
+/*< public >*/
+
+DeviceRealize parent_realize;
+void (*parent_reset)(CPUState *cpu);
+} TileGXCPUClass;
+
+/**
+ * TileGXCPU:
+ * @env: #CPUTLGState
+ *
+ * A Tile-GX CPU.
+ */
+typedef struct TileGXCPU {
+/*< private >*/
+CPUState parent_obj;
+/*< public >*/
+
+CPUTLGState env;
+} TileGXCPU;
+
+static inline TileGXCPU *tilegx_env_get_cpu(CPUTLGState *env)
+{
+return container_of(env, TileGXCPU, env);
+}
+
+#define ENV_GET_CPU(e) CPU(tilegx_env_get_cpu(e))
+
+#endif
diff --git a/target-tilegx/cpu.c b/target-tilegx/cpu.c
new file mode 100644
index 000..cf46b8b
--- /dev/null
+++ b/target-tilegx/cpu.c
@@ -0,0 +1,153 @@
+/*
+ * QEMU TILE-Gx CPU
+ *
+ *  Copyright (c) 2015 Chen Gang
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see
+ * 
+ */
+
+#include "cpu.h"
+#include "qemu-common.h"
+#include "hw/qdev-properties.h"
+#include "migration/vmstate.h"
+
+TileGXCPU *cpu_tilegx_init(const char *cpu_model)
+{
+TileGXCPU *cpu;
+
+cpu = TILEGX_CPU(ob

[Qemu-devel] [PATCH 2/6 v5] linux-user: tilegx: Firstly add architecture related features

2015-03-05 Thread Chen Gang

They are based on Linux kernel tilegx architecture for 64 bit binary,
also based on tilegx ABI reference document.

Signed-off-by: Chen Gang 
---
 linux-user/tilegx/syscall.h|  80 
 linux-user/tilegx/syscall_nr.h | 278 
 linux-user/tilegx/termbits.h   | 285 +
 3 files changed, 643 insertions(+)
 create mode 100644 linux-user/tilegx/syscall.h
 create mode 100644 linux-user/tilegx/syscall_nr.h
 create mode 100644 linux-user/tilegx/termbits.h

diff --git a/linux-user/tilegx/syscall.h b/linux-user/tilegx/syscall.h
new file mode 100644
index 000..2edae92
--- /dev/null
+++ b/linux-user/tilegx/syscall.h
@@ -0,0 +1,80 @@
+#ifndef TILEGX_SYSCALLS_H
+#define TILEGX_SYSCALLS_H
+
+#define UNAME_MACHINE "tilegx"
+#define UNAME_MINIMUM_RELEASE "3.19"
+
+/* We use tilegx to keep things similar to the kernel sources.  */
+typedef uint64_t tilegx_reg_t;
+
+struct target_pt_regs {
+
+/* Can be as parameters */
+tilegx_reg_t r0;
+tilegx_reg_t r1;
+tilegx_reg_t r2;
+tilegx_reg_t r3;
+tilegx_reg_t r4;
+tilegx_reg_t r5;
+tilegx_reg_t r6;
+tilegx_reg_t r7;
+tilegx_reg_t r8;
+tilegx_reg_t r9;
+
+/* Normal using, caller saved */
+tilegx_reg_t r10;
+tilegx_reg_t r11;
+tilegx_reg_t r12;
+tilegx_reg_t r13;
+tilegx_reg_t r14;
+tilegx_reg_t r15;
+tilegx_reg_t r16;
+tilegx_reg_t r17;
+tilegx_reg_t r18;
+tilegx_reg_t r19;
+tilegx_reg_t r20;
+tilegx_reg_t r21;
+tilegx_reg_t r22;
+tilegx_reg_t r23;
+tilegx_reg_t r24;
+tilegx_reg_t r25;
+tilegx_reg_t r26;
+tilegx_reg_t r27;
+tilegx_reg_t r28;
+tilegx_reg_t r29;
+
+/* Normal using, callee saved */
+tilegx_reg_t r30;
+tilegx_reg_t r31;
+tilegx_reg_t r32;
+tilegx_reg_t r33;
+tilegx_reg_t r34;
+tilegx_reg_t r35;
+tilegx_reg_t r36;
+tilegx_reg_t r37;
+tilegx_reg_t r38;
+tilegx_reg_t r39;
+tilegx_reg_t r40;
+tilegx_reg_t r41;
+tilegx_reg_t r42;
+tilegx_reg_t r43;
+tilegx_reg_t r44;
+tilegx_reg_t r45;
+tilegx_reg_t r46;
+tilegx_reg_t r47;
+tilegx_reg_t r48;
+tilegx_reg_t r49;
+tilegx_reg_t r50;
+tilegx_reg_t r51;
+
+/* Control using */
+tilegx_reg_t r52;/* optional frame pointer */
+tilegx_reg_t tp; /* thread-local data */
+tilegx_reg_t sp; /* stack pointer */
+tilegx_reg_t lr; /* lr pointer */
+};
+
+#define TARGET_MLOCKALL_MCL_CURRENT 1
+#define TARGET_MLOCKALL_MCL_FUTURE  2
+
+#endif
diff --git a/linux-user/tilegx/syscall_nr.h b/linux-user/tilegx/syscall_nr.h
new file mode 100644
index 000..8121154
--- /dev/null
+++ b/linux-user/tilegx/syscall_nr.h
@@ -0,0 +1,278 @@
+#ifndef TILEGX_SYSCALL_NR
+#define TILEGX_SYSCALL_NR
+
+/*
+ * Copy from linux kernel asm-generic/unistd.h, which tilegx uses.
+ */
+#define TARGET_NR_io_setup  0
+#define TARGET_NR_io_destroy1
+#define TARGET_NR_io_submit 2
+#define TARGET_NR_io_cancel 3
+#define TARGET_NR_io_getevents  4
+#define TARGET_NR_setxattr  5
+#define TARGET_NR_lsetxattr 6
+#define TARGET_NR_fsetxattr 7
+#define TARGET_NR_getxattr  8
+#define TARGET_NR_lgetxattr 9
+#define TARGET_NR_fgetxattr 10
+#define TARGET_NR_listxattr 11
+#define TARGET_NR_llistxattr12
+#define TARGET_NR_flistxattr13
+#define TARGET_NR_removexattr   14
+#define TARGET_NR_lremovexattr  15
+#define TARGET_NR_fremovexattr  16
+#define TARGET_NR_getcwd17
+#define TARGET_NR_lookup_dcookie18
+#define TARGET_NR_eventfd2  19
+#define TARGET_NR_epoll_create1 20
+#define TARGET_NR_epoll_ctl 21
+#define TARGET_NR_epoll_pwait   22
+#define TARGET_NR_dup   23
+#define TARGET_NR_dup3  24
+#define TARGET_NR_fcntl 25
+#define TARGET_NR_inotify_init1 26
+#define TARGET_NR_inotify_add_watch 27
+#define TARGET_NR_inotify_rm_watch  28
+#define TARGET_NR_ioctl 29
+#define TARGET_NR_ioprio_set30
+#define TARGET_NR_ioprio_get31
+#define TARGET_NR_flock 32
+#define TARGET_NR_mknodat   33
+#define TARGET_NR_mkdirat   34
+#define TARGET_NR_unlinkat  35
+#define TARGET_NR_symlinkat 36
+#define TARGET_NR_linkat37
+#define TARGET_NR_renameat  38
+#define TARGET_NR_umount2   39
+#define TARGET

Re: [Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3

2015-03-05 Thread Pavel Emelyanov


> All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
> live migration and the UFFD can be passed to a manager process through
> unix domain sockets to satisfy point 5).

Yup :) That's the best (from my POV) point of ufd -- the ability to delegate
the descriptor to some other task. Though there are several limitations (I've
expressed them in other e-mails), I'm definitely supporting this!

The respective CRIU code is quite sloppy yet, I will try to brush one up and
show soon.

Thanks,
Pavel

Re: [Qemu-devel] [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-03-05 Thread Pavel Emelyanov

> diff --git a/kernel/fork.c b/kernel/fork.c
> index cf65139..cb215c0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct 
> mm_struct *oldmm)
>   goto fail_nomem_anon_vma_fork;
>   tmp->vm_flags &= ~VM_LOCKED;
>   tmp->vm_next = tmp->vm_prev = NULL;
> + tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;

This creates an interesting effect when the userfaultfd is used outside of
the process which created and activated one. If I try to monitor the memory
usage of one task with another, once the first task fork()-s, its child
begins to see zero-pages in the places where the monitor task was supposed
to insert pages with data.

>   file = tmp->vm_file;
>   if (file) {
>   struct inode *inode = file_inode(file);
> .
>

Re: [Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Pavel Emelyanov

> +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> + unsigned long dst_start, unsigned long src_start,
> + unsigned long len, __u64 mode)
> +{
> + struct vm_area_struct *src_vma, *dst_vma;
> + long err = -EINVAL;
> + pmd_t *src_pmd, *dst_pmd;
> + pte_t *src_pte, *dst_pte;
> + spinlock_t *dst_ptl, *src_ptl;
> + unsigned long src_addr, dst_addr;
> + int thp_aligned = -1;
> + ssize_t moved = 0;
> +
> + /*
> +  * Sanitize the command parameters:
> +  */
> + BUG_ON(src_start & ~PAGE_MASK);
> + BUG_ON(dst_start & ~PAGE_MASK);
> + BUG_ON(len & ~PAGE_MASK);
> +
> + /* Does the address range wrap, or is the span zero-sized? */
> + BUG_ON(src_start + len <= src_start);
> + BUG_ON(dst_start + len <= dst_start);
> +
> + /*
> +  * Because these are read sempahores there's no risk of lock
> +  * inversion.
> +  */
> + down_read(&dst_mm->mmap_sem);
> + if (dst_mm != src_mm)
> + down_read(&src_mm->mmap_sem);
> +
> + /*
> +  * Make sure the vma is not shared, that the src and dst remap
> +  * ranges are both valid and fully within a single existing
> +  * vma.
> +  */
> + src_vma = find_vma(src_mm, src_start);
> + if (!src_vma || (src_vma->vm_flags & VM_SHARED))
> + goto out;
> + if (src_start < src_vma->vm_start ||
> + src_start + len > src_vma->vm_end)
> + goto out;
> +
> + dst_vma = find_vma(dst_mm, dst_start);
> + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED))
> + goto out;

I again have a concern about the case when one task monitors the VM of the
other one. If the target task (owning the mm) unmaps a VMA then the monitor
task (holding and operating on the ufd) will get plain EINVAL on UFFDIO_REMAP
request. This is not fatal, but still inconvenient as it will be hard to
find out the reason for failure -- dst VMA is removed and the monitor should
just drop the respective pages with data, or some other error has occurred
and some other actions should be taken.

Thanks,
Pavel

Re: [Qemu-devel] [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization

2015-03-05 Thread Pavel Emelyanov


> +int handle_userfault(struct vm_area_struct *vma, unsigned long address,
> +  unsigned int flags, unsigned long reason)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct userfaultfd_ctx *ctx;
> + struct userfaultfd_wait_queue uwq;
> +
> + BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
> +
> + ctx = vma->vm_userfaultfd_ctx.ctx;
> + if (!ctx)
> + return VM_FAULT_SIGBUS;
> +
> + BUG_ON(ctx->mm != mm);
> +
> + VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> + VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> +
> + /*
> +  * If it's already released don't get it. This avoids to loop
> +  * in __get_user_pages if userfaultfd_release waits on the
> +  * caller of handle_userfault to release the mmap_sem.
> +  */
> + if (unlikely(ACCESS_ONCE(ctx->released)))
> + return VM_FAULT_SIGBUS;
> +
> + /* check that we can return VM_FAULT_RETRY */
> + if (unlikely(!(flags & FAULT_FLAG_ALLOW_RETRY))) {
> + /*
> +  * Validate the invariant that nowait must allow retry
> +  * to be sure not to return SIGBUS erroneously on
> +  * nowait invocations.
> +  */
> + BUG_ON(flags & FAULT_FLAG_RETRY_NOWAIT);
> +#ifdef CONFIG_DEBUG_VM
> + if (printk_ratelimit()) {
> + printk(KERN_WARNING
> +"FAULT_FLAG_ALLOW_RETRY missing %x\n", flags);
> + dump_stack();
> + }
> +#endif
> + return VM_FAULT_SIGBUS;
> + }
> +
> + /*
> +  * Handle nowait, not much to do other than tell it to retry
> +  * and wait.
> +  */
> + if (flags & FAULT_FLAG_RETRY_NOWAIT)
> + return VM_FAULT_RETRY;
> +
> + /* take the reference before dropping the mmap_sem */
> + userfaultfd_ctx_get(ctx);
> +
> + /* be gentle and immediately relinquish the mmap_sem */
> + up_read(&mm->mmap_sem);
> +
> + init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> + uwq.wq.private = current;
> + uwq.address = userfault_address(address, flags, reason);

Since we report only the virtual address of the fault, this will make 
difficulties
for task monitoring the address space of some other task. Like this:

Let's assume a task creates a userfaultfd, activates one, registers several 
VMAs 
in it and then sends the ufd descriptor to other task. If later the first task 
will
remap those VMAs and will start touching pages, the monitor will start 
receiving 
fault addresses using which it will not be able to guess the exact vma the
requests come from.

Thanks,
Pavel

[Qemu-devel] [Bug 1428657] [NEW] qemu-system-arm does not ignore the lowest bit of pc when returning from interrrupt

2015-03-05 Thread Anders Esbensen

Public bug reported:

This was observed in qemu v2.1.3, running a sample app from

FreeRTOS(FreeRTOSV7.5.2/FreeRTOS/Demo/CORTEX_LM3S_Eclipse/RTOSDemo)

In the sample code compiled with arm-none-eabi-gcc , version 4.8.2
(4.8.2-14ubuntu1+6) .

qemu seems to be executing the wrong instrunction after returning from
the SVCHandler. The svc handler changes the PSP register and the new
stack contains an add return address, which should be
allowed(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12545.html).
The lowest bit of the address should be ignored, but it seems that qemu
executes garbage after returning from the interrupt.

qemu is run like this:

qemu-system-arm -semihosting -machine lm3s6965evb -kernel RTOSDemo.axf
-gdb tcp::1234 -S


this is the arm-gdb trace
Program received signal SIGINT, Interrupt.
IntDefaultHandler () at startup.c:231
231 {
(gdb) bt
#0  IntDefaultHandler () at startup.c:231
#1  0xfffc in ?? ()

(gdb) info registers 
r0 0x0  0
r1 0x14b4b4b4   347387060
r2 0xa5a5a5a5   -1515870811
r3 0xa5a5a53d   -1515870915
r4 0xa5a5a5a5   -1515870811
r5 0xa5a5a5a5   -1515870811
r6 0xa5a5a5a5   -1515870811
r7 0x40d00542   1087374658
r8 0xa5a5a5a5   -1515870811
r9 0xa5a5a5a5   -1515870811
r100xa5a5a5a5   -1515870811
r110xa5a5a5a5   -1515870811
r120xa5a5a5a5   -1515870811
sp 0x20008380   0x20008380
lr 0xfffd   -3
pc 0xc648   0xc648 
cpsr   0x2173   536871283

this exception occur after running SVC handler code

(gdb) disassemble vPortSVCHandler 
Dump of assembler code for function vPortSVCHandler:
   0xc24c <+0>: ldr r3, [pc, #24]   ; (0xc268 )
   0xc24e <+2>: ldr r1, [r3, #0]
   0xc250 <+4>: ldr r0, [r1, #0]
   0xc252 <+6>: ldmia.w r0!, {r4, r5, r6, r7, r8, r9, r10, r11}
   0xc256 <+10>:msr PSP, r0
   0xc25a <+14>:mov.w   r0, #0
   0xc25e <+18>:msr BASEPRI, r0
   0xc262 <+22>:orr.w   lr, lr, #13
   0xc266 <+26>:bx  lr
   0xc268 <+28>:andcs   r2, r0, r12, ror #5
End of assembler dump.

This stores this stack in PSP register:
(gdb) x /32 0x200052c8
0x200052c8: 0xa5a5a5a5  0xa5a5a5a5  0xa5a5a5a5  0xa5a5a5a5
0x200052d8: 0xa5a5a5a5  0xa5a5a5a5  0xa5a5a5a5  0xa5a5a5a5
0x200052e8: 0x  0x14b4b4b4  0xa5a5a5a5  0xa5a5a53d
0x200052f8: 0xa5a5a5a5  0x  0x3b49  0x2100
0x20005308: 0xa5a5a5a5  0xa5a5a5a5  0x200081b8  0x0058
0x20005318: 0x  0x  0x  0x
0x20005328: 0x  0x20005330  0x  0x20005330
0x20005338: 0x20005330  0x  0x20005344  0x

It seems that qemu actually executes 0x3b49 after the interrupt, but
it should execute 0x3b48

** Affects: qemu
 Importance: Undecided
 Status: New


** Tags: arm cortex-m3

** Attachment added: "Test program for -machine lm3s6965evb"
   
https://bugs.launchpad.net/bugs/1428657/+attachment/4335262/+files/RTOSDemo.axf

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1428657

Title:
  qemu-system-arm does not ignore the lowest bit of pc when returning
  from interrrupt

Status in QEMU:
  New

Bug description:
  This was observed in qemu v2.1.3, running a sample app from

  FreeRTOS(FreeRTOSV7.5.2/FreeRTOS/Demo/CORTEX_LM3S_Eclipse/RTOSDemo)

  In the sample code compiled with arm-none-eabi-gcc , version 4.8.2
  (4.8.2-14ubuntu1+6) .

  qemu seems to be executing the wrong instrunction after returning from
  the SVCHandler. The svc handler changes the PSP register and the new
  stack contains an add return address, which should be
  
allowed(http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12545.html).
  The lowest bit of the address should be ignored, but it seems that
  qemu executes garbage after returning from the interrupt.

  qemu is run like this:

  qemu-system-arm -semihosting -machine lm3s6965evb -kernel RTOSDemo.axf
  -gdb tcp::1234 -S

  
  this is the arm-gdb trace
  Program received signal SIGINT, Interrupt.
  IntDefaultHandler () at startup.c:231
  231   {
  (gdb) bt
  #0  IntDefaultHandler () at startup.c:231
  #1  0xfffc in ?? ()

  (gdb) info registers 
  r0 0x00
  r1 0x14b4b4b4 347387060
  r2 0xa5a5a5a5 -1515870811
  r3 0xa5a5a53d -1515870915
  r4 0xa5a5a5a5 -1515870811
  r5 0xa5a5a5a5 -1515870811
  r6 0xa5a5a5a5 -1515870811
  r7 0x40d00542 1087374658
  r8 0xa5a5a5a5 -15158708

Re: [Qemu-devel] [PATCH v3 for-2.3 10/24] hw/apci: add _PRT method for extra PCI root busses

2015-03-05 Thread Michael S. Tsirkin

On Thu, Mar 05, 2015 at 04:55:08PM +0200, Marcel Apfelbaum wrote:
> Signed-off-by: Marcel Apfelbaum 
> ---
>  hw/i386/acpi-build.c | 78 
> 
>  1 file changed, 78 insertions(+)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index e5709e8..f0401d2 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -664,6 +664,83 @@ static void build_append_pci_bus_devices(Aml 
> *parent_scope, PCIBus *bus,
>  aml_append(parent_scope, method);
>  }
>  
> +static Aml *build_prt(void)
> +{
> +Aml *method, *pkg, *if_ctx, *while_ctx;
> +
> +method = aml_method("_PRT", 0);
> +
> +aml_append(method, aml_store(aml_package(128), aml_local(0)));
> +aml_append(method, aml_store(aml_int(0), aml_local(1)));
> +while_ctx = aml_while(aml_lless(aml_local(1), aml_int(128)));
> +{
> +aml_append(while_ctx,
> +aml_store(aml_shiftright(aml_local(1), aml_int(2)), 
> aml_local(2)));
> +aml_append(while_ctx,
> +aml_store(aml_and(aml_add(aml_local(1), aml_local(2)), 
> aml_int(3)),
> +  aml_local(3)));
> +
> +if_ctx = aml_if(aml_equal(aml_local(3), aml_int(0)));
> +{
> +pkg = aml_package(4);
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_name("LNKD"));
> +aml_append(pkg, aml_int(0));
> +aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> +}
> +aml_append(while_ctx, if_ctx);
> +
> +if_ctx = aml_if(aml_equal(aml_local(3), aml_int(1)));
> +{
> +pkg = aml_package(4);
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_name("LNKA"));
> +aml_append(pkg, aml_int(0));
> +aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> +}
> +aml_append(while_ctx, if_ctx);
> +
> +if_ctx = aml_if(aml_equal(aml_local(3), aml_int(2)));
> +{
> +pkg = aml_package(4);
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_name("LNKB"));
> +aml_append(pkg, aml_int(0));
> +aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> +}
> +aml_append(while_ctx, if_ctx);
> +
> +if_ctx = aml_if(aml_equal(aml_local(3), aml_int(3)));
> +{
> +pkg = aml_package(4);
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_int(0));
> +aml_append(pkg, aml_name("LNKC"));
> +aml_append(pkg, aml_int(0));
> +aml_append(if_ctx, aml_store(pkg, aml_local(4)));
> +}
> +aml_append(while_ctx, if_ctx);
> +
> +aml_append(while_ctx,
> +aml_store(aml_or(aml_shiftleft(aml_local(2), aml_int(16)),
> + aml_int(0x)),
> +  aml_index(aml_local(4), aml_int(0;
> +aml_append(while_ctx,
> +aml_store(aml_and(aml_local(1), aml_int(3)),
> +  aml_index(aml_local(4), aml_int(1;
> +aml_append(while_ctx,
> +aml_store(aml_local(4), aml_index(aml_local(0), aml_local(1;
> +aml_append(while_ctx, aml_increment(aml_local(1)));
> +}
> +aml_append(method, while_ctx);
> +aml_append(method, aml_return(aml_local(0)));
> +
> +return method;
> +}
> +

Pls improve readability of this code using comments, sub-functions and
local variables.


>  static void
>  build_ssdt(GArray *table_data, GArray *linker,
> AcpiCpuInfo *cpu, AcpiPmInfo *pm, AcpiMiscInfo *misc,
> @@ -708,6 +785,7 @@ build_ssdt(GArray *table_data, GArray *linker,
>  aml_append(dev, aml_name_decl("_HID", aml_string("PNP0A03")));
>  aml_append(dev,
>  aml_name_decl("_BBN", aml_int((uint8_t)bus_info->bus)));
> +aml_append(dev, build_prt());
>  aml_append(scope, dev);
>  aml_append(ssdt, scope);
>  }
> -- 
> 2.1.0

Re: [Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Linus Torvalds

On Thu, Mar 5, 2015 at 10:51 AM, Andrea Arcangeli  wrote:
>
> Thanks for your idea that the UFFDIO_COPY is faster, the userland code
> we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
> UFFDIO_REMAP.

Ok. So there's no actual expected use of the remap interface. Good.
That makes this series more palatable, since the rest didn't raise my
hackles much.

(But yeah, the documentation patch didn't really explain the uses very
much or at all, so I think something more is needed in that area).

   Linus

Re: [Qemu-devel] [PATCH v4 2/5] target-i386: Remove unused APIC ID default code

2015-03-05 Thread Eduardo Habkost

On Thu, Mar 05, 2015 at 07:35:17PM +0100, Andreas Färber wrote:
> Am 05.03.2015 um 14:43 schrieb Eduardo Habkost:
> > On Tue, Mar 03, 2015 at 11:13:41PM -0300, Eduardo Habkost wrote:
> >> The existing apic_id = cpu_index code has no visible effect: the PC code
> >> already initializes the APIC ID according to the topology on
> >> pc_new_cpu(), and linux-user memcpy()s the CPU state (including
> >> cpuid_apic_id) on cpu_copy().
> >>
> >> Remove the dead code and simply let APIC ID to to be 0 by default. This
> >> doesn't change behavior of PC because apic-id is already explicitly set,
> >> and doesn't affect linux-user because APIC ID was already always 0.
> >>
> >> Signed-off-by: Eduardo Habkost 
> > 
> > This patch is holding the rest of the series, so a Reviewed-by or
> > Acked-by would be welcome.
> > 
> > This change removes the 254-CPU limit from {i386,x86_64}-linux-user that
> > Peter and I discussed previously.
> 
> Reviewed-by: Andreas Färber 
> 
> Are you going to send a new pull for the 2 plus these 5 now?

Yes. I plan to send a pull request tomorrow.

(If we get reviews in time, the pull request may include the
instance_init series as well)

-- 
Eduardo

Re: [Qemu-devel] [RFC PATCH 01/14] docs: block replication's description

2015-03-05 Thread Dr. David Alan Gilbert

* Wen Congyang (we...@cn.fujitsu.com) wrote:
> On 03/05/2015 12:35 AM, Dr. David Alan Gilbert wrote:
> > * Wen Congyang (we...@cn.fujitsu.com) wrote:
> >> Signed-off-by: Wen Congyang 
> >> Signed-off-by: Paolo Bonzini 
> >> Signed-off-by: Yang Hongyang 
> >> Signed-off-by: zhanghailiang 
> >> Signed-off-by: Gonglei 
> > 
> > Hi,
> > 
> >> ---
> >>  docs/block-replication.txt | 129 
> >> +
> >>  1 file changed, 129 insertions(+)
> >>  create mode 100644 docs/block-replication.txt
> >>
> >> diff --git a/docs/block-replication.txt b/docs/block-replication.txt
> >> new file mode 100644
> >> index 000..59150b8
> >> --- /dev/null
> >> +++ b/docs/block-replication.txt
> >> @@ -0,0 +1,129 @@
> >> +Block replication
> >> +
> >> +Copyright Fujitsu, Corp. 2015
> >> +Copyright (c) 2015 Intel Corporation
> >> +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
> >> +
> >> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> >> +See the COPYING file in the top-level directory.
> >> +
> >> +The block replication is used for continuous checkpoints. It is designed
> >> +for COLO that Secondary VM is running. It can also be applied for FT/HA
> >> +scene that Secondary VM is not running.
> >> +
> >> +This document gives an overview of block replication's design.
> >> +
> >> +== Background ==
> >> +High availability solutions such as micro checkpoint and COLO will do
> >> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is
> >> +identical right after a VM checkpoint, but becomes different as the VM
> >> +executes till the next checkpoint. To support disk contents checkpoint,
> >> +the modified disk contents in the Secondary VM must be buffered, and are
> >> +only dropped at next checkpoint time. To reduce the network transportation
> >> +effort at the time of checkpoint, the disk modification operations of
> >> +Primary disk are asynchronously forwarded to the Secondary node.
> > 
> > Can you explain how the block data is synchronised with the main checkpoint
> > stream?  i.e. when the secondary receives a new checkpoint how does it know
> > it's received all of the block writes from the primary associated with that
> > checkpoint and that all the following writes that it receives are for the
> > next checkpoint period?
> 
> NBD server will do it. Writing to NBD client will return after NBD server 
> replies
> the result(ACK or error).

Ah OK, so if the NBD client is synchronous then yes I can see that;
(I was confused by the word 'asynchronously' in your description above
but I guess that means asynchronous to the checkpoint stream).
I see that 'do_colo_transaction' keeps the primary stopped until after
the secondary does blk_do_checkpoint and then sends 'LOADED'.

I think yes that should work; although potentially you could make it faster;
since the primary doesn't need to know that it's write has been commited
until the next checkpoint, and if you could mark the separation in the two
checkpoints, then you could start the primary running again earlier.  But that's
all more complicated; this should work OK.

Thanks for the explanation,

Dave

> Thanks
> Wen Congyang
> 
> > 
> > Dave
> > 
> >> +
> >> +== Workflow ==
> >> +The following is the image of block replication workflow:
> >> +
> >> ++--+++
> >> +|Primary Write Requests||Secondary Write Requests|
> >> ++--+++
> >> +  |   |
> >> +  |  (4)
> >> +  |   V
> >> +  |  /-\
> >> +  |  Copy and Forward| |
> >> +  |-(1)--+   | Disk Buffer |
> >> +  |  |   | |
> >> +  | (3)  \-/
> >> +  | speculative  ^
> >> +  |write through(2)
> >> +  |  |   |
> >> +  V  V   |
> >> +   +--+   ++
> >> +   | Primary Disk |   | Secondary Disk |
> >> +   +--+   ++
> >> +
> >> +1) Primary write requests will be copied and forwarded to Secondary
> >> +   QEMU.
> >> +2) Before Primary write requests are written to Secondary disk, the
> >> +   original sector content will be read from Secondary disk and
> >> +   buffered in the Disk buffer, but it will not overwrite the existing
> >> +   sector content in the Disk buffer.
> >> +3) Primary write requests will

Re: [Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Andrea Arcangeli

On Thu, Mar 05, 2015 at 09:39:48AM -0800, Linus Torvalds wrote:
> Is this really worth it? On real loads? That people are expected to use?

I fully agree that it's not worth merging upstream UFFDIO_REMAP until
(and if) a real world usage for it will showup. To further clarify:
would this not have been an RFC, the patchset would have stopped at
patch number 15/21 included.

Merging UFFDIO_REMAP with no real life users, would just increase the
attack vector surface of the kernel for no good.

Thanks for your idea that the UFFDIO_COPY is faster, the userland code
we submitted for qemu only uses UFFDIO_COPY|ZEROPAGE, it never uses
UFFDIO_REMAP. I immediately agreed about UFFDIO_COPY being preferable
after you mentioned it during review of the previous RFC.

However this being a RFC with a large audience, and UFFDIO_REMAP
allowing to "remove" memory (think like externalizing memory into to
ceph with deduplication or such), I still added it just in case there
are real world use cases that may justify me keeping it around (even
if I would definitely not have submitted it for merging in the short
term regardless).

In addition of dropping the parts that aren't suitable for merging in
the short term like UFFDIO_REMAP, for any further submits that won't
substantially alter the API like it happened between the v2 to v3
RFCs, I'll also shrink the To/Cc list considerably.

> Considering how we just got rid of one special magic VM remapping
> thing that nobody actually used, I'd really hate to add a new one.

Having to define an API somehow, I tried to think at all possible
future usages and make sure the API would allow for those if needed.

> Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
> argue for not merging the remap parts. I just don't see the point. It
> doesn't seem to add anything that is semantically very important -
> it's *potentially* a faster copy, but even that is
> 
>   (a) questionable in the first place

Yes, we already measured the UFFDIO_COPY is faster than UFFDIO_REMAP,
the userfault latency decreases -20%.

> 
> and
> 
>  (b) unclear why anybody would ever care about performance of
> infrastructure that nobody actually uses today, and future use isn't
> even clear or shown to be particualrly performance-sensitive.

The only potential _theoretical_ case that justify the existence of
UFFDIO_REMAP is about "removing" memory from the address space. To
"add" memory UFFDIO_COPY and UFFDIO_ZEROPAGE are always preferable
like you suggested.

> So basically I'd like to see better documentation, a few real use
> cases (and by real I very much do *not* mean "you can use it for
> this", but actual patches to actual projects that matter and that are
> expected to care and merge them), and a simplified series that doesn't
> do the remap thing.

So far I wrote some doc in 2/21 and in the cover letter, but certainly
more docs are necessary. Trinity is also needed (I got trinity running
on the v2 API but I haven't adapted to the new API yet).

About the real world usages, this is the primary one:

http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

And it actually cannot be merged in qemu until userfaultfd is merged
in the kernel. There's simply no safe way to implement postcopy live
migration without something equivalent to the userfaultfd if all Linux
VM features are intended to be retained in the destination node.

> Because *every* time we add a new clever interface, we end up with
> approximately zero users and just pain down the line. Examples:
> splice, mremap, yadda yadda.

Aside from mremap which I think is widely used, I totally agree in
principle.

For now I can quite comfortably guarantee the above real life user for
userfaultfd (qemu), but there are potential 5 of them. And none needs
UFFDIO_REMAP, which is again why I totally agree of not submitting it
for merging and it was intended it only for the initial RFC to share
the idea of "removing" the memory with a larger audience before I
shrink the Cc/To list for further updates.

Thanks,
Andrea

Re: [Qemu-devel] [PATCH v4 09/10] cpu: add device_add foo-x86_64-cpu support

2015-03-05 Thread Eduardo Habkost

On Fri, Feb 13, 2015 at 06:25:32PM +0800, Zhu Guihua wrote:
> From: Chen Fan 
> 
> Add support to device_add foo-x86_64-cpu, and additional checks of
> apic id are added into x86_cpuid_set_apic_id() to avoid duplicate.
> Besides, in order to support "device/device_add foo-x86_64-cpu"
> which without specified apic id, we assign cpuid_apic_id with a
> default broadcast value (0x) in initfn, and a new function
> get_free_apic_id() to provide a free apid id to cpuid_apic_id if
> it still has the default at realize time (e.g. hot add foo-cpu without
> a specified apic id) to avoid apic id duplicates.
> 
> Thanks very much for Igor's suggestion.
> 
> Signed-off-by: Chen Fan 
> Signed-off-by: Gu Zheng 
> Signed-off-by: Zhu Guihua 
> ---
>  hw/acpi/cpu_hotplug.c |  6 --
>  hw/i386/pc.c  |  6 --
>  target-i386/cpu.c | 48 +---
>  3 files changed, 49 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/acpi/cpu_hotplug.c b/hw/acpi/cpu_hotplug.c
> index b8ebfad..8e4ed6e 100644
> --- a/hw/acpi/cpu_hotplug.c
> +++ b/hw/acpi/cpu_hotplug.c
> @@ -59,8 +59,10 @@ void acpi_cpu_plug_cb(ACPIREGS *ar, qemu_irq irq,
>  return;
>  }
>  
> -ar->gpe.sts[0] |= ACPI_CPU_HOTPLUG_STATUS;
> -acpi_update_sci(ar, irq);
> +/* Only trigger sci if cpu is hotplugged */
> +if (dev->hotplugged) {
> +acpi_send_gpe_event(ar, irq, ACPI_CPU_HOTPLUG_STATUS);
> +}
>  }
>  
>  void acpi_cpu_hotplug_init(MemoryRegion *parent, Object *owner,
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 500d369..1187e12 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1637,13 +1637,7 @@ static void pc_cpu_plug(HotplugHandler *hotplug_dev,
>  Error *local_err = NULL;
>  PCMachineState *pcms = PC_MACHINE(hotplug_dev);
>  
> -if (!dev->hotplugged) {
> -goto out;
> -}
> -
>  if (!pcms->acpi_dev) {
> -error_setg(&local_err,
> -   "cpu hotplug is not enabled: missing acpi device");
>  goto out;
>  }
>  
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 028063c..68a6aa4 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -1703,6 +1703,7 @@ static void x86_cpuid_set_apic_id(Object *obj, Visitor 
> *v, void *opaque,
>  const int64_t max = UINT32_MAX;
>  Error *error = NULL;
>  int64_t value;
> +X86CPUTopoInfo topo;
>  
>  if (dev->realized) {
>  error_setg(errp, "Attempt to set property '%s' on '%s' after "
> @@ -1722,6 +1723,19 @@ static void x86_cpuid_set_apic_id(Object *obj, Visitor 
> *v, void *opaque,
>  return;
>  }
>  
> +if (value > x86_cpu_apic_id_from_index(max_cpus - 1)) {
> +error_setg(errp, "CPU with APIC ID %" PRIi64
> +   " is more than MAX APIC ID limits", value);
> +return;
> +}
> +
> +x86_topo_ids_from_apic_id(smp_cores, smp_threads, value, &topo);
> +if (topo.smt_id >= smp_threads || topo.core_id >= smp_cores) {
> +error_setg(errp, "CPU with APIC ID %" PRIi64 " does not match "
> +   "topology configuration.", value);
> +return;
> +}
> +
>  if ((value != cpu->env.cpuid_apic_id) && cpu_exists(value)) {
>  error_setg(errp, "CPU with APIC ID %" PRIi64 " exists", value);
>  return;
> @@ -2166,8 +2180,10 @@ static void x86_cpu_cpudef_class_init(ObjectClass *oc, 
> void *data)
>  {
>  X86CPUDefinition *cpudef = data;
>  X86CPUClass *xcc = X86_CPU_CLASS(oc);
> +DeviceClass *dc = DEVICE_CLASS(oc);
>  
>  xcc->cpu_def = cpudef;
> +dc->cannot_instantiate_with_device_add_yet = false;
>  }
>  
>  static void x86_register_cpudef_type(X86CPUDefinition *def)
> @@ -2176,6 +2192,7 @@ static void x86_register_cpudef_type(X86CPUDefinition 
> *def)
>  TypeInfo ti = {
>  .name = typename,
>  .parent = TYPE_X86_CPU,
> +.instance_size = sizeof(X86CPU),
>  .class_init = x86_cpu_cpudef_class_init,
>  .class_data = def,
>  };
> @@ -2709,11 +2726,28 @@ static void mce_init(X86CPU *cpu)
>  }
>  
>  #ifndef CONFIG_USER_ONLY
> +static uint32_t get_free_apic_id(void)
> +{
> +int i;
> +
> +for (i = 0; i < max_cpus; i++) {
> +uint32_t id = x86_cpu_apic_id_from_index(i);
> +
> +if (!cpu_exists(id)) {
> +return id;
> +}
> +}
> +
> +return x86_cpu_apic_id_from_index(max_cpus);
> +}
> +
> +#define APIC_ID_NOT_SET (~0U)

This is inside CONFIG_USER_ONLY...

> +
[...]
> @@ -2920,7 +2962,7 @@ static void x86_cpu_initfn(Object *obj)
>  NULL, NULL, (void *)cpu->filtered_features, NULL);
>  
>  cpu->hyperv_spinlock_attempts = HYPERV_SPINLOCK_NEVER_RETRY;
> -env->cpuid_apic_id = x86_cpu_apic_id_from_index(cs->cpu_index);
> +env->cpuid_apic_id = APIC_ID_NOT_SET;


...but this is not.


CCx86_64-linux-user/target-i386/cpu.o
  /home/ehabkost/rh/proj/virt/qemu/target-i386/cpu.c: In function 
‘x86_cpu

Re: [Qemu-devel] [PATCH v4 2/5] target-i386: Remove unused APIC ID default code

2015-03-05 Thread Andreas Färber

Am 05.03.2015 um 14:43 schrieb Eduardo Habkost:
> On Tue, Mar 03, 2015 at 11:13:41PM -0300, Eduardo Habkost wrote:
>> The existing apic_id = cpu_index code has no visible effect: the PC code
>> already initializes the APIC ID according to the topology on
>> pc_new_cpu(), and linux-user memcpy()s the CPU state (including
>> cpuid_apic_id) on cpu_copy().
>>
>> Remove the dead code and simply let APIC ID to to be 0 by default. This
>> doesn't change behavior of PC because apic-id is already explicitly set,
>> and doesn't affect linux-user because APIC ID was already always 0.
>>
>> Signed-off-by: Eduardo Habkost 
> 
> This patch is holding the rest of the series, so a Reviewed-by or
> Acked-by would be welcome.
> 
> This change removes the 254-CPU limit from {i386,x86_64}-linux-user that
> Peter and I discussed previously.

Reviewed-by: Andreas Färber 

Are you going to send a new pull for the 2 plus these 5 now?

Thanks,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu,
Graham Norton; HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH v4 05/10] qom/cpu: move register_vmstate to common CPUClass.realizefn

2015-03-05 Thread Eduardo Habkost

On Fri, Feb 13, 2015 at 06:25:28PM +0800, Zhu Guihua wrote:
> From: Gu Zheng 
> 
> Move cpu vmstate register from cpu_exec_init into cpu_common_realizefn,
> and use cc->get_arch_id as the instance id that suggested by Igor to
> fix the migration issue.

If you are implementing something new, please do that in a separate patch,
either before or after moving the code. Makes it easier to review and easier to
revert in case something goes wrong.

See two additional issues below:

> 
> Signed-off-by: Gu Zheng 
> Signed-off-by: Zhu Guihua 
> ---
>  exec.c| 25 ++---
>  include/qom/cpu.h |  2 ++
>  qom/cpu.c |  4 
>  3 files changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 6dff7bc..8361591 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -513,10 +513,26 @@ void tcg_cpu_address_space_init(CPUState *cpu, 
> AddressSpace *as)
>  }
>  #endif
>  
> +void cpu_vmstate_register(CPUState *cpu)
> +{
> +CPUClass *cc = CPU_GET_CLASS(cpu);
> +int cpu_index = cc->get_arch_id(cpu) + max_cpus;


Breaks linux-user build:

  LINK  x86_64-linux-user/qemu-x86_64
exec.o: In function `cpu_vmstate_register':
/home/ehabkost/rh/proj/virt/qemu/exec.c:533: undefined reference to `max_cpus'
collect2: error: ld returned 1 exit status
Makefile:182: recipe for target 'qemu-x86_64' failed
make[1]: *** [qemu-x86_64] Error 1
Makefile:169: recipe for target 'subdir-x86_64-linux-user' failed
make: *** [subdir-x86_64-linux-user] Error 2


> +int compat_index = cc->get_compat_arch_id(cpu);
> +
> +if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
> +vmstate_register_with_alias_id(NULL, cpu_index, &vmstate_cpu_common,
> +   cpu, compat_index, 3);
> +}
> +
> +if (cc->vmsd != NULL) {
> +vmstate_register_with_alias_id(NULL, cpu_index, cc->vmsd,
> +   cpu, compat_index, 3);
> +}
> +}
> +
>  void cpu_exec_init(CPUArchState *env)
>  {
>  CPUState *cpu = ENV_GET_CPU(env);
> -CPUClass *cc = CPU_GET_CLASS(cpu);
>  CPUState *some_cpu;
>  int cpu_index;
>  
> @@ -539,18 +555,13 @@ void cpu_exec_init(CPUArchState *env)
>  #if defined(CONFIG_USER_ONLY)
>  cpu_list_unlock();
>  #endif
> -if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
> -vmstate_register(NULL, cpu_index, &vmstate_cpu_common, cpu);
> -}
>  #if defined(CPU_SAVE_VERSION) && !defined(CONFIG_USER_ONLY)
> +CPUClass *cc = CPU_GET_CLASS(cpu);
>  register_savevm(NULL, "cpu", cpu_index, CPU_SAVE_VERSION,
>  cpu_save, cpu_load, env);
>  assert(cc->vmsd == NULL);
>  assert(qdev_get_vmsd(DEVICE(cpu)) == NULL);
>  #endif
> -if (cc->vmsd != NULL) {
> -vmstate_register(NULL, cpu_index, cc->vmsd, cpu);
> -}
>  }
>  
>  #if defined(CONFIG_USER_ONLY)
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 2e68dd2..d0a50e2 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -565,6 +565,8 @@ void cpu_interrupt(CPUState *cpu, int mask);
>  
>  #endif /* USER_ONLY */
>  
> +void cpu_vmstate_register(CPUState *cpu);
> +
>  #ifdef CONFIG_SOFTMMU
>  static inline void cpu_unassigned_access(CPUState *cpu, hwaddr addr,
>   bool is_write, bool is_exec,
> diff --git a/qom/cpu.c b/qom/cpu.c
> index 83d7766..8e37045 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -302,6 +302,10 @@ static void cpu_common_realizefn(DeviceState *dev, Error 
> **errp)
>  {
>  CPUState *cpu = CPU(dev);
>  
> +#if !defined(CONFIG_USER_ONLY)
> +cpu_vmstate_register(cpu);
> +#endif

CONFIG_USER_ONLY is never set on qom/cpu.c because it is target-independent
code.

Good news is that we already have vmstate stubs, so you shouldn't need any
CONFIG_USER_ONLY ifdefs around the vmstate code (but it looks like we will need
a max_cpus stub).

> +
>  if (dev->hotplugged) {
>  cpu_synchronize_post_init(cpu);
>  cpu_resume(cpu);
> -- 
> 1.9.3
> 
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH v3 3/4] migration: Convert 'status' of MigrationInfo to use an enum type

2015-03-05 Thread Markus Armbruster

zhanghailiang  writes:

> The original 'status' is an open-coded 'str' type, convert it to use an
> enum type.
> This conversion is backwards compatible, better documented and
> more convenient for future extensibility.
>
> We also rename 'MIGRATION_STATUS_ERROR' to 'MIGRATION_STATUS_FAILED'.
> In addition, Fix a typo for qapi-schema.json: comppleted -> completed
>
> Signed-off-by: zhanghailiang 
[...]
> diff --git a/qapi-schema.json b/qapi-schema.json
> index e16f8eb..3b5904b 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
>  ##
>  # @MigrationInfo
>  #
>  # Information about current migration process.
>  #
> -# @status: #optional string describing the current migration status.
> -#  As of 0.14.0 this can be 'setup', 'active', 'completed', 'failed' 
> or
> -#  'cancelled'. If this field is not returned, no migration process
> +# @status: #optional @MigState describing the current migration status.
> +#  If this field is not returned, no migration process
>  #  has been initiated
>  #
>  # @ram: #optional @MigrationStats containing detailed migration
>  #   status, only returned if status is 'active' or
> -#   'completed'. 'comppleted' (since 1.2)
> +#   'completed'. 'completed' (since 1.2)

Shouldn't this just be

+#   'completed' (since 1.2)

?

>  #
>  # @disk: #optional @MigrationStats containing detailed disk migration
>  #status, only returned if status is 'active' and it is a block
> @@ -453,7 +477,7 @@
>  # Since: 0.14.0
>  ##
>  { 'type': 'MigrationInfo',
> -  'data': {'*status': 'str', '*ram': 'MigrationStats',
> +  'data': {'*status': 'MigrationStatus', '*ram': 'MigrationStats',
> '*disk': 'MigrationStats',
> '*xbzrle-cache': 'XBZRLECacheStats',
> '*total-time': 'int',

Re: [Qemu-devel] [PATCH v4 01/10] cpu/apic: drop icc bus/bridge/

2015-03-05 Thread Eduardo Habkost

On Fri, Feb 13, 2015 at 06:25:24PM +0800, Zhu Guihua wrote:
> From: Chen Fan 
> 
> ICC bus was invented only to provide hotplug capability to
> CPU and APIC because at the time being hotplug was available only for
> BUS attached devices.
> 
> Now this patch is to drop ICC bus impl, and switch to bus-less
> CPU+APIC hotplug, handling them in the same manner as pc-dimm.
> 
> Signed-off-by: Chen Fan 
> Signed-off-by: Zhu Guihua 
> ---
>  hw/i386/kvm/apic.c  | 10 --
>  hw/i386/pc.c| 21 +
>  hw/i386/pc_piix.c   |  9 +
>  hw/i386/pc_q35.c|  9 +
>  hw/intc/apic.c  | 16 +++-
>  hw/intc/apic_common.c   | 14 +-
>  include/hw/i386/apic_internal.h |  6 ++
>  include/hw/i386/pc.h|  3 ++-
>  target-i386/cpu.c   | 19 +++
>  target-i386/cpu.h   |  3 +--
>  10 files changed, 43 insertions(+), 67 deletions(-)

What about hw/i386/xen/xen_apic.c:xen_apic_realize()?

  $ make
CCx86_64-softmmu/hw/i386/xen/xen_apic.o
  /home/ehabkost/rh/proj/virt/qemu/hw/i386/xen/xen_apic.c: In function 
‘xen_apic_realize’:
  /home/ehabkost/rh/proj/virt/qemu/hw/i386/xen/xen_apic.c:44:29: error: 
‘APICCommonState’ has no member named ‘io_memory’
   memory_region_init_io(&s->io_memory, OBJECT(s), &xen_apic_io_ops, s,
   ^
  /home/ehabkost/rh/proj/virt/qemu/rules.mak:57: recipe for target 
'hw/i386/xen/xen_apic.o' failed
  make[1]: *** [hw/i386/xen/xen_apic.o] Error 1
  Makefile:169: recipe for target 'subdir-x86_64-softmmu' failed
  make: *** [subdir-x86_64-softmmu] Error 2

-- 
Eduardo

Re: [Qemu-devel] 9pfs-local: open2() deletes existing data?

2015-03-05 Thread Aneesh Kumar K.V

Michael Tokarev  writes:

> I was looking at various interesting functions in hw/9pfs/virtio-9p-local.c
> and noticed local_open2() which basically tries to open a file in a
> filesystem, and if that is successful, it tries to set file credentials
> using a configured mechanism, and if that fails, it deletes the file.
>
> Now I wonder what happens if we tried to open an existing file but was
> not able to set credentials for whatever reason -- eg, because the
> underlying filesystem does not support xattrs, or whatever.  It looks
> to me that we will remove the user file!
>
> If that's the case, it looks like it is a very serious bug...

That callback is used for create. What is used for open is local_open()

-aneesh

Re: [Qemu-devel] [PATCH] block/raw-posix: fix launching with failed disks

2015-03-05 Thread Stefan Hajnoczi

On Thu, Mar 05, 2015 at 01:53:57PM +0100, Kevin Wolf wrote:
> Am 04.03.2015 um 23:48 hat Stefan Hajnoczi geschrieben:
> > Since commit c25f53b06eba1575d5d0e92a0132455c97825b83 ("raw: Probe
> > required direct I/O alignment") QEMU has failed to launch if image files
> > produce I/O errors.
> > 
> > Previously, QEMU would launch successfully and the guest would see the
> > errors when attempting I/O.
> > 
> > This is a regression and may prevent multipath I/O inside the guest,
> > where QEMU must launch and let the guest figure out by itself which
> > disks are online.
> > 
> > Tweak the alignment probing code in raw-posix.c to explicitly look for
> > EINVAL on Linux instead of bailing.  The kernel refuses misaligned
> > requests with this error code and other error codes can be ignored.
> > 
> > Signed-off-by: Stefan Hajnoczi 
> 
> This seems to conflict with the geometry series. Please rebase on the
> current block branch.
> 
> Also, I would be surprised if this had been working by design. It's
> probably more by chance. If we want to make this a supported case, we
> need to add a qemu-iotests case, as this seems to be easy to break
> accidentally.

Will send v2.

Stefan


pgpErV_z5SzY1.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH 7/9] throttle: Add throttle group support

2015-03-05 Thread Stefan Hajnoczi

On Wed, Mar 04, 2015 at 05:16:51PM +0100, Alberto Garcia wrote:
> On Wed, Mar 04, 2015 at 10:04:27AM -0600, Stefan Hajnoczi wrote:
> 
> > > > This pattern suggests throttle_timer_fired() should acquire the
> > > > lock internally instead.
> > > 
> > > The idea is that the ThrottleState code itself doesn't know
> > > anything about locks or groups. As I understood it Benoît
> > > designed the ThrottleState code to be independent from the block
> > > layer and reusable for other things (that's why it's in util/).
> > 
> > Then ThrottleGroup could offer an API throttle_group_timer_fired()
> > that does the locking.
> > 
> > The advantage of encapsulating locking in ThrottleGroup is that
> > callers don't have to remember the take the lock.  (But they must
> > still be careful about sequences of calls which will not be atomic.)
> 
> No other code in ThrottleGroup takes the lock directly, so making this
> an exception could be confusing.

It shouldn't be an exception.  I'm proposing that the ThrottleGroup API
should handle locking internally instead of requiring callers to do it.

Stefan


pgpD9CYefgzce.pgp
Description: PGP signature

Re: [Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Linus Torvalds

On Thu, Mar 5, 2015 at 9:18 AM, Andrea Arcangeli  wrote:
> remap_pages is the lowlevel mm helper needed to implement
> UFFDIO_REMAP.

This function is nasty nasty nasty.

Is this really worth it? On real loads? That people are expected to use?

Considering how we just got rid of one special magic VM remapping
thing that nobody actually used, I'd really hate to add a new one.

The fact is, almost nobody ever uses anything that isn't standard
POSIX. There are no apps, and even for specialized things like
virtualization hypervisors this kind of thing is often simply not
worth it.

Quite frankly, *if* we ever merge userfaultfd, I would *strongly*
argue for not merging the remap parts. I just don't see the point. It
doesn't seem to add anything that is semantically very important -
it's *potentially* a faster copy, but even that is

  (a) questionable in the first place

and

 (b) unclear why anybody would ever care about performance of
infrastructure that nobody actually uses today, and future use isn't
even clear or shown to be particualrly performance-sensitive.

So basically I'd like to see better documentation, a few real use
cases (and by real I very much do *not* mean "you can use it for
this", but actual patches to actual projects that matter and that are
expected to care and merge them), and a simplified series that doesn't
do the remap thing.

Because *every* time we add a new clever interface, we end up with
approximately zero users and just pain down the line. Examples:
splice, mremap, yadda yadda.

Linus

Re: [Qemu-devel] [PATCH] savevm: create snapshot failed when id_str already exits

2015-03-05 Thread Stefan Hajnoczi

On Thu, Mar 05, 2015 at 09:05:52PM +0800, Yi Wang wrote:
> Thanks for your reply and Happy Lantern Festival!
> I am afraid you misunderstood what I mean, maybe I didn't express
> clearly :-) My patch works in such case:
> Firstly vm has two disks:
> [root@fox-host vmimg]# virsh domblklist win7
> Target Source
> 
> hdb /home/virtio_test.iso
> vda /home/vmimg/win7.img.1
> vdb /home/vmimg/win7.append
> 
> Secondly first disk has one snapshot with id_str "1", and another disk
> has three snapshots with id_str "1", "2", "3".
> [root@fox-host vmimg]# qemu-img snapshot -l win7.img.1
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 s1 0 2015-03-05 10:26:16 00:00:00.000
> 
> [root@fox-host vmimg]# qemu-img snapshot -l win7.append
> Snapshot list:
> ID TAG VM SIZE DATE VM CLOCK
> 1 s3 0 2015-03-05 10:29:21 00:00:00.000
> 2 s1 0 2015-03-05 10:29:38 00:00:00.000
> 3 s2 0 2015-03-05 10:30:49 00:00:00.000
> 
> In this case, we will fail when create snapshot specifying a name,
> 'cause id_str "2" already exists in disk vdb.
> [root@fox-host8 vmimg]# virsh snapshot-create-as win7-fox s4
> error: operation failed: Failed to take snapshot: Error while creating
> snapshot on 'drive-virtio-disk1'

This means that name != NULL but it is still unnecessary to duplicate ID
generation.

Does this work?

diff --git a/savevm.c b/savevm.c
index 08ec678..e81e4aa 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1047,6 +1047,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
 QEMUFile *f;
 int saved_vm_running;
 uint64_t vm_state_size;
+bool generate_ids = true;
 qemu_timeval tv;
 struct tm tm;
 const char *name = qdict_get_try_str(qdict, "name");
@@ -1088,6 +1089,7 @@ void do_savevm(Monitor *mon, const QDict *qdict)
 if (ret >= 0) {
 pstrcpy(sn->name, sizeof(sn->name), old_sn->name);
 pstrcpy(sn->id_str, sizeof(sn->id_str), old_sn->id_str);
+generate_ids = false;
 } else {
 pstrcpy(sn->name, sizeof(sn->name), name);
 }
@@ -1123,6 +1125,14 @@ void do_savevm(Monitor *mon, const QDict *qdict)
 if (bdrv_can_snapshot(bs1)) {
 /* Write VM state size only to the image that contains the state */
 sn->vm_state_size = (bs == bs1 ? vm_state_size : 0);
+
+/* Images may have existing IDs so let the ID be autogenerated if 
the
+ * user did not specify a name.
+ */
+if (generate_ids) {
+sn->id_str[0] = '\0';
+}
+
 ret = bdrv_snapshot_create(bs1, sn);
 if (ret < 0) {
 monitor_printf(mon, "Error while creating snapshot on '%s'\n",


pgpMlqXldxPux.pgp
Description: PGP signature

Re: [Qemu-devel] [Xen-devel] [v2][PATCH] libxl: add one machine property to support IGD GFX passthrough

2015-03-05 Thread Ian Campbell

On Mon, 2015-03-02 at 09:20 +0800, Chen, Tiejun wrote:
> Is this expected?

Yes. Can you post it as a proper patch please.

I suggest you split the basic stuff and the kind override discussed
below in to two patches.

> >> +(b_info->u.hvm.gfx_passthru &&
> >> + strncmp(b_info->u.hvm.gfx_passthru, "igd", 3) == 0) ) {
> 
> But as you mentioned previously,
> 
> "
> You might like to optionally consider add a forcing option somehow so
> that people with new devices not in the list can control things without
> the need to recompile (e.g. gfx_passthru_kind_override?).
> "

> 
> Here I was trying to convert "gfx_passthru" to address this thing. 
> According to your comment right now, you prefer we should introduce a 
> new field instead of the original 'gfx_passthru' to enumerate such a 
> type. So what about this?
> 
> libxl_gfx_passthru_kind_type = Enumeration("gfx_passthru_kind_type", [

"kind_type" is redundant. I think just "kind" will do.

>  (0, "unknown"),

"default" I think is better, i.e. if gfx_passthru is enabled then do the
probing from the PCI ID list thing.

>  (1, "igd"),
>  ])

You would then need to add a field of this type next to the gfx_passthru
one in b_config, lets say it's called gfx_passthru_kind.

> Then if we want to override this, just submit the following line into .cfg:
> 
> gfx_passthru_kind_override = "igd"

So, while we cannot change the defbool in the libxl interface we do
actually have a little more freedom in the xl cfg parsing because we can
detect bool/integer vs string.

So I think it should be possible to arrange to support any of
gfx_passthru = 0  => sets build_info.u.gfx_passthru to false
gfx_passthru = 1  => sets build_info.u.gfx_passthru to false and
 build_info.u.gfx_passthru_kind to DEFAULT
gfx_passthru = "igd"  => sets build_info.u.gfx_passthru to false and
 build_info.u.gfx_passthru_kind to IGD

Take a look at how the "timer_mode" option is handled in xl_cmdimpl.c
(except none of these variants are deprecated. You should be able to use
the libxl_..._from_string enum helper to do the parsing in the latter
case.

Ian.

[Qemu-devel] [PATCH 05/21] userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

2015-03-05 Thread Andrea Arcangeli

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm_types.h | 11 +++
 kernel/fork.c|  1 +
 2 files changed, 12 insertions(+)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03a..fbf21f5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -247,6 +247,16 @@ struct vm_region {
* this region */
 };
 
+#ifdef CONFIG_USERFAULTFD
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) { NULL, })
+struct vm_userfaultfd_ctx {
+   struct userfaultfd_ctx *ctx;
+};
+#else /* CONFIG_USERFAULTFD */
+#define NULL_VM_UFFD_CTX ((struct vm_userfaultfd_ctx) {})
+struct vm_userfaultfd_ctx {};
+#endif /* CONFIG_USERFAULTFD */
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -313,6 +323,7 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
 #endif
+   struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
 };
 
 struct core_thread {
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..cb215c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -425,6 +425,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
goto fail_nomem_anon_vma_fork;
tmp->vm_flags &= ~VM_LOCKED;
tmp->vm_next = tmp->vm_prev = NULL;
+   tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;
if (file) {
struct inode *inode = file_inode(file);

[Qemu-devel] [PATCH 03/21] userfaultfd: uAPI

2015-03-05 Thread Andrea Arcangeli

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli 
---
 Documentation/ioctl/ioctl-number.txt |  1 +
 include/uapi/linux/userfaultfd.h | 81 
 2 files changed, 82 insertions(+)
 create mode 100644 include/uapi/linux/userfaultfd.h

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 8136e1f..be2d4a2 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -301,6 +301,7 @@ Code  Seq#(hex) Include FileComments
 0xA3   80-8F   Port ACLin development:

 0xA3   90-9F   linux/dtlk.h
+0xAA   00-3F   linux/uapi/linux/userfaultfd.h
 0xAB   00-1F   linux/nbd.h
 0xAC   00-1F   linux/raw.h
 0xAD   00  Netfilter devicein development:
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
new file mode 100644
index 000..9a8cd56
--- /dev/null
+++ b/include/uapi/linux/userfaultfd.h
@@ -0,0 +1,81 @@
+/*
+ *  include/linux/userfaultfd.h
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_H
+#define _LINUX_USERFAULTFD_H
+
+#define UFFD_API ((__u64)0xAA)
+/* FIXME: add "|UFFD_BIT_WP" to UFFD_API_BITS after implementing it */
+#define UFFD_API_BITS (UFFD_BIT_WRITE)
+#define UFFD_API_IOCTLS\
+   ((__u64)1 << _UFFDIO_REGISTER | \
+(__u64)1 << _UFFDIO_UNREGISTER |   \
+(__u64)1 << _UFFDIO_API)
+#define UFFD_API_RANGE_IOCTLS  \
+   ((__u64)1 << _UFFDIO_WAKE)
+
+/*
+ * Valid ioctl command number range with this API is from 0x00 to
+ * 0x3F.  UFFDIO_API is the fixed number, everything else can be
+ * changed by implementing a different UFFD_API. If sticking to the
+ * same UFFD_API more ioctl can be added and userland will be aware of
+ * which ioctl the running kernel implements through the ioctl command
+ * bitmask written by the UFFDIO_API.
+ */
+#define _UFFDIO_REGISTER   (0x00)
+#define _UFFDIO_UNREGISTER (0x01)
+#define _UFFDIO_WAKE   (0x02)
+#define _UFFDIO_API(0x3F)
+
+/* userfaultfd ioctl ids */
+#define UFFDIO 0xAA
+#define UFFDIO_API _IOWR(UFFDIO, _UFFDIO_API,  \
+ struct uffdio_api)
+#define UFFDIO_REGISTER_IOWR(UFFDIO, _UFFDIO_REGISTER, \
+ struct uffdio_register)
+#define UFFDIO_UNREGISTER  _IOR(UFFDIO, _UFFDIO_UNREGISTER,\
+struct uffdio_range)
+#define UFFDIO_WAKE_IOR(UFFDIO, _UFFDIO_WAKE,  \
+struct uffdio_range)
+
+/*
+ * Valid bits below PAGE_SHIFT in the userfault address read through
+ * the read() syscall.
+ */
+#define UFFD_BIT_WRITE (1<<0)  /* this was a write fault, MISSING or WP */
+#define UFFD_BIT_WP(1<<1)  /* handle_userfault() reason VM_UFFD_WP */
+#define UFFD_BITS  2   /* two above bits used for UFFD_BIT_* mask */
+
+struct uffdio_api {
+   /* userland asks for an API number */
+   __u64 api;
+
+   /* kernel answers below with the available features for the API */
+   __u64 bits;
+   __u64 ioctls;
+};
+
+struct uffdio_range {
+   __u64 start;
+   __u64 len;
+};
+
+struct uffdio_register {
+   struct uffdio_range range;
+#define UFFDIO_REGISTER_MODE_MISSING   ((__u64)1<<0)
+#define UFFDIO_REGISTER_MODE_WP((__u64)1<<1)
+   __u64 mode;
+
+   /*
+* kernel answers which ioctl commands are available for the
+* range, keep at the end as the last 8 bytes aren't read.
+*/
+   __u64 ioctls;
+};
+
+#endif /* _LINUX_USERFAULTFD_H */

[Qemu-devel] [PATCH 0/1] target-i386: Move icc_bridge code to PC

2015-03-05 Thread Eduardo Habkost

This removes yet another chunk of PC-specific code from target-i386/cpu.c and
moves it to PC code.

WIth this we get closer to being able to change target-i386 to use
cpu_generic_init().

This series is based on my x86 tree, located at:
  https://github.com/ehabkost/qemu.git x86

Eduardo Habkost (1):
  target-i386: Remove icc_bridge parameter from cpu_x86_create()

 hw/i386/pc.c  |  6 +-
 target-i386/cpu.c | 14 ++
 target-i386/cpu.h |  3 +--
 3 files changed, 8 insertions(+), 15 deletions(-)

-- 
2.1.0

[Qemu-devel] [PATCH 1/1] target-i386: Remove icc_bridge parameter from cpu_x86_create()

2015-03-05 Thread Eduardo Habkost

Instead of passing icc_bridge from the PC initialization code to
cpu_x86_create(), make the PC initialization code attach the CPU to
icc_bridge.

The only difference here is that icc_bridge attachment will now be done
after x86_cpu_parse_featurestr() is called. But this shouldn't make any
difference, as property setters shouldn't depend on icc_bridge.

Signed-off-by: Eduardo Habkost 
---
 hw/i386/pc.c  |  6 +-
 target-i386/cpu.c | 14 ++
 target-i386/cpu.h |  3 +--
 3 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index ed54d93..66b9fa6 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -995,12 +995,16 @@ static X86CPU *pc_new_cpu(const char *cpu_model, int64_t 
apic_id,
 X86CPU *cpu;
 Error *local_err = NULL;
 
-cpu = cpu_x86_create(cpu_model, icc_bridge, &local_err);
+cpu = cpu_x86_create(cpu_model, &local_err);
 if (local_err != NULL) {
 error_propagate(errp, local_err);
 return NULL;
 }
 
+assert(icc_bridge);
+qdev_set_parent_bus(DEVICE(cpu), qdev_get_child_bus(icc_bridge, "icc"));
+object_unref(OBJECT(cpu));
+
 object_property_set_int(OBJECT(cpu), apic_id, "apic-id", &local_err);
 object_property_set_bool(OBJECT(cpu), true, "realized", &local_err);
 
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 50907d0..097924c 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -2076,8 +2076,7 @@ static void x86_cpu_load_def(X86CPU *cpu, 
X86CPUDefinition *def, Error **errp)
 
 }
 
-X86CPU *cpu_x86_create(const char *cpu_model, DeviceState *icc_bridge,
-   Error **errp)
+X86CPU *cpu_x86_create(const char *cpu_model, Error **errp)
 {
 X86CPU *cpu = NULL;
 X86CPUClass *xcc;
@@ -2108,15 +2107,6 @@ X86CPU *cpu_x86_create(const char *cpu_model, 
DeviceState *icc_bridge,
 
 cpu = X86_CPU(object_new(object_class_get_name(oc)));
 
-#ifndef CONFIG_USER_ONLY
-if (icc_bridge == NULL) {
-error_setg(&error, "Invalid icc-bridge value");
-goto out;
-}
-qdev_set_parent_bus(DEVICE(cpu), qdev_get_child_bus(icc_bridge, "icc"));
-object_unref(OBJECT(cpu));
-#endif
-
 x86_cpu_parse_featurestr(CPU(cpu), features, &error);
 if (error) {
 goto out;
@@ -2139,7 +2129,7 @@ X86CPU *cpu_x86_init(const char *cpu_model)
 Error *error = NULL;
 X86CPU *cpu;
 
-cpu = cpu_x86_create(cpu_model, NULL, &error);
+cpu = cpu_x86_create(cpu_model, &error);
 if (error) {
 goto out;
 }
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 0638d24..8d748bd 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -982,8 +982,7 @@ typedef struct CPUX86State {
 #include "cpu-qom.h"
 
 X86CPU *cpu_x86_init(const char *cpu_model);
-X86CPU *cpu_x86_create(const char *cpu_model, DeviceState *icc_bridge,
-   Error **errp);
+X86CPU *cpu_x86_create(const char *cpu_model, Error **errp);
 int cpu_x86_exec(CPUX86State *s);
 void x86_cpu_list(FILE *f, fprintf_function cpu_fprintf);
 void x86_cpudef_setup(void);
-- 
2.1.0

Re: [Qemu-devel] [PATCH] Fix bug in implementation of SYSRET instruction for x86-64

2015-03-05 Thread John Snow


CC'ing X86 maintainers.

On 03/04/2015 12:48 PM, Bill Paul wrote:

Hi guys. I seem to have found a bug in the helper_systet() function in

target-i386/seg_helper.c. I downloaded the Intel architecture manual
from here:

http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

And it describes the behavior of SYSRET with regards to the stack
selector (SS) as follows:

SS.Selector <-- (IA32_STAR[63:48]+8) OR 3; (* RPL forced to 3 *)

In other words, the value of the SS register is supposed to be loaded
from bits 63:48 of the IA32_STAR model-specific register (MSR),
incremented by 8, _AND_ ORed with 3. ORing in the 3 sets the privilege
level to 3 (user). This is done since SYSRET returns to user mode after
a system call.

The code in helper_sysret() performs the "increment by 8" step, but
_NOT_ the "OR with 3" step.

Here's another description of the SYSRET instruction which says
basically the same thing, though in slightly different format:

http://tptp.cc/mirrors/siyobik.info/instruction/SYSRET.html

[...]

SS(SEL) = IA32_STAR[63:48] + 8;

SS(PL) = 0x3;

[...]

The effect of this bug is that when x86-64 code uses the SYSCALL
instruction to enter kernel mode, the SYSRET instruction will return the
CPU to user mode with the SS register incorrectly set (the privilege
level bits will be 0). In my case, the original SS value for user mode
was 0x2B, but after the return from SYSRET, it was changed to 0x28. It
took me quite some time to realize this was due to a bug in QEMU rather
than my own code.

This caused a problem later when the user-mode code was preempted by an
interrupt. At the end of the interrupt handling, an IRET instruction was
used to exit the interrupt context, and the IRET instruction would
generate a general protection fault because it would attempt to validate
the stack selector value and notice that it was set for PL 0
(supervisor) instead of PL 3 (user).

 From my reading, the code is quite clearly in error with respect to the
Intel documentation, and fixing the bug in my local sources results in
my test code (which runs correctly on real hardware) working correctly
in QEMU. It seems this bug has been there for a long time -- I happened
to have the sources for QEMU 0.10.5 and the bug is there too (in
target-i386/op_helper.c). I'm puzzled as to how it's gone unnoticed for
so long, which has me thinking that maybe I'm missing something. I'm
pretty sure this is wrong though.

I'm including a patch to fix this below. It seems to fix the problem
quite nicely on my QEMU 2.2.0 installation. I'm also attaching a
separate copy in case my mail client mangles the formatting on the
inline copy.

-Bill

--

=

-Bill Paul (510) 749-2329 | Senior Member of Technical Staff,

wp...@windriver.com | Master of Unix-Fu - Wind River Systems

=

"I put a dollar in a change machine. Nothing changed." - George Carlin

=

Signed-off-by: Bill Paul 

---

target-i386/seg_helper.c | 4 ++--

1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target-i386/seg_helper.c b/target-i386/seg_helper.c

index fa374d0..2bc757a 100644

--- a/target-i386/seg_helper.c

+++ b/target-i386/seg_helper.c

@@ -1043,7 +1043,7 @@ void helper_sysret(CPUX86State *env, int dflag)

DESC_CS_MASK | DESC_R_MASK | DESC_A_MASK);

env->eip = (uint32_t)env->regs[R_ECX];

}

- cpu_x86_load_seg_cache(env, R_SS, selector + 8,

+ cpu_x86_load_seg_cache(env, R_SS, (selector + 8) | 3,

0, 0x,

DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |

DESC_S_MASK | (3 << DESC_DPL_SHIFT) |

@@ -1056,7 +1056,7 @@ void helper_sysret(CPUX86State *env, int dflag)

DESC_S_MASK | (3 << DESC_DPL_SHIFT) |

DESC_CS_MASK | DESC_R_MASK | DESC_A_MASK);

env->eip = (uint32_t)env->regs[R_ECX];

- cpu_x86_load_seg_cache(env, R_SS, selector + 8,

+ cpu_x86_load_seg_cache(env, R_SS, (selector + 8) | 3,

0, 0x,

DESC_G_MASK | DESC_B_MASK | DESC_P_MASK |

DESC_S_MASK | (3 << DESC_DPL_SHIFT) |

--

1.8.0

[Qemu-devel] [PATCH 14/21] userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

2015-03-05 Thread Andrea Arcangeli

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h |   6 +
 mm/Makefile   |   1 +
 mm/userfaultfd.c  | 267 ++
 3 files changed, 274 insertions(+)
 create mode 100644 mm/userfaultfd.c

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e1e4360..587480a 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -30,6 +30,12 @@
 extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
unsigned int flags, unsigned long reason);
 
+extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long dst_start,
+   unsigned long src_start, unsigned long len);
+extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
+ unsigned long dst_start,
+ unsigned long len);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/Makefile b/mm/Makefile
index 3c1caa2..ea9828e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,3 +76,4 @@ obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)  += cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
 obj-$(CONFIG_PAGE_EXTENSION) += page_ext.o
+obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
new file mode 100644
index 000..3f4c0ef
--- /dev/null
+++ b/mm/userfaultfd.c
@@ -0,0 +1,267 @@
+/*
+ *  mm/userfaultfd.c
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "internal.h"
+
+static int mcopy_atomic_pte(struct mm_struct *dst_mm,
+   pmd_t *dst_pmd,
+   struct vm_area_struct *dst_vma,
+   unsigned long dst_addr,
+   unsigned long src_addr)
+{
+   struct mem_cgroup *memcg;
+   pte_t _dst_pte, *dst_pte;
+   spinlock_t *ptl;
+   struct page *page;
+   void *page_kaddr;
+   int ret;
+
+   ret = -ENOMEM;
+   page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, dst_vma, dst_addr);
+   if (!page)
+   goto out;
+
+   page_kaddr = kmap(page);
+   ret = -EFAULT;
+   if (copy_from_user(page_kaddr, (const void __user *) src_addr,
+  PAGE_SIZE))
+   goto out_kunmap_release;
+   kunmap(page);
+
+   /*
+* The memory barrier inside __SetPageUptodate makes sure that
+* preceeding stores to the page contents become visible before
+* the set_pte_at() write.
+*/
+   __SetPageUptodate(page);
+
+   ret = -ENOMEM;
+   if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+   goto out_release;
+
+   _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+   if (dst_vma->vm_flags & VM_WRITE)
+   _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
+
+   ret = -EEXIST;
+   dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+   if (!pte_none(*dst_pte))
+   goto out_release_uncharge_unlock;
+
+   inc_mm_counter(dst_mm, MM_ANONPAGES);
+   page_add_new_anon_rmap(page, dst_vma, dst_addr);
+   mem_cgroup_commit_charge(page, memcg, false);
+   lru_cache_add_active_or_unevictable(page, dst_vma);
+
+   set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+
+   /* No need to invalidate - it was non-present before */
+   update_mmu_cache(dst_vma, dst_addr, dst_pte);
+
+   pte_unmap_unlock(dst_pte, ptl);
+   ret = 0;
+out:
+   return ret;
+out_release_uncharge_unlock:
+   pte_unmap_unlock(dst_pte, ptl);
+   mem_cgroup_cancel_charge(page, memcg);
+out_release:
+   page_cache_release(page);
+   goto out;
+out_kunmap_release:
+   kunmap(page);
+   goto out_release;
+}
+
+static int mfill_zeropage_pte(struct mm_struct *dst_mm,
+ pmd_t *dst_pmd,
+ struct vm_area_struct *dst_vma,
+ unsigned long dst_addr)
+{
+   pte_t _dst_pte, *dst_pte;
+   spinlock_t *ptl;
+   int ret;
+
+   _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
+dst_vma->vm_page_prot));
+   ret = -EEXIST;
+   dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+   if (!pte_none(*dst_pte))
+   goto out_unlock;
+   set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+

[Qemu-devel] [PATCH 21/21] userfaultfd: add userfaultfd_wp mm helpers

2015-03-05 Thread Andrea Arcangeli

These helpers will be used to know if to call handle_userfault() during
wrprotect faults in order to deliver the wrprotect faults to userland.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 3c39a4f..81f0d11 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -65,6 +65,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return vma->vm_flags & VM_UFFD_MISSING;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_WP;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
@@ -92,6 +97,11 @@ static inline bool userfaultfd_missing(struct vm_area_struct 
*vma)
return false;
 }
 
+static inline bool userfaultfd_wp(struct vm_area_struct *vma)
+{
+   return false;
+}
+
 static inline bool userfaultfd_armed(struct vm_area_struct *vma)
 {
return false;

[Qemu-devel] [PATCH 10/21] userfaultfd: add new syscall to provide memory externalization

2015-03-05 Thread Andrea Arcangeli

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread
responsible for doing the memory externalization can manage the page
faults in userland by talking to the kernel using the userfaultfd
protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 977 +++
 1 file changed, 977 insertions(+)
 create mode 100644 fs/userfaultfd.c

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
new file mode 100644
index 000..6b31967
--- /dev/null
+++ b/fs/userfaultfd.c
@@ -0,0 +1,977 @@
+/*
+ *  fs/userfaultfd.c
+ *
+ *  Copyright (C) 2007  Davide Libenzi 
+ *  Copyright (C) 2008-2009 Red Hat, Inc.
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ *
+ *  Some part derived from fs/eventfd.c (anon inode setup) and
+ *  mm/ksm.c (mm hashing).
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum userfaultfd_state {
+   UFFD_STATE_WAIT_API,
+   UFFD_STATE_RUNNING,
+};
+
+struct userfaultfd_ctx {
+   /* pseudo fd refcounting */
+   atomic_t refcount;
+   /* waitqueue head for the userfaultfd page faults */
+   wait_queue_head_t fault_wqh;
+   /* waitqueue head for the pseudo fd to wakeup poll/read */
+   wait_queue_head_t fd_wqh;
+   /* userfaultfd syscall flags */
+   unsigned int flags;
+   /* state machine */
+   enum userfaultfd_state state;
+   /* released */
+   bool released;
+   /* mm with one ore more vmas attached to this userfaultfd_ctx */
+   struct mm_struct *mm;
+};
+
+struct userfaultfd_wait_queue {
+   unsigned long address;
+   wait_queue_t wq;
+   bool pending;
+   struct userfaultfd_ctx *ctx;
+};
+
+struct userfaultfd_wake_range {
+   unsigned long start;
+   unsigned long len;
+};
+
+static int userfaultfd_wake_function(wait_queue_t *wq, unsigned mode,
+int wake_flags, void *key)
+{
+   struct userfaultfd_wake_range *range = key;
+   int ret;
+   struct userfaultfd_wait_queue *uwq;
+   unsigned long start, len;
+
+   uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
+   ret = 0;
+   /* don't wake the pending ones to avoid reads to block */
+   if (uwq->pending && !ACCESS_ONCE(uwq->ctx->released))
+   goto out;
+   /* len == 0 means wake all */
+   start = range->start;
+   len = range->len;
+   if (len && (start > uwq->address || start + len <= uwq->address))
+   goto out;
+   ret = wake_up_state(wq->private, mode);
+   if (ret)
+   /* wake only once, autoremove behavior */
+   list_del_init(&wq->task_list);
+out:
+   return ret;
+}
+
+/**
+ * userfaultfd_ctx_get - Acquires a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to the userfaultfd context.
+ *
+ * Returns: In case of success, returns not zero.
+ */
+static void userfaultfd_ctx_get(struct userfaultfd_ctx *ctx)
+{
+   if (!atomic_inc_not_zero(&ctx->refcount))
+   BUG();
+}
+
+/**
+ * userfaultfd_ctx_put - Releases a reference to the internal userfaultfd
+ * context.
+ * @ctx: [in] Pointer to userfaultfd context.
+ *
+ * The userfaultfd context reference must have been previously acquired either
+ * with userfaultfd_ctx_get() or userfaultfd_ctx_fdget().
+ */
+static void userfaultfd_ctx_put(struct userfaultfd_ctx *ctx)
+{
+   if (atomic_dec_and_test(&ctx->refcount)) {
+   mmdrop(ctx->mm);
+   kfree(ctx);
+   }
+}
+
+static inline unsigned long userfault_address(unsigned long address,
+ unsigned int flags,
+ unsigned long reason)
+{
+   BUILD_BUG_ON(PAGE_SHIFT < UFFD_BITS);
+   address &= PAGE_MASK;
+   if (flags & FAULT_FLAG_WRITE)
+   /*
+* Encode "write" fault information in the LSB of the
+* address read by userland, without depending on
+* FAULT_FLAG_WRITE kernel internal value.
+*/
+   address |= UFFD_BIT_WRITE;
+   if (reason & VM_UFFD_WP)
+   /*
+* Encode "reason" fault information as bit number 1
+* in the address read by userland. If bit number 1 is
+* clear it means the reason is a VM_FAULT_MISSING
+* fault.
+*/
+   address |= UFFD_BIT_WP;
+   return address;
+}
+
+/*
+ * The locking rules involved in returning VM_FAULT_RETRY depending on
+ * FAULT_FLAG_ALLOW_RETRY, FAULT_FLAG_RETRY

[Qemu-devel] [PATCH 04/21] userfaultfd: linux/userfaultfd_k.h

2015-03-05 Thread Andrea Arcangeli

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h | 79 +++
 1 file changed, 79 insertions(+)
 create mode 100644 include/linux/userfaultfd_k.h

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
new file mode 100644
index 000..e1e4360
--- /dev/null
+++ b/include/linux/userfaultfd_k.h
@@ -0,0 +1,79 @@
+/*
+ *  include/linux/userfaultfd_k.h
+ *
+ *  Copyright (C) 2015  Red Hat, Inc.
+ *
+ */
+
+#ifndef _LINUX_USERFAULTFD_K_H
+#define _LINUX_USERFAULTFD_K_H
+
+#ifdef CONFIG_USERFAULTFD
+
+#include  /* linux/include/uapi/linux/userfaultfd.h */
+
+#include 
+
+/*
+ * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from userfaultfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+#define UFFD_CLOEXEC O_CLOEXEC
+#define UFFD_NONBLOCK O_NONBLOCK
+
+#define UFFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
+#define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
+
+extern int handle_userfault(struct vm_area_struct *vma, unsigned long address,
+   unsigned int flags, unsigned long reason);
+
+/* mm helpers */
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+   struct vm_userfaultfd_ctx vm_ctx)
+{
+   return vma->vm_userfaultfd_ctx.ctx == vm_ctx.ctx;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & VM_UFFD_MISSING;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+   return vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP);
+}
+
+#else /* CONFIG_USERFAULTFD */
+
+/* mm helpers */
+static inline int handle_userfault(struct vm_area_struct *vma,
+  unsigned long address,
+  unsigned int flags,
+  unsigned long reason)
+{
+   return VM_FAULT_SIGBUS;
+}
+
+static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
+   struct vm_userfaultfd_ctx vm_ctx)
+{
+   return true;
+}
+
+static inline bool userfaultfd_missing(struct vm_area_struct *vma)
+{
+   return false;
+}
+
+static inline bool userfaultfd_armed(struct vm_area_struct *vma)
+{
+   return false;
+}
+
+#endif /* CONFIG_USERFAULTFD */
+
+#endif /* _LINUX_USERFAULTFD_K_H */

[Qemu-devel] [PATCH 00/21] RFC: userfaultfd v3

2015-03-05 Thread Andrea Arcangeli

Hello everyone,

This is a RFC for the userfaultfd syscall API v3 that addresses the
feedback received for the previous v2 submit.

The main change from the v2 is that MADV_USERFAULT/NOUSERFAULT
disappeared (they're replaced by the UFFDIO_REGISTER/UNREGISTER
ioctls). In short userfaults are now only possible through the
userfaultfd. The remap_anon_pages syscall also disappeared replaced by
the UFFDIO_REMAP ioctl which is in turn mostly obsoleted by the newer
UFFDIO_COPY and UFFDIO_ZEROPAGE ioctls that are indeed more efficient
by never having to flush the TLB. The suggestion to copy the data
instead of moving it, in order to resolve the userfault, was
immediately agreed.

The latest code can also be cloned here:

git clone --reference linux -b userfault 
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git


Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control on
various types of page faults.

For example userfaults allows a proper and more optimal implementation
of the PROT_NONE+SIGSEGV trick.

There has been interest from multiple users for different use cases:

1) KVM postcopy live migration (one form of cloud memory
   externalization). KVM postcopy live migration is the primary driver
   of this work:
   http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
   http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html
   )

2) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   The syscall API is already contemplating the wrprotect fault
   tracking and it's generic enough to allow its later implementation
   in a backwards compatible fashion.

3) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

4) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This also requires point 3)
   to be fulfilled, as volatile pages happily apply to tmpfs.

5) postcopy live migration of binaries inside linux containers.

Even though there wasn't a real use case requesting it yet, the new
API also allows to implement distributed shared memory in a way that
readonly shared mappings can exist simultaneously in different hosts
and they can be become exclusive at the first wrprotect fault.

The UFFDIO_REMAP method is still present in the patchset but it's
provided primarily to remove (add not) memory from the userfault
range. The addition of the UFFDIO_REMAP method is intentionally kept
at the end of the patchset. The postcopy live migration qemu code will
only use UFFDIO_COPY and UFFDIO_ZEROPAGE. UFFDIO_REMAP isn't intended
to be merged upstream in the short term, and it can be dropped later
if there's an agreement it's a bad idea to keep it around in the
patchset.

David run some KVM postcopy live migration benchmarks on a 8-way CPU
system and he measured that using UFFDIO_COPY instead of UFFDIO_REMAP
resulted in a roughly a -20% reduction in latency which is good. The
standard deviation error on the latency measurement decreased
significantly as well (because the number of CPUs that required IPI
delivery was variable, while the copy always takes roughly the same
time). A bigger improvement is expectable if measured on a larger host
with more CPUs.

All UFFDIO_COPY/ZEROPAGE/REMAP methods already support CRIU postcopy
live migration and the UFFD can be passed to a manager process through
unix domain sockets to satisfy point 5).

I look forward to discuss this further next week at the LSF/MM
summit, if you're attending the summit see you soon!

Comments welcome, thanks,
Andrea

Credits: partially funded by the Orbit EU project.

PS. There is one TODO detail worth mentioning for completeness that
affects usage 2) and UFFDIO_REMAP if used to remove memory from the
userfault range: handle_userfault() is only effective if
FAULT_FLAG_ALLOW_RETRY is set... but that is only set at the first
attempted page fault. If by accident some thread was already faulting
in the range and the first page fault attempt returned VM_FAULT_RETRY
and UFFDIO_REMAP or UFFDIO_WP jumps in to arm the userfault just
before the second attempt starts, a SIGBUS would be raised by the page
fault. Stopping all thread access to the userfault ranges during
UFFDIO_REMAP/WP while possible, isn't optimal. Currently (excluding
real filebacked mappings and handle_userfault() itself which is
clearly no problem) only tmpfs or a swapin can return
VM_FAULT_RETRY. To close this SIGBUS window for all usages, the
simplest solution would be that

[Qemu-devel] [PATCH 07/21] userfaultfd: call handle_userfault() for userfaultfd_missing() faults

2015-03-05 Thread Andrea Arcangeli

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 68 ++--
 mm/memory.c  | 16 +
 2 files changed, 62 insertions(+), 22 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f0207cf..5374132 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -708,7 +709,7 @@ static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t 
prot)
 static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
-   struct page *page)
+   struct page *page, unsigned int flags)
 {
struct mem_cgroup *memcg;
pgtable_t pgtable;
@@ -716,12 +717,16 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
 
VM_BUG_ON_PAGE(!PageCompound(page), page);
 
-   if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg))
-   return VM_FAULT_OOM;
+   if (mem_cgroup_try_charge(page, mm, GFP_TRANSHUGE, &memcg)) {
+   put_page(page);
+   count_vm_event(THP_FAULT_FALLBACK);
+   return VM_FAULT_FALLBACK;
+   }
 
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable)) {
mem_cgroup_cancel_charge(page, memcg);
+   put_page(page);
return VM_FAULT_OOM;
}
 
@@ -741,6 +746,21 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
+
+   /* Deliver the page fault to userland */
+   if (userfaultfd_missing(vma)) {
+   int ret;
+
+   spin_unlock(ptl);
+   mem_cgroup_cancel_charge(page, memcg);
+   put_page(page);
+   pte_free(mm, pgtable);
+   ret = handle_userfault(vma, haddr, flags,
+  VM_UFFD_MISSING);
+   VM_BUG_ON(ret & VM_FAULT_FALLBACK);
+   return ret;
+   }
+
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
@@ -751,6 +771,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct 
*mm,
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
atomic_long_inc(&mm->nr_ptes);
spin_unlock(ptl);
+   count_vm_event(THP_FAULT_ALLOC);
}
 
return 0;
@@ -762,19 +783,16 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, 
gfp_t extra_gfp)
 }
 
 /* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
 {
pmd_t entry;
-   if (!pmd_none(*pmd))
-   return false;
entry = mk_pmd(zero_page, vma->vm_page_prot);
entry = pmd_mkhuge(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
atomic_long_inc(&mm->nr_ptes);
-   return true;
 }
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct 
*vma,
@@ -797,6 +815,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct 
vm_area_struct *vma,
pgtable_t pgtable;
struct page *zero_page;
bool set;
+   int ret;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
return VM_FAULT_OOM;
@@ -807,14 +826,28 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, 
struct vm_area_struct *vma,
return VM_FAULT_FALLBACK;
}
ptl = pmd_lock(mm, pmd);
-   set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
-   zero_page);
-   spin_unlock(ptl);
+   ret = 0;
+   set = false;
+   if (pmd_none(*pmd)) {
+   if (userfaultfd_missing(vma)) {
+   spin_unlock(ptl);
+   ret = handle_userfault(vma, haddr,

[Qemu-devel] [PATCH 19/21] userfaultfd: remap_pages: UFFDIO_REMAP preparation

2015-03-05 Thread Andrea Arcangeli

remap_pages is the lowlevel mm helper needed to implement
UFFDIO_REMAP.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/userfaultfd_k.h |  17 ++
 mm/huge_memory.c  | 120 ++
 mm/userfaultfd.c  | 526 ++
 3 files changed, 663 insertions(+)

diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 587480a..3c39a4f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -36,6 +36,23 @@ extern ssize_t mfill_zeropage(struct mm_struct *dst_mm,
  unsigned long dst_start,
  unsigned long len);
 
+/* remap_pages */
+extern void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern void double_pt_unlock(spinlock_t *ptl1, spinlock_t *ptl2);
+extern ssize_t remap_pages(struct mm_struct *dst_mm,
+  struct mm_struct *src_mm,
+  unsigned long dst_start,
+  unsigned long src_start,
+  unsigned long len, __u64 flags);
+extern int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+   struct mm_struct *src_mm,
+   pmd_t *dst_pmd, pmd_t *src_pmd,
+   pmd_t dst_pmdval,
+   struct vm_area_struct *dst_vma,
+   struct vm_area_struct *src_vma,
+   unsigned long dst_addr,
+   unsigned long src_addr);
+
 /* mm helpers */
 static inline bool is_mergeable_vm_userfaultfd_ctx(struct vm_area_struct *vma,
struct vm_userfaultfd_ctx vm_ctx)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e25cb3..08c8afc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1531,6 +1531,124 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
return ret;
 }
 
+#ifdef CONFIG_USERFAULTFD
+/*
+ * The PT lock for src_pmd and the mmap_sem for reading are held by
+ * the caller, but it must return after releasing the
+ * page_table_lock. We're guaranteed the src_pmd is a pmd_trans_huge
+ * until the PT lock of the src_pmd is released. Just move the page
+ * from src_pmd to dst_pmd if possible. Return zero if succeeded in
+ * moving the page, -EAGAIN if it needs to be repeated by the caller,
+ * or other errors in case of failure.
+ */
+int remap_pages_huge_pmd(struct mm_struct *dst_mm,
+struct mm_struct *src_mm,
+pmd_t *dst_pmd, pmd_t *src_pmd,
+pmd_t dst_pmdval,
+struct vm_area_struct *dst_vma,
+struct vm_area_struct *src_vma,
+unsigned long dst_addr,
+unsigned long src_addr)
+{
+   pmd_t _dst_pmd, src_pmdval;
+   struct page *src_page;
+   struct anon_vma *src_anon_vma, *dst_anon_vma;
+   spinlock_t *src_ptl, *dst_ptl;
+   pgtable_t pgtable;
+
+   src_pmdval = *src_pmd;
+   src_ptl = pmd_lockptr(src_mm, src_pmd);
+
+   BUG_ON(!pmd_trans_huge(src_pmdval));
+   BUG_ON(pmd_trans_splitting(src_pmdval));
+   BUG_ON(!pmd_none(dst_pmdval));
+   BUG_ON(!spin_is_locked(src_ptl));
+   BUG_ON(!rwsem_is_locked(&src_mm->mmap_sem));
+   BUG_ON(!rwsem_is_locked(&dst_mm->mmap_sem));
+
+   src_page = pmd_page(src_pmdval);
+   BUG_ON(!PageHead(src_page));
+   BUG_ON(!PageAnon(src_page));
+   if (unlikely(page_mapcount(src_page) != 1)) {
+   spin_unlock(src_ptl);
+   return -EBUSY;
+   }
+
+   get_page(src_page);
+   spin_unlock(src_ptl);
+
+   mmu_notifier_invalidate_range_start(src_mm, src_addr,
+   src_addr + HPAGE_PMD_SIZE);
+
+   /* block all concurrent rmap walks */
+   lock_page(src_page);
+
+   /*
+* split_huge_page walks the anon_vma chain without the page
+* lock. Serialize against it with the anon_vma lock, the page
+* lock is not enough.
+*/
+   src_anon_vma = page_get_anon_vma(src_page);
+   if (!src_anon_vma) {
+   unlock_page(src_page);
+   put_page(src_page);
+   mmu_notifier_invalidate_range_end(src_mm, src_addr,
+ src_addr + HPAGE_PMD_SIZE);
+   return -EAGAIN;
+   }
+   anon_vma_lock_write(src_anon_vma);
+
+   dst_ptl = pmd_lockptr(dst_mm, dst_pmd);
+   double_pt_lock(src_ptl, dst_ptl);
+   if (unlikely(!pmd_same(*src_pmd, src_pmdval) ||
+!pmd_same(*dst_pmd, dst_pmdval) ||
+page_mapcount(src_page) != 1)) {
+   double_pt_unlock(src_ptl, dst_ptl);
+   anon_vma_unlock_write(src_anon_vma);
+   put_anon_vma(src_anon_vma);
+   unlock_page(src_page);
+   put_pa

[Qemu-devel] [PATCH 20/21] userfaultfd: UFFDIO_REMAP

2015-03-05 Thread Andrea Arcangeli

This remap ioctl allows to atomically move a page in or out of an
userfaultfd address space. It's more expensive than "copy" (and of
course more expensive than "zerofill") as it requires a TLB flush on
the source range for each ioctl, which is an expensive operation on
SMP. Especially if copying only a few pages at time, copying without
TLB flush is faster.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6230f22..b4c7f25 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -892,6 +892,54 @@ out:
return ret;
 }
 
+static int userfaultfd_remap(struct userfaultfd_ctx *ctx,
+unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_remap uffdio_remap;
+   struct uffdio_remap __user *user_uffdio_remap;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_remap = (struct uffdio_remap __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_remap, user_uffdio_remap,
+  /* don't copy "remap" and "wake" last field */
+  sizeof(uffdio_remap)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_remap.dst, uffdio_remap.len);
+   if (ret)
+   goto out;
+   ret = validate_range(current->mm, uffdio_remap.src, uffdio_remap.len);
+   if (ret)
+   goto out;
+   ret = -EINVAL;
+   if (uffdio_remap.mode & ~(UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES|
+ UFFDIO_REMAP_MODE_DONTWAKE))
+   goto out;
+
+   ret = remap_pages(ctx->mm, current->mm,
+ uffdio_remap.dst, uffdio_remap.src,
+ uffdio_remap.len, uffdio_remap.mode);
+   if (unlikely(put_user(ret, &user_uffdio_remap->remap)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   /* len == 0 would wake all */
+   BUG_ON(!ret);
+   range.len = ret;
+   if (!(uffdio_remap.mode & UFFDIO_REMAP_MODE_DONTWAKE)) {
+   range.start = uffdio_remap.dst;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_remap->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_remap.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -955,6 +1003,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned 
cmd,
case UFFDIO_ZEROPAGE:
ret = userfaultfd_zeropage(ctx, arg);
break;
+   case UFFDIO_REMAP:
+   ret = userfaultfd_remap(ctx, arg);
+   break;
}
return ret;
 }

[Qemu-devel] [PATCH 15/21] userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

2015-03-05 Thread Andrea Arcangeli

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli 
---
 fs/userfaultfd.c | 100 +++
 1 file changed, 100 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 6b31967..6230f22 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -798,6 +798,100 @@ out:
return ret;
 }
 
+static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
+   unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_copy uffdio_copy;
+   struct uffdio_copy __user *user_uffdio_copy;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_copy = (struct uffdio_copy __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_copy, user_uffdio_copy,
+  /* don't copy "copy" and "wake" last field */
+  sizeof(uffdio_copy)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_copy.dst, uffdio_copy.len);
+   if (ret)
+   goto out;
+   /*
+* double check for wraparound just in case. copy_from_user()
+* will later check uffdio_copy.src + uffdio_copy.len to fit
+* in the userland range.
+*/
+   ret = -EINVAL;
+   if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
+   goto out;
+   if (uffdio_copy.mode & ~UFFDIO_COPY_MODE_DONTWAKE)
+   goto out;
+
+   ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
+  uffdio_copy.len);
+   if (unlikely(put_user(ret, &user_uffdio_copy->copy)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   BUG_ON(!ret);
+   /* len == 0 would wake all */
+   range.len = ret;
+   if (!(uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE)) {
+   range.start = uffdio_copy.dst;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_copy->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_copy.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
+static int userfaultfd_zeropage(struct userfaultfd_ctx *ctx,
+   unsigned long arg)
+{
+   __s64 ret;
+   struct uffdio_zeropage uffdio_zeropage;
+   struct uffdio_zeropage __user *user_uffdio_zeropage;
+   struct userfaultfd_wake_range range;
+
+   user_uffdio_zeropage = (struct uffdio_zeropage __user *) arg;
+
+   ret = -EFAULT;
+   if (copy_from_user(&uffdio_zeropage, user_uffdio_zeropage,
+  /* don't copy "zeropage" and "wake" last field */
+  sizeof(uffdio_zeropage)-sizeof(__s64)*2))
+   goto out;
+
+   ret = validate_range(ctx->mm, uffdio_zeropage.range.start,
+uffdio_zeropage.range.len);
+   if (ret)
+   goto out;
+   ret = -EINVAL;
+   if (uffdio_zeropage.mode & ~UFFDIO_ZEROPAGE_MODE_DONTWAKE)
+   goto out;
+
+   ret = mfill_zeropage(ctx->mm, uffdio_zeropage.range.start,
+uffdio_zeropage.range.len);
+   if (unlikely(put_user(ret, &user_uffdio_zeropage->zeropage)))
+   return -EFAULT;
+   if (ret < 0)
+   goto out;
+   /* len == 0 would wake all */
+   BUG_ON(!ret);
+   range.len = ret;
+   if (!(uffdio_zeropage.mode & UFFDIO_ZEROPAGE_MODE_DONTWAKE)) {
+   range.start = uffdio_zeropage.range.start;
+   ret = wake_userfault(ctx, &range);
+   if (unlikely(put_user(ret, &user_uffdio_zeropage->wake)))
+   return -EFAULT;
+   }
+   ret = range.len == uffdio_zeropage.range.len ? 0 : -EAGAIN;
+out:
+   return ret;
+}
+
 /*
  * userland asks for a certain API version and we return which bits
  * and ioctl commands are implemented in this kernel for such API
@@ -855,6 +949,12 @@ static long userfaultfd_ioctl(struct file *file, unsigned 
cmd,
case UFFDIO_WAKE:
ret = userfaultfd_wake(ctx, arg);
break;
+   case UFFDIO_COPY:
+   ret = userfaultfd_copy(ctx, arg);
+   break;
+   case UFFDIO_ZEROPAGE:
+   ret = userfaultfd_zeropage(ctx, arg);
+   break;
}
return ret;
 }

[Qemu-devel] [PATCH 16/21] userfaultfd: remap_pages: rmap preparation

2015-03-05 Thread Andrea Arcangeli

As far as the rmap code is concerned, rmap_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_pages takes the anon_vma lock for writing before altering
the page->mapping, so if the page->mapping is still the same after
obtaining the anon_vma lock (without the page lock), the rmap walks
can go ahead safely (and remap_pages will wait them to complete before
proceeding).

remap_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_pages holds the mmap_sem for reading.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_pages must be mapped only in one vma, but
this is not a limitation when used to handle userland page faults. The
source addresses passed to remap_pages should be set as VM_DONTCOPY
with MADV_DONTFORK to avoid any risk of the mapcount of the pages
increasing, if fork runs in parallel in another thread, before or
while remap_pages runs.

Signed-off-by: Andrea Arcangeli 
---
 mm/huge_memory.c | 23 +++
 mm/rmap.c|  9 +
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f1b6a5..1e25cb3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1902,6 +1902,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 {
struct anon_vma *anon_vma;
int ret = 1;
+   struct address_space *mapping;
 
BUG_ON(is_huge_zero_page(page));
BUG_ON(!PageAnon(page));
@@ -1913,10 +1914,24 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * page_lock_anon_vma_read except the write lock is taken to serialise
 * against parallel split or collapse operations.
 */
-   anon_vma = page_get_anon_vma(page);
-   if (!anon_vma)
-   goto out;
-   anon_vma_lock_write(anon_vma);
+   for (;;) {
+   mapping = ACCESS_ONCE(page->mapping);
+   anon_vma = page_get_anon_vma(page);
+   if (!anon_vma)
+   goto out;
+   anon_vma_lock_write(anon_vma);
+   /*
+* We don't hold the page lock here so
+* remap_pages_huge_pmd can change the anon_vma from
+* under us until we obtain the anon_vma lock. Verify
+* that we obtained the anon_vma lock before
+* remap_pages did.
+*/
+   if (likely(mapping == ACCESS_ONCE(page->mapping)))
+   break;
+   anon_vma_unlock_write(anon_vma);
+   put_anon_vma(anon_vma);
+   }
 
ret = 0;
if (!PageCompound(page))
diff --git a/mm/rmap.c b/mm/rmap.c
index 5e3e090..5ab2df1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -492,6 +492,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct anon_vma *root_anon_vma;
unsigned long anon_mapping;
 
+repeat:
rcu_read_lock();
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -530,6 +531,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page)
rcu_read_unlock();
anon_vma_lock_read(anon_vma);
 
+   /* check if remap_anon_pages changed the anon_vma */
+   if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != 
anon_mapping)) {
+   anon_vma_unlock_read(anon_vma);
+   put_anon_vma(anon_vma);
+   anon_vma = NULL;
+   goto repeat;
+   }
+
if (atomic_dec_and_test(&anon_vma->refcount)) {
/*
 * Oops, we held the last refcount, release the lock

[Qemu-devel] [PATCH 01/21] userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key

2015-03-05 Thread Andrea Arcangeli

userfaultfd needs to wake all waitqueues (pass 0 as nr parameter),
instead of the current hardcoded 1 (that would wake just the first
waitqueue in the head list).

Signed-off-by: Andrea Arcangeli 
---
 include/linux/wait.h | 5 +++--
 kernel/sched/wait.c  | 7 ---
 net/sunrpc/sched.c   | 2 +-
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db8334..cf884cf 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,7 +147,8 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t 
*old)
 
 typedef int wait_bit_action_f(struct wait_bit_key *);
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void 
*key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
@@ -179,7 +180,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, (void *) (m))
 #define wake_up_locked_poll(x, m)  \
-   __wake_up_locked_key((x), TASK_NORMAL, (void *) (m))
+   __wake_up_locked_key((x), TASK_NORMAL, 1, (void *) (m))
 #define wake_up_interruptible_poll(x, m)   \
__wake_up(x, TASK_INTERRUPTIBLE, 1, (void *) (m))
 #define wake_up_interruptible_sync_poll(x, m)  \
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 852143a..6da208dd2 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -106,9 +106,10 @@ void __wake_up_locked(wait_queue_head_t *q, unsigned int 
mode, int nr)
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked);
 
-void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key)
+void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, int nr,
+ void *key)
 {
-   __wake_up_common(q, mode, 1, 0, key);
+   __wake_up_common(q, mode, nr, 0, key);
 }
 EXPORT_SYMBOL_GPL(__wake_up_locked_key);
 
@@ -283,7 +284,7 @@ void abort_exclusive_wait(wait_queue_head_t *q, 
wait_queue_t *wait,
if (!list_empty(&wait->task_list))
list_del_init(&wait->task_list);
else if (waitqueue_active(q))
-   __wake_up_locked_key(q, mode, key);
+   __wake_up_locked_key(q, mode, 1, key);
spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL(abort_exclusive_wait);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index b91fd9c..dead9e0 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -297,7 +297,7 @@ static int rpc_complete_task(struct rpc_task *task)
clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
ret = atomic_dec_and_test(&task->tk_count);
if (waitqueue_active(wq))
-   __wake_up_locked_key(wq, TASK_NORMAL, &k);
+   __wake_up_locked_key(wq, TASK_NORMAL, 1, &k);
spin_unlock_irqrestore(&wq->lock, flags);
return ret;
 }

[Qemu-devel] [PATCH 17/21] userfaultfd: remap_pages: swp_entry_swapcount() preparation

2015-03-05 Thread Andrea Arcangeli

Provide a new swapfile method for remap_pages() to verify the swap
entry is mapped only in one vma before relocating the swap entry in a
different virtual address. Otherwise if the swap entry is mapped in
multiple vmas, when the page is swapped back in, it could get mapped
in a non linear way in some anon_vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/swap.h |  6 ++
 mm/swapfile.c| 13 +
 2 files changed, 19 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4759491..9adda11 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -436,6 +436,7 @@ extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
+extern int swp_entry_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern int reuse_swap_page(struct page *);
 extern int try_to_free_swap(struct page *);
@@ -527,6 +528,11 @@ static inline int page_swapcount(struct page *page)
return 0;
 }
 
+static inline int swp_entry_swapcount(swp_entry_t entry)
+{
+   return 0;
+}
+
 #define reuse_swap_page(page)  (page_mapcount(page) == 1)
 
 static inline int try_to_free_swap(struct page *page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63f55cc..04c7621 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -874,6 +874,19 @@ int page_swapcount(struct page *page)
return count;
 }
 
+int swp_entry_swapcount(swp_entry_t entry)
+{
+   int count = 0;
+   struct swap_info_struct *p;
+
+   p = swap_info_get(entry);
+   if (p) {
+   count = swap_count(p->swap_map[swp_offset(entry)]);
+   spin_unlock(&p->lock);
+   }
+   return count;
+}
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content

[Qemu-devel] [PATCH 12/21] userfaultfd: activate syscall

2015-03-05 Thread Andrea Arcangeli

This activates the userfaultfd syscall.

Signed-off-by: Andrea Arcangeli 
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 arch/x86/syscalls/syscall_32.tbl   | 1 +
 arch/x86/syscalls/syscall_64.tbl   | 1 +
 include/linux/syscalls.h   | 1 +
 kernel/sys_ni.c| 1 +
 7 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 91062ee..7f21cfd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -367,3 +367,4 @@ SYSCALL_SPU(getrandom)
 SYSCALL_SPU(memfd_create)
 SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
+SYSCALL_SPU(userfaultfd)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 36b79c3..f4f8b66 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  363
+#define __NR_syscalls  364
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index ef5b5b1..4b4f21e 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -385,5 +385,6 @@
 #define __NR_memfd_create  360
 #define __NR_bpf   361
 #define __NR_execveat  362
+#define __NR_userfaultfd   363
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..a20f0b8 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356i386memfd_createsys_memfd_create
 357i386bpf sys_bpf
 358i386execveatsys_execveat
stub32_execveat
+359i386userfaultfd sys_userfaultfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..f320b19 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec_file_load
 321common  bpf sys_bpf
 32264  execveatstub_execveat
+323common  userfaultfd sys_userfaultfd
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..adf5901 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -810,6 +810,7 @@ asmlinkage long sys_timerfd_gettime(int ufd, struct 
itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int 
flags);
+asmlinkage long sys_userfaultfd(int flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user 
*, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5adcb0a..2a10e42 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -204,6 +204,7 @@ cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
 cond_syscall(sys_memfd_create);
+cond_syscall(sys_userfaultfd);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);

[Qemu-devel] [PATCH 13/21] userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

2015-03-05 Thread Andrea Arcangeli

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli 
---
 include/uapi/linux/userfaultfd.h | 46 +++-
 1 file changed, 45 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 9a8cd56..61251e6 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -17,7 +17,9 @@
 (__u64)1 << _UFFDIO_UNREGISTER |   \
 (__u64)1 << _UFFDIO_API)
 #define UFFD_API_RANGE_IOCTLS  \
-   ((__u64)1 << _UFFDIO_WAKE)
+   ((__u64)1 << _UFFDIO_WAKE | \
+(__u64)1 << _UFFDIO_COPY | \
+(__u64)1 << _UFFDIO_ZEROPAGE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
@@ -30,6 +32,8 @@
 #define _UFFDIO_REGISTER   (0x00)
 #define _UFFDIO_UNREGISTER (0x01)
 #define _UFFDIO_WAKE   (0x02)
+#define _UFFDIO_COPY   (0x03)
+#define _UFFDIO_ZEROPAGE   (0x04)
 #define _UFFDIO_API(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -42,6 +46,10 @@
 struct uffdio_range)
 #define UFFDIO_WAKE_IOR(UFFDIO, _UFFDIO_WAKE,  \
 struct uffdio_range)
+#define UFFDIO_COPY_IOWR(UFFDIO, _UFFDIO_COPY, \
+ struct uffdio_copy)
+#define UFFDIO_ZEROPAGE_IOWR(UFFDIO, _UFFDIO_ZEROPAGE, \
+ struct uffdio_zeropage)
 
 /*
  * Valid bits below PAGE_SHIFT in the userfault address read through
@@ -78,4 +86,40 @@ struct uffdio_register {
__u64 ioctls;
 };
 
+struct uffdio_copy {
+   __u64 dst;
+   __u64 src;
+   __u64 len;
+   /*
+* There will be a wrprotection flag later that allows to map
+* pages wrprotected on the fly. And such a flag will be
+* available if the wrprotection ioctl are implemented for the
+* range according to the uffdio_register.ioctls.
+*/
+#define UFFDIO_COPY_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "copy" and "wake" are written by the ioctl and must be at
+* the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 copy;
+   __s64 wake;
+};
+
+struct uffdio_zeropage {
+   struct uffdio_range range;
+#define UFFDIO_ZEROPAGE_MODE_DONTWAKE  ((__u64)1<<0)
+   __u64 mode;
+
+   /*
+* "zeropage" and "wake" are written by the ioctl and must be
+* at the end: the copy_from_user will not read the last 16
+* bytes.
+*/
+   __s64 zeropage;
+   __s64 wake;
+};
+
 #endif /* _LINUX_USERFAULTFD_H */

[Qemu-devel] [PATCH 08/21] userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx

2015-03-05 Thread Andrea Arcangeli

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h |  2 +-
 mm/madvise.c   |  3 ++-
 mm/mempolicy.c |  4 ++--
 mm/mlock.c |  3 ++-
 mm/mmap.c  | 39 +++
 mm/mprotect.c  |  3 ++-
 6 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 762ef9d..26cef61 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1879,7 +1879,7 @@ extern int vma_adjust(struct vm_area_struct *vma, 
unsigned long start,
 extern struct vm_area_struct *vma_merge(struct mm_struct *,
struct vm_area_struct *prev, unsigned long addr, unsigned long end,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
-   struct mempolicy *);
+   struct mempolicy *, struct vm_userfaultfd_ctx);
 extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
 extern int split_vma(struct mm_struct *,
struct vm_area_struct *, unsigned long addr, int new_below);
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..10f62b7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -102,7 +102,8 @@ static long madvise_behavior(struct vm_area_struct *vma,
 
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma,
-   vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4721046..e1a2e9b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -722,8 +722,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long 
start,
pgoff = vma->vm_pgoff +
((vmstart - vma->vm_start) >> PAGE_SHIFT);
prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
- vma->anon_vma, vma->vm_file, pgoff,
- new_pol);
+vma->anon_vma, vma->vm_file, pgoff,
+new_pol, vma->vm_userfaultfd_ctx);
if (prev) {
vma = prev;
next = vma->vm_next;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73cf098..9725abe 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -566,7 +566,8 @@ static int mlock_fixup(struct vm_area_struct *vma, struct 
vm_area_struct **prev,
 
pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
*prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma,
- vma->vm_file, pgoff, vma_policy(vma));
+ vma->vm_file, pgoff, vma_policy(vma),
+ vma->vm_userfaultfd_ctx);
if (*prev) {
vma = *prev;
goto success;
diff --git a/mm/mmap.c b/mm/mmap.c
index da9990a..135c2fa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -921,7 +922,8 @@ again:  remove_next = 1 + (end > 
next->vm_end);
  * per-vma resources, so we don't attempt to merge those.
  */
 static inline int is_mergeable_vma(struct vm_area_struct *vma,
-   struct file *file, unsigned long vm_flags)
+   struct file *file, unsigned long vm_flags,
+   struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
/*
 * VM_SOFTDIRTY should not prevent from VMA merging, if we
@@ -937,6 +939,8 @@ static inline int is_mergeable_vma(struct vm_area_struct 
*vma,
return 0;
if (vma->vm_ops && vma->vm_ops->close)
return 0;
+   if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx))
+   return 0;
return 1;
 }
 
@@ -967,9 +971,11 @@ static inline int is_mergeable_anon_vma(struct anon_vma 
*anon_vma1,
  */
 static int
 can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags,
-   struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff)
+struct anon_vma *anon_vma, struct file *file,
+pgoff_t vm_pgoff,
+struct vm_userfaultfd_ctx vm_userfaultfd_ctx)
 {
-   if (is_mergeable_vma(vma, file, vm_flags) &&
+   if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) &&
is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) {
if (vma->vm_pgoff == vm_pgoff)
return 1;
@@ -986,9 +992,11 @@ can_vma_merge_before(struct vm_area_struct *vma, unsigned 
long vm_flags,
  */
 static int
 can_vma_me

[Qemu-devel] [PATCH 11/21] userfaultfd: buildsystem activation

2015-03-05 Thread Andrea Arcangeli

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli 
---
 fs/Makefile  |  1 +
 init/Kconfig | 11 +++
 2 files changed, 12 insertions(+)

diff --git a/fs/Makefile b/fs/Makefile
index a88ac48..ba8ab62 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
 obj-$(CONFIG_SIGNALFD) += signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
+obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
 obj-$(CONFIG_FS_DAX)   += dax.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d..580dae7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1550,6 +1550,17 @@ config ADVISE_SYSCALLS
  applications use these syscalls, you can disable this option to save
  space.
 
+config USERFAULTFD
+   bool "Enable userfaultfd() system call"
+   select ANON_INODES
+   default y
+   depends on MMU
+   help
+ Enable the userfaultfd() system call that allows to intercept and
+ handle page faults in userland.
+
+ If unsure, say Y.
+
 config PCI_QUIRKS
default y
bool "Enable PCI quirk workarounds" if EXPERT

[Qemu-devel] [PATCH 06/21] userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

2015-03-05 Thread Andrea Arcangeli

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli 
---
 include/linux/mm.h | 2 ++
 kernel/fork.c  | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 47a9392..762ef9d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -123,8 +123,10 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_MAYSHARE0x0080
 
 #define VM_GROWSDOWN   0x0100  /* general info on the segment */
+#define VM_UFFD_MISSING0x0200  /* missing pages tracking */
 #define VM_PFNMAP  0x0400  /* Page-ranges managed without "struct 
page", just pure PFN */
 #define VM_DENYWRITE   0x0800  /* ETXTBSY on write attempts.. */
+#define VM_UFFD_WP 0x1000  /* wrprotect pages tracking */
 
 #define VM_LOCKED  0x2000
 #define VM_IO   0x4000 /* Memory mapped I/O or similar */
diff --git a/kernel/fork.c b/kernel/fork.c
index cb215c0..cfab6e9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -423,7 +423,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct 
*oldmm)
tmp->vm_mm = mm;
if (anon_vma_fork(tmp, mpnt))
goto fail_nomem_anon_vma_fork;
-   tmp->vm_flags &= ~VM_LOCKED;
+   tmp->vm_flags &= ~(VM_LOCKED|VM_UFFD_MISSING|VM_UFFD_WP);
tmp->vm_next = tmp->vm_prev = NULL;
tmp->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
file = tmp->vm_file;

[Qemu-devel] [PATCH 02/21] userfaultfd: linux/Documentation/vm/userfaultfd.txt

2015-03-05 Thread Andrea Arcangeli

Add documentation.

Signed-off-by: Andrea Arcangeli 
---
 Documentation/vm/userfaultfd.txt | 97 
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/vm/userfaultfd.txt

diff --git a/Documentation/vm/userfaultfd.txt b/Documentation/vm/userfaultfd.txt
new file mode 100644
index 000..2ec296c
--- /dev/null
+++ b/Documentation/vm/userfaultfd.txt
@@ -0,0 +1,97 @@
+= Userfaultfd =
+
+== Objective ==
+
+Userfaults allow to implement on demand paging from userland and more
+generally they allow userland to take control various memory page
+faults, something otherwise only the kernel code could do.
+
+For example userfaults allows a proper and more optimal implementation
+of the PROT_NONE+SIGSEGV trick.
+
+== Design ==
+
+Userfaults are delivered and resolved through the userfaultfd syscall.
+
+The userfaultfd (aside from registering and unregistering virtual
+memory ranges) provides for two primary functionalities:
+
+1) read/POLLIN protocol to notify an userland thread of the faults
+   happening
+
+2) various UFFDIO_* ioctls that can mangle over the virtual memory
+   regions registered in the userfaultfd that allows userland to
+   efficiently resolve the userfaults it receives via 1) or to mangle
+   the virtual memory in the background
+
+The real advantage of userfaults if compared to regular virtual memory
+management of mremap/mprotect is that the userfaults in all their
+operations never involve heavyweight structures like vmas (in fact the
+userfaultfd runtime load never takes the mmap_sem for writing).
+
+Vmas are not suitable for page(or hugepage)-granular fault tracking
+when dealing with virtual address spaces that could span
+Terabytes. Too many vmas would be needed for that.
+
+The userfaultfd once opened by invoking the syscall, can also be
+passed using unix domain sockets to a manager process, so the same
+manager process could handle the userfaults of a multitude of
+different process without them being aware about what is going on
+(well of course unless they later try to use the userfaultfd themself
+on the same region the manager is already tracking, which is a corner
+case that would currently return -EBUSY).
+
+== API ==
+
+When first opened the userfaultfd must be enabled invoking the
+UFFDIO_API ioctl specifying an uffdio_api.api value set to UFFD_API
+which will specify the read/POLLIN protocol userland intends to speak
+on the UFFD. The UFFDIO_API ioctl if successful (i.e. if the requested
+uffdio_api.api is spoken also by the running kernel), will return into
+uffdio_api.bits and uffdio_api.ioctls two 64bit bitmasks of
+respectively the activated feature bits below PAGE_SHIFT in the
+userfault addresses returned by read(2) and the generic ioctl
+available.
+
+Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
+be invoked (if present in the returned uffdio_api.ioctls bitmask) to
+register a memory range in the userfaultfd by setting the
+uffdio_register structure accordingly. The uffdio_register.mode
+bitmask will specify to the kernel which kind of faults to track for
+the range (UFFDIO_REGISTER_MODE_MISSING would track missing
+pages). The UFFDIO_REGISTER ioctl will return the
+uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
+userfaults on the range reigstered. Not all ioctls will necessarily be
+supported for all memory types depending on the underlying virtual
+memory backend (anonymous memory vs tmpfs vs real filebacked
+mappings).
+
+Userland can use the uffdio_register.ioctls to mangle the virtual
+address space in the background (to add or potentially also remove
+memory from the userfaultfd registered range). This means an userfault
+could be triggering just before userland maps in the background the
+user-faulted page. To avoid POLLIN resulting in an unexpected blocking
+read (if the UFFD is not opened in nonblocking mode in the first
+place), we don't allow the background thread to wake userfaults that
+haven't been read by userland yet. If we would do that likely the
+UFFDIO_WAKE ioctl could be dropped. This may change in the future
+(with a UFFD_API protocol bumb combined with the removal of the
+UFFDIO_WAKE ioctl) if it'll be demonstrated that it's a valid
+optimization and worthy to force userland to use the UFFD always in
+nonblocking mode if combined with POLLIN.
+
+userfaultfd is also a generic enough feature, that it allows KVM to
+implement postcopy live migration (one form of memory externalization
+consisting of a virtual machine running with part or all of its memory
+residing on a different node in the cloud) without having to modify a
+single line of KVM kernel code. Guest async page faults, FOLL_NOWAIT
+and all other GUP features works just fine in combination with
+userfaults (userfaults trigger async page faults in the guest
+scheduler so those guest processes that aren't waiting for userfaults
+can keep running in the guest vcpus).
+
+The primary ioctl to re

Re: [Qemu-devel] [PATCH RFC v3 24/27] COLO NIC: Implement NIC checkpoint and failover

2015-03-05 Thread Dr. David Alan Gilbert

* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote:
> Signed-off-by: zhanghailiang 
> Signed-off-by: Gao feng 
> ---
>  include/net/colo-nic.h |  3 ++-
>  migration/colo.c   | 22 ++
>  net/colo-nic.c | 19 +++
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/include/net/colo-nic.h b/include/net/colo-nic.h
> index 67c9807..ddc21cd 100644
> --- a/include/net/colo-nic.h
> +++ b/include/net/colo-nic.h
> @@ -20,5 +20,6 @@ void colo_add_nic_devices(NetClientState *nc);
>  void colo_remove_nic_devices(NetClientState *nc);
>  
>  int colo_proxy_compare(void);
> -
> +int colo_proxy_failover(void);
> +int colo_proxy_checkpoint(void);
>  #endif
> diff --git a/migration/colo.c b/migration/colo.c
> index 579aabf..874971c 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -94,6 +94,11 @@ static void slave_do_failover(void)
>  ;
>  }
>  
> +if (colo_proxy_failover() != 0) {
> +error_report("colo proxy failed to do failover");
> +}
> +colo_proxy_destroy(COLO_SECONDARY_MODE);

I'm not sure if this is the best thing to do on a secondary failover.
If I understand correctly, when it's running, we have:


---+
   |br0---eth0
   |
 slave +-tun - xt_SECCOLO - br1---eth1
   |
---+

what I think that colo-proxy-destroy  is doing is rewiring that as:


---+
   | +--br0---eth0
   | |
 slave +-tun +  br1---eth1
   |
---+

but now we've lost the sequence number adjustment data that
was held in xt_SECCOLO and so you are likely to break existing TCP
connections.

Also, I don't think colo-proxy-script is passed a flag to let it
know whether the reason it's doing a slave_uninstall is due to
a failover or a simple shutdown; and so it assumes it has
to do the rewire for a failover.
(Actually the script in the qemu repo is newer than the script in
the colo-proxy repo, that one doesn't have the rewire at all).

Dave

> +
>  colo = NULL;
>  
>  if (!autostart) {
> @@ -115,7 +120,7 @@ static void master_do_failover(void)
>  if (!colo_runstate_is_stopped()) {
>  vm_stop_force_state(RUN_STATE_COLO);
>  }
> -
> +colo_proxy_destroy(COLO_PRIMARY_MODE);
>  if (s->state != MIG_STATE_ERROR) {
>  migrate_set_state(s, MIG_STATE_COLO, MIG_STATE_COMPLETED);
>  }
> @@ -245,6 +250,11 @@ static int do_colo_transaction(MigrationState *s, 
> QEMUFile *control)
>  
>  qemu_fflush(trans);
>  
> +ret = colo_proxy_checkpoint();
> +if (ret < 0) {
> +goto out;
> +}
> +
>  ret = colo_ctl_put(s->file, COLO_CHECKPOINT_SEND);
>  if (ret < 0) {
>  goto out;
> @@ -387,8 +397,6 @@ out:
>  qemu_bh_schedule(s->cleanup_bh);
>  qemu_mutex_unlock_iothread();
>  
> -colo_proxy_destroy(COLO_PRIMARY_MODE);
> -
>  return NULL;
>  }
>  
> @@ -508,6 +516,12 @@ void *colo_process_incoming_checkpoints(void *opaque)
>  goto out;
>  }
>  
> +ret = colo_proxy_checkpoint();
> +if (ret < 0) {
> +goto out;
> +}
> +DPRINTF("proxy begin to do checkpoint\n");
> +
>  ret = colo_ctl_get(f, COLO_CHECKPOINT_SEND);
>  if (ret < 0) {
>  goto out;
> @@ -584,6 +598,7 @@ out:
>  * just kill slave
>  */
>  error_report("SVM is going to exit!");
> +colo_proxy_destroy(COLO_SECONDARY_MODE);
>  exit(1);
>  } else {
>  /* if we went here, means master may dead, we are doing failover */
> @@ -610,6 +625,5 @@ out:
>  
>  loadvm_exit_colo();
>  
> -colo_proxy_destroy(COLO_SECONDARY_MODE);
>  return NULL;
>  }
> diff --git a/net/colo-nic.c b/net/colo-nic.c
> index 563d661..02a454d 100644
> --- a/net/colo-nic.c
> +++ b/net/colo-nic.c
> @@ -379,6 +379,25 @@ void colo_proxy_destroy(int side)
>  cp_info.index = -1;
>  colo_nic_side = -1;
>  }
> +
> +int colo_proxy_failover(void)
> +{
> +if (colo_proxy_send(NULL, 0, COLO_FAILOVER) < 0) {
> +return -1;
> +}
> +
> +return 0;
> +}
> +
> +int colo_proxy_checkpoint(void)
> +{
> +if (colo_proxy_send(NULL, 0, COLO_CHECKPOINT) < 0) {
> +return -1;
> +}
> +
> +return 0;
> +}
> +
>  /*
>  do checkpoint: return 1
>  error: return -1
> -- 
> 1.7.12.4
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH 6/6] target-i386: Call cpu_exec_init() on realize

2015-03-05 Thread Andreas Färber

Am 05.03.2015 um 17:42 schrieb Igor Mammedov:
> On Thu,  5 Mar 2015 12:38:50 -0300
> Eduardo Habkost  wrote:
> 
>> To allow new code to ask the CPU classes for CPU model information and
>> allow QOM properties to be queried by qmp_device_list_properties(), we
>> need to be able to safely instantiate a X86CPU object without any
>> side-effects.
>>
>> cpu_exec_init() has lots of side-effects on global QEMU state, move it
>> to realize so it will be called only if the X86CPU instance is realized.
>>
>> For reference, this is the current cpu_exec_init() code:
>>
>>> void cpu_exec_init(CPUArchState *env)
>>> {
>>> CPUState *cpu = ENV_GET_CPU(env);
>>> CPUClass *cc = CPU_GET_CLASS(cpu);
>>> CPUState *some_cpu;
>>> int cpu_index;
>>>
>>> #ifndef CONFIG_USER_ONLY
>>> cpu->as = &address_space_memory;
>>> cpu->thread_id = qemu_get_thread_id();
>>> #endif
>>
>> Those fields should be used only after actually starting the VCPU and can be
>> initialized on realize.
>>
>>>
>>> #if defined(CONFIG_USER_ONLY)
>>> cpu_list_lock();
>>> #endif
>>> cpu_index = 0;
>>> CPU_FOREACH(some_cpu) {
>>> cpu_index++;
>>> }
>>> cpu->cpu_index = cpu_index;
>>> QTAILQ_INSERT_TAIL(&cpus, cpu, node);
>>> #if defined(CONFIG_USER_ONLY)
>>> cpu_list_unlock();
>>> #endif
>>
>> The above initializes cpu_index and add the CPU to the global CPU list.
>> This affects QEMU global state and must be done only on realize.
>>
>>> if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
>>> vmstate_register(NULL, cpu_index, &vmstate_cpu_common, cpu);
>>> }
>>> #if defined(CPU_SAVE_VERSION) && !defined(CONFIG_USER_ONLY)
>>> register_savevm(NULL, "cpu", cpu_index, CPU_SAVE_VERSION,
>>> cpu_save, cpu_load, env);
>>> assert(cc->vmsd == NULL);
>>> assert(qdev_get_vmsd(DEVICE(cpu)) == NULL);
>>> #endif
>>> if (cc->vmsd != NULL) {
>>> vmstate_register(NULL, cpu_index, cc->vmsd, cpu);
>>> }
>>
>> vmstate and savevm registration also affects global QEMU state and should be
>> done only on realize.
>>
>>> }
>>
>> Signed-off-by: Eduardo Habkost 
>> ---
>>  target-i386/cpu.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
>> index 400b1e0..8b76604 100644
>> --- a/target-i386/cpu.c
>> +++ b/target-i386/cpu.c
>> @@ -2758,6 +2758,8 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
>> **errp)
>>  static bool ht_warned;
>>  static bool tcg_initialized;
>>  
>> +cpu_exec_init(env);
>> +
>>  if (tcg_enabled() && !tcg_initialized) {
>>  tcg_initialized = 1;
>>  tcg_x86_init();
>> @@ -2840,7 +2842,6 @@ static void x86_cpu_initfn(Object *obj)
>>  CPUX86State *env = &cpu->env;
>>  
>>  cs->env_ptr = env;
>> -cpu_exec_init(env);
> looks wrong, later in this function we do
>  env->cpuid_apic_id = x86_cpu_apic_id_from_index(cs->cpu_index);
> and with this patch will always yield 0

Being tackled in Eduardo's APIC series. ;)

Cheers,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu,
Graham Norton; HRB 21284 (AG Nürnberg)

Re: [Qemu-devel] [PATCH 6/6] target-i386: Call cpu_exec_init() on realize

2015-03-05 Thread Igor Mammedov

On Thu,  5 Mar 2015 12:38:50 -0300
Eduardo Habkost  wrote:

> To allow new code to ask the CPU classes for CPU model information and
> allow QOM properties to be queried by qmp_device_list_properties(), we
> need to be able to safely instantiate a X86CPU object without any
> side-effects.
> 
> cpu_exec_init() has lots of side-effects on global QEMU state, move it
> to realize so it will be called only if the X86CPU instance is realized.
> 
> For reference, this is the current cpu_exec_init() code:
> 
> > void cpu_exec_init(CPUArchState *env)
> > {
> > CPUState *cpu = ENV_GET_CPU(env);
> > CPUClass *cc = CPU_GET_CLASS(cpu);
> > CPUState *some_cpu;
> > int cpu_index;
> >
> > #ifndef CONFIG_USER_ONLY
> > cpu->as = &address_space_memory;
> > cpu->thread_id = qemu_get_thread_id();
> > #endif
> 
> Those fields should be used only after actually starting the VCPU and can be
> initialized on realize.
> 
> >
> > #if defined(CONFIG_USER_ONLY)
> > cpu_list_lock();
> > #endif
> > cpu_index = 0;
> > CPU_FOREACH(some_cpu) {
> > cpu_index++;
> > }
> > cpu->cpu_index = cpu_index;
> > QTAILQ_INSERT_TAIL(&cpus, cpu, node);
> > #if defined(CONFIG_USER_ONLY)
> > cpu_list_unlock();
> > #endif
> 
> The above initializes cpu_index and add the CPU to the global CPU list.
> This affects QEMU global state and must be done only on realize.
> 
> > if (qdev_get_vmsd(DEVICE(cpu)) == NULL) {
> > vmstate_register(NULL, cpu_index, &vmstate_cpu_common, cpu);
> > }
> > #if defined(CPU_SAVE_VERSION) && !defined(CONFIG_USER_ONLY)
> > register_savevm(NULL, "cpu", cpu_index, CPU_SAVE_VERSION,
> > cpu_save, cpu_load, env);
> > assert(cc->vmsd == NULL);
> > assert(qdev_get_vmsd(DEVICE(cpu)) == NULL);
> > #endif
> > if (cc->vmsd != NULL) {
> > vmstate_register(NULL, cpu_index, cc->vmsd, cpu);
> > }
> 
> vmstate and savevm registration also affects global QEMU state and should be
> done only on realize.
> 
> > }
> 
> Signed-off-by: Eduardo Habkost 
> ---
>  target-i386/cpu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 400b1e0..8b76604 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2758,6 +2758,8 @@ static void x86_cpu_realizefn(DeviceState *dev, Error 
> **errp)
>  static bool ht_warned;
>  static bool tcg_initialized;
>  
> +cpu_exec_init(env);
> +
>  if (tcg_enabled() && !tcg_initialized) {
>  tcg_initialized = 1;
>  tcg_x86_init();
> @@ -2840,7 +2842,6 @@ static void x86_cpu_initfn(Object *obj)
>  CPUX86State *env = &cpu->env;
>  
>  cs->env_ptr = env;
> -cpu_exec_init(env);
looks wrong, later in this function we do
 env->cpuid_apic_id = x86_cpu_apic_id_from_index(cs->cpu_index);
and with this patch will always yield 0

>  
>  object_property_add(obj, "family", "int",
>  x86_cpuid_version_get_family,

Re: [Qemu-devel] [PATCH 4/6] target-i386: Rename optimize_flags_init()

2015-03-05 Thread Eduardo Habkost

On Thu, Mar 05, 2015 at 05:31:39PM +0100, Igor Mammedov wrote:
> On Thu,  5 Mar 2015 12:38:48 -0300
> Eduardo Habkost  wrote:
> 
> > Rename the function so that the reason for its existence is clearer: it
> > does x86-specific initialization of TCG structures.
> > 
> > Signed-off-by: Eduardo Habkost 
> > ---
> >  target-i386/cpu.c   | 2 +-
> >  target-i386/cpu.h   | 2 +-
> >  target-i386/translate.c | 2 +-
> >  3 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 50907d0..b4e70d3 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -2883,7 +2883,7 @@ static void x86_cpu_initfn(Object *obj)
> >  /* init various static tables used in TCG mode */
> >  if (tcg_enabled() && !inited) {
> >  inited = 1;
> > -optimize_flags_init();
> > +tcg_x86_init();
> >  }
> how about moving 'inited' handling inside of tcg_x86_init() along with 
> renaming?

Makes sense, and it will help simplify patch 5/6. But I'll do that in a
separate patch.

> 
> >  }
> >  
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index 0638d24..52b460a 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -1228,7 +1228,7 @@ static inline target_long lshift(target_long x, int n)
> >  #define ST1ST(1)
> >  
> >  /* translate.c */
> > -void optimize_flags_init(void);
> > +void tcg_x86_init(void);
> >  
> >  #include "exec/cpu-all.h"
> >  #include "svm.h"
> > diff --git a/target-i386/translate.c b/target-i386/translate.c
> > index 094cec0..f19f20f 100644
> > --- a/target-i386/translate.c
> > +++ b/target-i386/translate.c
> > @@ -7852,7 +7852,7 @@ static target_ulong disas_insn(CPUX86State *env, 
> > DisasContext *s,
> >  return s->pc;
> >  }
> >  
> > -void optimize_flags_init(void)
> > +void tcg_x86_init(void)
> >  {
> >  static const char reg_names[CPU_NB_REGS][4] = {
> >  #ifdef TARGET_X86_64
> 

-- 
Eduardo

Re: [Qemu-devel] [PATCH 3/6] cpu: Reorder cpu->as and cpu->thread_id initialization

2015-03-05 Thread Igor Mammedov

On Thu,  5 Mar 2015 12:38:47 -0300
Eduardo Habkost  wrote:

> Instead of initializing cpu->as and cpu->thread_id while holding
> cpu_list_lock(), initialize it earlier.
> 
> This allows the code handling cpu_index and global CPU list to be
> isolated from the rest.
> 
> Signed-off-by: Eduardo Habkost 
Reviewed-by: Igor Mammedov 

> ---
>  exec.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8220535..2e370d0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -534,6 +534,11 @@ void cpu_exec_init(CPUArchState *env)
>  CPUState *some_cpu;
>  int cpu_index;
>  
> +#ifndef CONFIG_USER_ONLY
> +cpu->as = &address_space_memory;
> +cpu->thread_id = qemu_get_thread_id();
> +#endif
> +
>  #if defined(CONFIG_USER_ONLY)
>  cpu_list_lock();
>  #endif
> @@ -542,10 +547,6 @@ void cpu_exec_init(CPUArchState *env)
>  cpu_index++;
>  }
>  cpu->cpu_index = cpu_index;
> -#ifndef CONFIG_USER_ONLY
> -cpu->as = &address_space_memory;
> -cpu->thread_id = qemu_get_thread_id();
> -#endif
>  QTAILQ_INSERT_TAIL(&cpus, cpu, node);
>  #if defined(CONFIG_USER_ONLY)
>  cpu_list_unlock();

Re: [Qemu-devel] [PATCH 2/6] cpu: Initialize breakpoint/watchpoint lists on cpu_common_initfn()

2015-03-05 Thread Igor Mammedov

On Thu,  5 Mar 2015 12:38:46 -0300
Eduardo Habkost  wrote:

> One small step in the simplification of cpu_exec_init().
> 
> Signed-off-by: Eduardo Habkost 
Reviewed-by: Igor Mammedov 

> ---
>  exec.c| 2 --
>  qom/cpu.c | 2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 3a61e51..8220535 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -542,8 +542,6 @@ void cpu_exec_init(CPUArchState *env)
>  cpu_index++;
>  }
>  cpu->cpu_index = cpu_index;
> -QTAILQ_INIT(&cpu->breakpoints);
> -QTAILQ_INIT(&cpu->watchpoints);
>  #ifndef CONFIG_USER_ONLY
>  cpu->as = &address_space_memory;
>  cpu->thread_id = qemu_get_thread_id();
> diff --git a/qom/cpu.c b/qom/cpu.c
> index 970377e..b69ac41 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -313,6 +313,8 @@ static void cpu_common_initfn(Object *obj)
>  CPUClass *cc = CPU_GET_CLASS(obj);
>  
>  cpu->gdb_num_regs = cpu->gdb_num_g_regs = cc->gdb_num_core_regs;
> +QTAILQ_INIT(&cpu->breakpoints);
> +QTAILQ_INIT(&cpu->watchpoints);
>  }
>  
>  static int64_t cpu_common_get_arch_id(CPUState *cpu)

Re: [Qemu-devel] [PATCH 4/6] target-i386: Rename optimize_flags_init()

2015-03-05 Thread Igor Mammedov

On Thu,  5 Mar 2015 12:38:48 -0300
Eduardo Habkost  wrote:

> Rename the function so that the reason for its existence is clearer: it
> does x86-specific initialization of TCG structures.
> 
> Signed-off-by: Eduardo Habkost 
> ---
>  target-i386/cpu.c   | 2 +-
>  target-i386/cpu.h   | 2 +-
>  target-i386/translate.c | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 50907d0..b4e70d3 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2883,7 +2883,7 @@ static void x86_cpu_initfn(Object *obj)
>  /* init various static tables used in TCG mode */
>  if (tcg_enabled() && !inited) {
>  inited = 1;
> -optimize_flags_init();
> +tcg_x86_init();
>  }
how about moving 'inited' handling inside of tcg_x86_init() along with renaming?

>  }
>  
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 0638d24..52b460a 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1228,7 +1228,7 @@ static inline target_long lshift(target_long x, int n)
>  #define ST1ST(1)
>  
>  /* translate.c */
> -void optimize_flags_init(void);
> +void tcg_x86_init(void);
>  
>  #include "exec/cpu-all.h"
>  #include "svm.h"
> diff --git a/target-i386/translate.c b/target-i386/translate.c
> index 094cec0..f19f20f 100644
> --- a/target-i386/translate.c
> +++ b/target-i386/translate.c
> @@ -7852,7 +7852,7 @@ static target_ulong disas_insn(CPUX86State *env, 
> DisasContext *s,
>  return s->pc;
>  }
>  
> -void optimize_flags_init(void)
> +void tcg_x86_init(void)
>  {
>  static const char reg_names[CPU_NB_REGS][4] = {
>  #ifdef TARGET_X86_64

[Qemu-devel] [PATCH 1/6] cpu: No need to zero-initialize numa_node

2015-03-05 Thread Eduardo Habkost

QOM objects are already zero-filled when instantiated, there's no need
to explicitly set numa_node to 0.

Signed-off-by: Eduardo Habkost 
---
 exec.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/exec.c b/exec.c
index c85321a..3a61e51 100644
--- a/exec.c
+++ b/exec.c
@@ -542,7 +542,6 @@ void cpu_exec_init(CPUArchState *env)
 cpu_index++;
 }
 cpu->cpu_index = cpu_index;
-cpu->numa_node = 0;
 QTAILQ_INIT(&cpu->breakpoints);
 QTAILQ_INIT(&cpu->watchpoints);
 #ifndef CONFIG_USER_ONLY
-- 
2.1.0

Re: [Qemu-devel] [PATCH 1/6] cpu: No need to zero-initialize numa_node

2015-03-05 Thread Igor Mammedov

On Thu,  5 Mar 2015 12:38:45 -0300
Eduardo Habkost  wrote:

> QOM objects are already zero-filled when instantiated, there's no need
> to explicitly set numa_node to 0.
> 
> Signed-off-by: Eduardo Habkost 
Reviewed-by: Igor Mammedov 

> ---
>  exec.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/exec.c b/exec.c
> index c85321a..3a61e51 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -542,7 +542,6 @@ void cpu_exec_init(CPUArchState *env)
>  cpu_index++;
>  }
>  cpu->cpu_index = cpu_index;
> -cpu->numa_node = 0;
>  QTAILQ_INIT(&cpu->breakpoints);
>  QTAILQ_INIT(&cpu->watchpoints);
>  #ifndef CONFIG_USER_ONLY

Re: [Qemu-devel] [PATCH v4 4/5] target-i386: Move APIC ID compatibility code to pc.c

2015-03-05 Thread Andreas Färber

Am 05.03.2015 um 14:37 schrieb Eduardo Habkost:
> On Thu, Mar 05, 2015 at 01:32:02AM +0100, Andreas Färber wrote:
>> Am 04.03.2015 um 03:13 schrieb Eduardo Habkost:
>>> The APIC ID compatibility code is required only for PC, and now that
>>> x86_cpu_initfn() doesn't use x86_cpu_apic_id_from_index() anymore, that
>>> code can be moved to pc.c.
>>>
>>> Reviewed-by: Paolo Bonzini 
>>> Reviewed-by: Andreas Färber 
>>> Signed-off-by: Eduardo Habkost 
>>> ---
>>>  hw/i386/pc.c  | 35 +++
>>>  target-i386/cpu.c | 34 --
>>>  2 files changed, 35 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>> index b229856..8c3c470 100644
>>> --- a/hw/i386/pc.c
>>> +++ b/hw/i386/pc.c
[...]
>>> @@ -629,6 +631,39 @@ bool e820_get_entry(int idx, uint32_t type, uint64_t 
>>> *address, uint64_t *length)
>>>  return false;
>>>  }
>>>  
>>> +/* Enables contiguous-apic-ID mode, for compatibility */
>>> +static bool compat_apic_id_mode;
>>> +
>>> +void enable_compat_apic_id_mode(void)
>>> +{
>>> +compat_apic_id_mode = true;
>>> +}
>>> +
>>> +/* Calculates initial APIC ID for a specific CPU index
>>> + *
>>> + * Currently we need to be able to calculate the APIC ID from the CPU index
>>> + * alone (without requiring a CPU object), as the QEMU<->Seabios 
>>> interfaces have
>>> + * no concept of "CPU index", and the NUMA tables on fw_cfg need the APIC 
>>> ID of
>>> + * all CPUs up to max_cpus.
>>> + */
>>> +uint32_t x86_cpu_apic_id_from_index(unsigned int cpu_index)
>>
>> Looking a bit closer here, as I am poking around its call site for the
>> socket modeling, can't this be made static while at it? (If so, don't
>> forget to drop the prototype.)
> 
> Yes, I'll make it static. I assume it's OK to do it before committing
> instead of resending the series because of that?

For me, sure. If you want to go safe or be verbose, you can post the
diff as a reply when you apply.

Cheers,
Andreas

-- 
SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu,
Graham Norton; HRB 21284 (AG Nürnberg)

[Qemu-devel] [PATCH 3/6] cpu: Reorder cpu->as and cpu->thread_id initialization

2015-03-05 Thread Eduardo Habkost

Instead of initializing cpu->as and cpu->thread_id while holding
cpu_list_lock(), initialize it earlier.

This allows the code handling cpu_index and global CPU list to be
isolated from the rest.

Signed-off-by: Eduardo Habkost 
---
 exec.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index 8220535..2e370d0 100644
--- a/exec.c
+++ b/exec.c
@@ -534,6 +534,11 @@ void cpu_exec_init(CPUArchState *env)
 CPUState *some_cpu;
 int cpu_index;
 
+#ifndef CONFIG_USER_ONLY
+cpu->as = &address_space_memory;
+cpu->thread_id = qemu_get_thread_id();
+#endif
+
 #if defined(CONFIG_USER_ONLY)
 cpu_list_lock();
 #endif
@@ -542,10 +547,6 @@ void cpu_exec_init(CPUArchState *env)
 cpu_index++;
 }
 cpu->cpu_index = cpu_index;
-#ifndef CONFIG_USER_ONLY
-cpu->as = &address_space_memory;
-cpu->thread_id = qemu_get_thread_id();
-#endif
 QTAILQ_INSERT_TAIL(&cpus, cpu, node);
 #if defined(CONFIG_USER_ONLY)
 cpu_list_unlock();
-- 
2.1.0

1 2 3 >

1 - 100 of 210 matches

Mail list logo