date:20151102

Re: [PATCH net-next rfc V2 0/2] basic busy polling support for vhost_net

2015-11-02 Thread Jason Wang



On 10/30/2015 07:58 PM, Jason Wang wrote:
>
> On 10/29/2015 04:45 PM, Jason Wang wrote:
>> Hi all:
>>
>> This series tries to add basic busy polling for vhost net. The idea is
>> simple: at the end of tx processing, busy polling for new tx added
>> descriptor and rx receive socket for a while. The maximum number of
>> time (in us) could be spent on busy polling was specified through
>> module parameter.
>>
>> Test were done through:
>>
>> - 50 us as busy loop timeout
>> - Netperf 2.6
>> - Two machines with back to back connected mlx4
>> - Guest with 8 vcpus and 1 queue
>>
>> Result shows very huge improvement on both tx (at most 158%) and rr
>> (at most 53%) while rx is as much as in the past. Most cases the cpu
>> utilization is also improved:
>>
> Just notice there's something wrong in the setup. So the numbers are
> incorrect here. Will re-run and post correct number here.
>
> Sorry.

Here's the updated testing result:

1) 1 vcpu 1 queue:

TCP_RR
size/session/+thu%/+normalize%
1/ 1/0%/  -25%
1/50/  +12%/0%
1/   100/  +12%/   +1%
1/   200/   +9%/   -1%
   64/ 1/   +3%/  -21%
   64/50/   +8%/0%
   64/   100/   +7%/0%
   64/   200/   +9%/0%
  256/ 1/   +1%/  -25%
  256/50/   +7%/   -2%
  256/   100/   +6%/   -2%
  256/   200/   +4%/   -2%
  512/ 1/   +2%/  -19%
  512/50/   +5%/   -2%
  512/   100/   +3%/   -3%
  512/   200/   +6%/   -2%
 1024/ 1/   +2%/  -20%
 1024/50/   +3%/   -3%
 1024/   100/   +5%/   -3%
 1024/   200/   +4%/   -2%
Guest RX
size/session/+thu%/+normalize%
   64/ 1/   -4%/   -5%
   64/ 4/   -3%/  -10%
   64/ 8/   -3%/   -5%
  512/ 1/  +15%/   +1%
  512/ 4/   -5%/   -5%
  512/ 8/   -2%/   -4%
 1024/ 1/   -5%/  -16%
 1024/ 4/   -2%/   -5%
 1024/ 8/   -6%/   -6%
 2048/ 1/  +10%/   +5%
 2048/ 4/   -8%/   -4%
 2048/ 8/   -1%/   -4%
 4096/ 1/   -9%/  -11%
 4096/ 4/   +1%/   -1%
 4096/ 8/   +1%/0%
16384/ 1/  +20%/  +11%
16384/ 4/0%/   -3%
16384/ 8/   +1%/0%
65535/ 1/  +36%/  +13%
65535/ 4/  -10%/   -9%
65535/ 8/   -3%/   -2%
Guest TX
size/session/+thu%/+normalize%
   64/ 1/   -7%/  -16%
   64/ 4/  -14%/  -23%
   64/ 8/   -9%/  -20%
  512/ 1/  -62%/  -56%
  512/ 4/  -62%/  -56%
  512/ 8/  -61%/  -53%
 1024/ 1/  -66%/  -61%
 1024/ 4/  -77%/  -73%
 1024/ 8/  -73%/  -67%
 2048/ 1/  -74%/  -75%
 2048/ 4/  -77%/  -74%
 2048/ 8/  -72%/  -68%
 4096/ 1/  -65%/  -68%
 4096/ 4/  -66%/  -63%
 4096/ 8/  -62%/  -57%
16384/ 1/  -25%/  -28%
16384/ 4/  -28%/  -17%
16384/ 8/  -24%/  -10%
65535/ 1/  -17%/  -14%
65535/ 4/  -22%/   -5%
65535/ 8/  -25%/   -9%

- obvious improvement on TCP_RR (at most 12%)
- improvement on guest RX
- huge decreasing on Guest TX (at most -75%), this is probably because
virtio-net driver suffers from buffer bloat by orphaning skb before
transmission. The faster vhost it is, the smaller packet it could
produced. To reduce the impact on this, turning off gso in guest can
result the following result:

size/session/+thu%/+normalize%
   64/ 1/   +3%/  -11%
   64/ 4/   +4%/  -10%
   64/ 8/   +4%/  -10%
  512/ 1/   +2%/   +5%
  512/ 4/0%/   -1%
  512/ 8/0%/0%
 1024/ 1/  +11%/0%
 1024/ 4/0%/   -1%
 1024/ 8/   +3%/   +1%
 2048/ 1/   +4%/   -1%
 2048/ 4/   +8%/   +3%
 2048/ 8/0%/   -1%
 4096/ 1/   +4%/   -1%
 4096/ 4/   +1%/0%
 4096/ 8/   +2%/0%
16384/ 1/   +2%/   -2%
16384/ 4/   +3%/   +1%
16384/ 8/0%/   -1%
65535/ 1/   +9%/   +7%
65535/ 4/0%/   -3%
65535/ 8/   -1%/   -1%

2) 8 vcpus 1 queue:

TCP_RR
size/session/+thu%/+normalize%
1/ 1/   +5%/  -14%
1/50/   +2%/   +1%
1/   100/0%/   -1%
1/   200/0%/0%
   64/ 1/0%/  -25%
   64/50/   +5%/   +5%
   64/   100/0%/0%
   64/   200/0%/   -1%
  256/ 1/0%/  -30%
  256/50/0%/0%
  256/   100/   -2%/   -2%
  256/   200/0%/0%
  512/ 1/   +1%/  -23%
  512/50/   +1%/   +1%
  512/   100/   +1%/0%
  512/   200/   +1%/   +1%
 1024/ 1/   +1%/  -23%
 1024/50/   +5%/   +5%
 1024/   100/0%/   -1%
 1024/   200/0%/0%
Guest RX
size/session/+thu%/+normalize%
   64/ 1/   +1%/   +1%
   64/ 4/   -2%/   +1%
   64/ 8/   +6%/  +19%
  512/ 1/   +5%/   -7%
  512/ 4/   -4%/   -4%
  512/ 8/0%/0%
 1024/ 1/   +1%/   +2%
 1024/ 4/   -2%/   -2%
 1024/ 8/   -1%/   +7%
 2048/ 1/   +8%/   -2%
 2048/ 4/0%/   +5%
 2048/ 8/   -1%/  +13%
 4096/ 1/   -1%/   +2%
 4096/ 4/0%/   +6%
 4096/ 8/   -2%/  +15%
16384/ 1/   -1%/0%
16384/ 4/   -2%/   -1%
16384/ 8/   -2%/   +2%
65535/ 1/   -2%/0%
65535/ 4/   -3%/   -3%
65535/ 8/   -2%/   +2%
Guest TX
size/session/+thu%/+normalize%
   64/ 1/   +6%/   +3%

RE: [PATCH v4 0/3] KVM: arm/arm64: Clean up some obsolete code

2015-11-02 Thread Pavel Fedin

 Hello!

> I ran this through my test scripts and I'm now quite sure that there's
> some breakage in here.
> 
> One of my tests is running two VMs in parallel, each booting up, running
> hackbench, and then doing reboot (from within the guest), and just
> repeating like that.
> 
> I've run your patches in the above config 100 times, and every time,
> the rebooting VMs got stuck before 50 reboots.
> 
> Without these patches, I could run the above config 100 times, and every
> time, the rebooting VMs passed 200 reboots.

 Huh, the description looks like some problem with vgic_retire_disabled_irqs(). 
By the way, during reboot, who does call it? The
only call i see is in vgic_handle_enable_reg(), which obviously just processes 
emulated register accesses...
 And the only thing i know is that in case of GICv2 the userland resets vGIC 
manually by resetting each register to its default
value (therefore all ENABLER are set to 0). At least qemu does this, and i'm 
not sure about kvmtool. And in case of vGICv3 nobody
can do this because there's no API to set registers yet. So, could we be 
rebooting with interrupts enabled or something like that?
 So: what kind of container are you running and what vGIC version? Does this 
problem reproduce with both vGICv2 and vGICv3?

 By this time i'll make a very minimal version of patch 0001, for you to test 
it. If we have problems with current 0001, which we
cannot solve quickly, we could stick to that version then, which will provide 
the necessary changes to plug in LPIs, yet with
minimal changes (it will only remove vgic_irq_lr_map).
 I guess i should have done it before. Or, i could even respin v5, with current 
0001 split up. This should make it easier to bisect
the problem.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-unit-tests PATCH 00/14] ppc64: initial drop

2015-11-02 Thread Thomas Huth

On 03/08/15 16:41, Andrew Jones wrote:
> This series is the first series of a series of series that will
> bring support to kvm-unit-tests for ppc64, and eventually ppc64le.

 Hi Andrew,

may I ask about the current state of ppc64 support in the
kvm-unit-tests? Is there a newer version available than the one you
posted three months ago?

 Thanks,
  Thomas


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v7 03/35] acpi: add aml_create_field

2015-11-02 Thread Shannon Zhao



On 2015/11/2 17:13, Xiao Guangrong wrote:
> Implement CreateField term which is used by NVDIMM _DSM method in later patch
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c | 13 +
>  include/hw/acpi/aml-build.h |  1 +
>  2 files changed, 14 insertions(+)
> 
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index a72214d..9fe5e7b 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1151,6 +1151,19 @@ Aml *aml_sizeof(Aml *arg)
>  return var;
>  }
>  
> +/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateField */
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
> +{
> +Aml *var = aml_alloc();
> +build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
> +build_append_byte(var->buf, 0x13); /* CreateFieldOp */
> +aml_append(var, srcbuf);
> +aml_append(var, index);
> +aml_append(var, len);
> +build_append_namestring(var->buf, "%s", name);
> +return var;
> +}
> +
>  void
>  build_header(GArray *linker, GArray *table_data,
>   AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 7296efb..7e1c43b 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -276,6 +276,7 @@ Aml *aml_touuid(const char *uuid);
>  Aml *aml_unicode(const char *str);
>  Aml *aml_derefof(Aml *arg);
>  Aml *aml_sizeof(Aml *arg);
> +Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
>  
Maybe this could be moved together with existing aml_create_dword_field.

>  void
>  build_header(GArray *linker, GArray *table_data,
> 

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: Fix commit which broke PML

2015-11-02 Thread Kai Huang

I found PML was broken since below commit:

commit feda805fe7c4ed9cf78158e73b1218752e3b4314
Author: Xiao Guangrong 
Date:   Wed Sep 9 14:05:55 2015 +0800

KVM: VMX: unify SECONDARY_VM_EXEC_CONTROL update

Unify the update in vmx_cpuid_update()

Signed-off-by: Xiao Guangrong 
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo]
Signed-off-by: Paolo Bonzini 

The reason is PML after above commit vmx_cpuid_update calls
vmx_secondary_exec_control, in which PML is disabled unconditionally, as PML is
enabled in creating vcpu. Therefore if vcpu_cpuid_update is called after vcpu is
created, PML will be disabled unexpectedly while log-dirty code still think PML
is used. Actually looks calling vmx_secondary_exec_control in vmx_cpuid_update
is likely to break any VMX features that is enabled/disabled on demand by
updating SECONDARY_VM_EXEC_CONTROL, if vmx_cpuid_update is called between the
feature is enabled and disabled.

Fix this by calling vmcs_read32 to read out SECONDARY_VM_EXEC_CONTROL directly.

Signed-off-by: Kai Huang 
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 4d0aa31..4525c0a7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8902,7 +8902,7 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u32 secondary_exec_ctl = vmx_secondary_exec_control(vmx);
+   u32 secondary_exec_ctl = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
 
if (vmx_rdtscp_supported()) {
bool rdtscp_enabled = guest_cpuid_has_rdtscp(vcpu);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-02 Thread Xiao Guangrong




On 11/03/2015 05:12 AM, Paolo Bonzini wrote:



On 02/11/2015 10:13, Xiao Guangrong wrote:

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a
directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
  exec.c | 80 ++
  1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 9075f4d..db0fdaf 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
  }

  #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_ram_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with '_'. */
+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);
+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
  Error **errp)
  {
-char *filename;
-char *sanitized_name;
-char *c;
  void *area;
  int fd;
  uint64_t pagesize;
@@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
  goto error;
  }

-/* Make name safe to use with mkstemp by replacing '/' with '_'. */
-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);

-fd = mkstemp(filename);
+fd = open_ram_file_path(block, path, memory);
  if (fd < 0) {
  error_setg_errno(errp, errno,
   "unable to create backing store for path %s", path);
-g_free(filename);
  goto error;
  }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}

  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
  if (area == MAP_FAILED) {



I was going to send tomorrow a pull request for a similar patch,
"backends/hostmem-file: Allow to specify full pathname for backing file".

The main difference seems to be your usage of O_EXCL.  Can you explain
why you added it?


It' used if we pass a block device as a NVDIMM backend memory:
 O_EXCL can be used without O_CREAT if pathname refers to a block device.  If 
the block device
 is in use by the system (e.g., mounted), open() fails with the error EBUSY
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 08/21] KVM: ARM64: Add reset and access handlers for PMXEVTYPER register

2015-11-02 Thread Shannon Zhao



On 2015/11/3 4:54, Christopher Covington wrote:
> Hi Shannon,
> 
> On 10/30/2015 02:21 AM, Shannon Zhao wrote:
>> From: Shannon Zhao 
>>
>> Since the reset value of PMXEVTYPER is UNKNOWN, use reset_unknown or
>> reset_unknown_cp15 for its reset handler. Add access handler which
>> emulates writing and reading PMXEVTYPER register. When writing to
>> PMXEVTYPER, call kvm_pmu_set_counter_event_type to create a perf_event
>> for the selected event type.
>>
>> Signed-off-by: Shannon Zhao 
>> ---
>>  arch/arm64/kvm/sys_regs.c | 26 --
>>  1 file changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index cb82b15..4e606ea 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -491,6 +491,17 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>>  
>>  if (p->is_write) {
>>  switch (r->reg) {
>> +case PMXEVTYPER_EL0: {
>> +val = vcpu_sys_reg(vcpu, PMSELR_EL0);
>> +kvm_pmu_set_counter_event_type(vcpu,
>> +   *vcpu_reg(vcpu, p->Rt),
>> +   val);
>> +vcpu_sys_reg(vcpu, PMXEVTYPER_EL0) =
>> + *vcpu_reg(vcpu, p->Rt);
> 
> Why does PMXEVTYPER get set directly? It seems like it could have an accessor
> that redirected to PMEVTYPER.
> 
Yeah, that's what this patch does. It gets the counter index from
PMSELR_EL0 register, then set the event type, create perf_event, store
event type to PMEVTYPER, etc.

>> +vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + val) =
>> + *vcpu_reg(vcpu, p->Rt);
> 
> I tried to look around briefly but couldn't find counter number range checking
> in the PMSELR handler or here. Should there be some here and in PMXEVCNTR?
> 
Ok, will fix this. Thanks.

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 07/21] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function

2015-11-02 Thread Shannon Zhao



On 2015/11/3 4:13, Christopher Covington wrote:
> On 10/30/2015 02:21 AM, Shannon Zhao wrote:
>> From: Shannon Zhao 
>>
>> When we use tools like perf on host, perf passes the event type and the
>> id of this event type category to kernel, then kernel will map them to
>> hardware event number and write this number to PMU PMEVTYPER_EL0
>> register. When getting the event number in KVM, directly use raw event
>> type to create a perf_event for it.
>>
>> Signed-off-by: Shannon Zhao 
>> ---
>>  arch/arm64/include/asm/pmu.h |   2 +
>>  arch/arm64/kvm/Makefile  |   1 +
>>  include/kvm/arm_pmu.h|  13 +
>>  virt/kvm/arm/pmu.c   | 117 
>> +++
>>  4 files changed, 133 insertions(+)
>>  create mode 100644 virt/kvm/arm/pmu.c
>>
>> diff --git a/arch/arm64/include/asm/pmu.h b/arch/arm64/include/asm/pmu.h
>> index b9f394a..2c025f2 100644
>> --- a/arch/arm64/include/asm/pmu.h
>> +++ b/arch/arm64/include/asm/pmu.h
>> @@ -31,6 +31,8 @@
>>  #define ARMV8_PMCR_D(1 << 3) /* CCNT counts every 64th cpu 
>> cycle */
>>  #define ARMV8_PMCR_X(1 << 4) /* Export to ETM */
>>  #define ARMV8_PMCR_DP   (1 << 5) /* Disable CCNT if 
>> non-invasive debug*/
>> +/* Determines which PMCCNTR_EL0 bit generates an overflow */
>> +#define ARMV8_PMCR_LC   (1 << 6)
>>  #define ARMV8_PMCR_N_SHIFT  11   /* Number of counters 
>> supported */
>>  #define ARMV8_PMCR_N_MASK   0x1f
>>  #define ARMV8_PMCR_MASK 0x3f /* Mask for writable bits */
>> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
>> index 1949fe5..18d56d8 100644
>> --- a/arch/arm64/kvm/Makefile
>> +++ b/arch/arm64/kvm/Makefile
>> @@ -27,3 +27,4 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3-emul.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += vgic-v3-switch.o
>>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
>> +kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
>> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
>> index 254d2b4..1908c88 100644
>> --- a/include/kvm/arm_pmu.h
>> +++ b/include/kvm/arm_pmu.h
>> @@ -38,4 +38,17 @@ struct kvm_pmu {
>>  #endif
>>  };
>>  
>> +#ifdef CONFIG_KVM_ARM_PMU
>> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
>> select_idx);
>> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
>> +u32 select_idx);
>> +#else
>> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
>> select_idx)
>> +{
>> +return 0;
>> +}
>> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
>> +u32 select_idx) {}
>> +#endif
>> +
>>  #endif
>> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
>> new file mode 100644
>> index 000..900a64c
>> --- /dev/null
>> +++ b/virt/kvm/arm/pmu.c
>> @@ -0,0 +1,117 @@
>> +/*
>> + * Copyright (C) 2015 Linaro Ltd.
>> + * Author: Shannon Zhao 
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program.  If not, see .
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +/**
>> + * kvm_pmu_get_counter_value - get PMU counter value
>> + * @vcpu: The vcpu pointer
>> + * @select_idx: The counter index
>> + */
>> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
>> select_idx)
>> +{
>> +u64 counter, enabled, running;
>> +struct kvm_pmu *pmu = &vcpu->arch.pmu;
>> +struct kvm_pmc *pmc = &pmu->pmc[select_idx];
>> +
>> +if (!vcpu_mode_is_32bit(vcpu))
>> +counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx);
>> +else
>> +counter = vcpu_cp15(vcpu, c14_PMEVCNTR0 + select_idx);
>> +
>> +if (pmc->perf_event)
>> +counter += perf_event_read_value(pmc->perf_event, &enabled,
>> + &running);
>> +
>> +return counter & pmc->bitmask;
>> +}
>> +
>> +/**
>> + * kvm_pmu_stop_counter - stop PMU counter
>> + * @pmc: The PMU counter pointer
>> + *
>> + * If this counter has been configured to monitor some event, release it 
>> here.
>> + */
>> +static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
>> +{
>> +struct kvm_vcpu *vcpu = pmc->vcpu;
>> +u64 counter;
>> +
>> +if (pmc->perf_event) {
>> +counter = kvm_pmu_get_counter_value(vcpu, pmc->idx);
>> +

[Bug 102651] vcpuX unhandled rdmsr: 0x570

2015-11-02 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=102651

--- Comment #4 from Huaitong Han  ---
(In reply to Joachim Frieben from comment #3)
> I do see this warning by kernel-4.2.3-300.fc23 of a Fedora 23 system.

You can patch it on guest kernel,
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?id=73fdeb66592ee80dffb16fb8a9b7378a00c1a826,
although the warning has no effect on your guest.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt

On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > Then I would argue for naming this differently. Make it an optional
> > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > achieved via using a bypass or other means in the backend not the
> > business of the driver.
> > 
> 
> With a name like that, who wouldn't pass that flag? ;-)

xHCI for example, vs. something like 10G ethernet... but yes I agree it
sucks. I don't like that sort of policy anywhere in drivers. On the
other hand the platform doesn't have much information to make that sort
of decision either.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM/arm: kernel low level debug support for ARM32 virtual platforms

2015-11-02 Thread Mario Smarduch

Hello,
   this is a re-post from couple weeks ago, please take time to review this 
simple patch which simplifies DEBUG_LL and prevents kernel crash on virtual 
platforms.

Before this patch DEBUG_LL for 'dummy virtual machine':

( ) Kernel low-level debugging via EmbeddedICE DCC channel
( ) Kernel low-level debug output via semihosting I/O
( ) Kernel low-level debugging via 8250 UART
( ) Kernel low-level debugging via ARM Ltd PL01x Primecell

In summary if debug uart is not emulated kernel crashes.
And once you pass that hurdle, uart physical/virtual addresses are unknown.
DEBUG_LL comes in handy on many occasions and should be somewhat 
intuitive to use like it is for physical platforms. For virtual platforms
user may start daubting the host and get into a bigger mess.

After this patch is applied user gets:

(X) Kernel low-level debugging on QEMU Virtual Platform
( ) Kernel low-level debugging on Kvmtool Virtual Platform
. above repeated 

The virtual addresses selected follow arm reference models, high in vmalloc 
section with high mem enabled and guest running with >= 1GB of memory. The 
offset is leftover from arm reference models.

The patch is against 4.2.0-rc2 commit 43297dda0a51

Original Description

When booting a VM using QEMU or Kvmtool there are no clear ways to 
enable low level debugging for these virtual platforms. some menu port 
choices are not supported by the virtual platforms at all. And there is no
help on the location of physical and virtual addresses for the ports.
This may lead to wrong debug port and a frozen VM with a blank screen.

This patch adds menu selections for QEMU and Kvmtool virtual platforms for low 
level kernel print debugging. Help section displays port physical and
virutal addresses.

ARM reference models use the MIDR register to run-time select UART port address 
(for ARCH_VEXPRESS) based on A9 or A15 part numbers. Looked for a same approach
but couldn't find a way to differentiate between virtual platforms, something
like a platform register.

Acked-by: Christoffer Dall 
Signed-off-by: Mario Smarduch 
---
 arch/arm/Kconfig.debug | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm/Kconfig.debug b/arch/arm/Kconfig.debug
index a2e16f9..d126bd4 100644
--- a/arch/arm/Kconfig.debug
+++ b/arch/arm/Kconfig.debug
@@ -1155,6 +1155,28 @@ choice
  This option selects UART0 on VIA/Wondermedia System-on-a-chip
  devices, including VT8500, WM8505, WM8650 and WM8850.
 
+   config DEBUG_VIRT_UART_QEMU
+   bool "Kernel low-level debugging on QEMU Virtual Platform"
+   depends on ARCH_VIRT
+   select DEBUG_UART_PL01X
+   help
+ Say Y here if you want the debug print routines to direct
+ their output to PL011 UART port on QEMU Virtual Platform.
+ Appropriate address values are:
+   PHYSVIRT
+   0x900   0xf809
+
+   config DEBUG_VIRT_UART_KVMTOOL
+   bool "Kernel low-level debugging on Kvmtool Virtual Platform"
+   depends on ARCH_VIRT
+   select DEBUG_UART_8250
+   help
+ Say Y here if you want the debug print routines to direct
+ their output to 8250 UART port on Kvmtool Virtual
+ Platform. Appropriate address values are:
+   PHYSVIRT
+   0x3f8   0xf80903f8
+
config DEBUG_ICEDCC
bool "Kernel low-level debugging via EmbeddedICE DCC channel"
help
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt

On Mon, 2015-11-02 at 22:48 +, David Woodhouse wrote:
> 
> In the Intel case, the mapping setup is entirely per-device (except for
> crap devices and devices behind a PCIe-PCI bridge, etc.).
> 
> So you can happily have a passthrough mapping for *one* device, without
> making that same mapping available to another device. You can make that
> trade-off of speed vs. protection for each device in the system.

Same for me. I should have written "in the context of a device ...",
the problem reamains that it doesn't buy you much to do it *per
-mapping* which is what this API seems to be about.

> Currently we have the 'iommu=pt' option that makes us use passthrough
> for *all* devices (except those whose dma_mask can't reach all of
> physical memory). But I could live with something like Shamir is
> proposing, which lets us do the bypass only for performance-sensitive
> devices which actually *ask* for it (and of course we'd have a system-
> wide mode which declines that request and does normal mappings anyway).
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread David Woodhouse

On Mon, 2015-11-02 at 08:10 +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> > Not sure this use case is possible for Infiniband where application hold
> > the data buffers and there is no way to force application to re use the 
> > buffer as suggested.
> > 
> > This is why I think there will be no easy way to bypass the DMA mapping cost
> > for all use cases unless we allow applications to request bypass/pass 
> > through
> > DMA mapping (implicitly or explicitly).
> 
> But but but ...
> 
> What I don't understand is how that brings you any safety.
> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).
> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.
> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.

In the Intel case, the mapping setup is entirely per-device (except for
crap devices and devices behind a PCIe-PCI bridge, etc.).

So you can happily have a passthrough mapping for *one* device, without
making that same mapping available to another device. You can make that
trade-off of speed vs. protection for each device in the system.

Currently we have the 'iommu=pt' option that makes us use passthrough
for *all* devices (except those whose dma_mask can't reach all of
physical memory). But I could live with something like Shamir is
proposing, which lets us do the bypass only for performance-sensitive
devices which actually *ask* for it (and of course we'd have a system-
wide mode which declines that request and does normal mappings anyway).

-- 
dwmw2

smime.p7s
Description: S/MIME cryptographic signature

Re: [kbuild-all] [PATCH 5/6] KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU

2015-11-02 Thread Michael Ellerman

On Mon, 2015-11-02 at 16:10 -0600, Suresh E. Warrier wrote:

> Hi Fengguang,
> 
> I understand the problem.
> 
> I will send you the URL for the powerpc git tree once my patches
> are pulled in so that we can then pass that to the robot.

The robot already knows about the powerpc next tree.

It chose the KVM tree because of your subject line.

Sending a series to one tree (KVM) that depends on a series in another tree
(powerpc) is always problematic, even for humans, so the robot is not really at
fault here.

cheers

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kbuild-all] [PATCH 5/6] KVM: PPC: Book3S HV: Send IPI to host core to wake VCPU

2015-11-02 Thread Suresh E. Warrier

Hi Fengguang,

I understand the problem.

I will send you the URL for the powerpc git tree once my patches
are pulled in so that we can then pass that to the robot.

Thanks.
-suresh

On 11/01/2015 08:51 PM, Fengguang Wu wrote:
> Hi Suresh,
> 
> Sorry for the noise!
> 
> Do you have a git tree that the robot can monitor and test?
> 
> In this case of one patchset depending on another, there is no chance
> for the robot to do valid testing based on emailed patches.
> 
> Thanks,
> Fengguang
> 
> On Fri, Oct 30, 2015 at 10:16:06AM -0500, Suresh E. Warrier wrote:
>> This patch set depends upon a previous patch set that I had submitted to
>> linux-ppc. The URL for that is:
>>
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135794.html
>>
>> -suresh
>>
>> On 10/29/2015 11:52 PM, kbuild test robot wrote:
>>> Hi Suresh,
>>>
>>> [auto build test ERROR on kvm/linux-next -- if it's inappropriate base, 
>>> please suggest rules for selecting the more suitable base]
>>>
>>> url:
>>> https://github.com/0day-ci/linux/commits/Suresh-Warrier/KVM-PPC-Book3S-HV-Optimize-wakeup-VCPU-from-H_IPI/20151030-081329
>>> config: powerpc-defconfig (attached as .config)
>>> reproduce:
>>> wget 
>>> https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
>>>  -O ~/bin/make.cross
>>> chmod +x ~/bin/make.cross
>>> # save the attached .config to linux build tree
>>> make.cross ARCH=powerpc 
>>>
>>> All errors (new ones prefixed by >>):
>>>
>>>arch/powerpc/kvm/book3s_hv_rm_xics.c: In function 'icp_rm_set_vcpu_irq':
> arch/powerpc/kvm/book3s_hv_rm_xics.c:142:4: error: implicit declaration 
> of function 'smp_muxed_ipi_rm_message_pass' 
> [-Werror=implicit-function-declaration]
>>>smp_muxed_ipi_rm_message_pass(hcpu,
>>>^
>>>arch/powerpc/kvm/book3s_hv_rm_xics.c:143:7: error: 
>>> 'PPC_MSG_RM_HOST_ACTION' undeclared (first use in this function)
>>>   PPC_MSG_RM_HOST_ACTION);
>>>   ^
>>>arch/powerpc/kvm/book3s_hv_rm_xics.c:143:7: note: each undeclared 
>>> identifier is reported only once for each function it appears in
>>>cc1: all warnings being treated as errors
>>>
>>> vim +/smp_muxed_ipi_rm_message_pass +142 
>>> arch/powerpc/kvm/book3s_hv_rm_xics.c
>>>
>>>136  hcore = -1;
>>>137  if (kvmppc_host_rm_ops_hv)
>>>138  hcore = 
>>> find_available_hostcore(XICS_RM_KICK_VCPU);
>>>139  if (hcore != -1) {
>>>140  hcpu = hcore << threads_shift;
>>>141  
>>> kvmppc_host_rm_ops_hv->rm_core[hcore].rm_data = vcpu;
>>>  > 142  smp_muxed_ipi_rm_message_pass(hcpu,
>>>143  
>>> PPC_MSG_RM_HOST_ACTION);
>>>144  } else {
>>>145  this_icp->rm_action |= 
>>> XICS_RM_KICK_VCPU;
>>>
>>> ---
>>> 0-DAY kernel test infrastructureOpen Source Technology 
>>> Center
>>> https://lists.01.org/pipermail/kbuild-all   Intel 
>>> Corporation
>>>
>>
>> ___
>> kbuild-all mailing list
>> kbuild-...@lists.01.org
>> https://lists.01.org/mailman/listinfo/kbuild-all
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch

On Tue, Nov 03, 2015 at 07:13:28AM +1100, Benjamin Herrenschmidt wrote:
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

Correct. This comment was also from internal review :-) Although
currently there is strong opposition to such new attribute.

> 
> It will partially only but it's just an example of another way the
> bakcend could provide some improved performances without a bypass.

Just curious.. 

In the Infiniband case where user application request the driver to DMA 
map some random pages they allocated - will your suggestion still
function? 

I mean can you use this limited 1:1 mapping window to map any page the
user application choose? How?

At the bypass we just use the physical address of the page (almost as is).
We use the fact that bypass address space cover the whole memory and so we
have mapping for any address.

In IOMMU case we use the IOMMU translation tables and so can map 64 bit 
address to some 32 bit address (or less).

In your case it is not clear to me what can we do with such limited 1:1
mapping.

> 
> Cheers,
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Arnd Bergmann

On Tuesday 03 November 2015 07:13:28 Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> > wrote:
> > > 
> > > Chosing on a per-mapping basis *in the back end* might still make
> > > some
> > 
> > In my case, choosing mapping based on the hardware that will use this
> > mappings makes more sense. Most hardware are not that performance 
> > sensitive as the Infiniband hardware.
> 
>  ...
> 
> > The driver know for what hardware it is mapping the memory so it know 
> > if the memory will be used by performance sensitive hardware or not.
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

With a name like that, who wouldn't pass that flag? ;-)

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: fix SMEP and SMAP without EPT

2015-11-02 Thread Radim Krčmář

The comment in code had it mostly right, but we enable paging for
emulated real mode regardless of EPT.

Without EPT (which implies emulated real mode), secondary VCPUs won't
start unless we disable SM[AE]P when the guest doesn't use paging.

Signed-off-by: Radim Krčmář 
---
 arch/x86/kvm/vmx.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b680c2e0e8a3..ab598558a7a4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3788,20 +3788,21 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
if (!is_paging(vcpu)) {
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
-   /*
-* SMEP/SMAP is disabled if CPU is in non-paging mode
-* in hardware. However KVM always uses paging mode to
-* emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP/SMAP needs to be
-* manually disabled when guest switches to non-paging
-* mode.
-*/
-   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
}
 
+   if (!enable_unrestricted_guest && !is_paging(vcpu))
+   /*
+* SMEP/SMAP is disabled if CPU is in non-paging mode in
+* hardware.  However KVM always uses paging mode without
+* unrestricted guest.
+* To emulate this behavior, SMEP/SMAP needs to be manually
+* disabled when guest switches to non-paging mode.
+*/
+   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
+
vmcs_writel(CR4_READ_SHADOW, cr4);
vmcs_writel(GUEST_CR4, hw_cr4);
return 0;
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 17/21] KVM: ARM64: Add helper to handle PMCR register bits

2015-11-02 Thread Christopher Covington

On 10/30/2015 02:21 AM, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> According to ARMv8 spec, when writing 1 to PMCR.E, all counters are
> enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are
> disabled. When writing 1 to PMCR.P, reset all event counters, not
> including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to
> zero.

> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> index ae21089..11d1bfb 100644
> --- a/virt/kvm/arm/pmu.c
> +++ b/virt/kvm/arm/pmu.c
> @@ -121,6 +121,56 @@ void kvm_pmu_disable_counter(struct kvm_vcpu *vcpu, u32 
> val)
>  }
>  
>  /**
> + * kvm_pmu_handle_pmcr - handle PMCR register
> + * @vcpu: The vcpu pointer
> + * @val: the value guest writes to PMCR register
> + */
> +void kvm_pmu_handle_pmcr(struct kvm_vcpu *vcpu, u32 val)
> +{
> + struct kvm_pmu *pmu = &vcpu->arch.pmu;
> + struct kvm_pmc *pmc;
> + u32 enable;
> + int i;
> +
> + if (val & ARMV8_PMCR_E) {
> + if (!vcpu_mode_is_32bit(vcpu))
> + enable = vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
> + else
> + enable = vcpu_cp15(vcpu, c9_PMCNTENSET);
> +
> + kvm_pmu_enable_counter(vcpu, enable, true);
> + } else
> + kvm_pmu_disable_counter(vcpu, 0xUL);

Nit: If using braces on one side of if-else, please use them on the other.
(Search for "braces in both branches" in Documentation/CodingStyle.)

> +
> + if (val & ARMV8_PMCR_C) {
> + pmc = &pmu->pmc[ARMV8_MAX_COUNTERS - 1];
> + if (pmc->perf_event)
> + local64_set(&pmc->perf_event->count, 0);
> + if (!vcpu_mode_is_32bit(vcpu))
> + vcpu_sys_reg(vcpu, PMCCNTR_EL0) = 0;
> + else
> + vcpu_cp15(vcpu, c9_PMCCNTR) = 0;
> + }
> +
> + if (val & ARMV8_PMCR_P) {
> + for (i = 0; i < ARMV8_MAX_COUNTERS - 1; i++) {
> + pmc = &pmu->pmc[i];
> + if (pmc->perf_event)
> + local64_set(&pmc->perf_event->count, 0);
> + if (!vcpu_mode_is_32bit(vcpu))
> + vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i) = 0;
> + else
> + vcpu_cp15(vcpu, c14_PMEVCNTR0 + i) = 0;
> + }
> + }
> +
> + if (val & ARMV8_PMCR_LC) {
> + pmc = &pmu->pmc[ARMV8_MAX_COUNTERS - 1];
> + pmc->bitmask = 0xUL;
> + }
> +}
> +
> +/**
>   * kvm_pmu_overflow_clear - clear PMU overflow interrupt
>   * @vcpu: The vcpu pointer
>   * @val: the value guest writes to PMOVSCLR register
> 


-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/3] KVM: arm/arm64: Clean up some obsolete code

2015-11-02 Thread Christoffer Dall

On Tue, Oct 27, 2015 at 11:37:28AM +0300, Pavel Fedin wrote:
> Current KVM code has lots of old redundancies, which can be cleaned up.
> This patchset is actually a better alternative to
> http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
> keep piggy-backed LRs. The idea is based on the fact that our code also
> maintains LR state in elrsr, and this information is enough to track LR
> usage.
> 
> In case of problems this series can be applied partially, each patch is
> a complete refactoring step on its own.
> 
> Thanks to Andre Przywara for pinpointing some 4.3+ specifics.
> 
> This version has been tested on SMDK5410 development board
> (Exynos5410 SoC).

I ran this through my test scripts and I'm now quite sure that there's
some breakage in here.

One of my tests is running two VMs in parallel, each booting up, running
hackbench, and then doing reboot (from within the guest), and just
repeating like that.

I've run your patches in the above config 100 times, and every time,
the rebooting VMs got stuck before 50 reboots.

Without these patches, I could run the above config 100 times, and every
time, the rebooting VMs passed 200 reboots.

I'll try to test each patch individually soon.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-02 Thread Paolo Bonzini



On 02/11/2015 10:13, Xiao Guangrong wrote:
> Currently, file_ram_alloc() only works on directory - it creates a file
> under @path and do mmap on it
> 
> This patch tries to allow it to work on file directly, if @path is a
> directory it works as before, otherwise it treats @path as the target
> file then directly allocate memory from it
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  exec.c | 80 
> ++
>  1 file changed, 51 insertions(+), 29 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 9075f4d..db0fdaf 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
>  }
>  
>  #ifdef __linux__
> +static bool path_is_dir(const char *path)
> +{
> +struct stat fs;
> +
> +return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
> +}
> +
> +static int open_ram_file_path(RAMBlock *block, const char *path, size_t size)
> +{
> +char *filename;
> +char *sanitized_name;
> +char *c;
> +int fd;
> +
> +if (!path_is_dir(path)) {
> +int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
> +
> +flags |= O_EXCL;
> +return open(path, flags);
> +}
> +
> +/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> +sanitized_name = g_strdup(memory_region_name(block->mr));
> +for (c = sanitized_name; *c != '\0'; c++) {
> +if (*c == '/') {
> +*c = '_';
> +}
> +}
> +filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> +   sanitized_name);
> +g_free(sanitized_name);
> +fd = mkstemp(filename);
> +if (fd >= 0) {
> +unlink(filename);
> +/*
> + * ftruncate is not supported by hugetlbfs in older
> + * hosts, so don't bother bailing out on errors.
> + * If anything goes wrong with it under other filesystems,
> + * mmap will fail.
> + */
> +if (ftruncate(fd, size)) {
> +perror("ftruncate");
> +}
> +}
> +g_free(filename);
> +
> +return fd;
> +}
> +
>  static void *file_ram_alloc(RAMBlock *block,
>  ram_addr_t memory,
>  const char *path,
>  Error **errp)
>  {
> -char *filename;
> -char *sanitized_name;
> -char *c;
>  void *area;
>  int fd;
>  uint64_t pagesize;
> @@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
>  goto error;
>  }
>  
> -/* Make name safe to use with mkstemp by replacing '/' with '_'. */
> -sanitized_name = g_strdup(memory_region_name(block->mr));
> -for (c = sanitized_name; *c != '\0'; c++) {
> -if (*c == '/')
> -*c = '_';
> -}
> -
> -filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
> -   sanitized_name);
> -g_free(sanitized_name);
> +memory = ROUND_UP(memory, pagesize);
>  
> -fd = mkstemp(filename);
> +fd = open_ram_file_path(block, path, memory);
>  if (fd < 0) {
>  error_setg_errno(errp, errno,
>   "unable to create backing store for path %s", path);
> -g_free(filename);
>  goto error;
>  }
> -unlink(filename);
> -g_free(filename);
> -
> -memory = ROUND_UP(memory, pagesize);
> -
> -/*
> - * ftruncate is not supported by hugetlbfs in older
> - * hosts, so don't bother bailing out on errors.
> - * If anything goes wrong with it under other filesystems,
> - * mmap will fail.
> - */
> -if (ftruncate(fd, memory)) {
> -perror("ftruncate");
> -}
>  
>  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
>  if (area == MAP_FAILED) {
> 

I was going to send tomorrow a pull request for a similar patch,
"backends/hostmem-file: Allow to specify full pathname for backing file".

The main difference seems to be your usage of O_EXCL.  Can you explain
why you added it?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 08/21] KVM: ARM64: Add reset and access handlers for PMXEVTYPER register

2015-11-02 Thread Christopher Covington

Hi Shannon,

On 10/30/2015 02:21 AM, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> Since the reset value of PMXEVTYPER is UNKNOWN, use reset_unknown or
> reset_unknown_cp15 for its reset handler. Add access handler which
> emulates writing and reading PMXEVTYPER register. When writing to
> PMXEVTYPER, call kvm_pmu_set_counter_event_type to create a perf_event
> for the selected event type.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/kvm/sys_regs.c | 26 --
>  1 file changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index cb82b15..4e606ea 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -491,6 +491,17 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>  
>   if (p->is_write) {
>   switch (r->reg) {
> + case PMXEVTYPER_EL0: {
> + val = vcpu_sys_reg(vcpu, PMSELR_EL0);
> + kvm_pmu_set_counter_event_type(vcpu,
> +*vcpu_reg(vcpu, p->Rt),
> +val);
> + vcpu_sys_reg(vcpu, PMXEVTYPER_EL0) =
> +  *vcpu_reg(vcpu, p->Rt);

Why does PMXEVTYPER get set directly? It seems like it could have an accessor
that redirected to PMEVTYPER.

> + vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + val) =
> +  *vcpu_reg(vcpu, p->Rt);

I tried to look around briefly but couldn't find counter number range checking
in the PMSELR handler or here. Should there be some here and in PMXEVCNTR?

Thanks,
Christopher Covington

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt

On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> wrote:
> > 
> > Chosing on a per-mapping basis *in the back end* might still make
> > some
> 
> In my case, choosing mapping based on the hardware that will use this
> mappings makes more sense. Most hardware are not that performance 
> sensitive as the Infiniband hardware.

 ...

> The driver know for what hardware it is mapping the memory so it know 
> if the memory will be used by performance sensitive hardware or not.

Then I would argue for naming this differently. Make it an optional
hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
achieved via using a bypass or other means in the backend not the
business of the driver.

> In your case, what will give the better performance - 1:1 mapping or
> IOMMU
> mapping? When you say 'relaxing the protection' you refer to 1:1
> mapping?

> Also, how this 1:1 window address the security concerns that other
> raised
> by other here?

It will partially only but it's just an example of another way the
bakcend could provide some improved performances without a bypass.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-02 Thread Andy Lutomirski

On Mon, Nov 2, 2015 at 3:16 AM, Cornelia Huck  wrote:
> On Fri, 30 Oct 2015 13:33:07 -0700
> Andy Lutomirski  wrote:
>
>> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck  
>> wrote:
>> > On Thu, 29 Oct 2015 15:50:38 -0700
>> > Andy Lutomirski  wrote:
>> >
>> >> Progress!  After getting that sort-of-working, I figured out what was
>> >> wrong with my earlier command, and I got that working, too.  Now I
>> >> get:
>> >>
>> >> qemu-system-s390x -fsdev
>> >> local,id=virtfs1,path=/,security_model=none,readonly -device
>> >> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
>> >> -nodefaults -device sclpconsole,chardev=console -parallel none -net
>> >> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
>> >> -serial chardev:console -mon chardev=console -vga none -display none
>> >> -kernel arch/s390/boot/bzImage -append
>> >> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init
>> >> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
>> >> TERM=xterm-256color rootfstype=9p
>> >> rootflags=ro,version=9p2000.L,trans=virtio,access=any
>> >> raid=noautodetect debug'
>> >
>> > The commandline looks sane AFAICS.
>> >
>> > (...)
>> >
>> >> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80
>> >> virtio_ccw 0.0.: Failed to set online: -5
>> >>
>> >> ^^^ bad news!
>> >
>> > I'd like to see where in the onlining process this fails. Could you set
>> > up qemu tracing for css_* and virtio_ccw_* (instructions in
>> > qemu/docs/tracing.txt)?
>>
>> I have a file called events that contains:
>>
>> css_*
>> virtio_ccw_*
>>
>> pointing -trace events= at it results in a trace- file that's 549
>> bytes long and contains nothing.  Are wildcards not as well-supported
>> as the docs suggest?
>
> Just tried it, seemed to work for me as expected. And as your messages
> indicate, at least some of the css tracepoints are guaranteed to be
> hit. Odd.
>
> Can you try the following sophisticated printf debug patch?
>
> diff --git a/hw/s390x/css.c b/hw/s390x/css.c
> index c033612..6a87bd6 100644
> --- a/hw/s390x/css.c
> +++ b/hw/s390x/css.c
> @@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
> ccw_addr)
>  sch->ccw_no_data_cnt++;
>  }
>
> +fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code);
> +
>  /* Look at the command. */
>  switch (ccw.cmd_code) {
>  case CCW_CMD_NOOP:
> @@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr 
> ccw_addr)
>  }
>  break;
>  }
> +fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret);
>  sch->last_cmd = ccw;
>  sch->last_cmd_valid = true;
>  if (ret == 0) {
>
>
>> > Which qemu version is this, btw.?
>> >
>>
>> git from yesterday.
>
> Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead.
>

No change.

With s390-ccw-virtio-2.4, I get:

Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Initializing cgroup subsys cpuacct
Linux version 4.3.0-rc7-8-gff230d6ec6b2
(l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red
Hat Cross 5.1.1-3) (GCC) ) #344 SMP Fri Oct 30 13:16:13 PDT 2015
setup: Linux is running under KVM in 64-bit mode
setup: Max memory size: 128MB
Zone ranges:
  DMA  [mem 0x-0x7fff]
  Normal   empty
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x07ff]
Initmem setup node 0 [mem 0x-0x07ff]
On node 0 totalpages: 32768
  DMA zone: 512 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 32768 pages, LIFO batch:7
PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736
pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096
pcpu-alloc: [0] 0 [0] 1
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32256
Kernel command line:
init=/home/luto/devel/virtme/virtme/guest/virtme-init
psmouse.proto=exps "virtme_stty_con=rows 45 cols 150 iutf8"
TERM=xterm-256color rootfstype=9p
rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect
ro debug
PID hash table entries: 512 (order: 0, 4096 bytes)
Dentry cache hash table entries: 16384 (order: 5, 131072 bytes)
Inode-cache hash table entries: 8192 (order: 4, 65536 bytes)
Memory: 92520K/131072K available (8255K kernel code, 802K rwdata,
3860K rodata, 2384K init, 14382K bss, 38552K reserved, 0K
cma-reserved)
Write protected kernel read-only data: 0x10 - 0xcd4fff
SLUB: HWalign=256, Order=0-3, MinObjects=0, CPUs=2, Nodes=1
Running RCU self tests
Hierarchical RCU implementation.
RCU debugfs-based tracing is enabled.
RCU lockdep checking is enabled.
Build-time adjustment of leaf fanout to 64.
RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=2.
RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=2
NR_IRQS:3
clocksource: tod: mask: 0x max_cycles:
0x3b0a9be803b0a9, max_idle_ns: 1805497147909793 ns
console [ttyS1] enabled
Lock dependency valid

Re: [PATCH v4 07/21] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function

2015-11-02 Thread Christopher Covington

On 10/30/2015 02:21 AM, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> When we use tools like perf on host, perf passes the event type and the
> id of this event type category to kernel, then kernel will map them to
> hardware event number and write this number to PMU PMEVTYPER_EL0
> register. When getting the event number in KVM, directly use raw event
> type to create a perf_event for it.
> 
> Signed-off-by: Shannon Zhao 
> ---
>  arch/arm64/include/asm/pmu.h |   2 +
>  arch/arm64/kvm/Makefile  |   1 +
>  include/kvm/arm_pmu.h|  13 +
>  virt/kvm/arm/pmu.c   | 117 
> +++
>  4 files changed, 133 insertions(+)
>  create mode 100644 virt/kvm/arm/pmu.c
> 
> diff --git a/arch/arm64/include/asm/pmu.h b/arch/arm64/include/asm/pmu.h
> index b9f394a..2c025f2 100644
> --- a/arch/arm64/include/asm/pmu.h
> +++ b/arch/arm64/include/asm/pmu.h
> @@ -31,6 +31,8 @@
>  #define ARMV8_PMCR_D (1 << 3) /* CCNT counts every 64th cpu cycle */
>  #define ARMV8_PMCR_X (1 << 4) /* Export to ETM */
>  #define ARMV8_PMCR_DP(1 << 5) /* Disable CCNT if 
> non-invasive debug*/
> +/* Determines which PMCCNTR_EL0 bit generates an overflow */
> +#define ARMV8_PMCR_LC(1 << 6)
>  #define  ARMV8_PMCR_N_SHIFT  11   /* Number of counters 
> supported */
>  #define  ARMV8_PMCR_N_MASK   0x1f
>  #define  ARMV8_PMCR_MASK 0x3f /* Mask for writable bits */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 1949fe5..18d56d8 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -27,3 +27,4 @@ kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/vgic-v3-emul.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += vgic-v3-switch.o
>  kvm-$(CONFIG_KVM_ARM_HOST) += $(KVM)/arm/arch_timer.o
> +kvm-$(CONFIG_KVM_ARM_PMU) += $(KVM)/arm/pmu.o
> diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> index 254d2b4..1908c88 100644
> --- a/include/kvm/arm_pmu.h
> +++ b/include/kvm/arm_pmu.h
> @@ -38,4 +38,17 @@ struct kvm_pmu {
>  #endif
>  };
>  
> +#ifdef CONFIG_KVM_ARM_PMU
> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
> select_idx);
> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
> + u32 select_idx);
> +#else
> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
> select_idx)
> +{
> + return 0;
> +}
> +void kvm_pmu_set_counter_event_type(struct kvm_vcpu *vcpu, u32 data,
> + u32 select_idx) {}
> +#endif
> +
>  #endif
> diff --git a/virt/kvm/arm/pmu.c b/virt/kvm/arm/pmu.c
> new file mode 100644
> index 000..900a64c
> --- /dev/null
> +++ b/virt/kvm/arm/pmu.c
> @@ -0,0 +1,117 @@
> +/*
> + * Copyright (C) 2015 Linaro Ltd.
> + * Author: Shannon Zhao 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * kvm_pmu_get_counter_value - get PMU counter value
> + * @vcpu: The vcpu pointer
> + * @select_idx: The counter index
> + */
> +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
> select_idx)
> +{
> + u64 counter, enabled, running;
> + struct kvm_pmu *pmu = &vcpu->arch.pmu;
> + struct kvm_pmc *pmc = &pmu->pmc[select_idx];
> +
> + if (!vcpu_mode_is_32bit(vcpu))
> + counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx);
> + else
> + counter = vcpu_cp15(vcpu, c14_PMEVCNTR0 + select_idx);
> +
> + if (pmc->perf_event)
> + counter += perf_event_read_value(pmc->perf_event, &enabled,
> +  &running);
> +
> + return counter & pmc->bitmask;
> +}
> +
> +/**
> + * kvm_pmu_stop_counter - stop PMU counter
> + * @pmc: The PMU counter pointer
> + *
> + * If this counter has been configured to monitor some event, release it 
> here.
> + */
> +static void kvm_pmu_stop_counter(struct kvm_pmc *pmc)
> +{
> + struct kvm_vcpu *vcpu = pmc->vcpu;
> + u64 counter;
> +
> + if (pmc->perf_event) {
> + counter = kvm_pmu_get_counter_value(vcpu, pmc->idx);
> + if (!vcpu_mode_is_32bit(vcpu))
> + vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + pmc->idx) = counter;
> + else
> + vcpu_cp15(vcpu, c14

Re: [PATCH] vhost: move is_le setup to the backend

2015-11-02 Thread David Miller

From: Greg Kurz 
Date: Fri, 30 Oct 2015 12:42:35 +0100

> The vq->is_le field is used to fix endianness when accessing the vring via
> the cpu_to_vhost16() and vhost16_to_cpu() helpers in the following cases:
> 
> 1) host is big endian and device is modern virtio
> 
> 2) host has cross-endian support and device is legacy virtio with a different
>endianness than the host
> 
> Both cases rely on the VHOST_SET_FEATURES ioctl, but 2) also needs the
> VHOST_SET_VRING_ENDIAN ioctl to be called by userspace. Since vq->is_le
> is only needed when the backend is active, it was decided to set it at
> backend start.
> 
> This is currently done in vhost_init_used()->vhost_init_is_le() but it
> obfuscates the core vhost code. This patch moves the is_le setup to a
> dedicated function that is called from the backend code.
> 
> Note vhost_net is the only backend that can pass vq->private_data == NULL to
> vhost_init_used(), hence the "if (sock)" branch.
> 
> No behaviour change.
> 
> Signed-off-by: Greg Kurz 

Michael, I'm assuming that you will be the one taking this.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 05/21] KVM: ARM64: Add reset and access handlers for PMSELR register

2015-11-02 Thread Christopher Covington

On 10/30/2015 02:21 AM, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> Since the reset value of PMSELR_EL0 is UNKNOWN, use reset_unknown for
> its reset handler. As it doesn't need to deal with the acsessing action

Nit: accessing

> specially, it uses default case to emulate writing and reading PMSELR
> register.
> 
> Add a helper for CP15 registers reset to UNKNOWN.
> 
> Signed-off-by: Shannon Zhao 

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch

On Mon, Nov 02, 2015 at 03:44:27PM +0100, Joerg Roedel wrote:
> 
> How do you envision this per-mapping by-pass to work? For the DMA-API
> mappings you have only one device address space. For by-pass to work,
> you need to map all physical memory of the machine into this space,
> which leaves the question where you want to put the window for
> remapping.
> 
> You surely have to put it after the physical mappings, but any
> protection will be gone, as the device can access all physical memory.

Correct. This issue is one of the concerns here in the previous replies.
I will take different approach which will not require the IOMMU bypass
per mapping. Will try to shift to the x86 'iommu=pt' approach.

> 
> So instead of working around the shortcomings of DMA-API
> implementations, can you present us some numbers and analysis of how bad
> the performance impact with an IOMMU is and what causes it?

We had a bunch of issues around SPARC IOMMU. Not all of them relate to
performance. The first issue was that on SPARC, currently, we only have 
limited address space to IOMMU so we had issue to do large DMA mappings
for Infiniband. Second issue was that we identified high contention on 
the IOMMU locks even in ETH driver. 

> 
> I know that we have lock-contention issues in our IOMMU drivers, which
> can be fixed. Maybe the performance problems you are seeing can be fixed
> too, when you give us more details about them.
> 

I do not want to put too much information here but you can see some results:

rds-stress test from sparc t5-2 -> x86:

with iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3 141278  0 1165565.81   0.00   0.008.93   376.60 -1.00  
(average)

without iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3  78558  0  648101.41   0.00   0.00   15.05   876.72 -1.00  
(average)

+ RDMA tests are totally not working (might be due to failure to DMA map all 
the memory).

So IOMMU bypass give ~80% performance boost.

> 
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 13/35] hostmem-file: use whole file size if possible

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

Use the whole file size if @size is not specified which is useful
if we want to directly pass a file to guest

Signed-off-by: Xiao Guangrong 
---
  backends/hostmem-file.c | 22 ++
  1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 9097a57..ea355c1 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -38,15 +38,29 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
  {
  HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
  
-if (!backend->size) {

-error_setg(errp, "can't create backend with size 0");
-return;
-}
  if (!fb->mem_path) {
  error_setg(errp, "mem-path property not set");
  return;
  }
  
+if (!backend->size) {

+Error *local_err = NULL;
+
+/*
+ * use the whole file size if @size is not specified.
+ */
+backend->size = qemu_file_getlength(fb->mem_path, &local_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+
+if (!backend->size) {
+error_setg(errp, "can't create backend on the file whose size is 0");
+return;
+}
+
  backend->force_prealloc = mem_prealloc;
  memory_region_init_ram_from_file(&backend->mr, OBJECT(backend),
   object_get_canonical_path(OBJECT(backend)),


why not just

+if (!backend->size) {
+/*
+ * use the whole file size if @size is not specified.
+ */
+backend->size = qemu_file_getlength(fb->mem_path, errp);
+if (*errp) {
+return;
+}
+}


what the purpose of propagating?

--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 102651] vcpuX unhandled rdmsr: 0x570

2015-11-02 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=102651

Joachim Namislow  changed:

   What|Removed |Added

 CC||jfrie...@hotmail.com

--- Comment #3 from Joachim Namislow  ---
I do see this warning by kernel-4.2.3-300.fc23 of a Fedora 23 system.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

2015-11-02 Thread Paolo Bonzini



On 02/11/2015 18:01, Radim Krcmar wrote:
>> > Yes.  Both because the Virtuozzo people confirmed that kvm_arch_set_irq
>> > isn't needed for synic, and because synic is currently broken with APICv.
> Thanks.
> 
> (We can add direct delivery for |online vcpus| < X if performance with
>  low number of VCPUs happens to regress because of the schedule_work,)

Yeah, we will change the VFIO interrupt handler to non-threaded, which
will do more or less the same (what you lose now through schedule_work,
you will recoup by doing things directly in the ISR).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

2015-11-02 Thread Radim Krcmar

2015-11-02 17:08+0100, Paolo Bonzini:
> On 02/11/2015 15:59, Radim Krcmar wrote:
>>> We do not want to do too much work in atomic context, in particular
>>> not walking all the VCPUs of the virtual machine.  So we want
>>> to distinguish the architecture-specific injection function for irqfd
>>> from kvm_set_msi.  Since it's still empty, reuse the newly added
>>> kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.
>> 
>> kvm/queue uses kvm_arch_set_irq since b7184313f4b9 ("kvm/x86: Hyper-V
>> synthetic interrupt controller").
>> 
>> Is synic going to be dropped before this patch is merged?
> 
> Yes.  Both because the Virtuozzo people confirmed that kvm_arch_set_irq
> isn't needed for synic, and because synic is currently broken with APICv.

Thanks.

(We can add direct delivery for |online vcpus| < X if performance with
 low number of VCPUs happens to regress because of the schedule_work,)

Reviewed-by: Radim Krčmář 

[2/3] and [3/3] are good too.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] s390/dma: Allow per device dma ops

2015-11-02 Thread Sebastian Ott

On Fri, 30 Oct 2015, Christian Borntraeger wrote:

> As virtio-ccw now has dma ops, we can no longer default to the PCI ones.
> Make use of dev_archdata to keep the dma_ops per device. The pci devices
> now use that to override the default, and the default is changed to use
> the noop ops for everything that is not PCI. To compile without PCI
> support we also have to enable the DMA api with virtio.
> 
> Signed-off-by: Christian Borntraeger 

Acked-by: Sebastian Ott 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 12/35] util: let qemu_fd_getlength support block device

2015-11-02 Thread Xiao Guangrong




On 11/03/2015 12:11 AM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

lseek can not work for all block devices as the man page says:
| Some devices are incapable of seeking and POSIX does not specify
| which devices must support lseek().

This patch tries to add the support on Linux by using BLKGETSIZE64
ioctl

Signed-off-by: Xiao Guangrong 
---
  util/osdep.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/util/osdep.c b/util/osdep.c
index 5a61e19..b20c793 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -45,6 +45,11 @@
  extern int madvise(caddr_t, size_t, int);
  #endif
+#ifdef CONFIG_LINUX
+#include 
+#include 
+#endif
+
  #include "qemu-common.h"
  #include "qemu/sockets.h"
  #include "qemu/error-report.h"
@@ -433,6 +438,21 @@ int64_t qemu_fd_getlength(int fd)
  {
  int64_t size;
+#ifdef CONFIG_LINUX
+struct stat stat_buf;
+if (fstat(fd, &stat_buf) < 0) {
+return -errno;
+}
+
+if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, &size)) {
+/* The size of block device is larger than max int64_t? */
+if (size < 0) {
+return -EOVERFLOW;
+}
+return size;
+}
+#endif
+
  size = lseek(fd, 0, SEEK_END);
  if (size < 0) {
  return -errno;


Reviewed-by: Vladimir Sementsov-Ogievskiy 

just a question: is there any use for stat.st_size ? Is it always worse then 
lseek?


The man page says:
The  st_size field gives the size of the file (if it is a regular file or a 
symbolic link)
in bytes.  The size of a symbolic link is the length of the pathname it 
contains, without a
terminating null byte.

So it can not work on symbolic link.


Does it work for
blk?



Quickly checked with a program written by myself and 'stat' command, the answer 
is NO. :)


also, "This patch tries to add..". Hmm. It looks like this patch is not sure 
about will it success.
I'd prefer "This patch adds", but this is not important



Thanks for your sharing. I did not know the different, now, i got it. :)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 20/35] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 18:06, Xiao Guangrong wrote:



On 11/02/2015 10:26 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 16:08, Xiao Guangrong wrote:



On 11/02/2015 08:19 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Current code did not check the return value of get_memory_region 
as it

assumed the backend memory of pc-dimm is always properly initialized,
we make get_memory_region internally catch the case if something is
wrong


but here you call not pc-dimm's get_memory_region, but common 
ddc->get_memory_region, which may be
nvdimm or possibly other future dimm, so, why not check it here? And 
than pc_dimm_get_memory_region

may be left untouched (error_abort is ok, because errp is unused).


Hmm, because 'here' is not the only place calling ->get_memory_region, 
this method has

multiple callers:

$ git grep "\->get_memory_region"
hw/i386/pc.c:MemoryRegion *mr = ddc->get_memory_region(dimm);
hw/i386/pc.c:MemoryRegion *mr = ddc->get_memory_region(dimm);
hw/mem/dimm.c:mr = ddc->get_memory_region(dimm);
hw/mem/nvdimm.c:ddc->get_memory_region = nvdimm_get_memory_region;
hw/mem/pc-dimm.c:ddc->get_memory_region = pc_dimm_get_memory_region;
hw/ppc/spapr.c:MemoryRegion *mr = ddc->get_memory_region(dimm);

memory region validation is also done for NVDIMM in nvdimm device.

Ok, then it should be documented by a comment in dimm.h, where 
DIMMDeviceClass is defined, that this function should not fail


--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 12/35] util: let qemu_fd_getlength support block device

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

lseek can not work for all block devices as the man page says:
| Some devices are incapable of seeking and POSIX does not specify
| which devices must support lseek().

This patch tries to add the support on Linux by using BLKGETSIZE64
ioctl

Signed-off-by: Xiao Guangrong 
---
  util/osdep.c | 20 
  1 file changed, 20 insertions(+)

diff --git a/util/osdep.c b/util/osdep.c
index 5a61e19..b20c793 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -45,6 +45,11 @@
  extern int madvise(caddr_t, size_t, int);
  #endif
  
+#ifdef CONFIG_LINUX

+#include 
+#include 
+#endif
+
  #include "qemu-common.h"
  #include "qemu/sockets.h"
  #include "qemu/error-report.h"
@@ -433,6 +438,21 @@ int64_t qemu_fd_getlength(int fd)
  {
  int64_t size;
  
+#ifdef CONFIG_LINUX

+struct stat stat_buf;
+if (fstat(fd, &stat_buf) < 0) {
+return -errno;
+}
+
+if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, &size)) {
+/* The size of block device is larger than max int64_t? */
+if (size < 0) {
+return -EOVERFLOW;
+}
+return size;
+}
+#endif
+
  size = lseek(fd, 0, SEEK_END);
  if (size < 0) {
  return -errno;


Reviewed-by: Vladimir Sementsov-Ogievskiy 

just a question: is there any use for stat.st_size ? Is it always worse 
then lseek? Does it work for blk?


also, "This patch tries to add..". Hmm. It looks like this patch is not 
sure about will it success. I'd prefer "This patch adds", but this is 
not important


--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

2015-11-02 Thread Paolo Bonzini



On 02/11/2015 15:59, Radim Krcmar wrote:
> > We do not want to do too much work in atomic context, in particular
> > not walking all the VCPUs of the virtual machine.  So we want
> > to distinguish the architecture-specific injection function for irqfd
> > from kvm_set_msi.  Since it's still empty, reuse the newly added
> > kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.
> 
> kvm/queue uses kvm_arch_set_irq since b7184313f4b9 ("kvm/x86: Hyper-V
> synthetic interrupt controller").
> 
> Is synic going to be dropped before this patch is merged?

Yes.  Both because the Virtuozzo people confirmed that kvm_arch_set_irq
isn't needed for synic, and because synic is currently broken with APICv.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] kvmtool: Cleanup kernel loading

2015-11-02 Thread Andre Przywara

Hi Dimitri,

On 02/11/15 15:17, Dimitri John Ledkov wrote:
> On 2 November 2015 at 14:58, Will Deacon  wrote:
>> On Fri, Oct 30, 2015 at 06:26:53PM +, Andre Przywara wrote:
>>> Hi,
>>
>> Hello Andre,
>>
>>> this series cleans up kvmtool's kernel loading functionality a bit.
>>> It has been broken out of a previous series I sent [1] and contains
>>> just the cleanup and bug fix parts, which should be less controversial
>>> and thus easier to merge ;-)
>>> I will resend the pipe loading part later on as a separate series.
>>>
>>> The first patch properly abstracts kernel loading to move
>>> responsibility into each architecture's code. It removes quite some
>>> ugly code from the generic kvm.c file.
>>> The later patches address the naive usage of read(2) to, well, read
>>> data from files. Doing this without coping with the subtleties of
>>> the UNIX read semantics (returning with less or none data read is not
>>> an error) can provoke hard to debug failures.
>>> So these patches make use of the existing and one new wrapper function
>>> to make sure we read everything we actually wanted to.
>>> The last patch moves the ARM kernel loading code into the proper
>>> location to be in line with the other architectures.
>>>
>>> Please have a look and give some comments!
>>
>> Looks good to me, but I'd like to see some comments from some mips/ppc/x86
>> people on the changes you're making over there.
> 
> Looks mostly good to me, as one of the kvmtool down streams. Over at
> https://github.com/clearlinux/kvmtool we have some patches to tweak
> the x86 boot flow, which will need rebasing/retweaking) specifically
> this commit here -
> https://github.com/clearlinux/kvmtool/commit/a8dee709f85735d16739d0eda0cc00d3c1b17477

Awesome - I was actually thinking about coding something like this!
In the last week I move the MIPS ELF loading out of mips/ into /elf.c to
be able to load kvm-unit-tests' tests (which are Multiboot/ELF). As
multiboot requires entering in protected mode, I was thinking about
changing kvmtool to support entering a guest directly in protected mode
- seems like this is mostly what you've done here already.

Looks like we should both post our patches to merge them somehow ;-)

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] Provide simple noop dma ops

2015-11-02 Thread Christian Borntraeger

Am 02.11.2015 um 16:16 schrieb Joerg Roedel:
> On Fri, Oct 30, 2015 at 02:20:35PM +0100, Christian Borntraeger wrote:
>> +static void *dma_noop_alloc(struct device *dev, size_t size,
>> +dma_addr_t *dma_handle, gfp_t gfp,
>> +struct dma_attrs *attrs)
>> +{
>> +void *ret;
>> +
>> +ret = (void *)__get_free_pages(gfp, get_order(size));
>> +if (ret) {
>> +memset(ret, 0, size);
> 
> There is no need to zero out the memory here. If the user wants
> initialized memory it can call dma_zalloc_coherent. Having the memset
> here means to clear the memory twice in the dma_zalloc_coherent path.
> 
> Otherwise it looks good.

Thanks. Will fix.
In addition I will also make the compilation of dma-noop.o dependent 
on HAS_DMA.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 18:25, Xiao Guangrong wrote:



On 11/02/2015 11:12 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a


It isn't try to allow, it allows, as I understand)...


Err... Sorry for my English, but what is the different between:
”This patch tries to allow it to work on file directly“ and
"This patch allows it to work on file directly"

:(


Not sure that everyone is interested in this nit-picking discussion..

A allows B: if A then B
A tries to allow B: if A then _may be_ B

In any way it doesn't matter.






directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
  exec.c | 80 
++

  1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 9075f4d..db0fdaf 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
  }
  #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_ram_file_path(RAMBlock *block, const char *path, 
size_t size)

+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with 
'_'. */

+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);


one empty line will be very nice here, and it was in master branch


+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
  Error **errp)
  {
-char *filename;
-char *sanitized_name;
-char *c;
  void *area;
  int fd;
  uint64_t pagesize;
@@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
  goto error;
  }
-/* Make name safe to use with mkstemp by replacing '/' with 
'_'. */

-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);
-fd = mkstemp(filename);
+fd = open_ram_file_path(block, path, memory);
  if (fd < 0) {
  error_setg_errno(errp, errno,
   "unable to create backing store for path 
%s", path);

-g_free(filename);
  goto error;
  }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}
  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & 
RAM_SHARED);

  if (area == MAP_FAILED) {



Reviewed-by: Vladimir Sementsov-Ogievskiy 


Thanks for your review.





--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [kvm-unit-tests PATCHv5 3/3] arm: pmu: Add CPI checking

2015-11-02 Thread Andrew Jones

On Fri, Oct 30, 2015 at 03:32:43PM -0400, Christopher Covington wrote:
> Hi Drew,
> 
> On 10/30/2015 09:00 AM, Andrew Jones wrote:
> > On Wed, Oct 28, 2015 at 03:12:55PM -0400, Christopher Covington wrote:
> >> Calculate the numbers of cycles per instruction (CPI) implied by ARM
> >> PMU cycle counter values. The code includes a strict checking facility
> >> intended for the -icount option in TCG mode but it is not yet enabled
> >> in the configuration file. Enabling it must wait on infrastructure
> >> improvements which allow for different tests to be run on TCG versus
> >> KVM.
> >>
> >> Signed-off-by: Christopher Covington 
> >> ---
> >>  arm/pmu.c | 103 
> >> +-
> >>  1 file changed, 102 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arm/pmu.c b/arm/pmu.c
> >> index 4334de4..76a 100644
> >> --- a/arm/pmu.c
> >> +++ b/arm/pmu.c
> >> @@ -43,6 +43,23 @@ static inline unsigned long get_pmccntr(void)
> >>asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r" (cycles));
> >>return cycles;
> >>  }
> >> +
> >> +/*
> >> + * Extra instructions inserted by the compiler would be difficult to 
> >> compensate
> >> + * for, so hand assemble everything between, and including, the PMCR 
> >> accesses
> >> + * to start and stop counting.
> >> + */
> >> +static inline void loop(int i, uint32_t pmcr)
> >> +{
> >> +  asm volatile(
> >> +  "   mcr p15, 0, %[pmcr], c9, c12, 0\n"
> >> +  "1: subs%[i], %[i], #1\n"
> >> +  "   bgt 1b\n"
> >> +  "   mcr p15, 0, %[z], c9, c12, 0\n"
> >> +  : [i] "+r" (i)
> >> +  : [pmcr] "r" (pmcr), [z] "r" (0)
> >> +  : "cc");
> >> +}
> >>  #elif defined(__aarch64__)
> >>  static inline uint32_t get_pmcr(void)
> >>  {
> >> @@ -64,6 +81,23 @@ static inline unsigned long get_pmccntr(void)
> >>asm volatile("mrs %0, pmccntr_el0" : "=r" (cycles));
> >>return cycles;
> >>  }
> >> +
> >> +/*
> >> + * Extra instructions inserted by the compiler would be difficult to 
> >> compensate
> >> + * for, so hand assemble everything between, and including, the PMCR 
> >> accesses
> >> + * to start and stop counting.
> >> + */
> >> +static inline void loop(int i, uint32_t pmcr)
> >> +{
> >> +  asm volatile(
> >> +  "   msr pmcr_el0, %[pmcr]\n"
> >> +  "1: subs%[i], %[i], #1\n"
> >> +  "   b.gt1b\n"
> >> +  "   msr pmcr_el0, xzr\n"
> >> +  : [i] "+r" (i)
> >> +  : [pmcr] "r" (pmcr)
> >> +  : "cc");
> >> +}
> >>  #endif
> >>  
> >>  struct pmu_data {
> >> @@ -131,12 +165,79 @@ static bool check_cycles_increase(void)
> >>return true;
> >>  }
> >>  
> >> -int main(void)
> >> +/*
> >> + * Execute a known number of guest instructions. Only odd instruction 
> >> counts
> >> + * greater than or equal to 3 are supported by the in-line assembly code. 
> >> The
> >> + * control register (PMCR_EL0) is initialized with the provided value 
> >> (allowing
> >> + * for example for the cycle counter or event counters to be reset). At 
> >> the end
> >> + * of the exact instruction loop, zero is written to PMCR_EL0 to disable
> >> + * counting, allowing the cycle counter or event counters to be read at 
> >> the
> >> + * leisure of the calling code.
> >> + */
> >> +static void measure_instrs(int num, uint32_t pmcr)
> >> +{
> >> +  int i = (num - 1) / 2;
> >> +
> >> +  assert(num >= 3 && ((num - 1) % 2 == 0));
> >> +  loop(i, pmcr);
> >> +}
> >> +
> >> +/*
> >> + * Measure cycle counts for various known instruction counts. Ensure that 
> >> the
> >> + * cycle counter progresses (similar to check_cycles_increase() but with 
> >> more
> >> + * instructions and using reset and stop controls). If supplied a 
> >> positive,
> >> + * nonzero CPI parameter, also strictly check that every measurement 
> >> matches
> >> + * it. Strict CPI checking is used to test -icount mode.
> >> + */
> >> +static bool check_cpi(int cpi)
> >> +{
> >> +  struct pmu_data pmu = {0};
> >> +
> >> +  pmu.cycle_counter_reset = 1;
> >> +  pmu.enable = 1;
> >> +
> >> +  if (cpi > 0)
> >> +  printf("Checking for CPI=%d.\n", cpi);
> >> +  printf("instrs : cycles0 cycles1 ...\n");
> >> +
> >> +  for (int i = 3; i < 300; i += 32) {
> >> +  int avg, sum = 0;
> >> +
> >> +  printf("%d :", i);
> >> +  for (int j = 0; j < NR_SAMPLES; j++) {
> >> +  int cycles;
> >> +
> >> +  measure_instrs(i, pmu.pmcr_el0);
> >> +  cycles = get_pmccntr();
> >> +  printf(" %d", cycles);
> >> +
> >> +  if (!cycles || (cpi > 0 && cycles != i * cpi)) {
> >> +  printf("\n");
> >> +  return false;
> >> +  }
> >> +
> >> +  sum += cycles;
> >> +  }
> >> +  avg = sum / NR_SAMPLES;
> >> +  printf(" sum=%d avg=%d avg_ipc=%d avg_cpi=%d\n",
> >> +  sum, avg, i / avg, avg / i);
> >> +  }
> >> +
> >> +  return true;
> >> +}
> >> +

Re: [PATCH v7 08/35] exec: allow memory to be allocated from any kind of path

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 18:22, Xiao Guangrong wrote:



On 11/02/2015 10:51 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:
Currently file_ram_alloc() is designed for hugetlbfs, however, the 
memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the 
file

locates at DAX enabled filesystem

So this patch let it work on any kind of path


No, this patch doesn't change any logic, but only fix variable name 
and some error messages.


Yes, it is.

'let it work' in my thought exactly was "fix variable name and some 
error messages"... okay,

if it confused you, how about change it to:

"This patch fixes variable name and some error messages to let it be 
aware of normal

path"


My english is not very good, I don't know figures of speech. For me 
"patch let it work" means that without this patch it will not work))
Your new variant is ok for me, or better (imo) "This patch fixes 
variable name and some error messages to be suitable for any kind of path"








Signed-off-by: Xiao Guangrong 
---
  exec.c | 24 
  1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/exec.c b/exec.c
index 9de38be..9075f4d 100644
--- a/exec.c
+++ b/exec.c
@@ -1184,25 +1184,25 @@ static void *file_ram_alloc(RAMBlock *block,
  char *c;
  void *area;
  int fd;
-uint64_t hpagesize;
+uint64_t pagesize;
  Error *local_err = NULL;
-hpagesize = qemu_file_get_page_size(path, &local_err);
+pagesize = qemu_file_get_page_size(path, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
  goto error;
  }
-if (hpagesize == getpagesize()) {
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");


Why do you remove path from error message? It is good additional 
information.. What if we have

several memory file backends?


Good catch, will change it to:
fprintf(stderr, "Memory is not allocated from HugeTlbfs on path 
%s.\n", path);


BTW, if no other big change in the further, i will post the new 
version just for of this patch,





  }
-block->mr->align = hpagesize;
+block->mr->align = pagesize;
-if (memory < hpagesize) {
+if (memory < pagesize) {
  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be 
equal to "

-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
  goto error;
  }
@@ -1226,14 +1226,14 @@ static void *file_ram_alloc(RAMBlock *block,
  fd = mkstemp(filename);
  if (fd < 0) {
  error_setg_errno(errp, errno,
- "unable to create backing store for 
hugepages");
+ "unable to create backing store for path 
%s", path);

  g_free(filename);
  goto error;
  }
  unlink(filename);
  g_free(filename);
-memory = ROUND_UP(memory, hpagesize);
+memory = ROUND_UP(memory, pagesize);
  /*
   * ftruncate is not supported by hugetlbfs in older
@@ -1245,10 +1245,10 @@ static void *file_ram_alloc(RAMBlock *block,
  perror("ftruncate");
  }
-area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & 
RAM_SHARED);
+area = qemu_ram_mmap(fd, memory, pagesize, block->flags & 
RAM_SHARED);

  if (area == MAP_FAILED) {
  error_setg_errno(errp, errno,
- "unable to map backing store for hugepages");
+ "unable to map backing store for path %s", 
path);

  close(fd);
  goto error;
  }





With these two fixes (any commit message variant):

Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-02 Thread Xiao Guangrong




On 11/02/2015 11:12 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a


It isn't try to allow, it allows, as I understand)...


Err... Sorry for my English, but what is the different between:
”This patch tries to allow it to work on file directly“ and
"This patch allows it to work on file directly"

:(




directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
  exec.c | 80 ++
  1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 9075f4d..db0fdaf 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
  }
  #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_ram_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with '_'. */
+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);


one empty line will be very nice here, and it was in master branch


+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
  Error **errp)
  {
-char *filename;
-char *sanitized_name;
-char *c;
  void *area;
  int fd;
  uint64_t pagesize;
@@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
  goto error;
  }
-/* Make name safe to use with mkstemp by replacing '/' with '_'. */
-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);
-fd = mkstemp(filename);
+fd = open_ram_file_path(block, path, memory);
  if (fd < 0) {
  error_setg_errno(errp, errno,
   "unable to create backing store for path %s", path);
-g_free(filename);
  goto error;
  }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}
  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
  if (area == MAP_FAILED) {



Reviewed-by: Vladimir Sementsov-Ogievskiy 


Thanks for your review.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 08/35] exec: allow memory to be allocated from any kind of path

2015-11-02 Thread Xiao Guangrong




On 11/02/2015 10:51 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
locates at DAX enabled filesystem

So this patch let it work on any kind of path


No, this patch doesn't change any logic, but only fix variable name and some 
error messages.


Yes, it is.

'let it work' in my thought exactly was "fix variable name and some error 
messages"... okay,
if it confused you, how about change it to:

"This patch fixes variable name and some error messages to let it be aware of 
normal
path"





Signed-off-by: Xiao Guangrong 
---
  exec.c | 24 
  1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/exec.c b/exec.c
index 9de38be..9075f4d 100644
--- a/exec.c
+++ b/exec.c
@@ -1184,25 +1184,25 @@ static void *file_ram_alloc(RAMBlock *block,
  char *c;
  void *area;
  int fd;
-uint64_t hpagesize;
+uint64_t pagesize;
  Error *local_err = NULL;
-hpagesize = qemu_file_get_page_size(path, &local_err);
+pagesize = qemu_file_get_page_size(path, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
  goto error;
  }
-if (hpagesize == getpagesize()) {
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");


Why do you remove path from error message? It is good additional information.. 
What if we have
several memory file backends?


Good catch, will change it to:
fprintf(stderr, "Memory is not allocated from HugeTlbfs on path %s.\n", path);

BTW, if no other big change in the further, i will post the new version just 
for of this patch,




  }
-block->mr->align = hpagesize;
+block->mr->align = pagesize;
-if (memory < hpagesize) {
+if (memory < pagesize) {
  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
  goto error;
  }
@@ -1226,14 +1226,14 @@ static void *file_ram_alloc(RAMBlock *block,
  fd = mkstemp(filename);
  if (fd < 0) {
  error_setg_errno(errp, errno,
- "unable to create backing store for hugepages");
+ "unable to create backing store for path %s", path);
  g_free(filename);
  goto error;
  }
  unlink(filename);
  g_free(filename);
-memory = ROUND_UP(memory, hpagesize);
+memory = ROUND_UP(memory, pagesize);
  /*
   * ftruncate is not supported by hugetlbfs in older
@@ -1245,10 +1245,10 @@ static void *file_ram_alloc(RAMBlock *block,
  perror("ftruncate");
  }
-area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);
+area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
  if (area == MAP_FAILED) {
  error_setg_errno(errp, errno,
- "unable to map backing store for hugepages");
+ "unable to map backing store for path %s", path);
  close(fd);
  goto error;
  }




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] s390/dma: Allow per device dma ops

2015-11-02 Thread Joerg Roedel

On Fri, Oct 30, 2015 at 02:20:37PM +0100, Christian Borntraeger wrote:
> As virtio-ccw now has dma ops, we can no longer default to the PCI ones.
> Make use of dev_archdata to keep the dma_ops per device. The pci devices
> now use that to override the default, and the default is changed to use
> the noop ops for everything that is not PCI. To compile without PCI
> support we also have to enable the DMA api with virtio.
> 
> Signed-off-by: Christian Borntraeger 
> ---
>  arch/s390/Kconfig   | 3 ++-
>  arch/s390/include/asm/device.h  | 6 +-
>  arch/s390/include/asm/dma-mapping.h | 6 --
>  arch/s390/pci/pci.c | 1 +
>  arch/s390/pci/pci_dma.c | 4 ++--
>  5 files changed, 14 insertions(+), 6 deletions(-)

Reviewed-by: Joerg Roedel 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 20/35] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Xiao Guangrong




On 11/02/2015 10:26 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 16:08, Xiao Guangrong wrote:



On 11/02/2015 08:19 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Current code did not check the return value of get_memory_region as it
assumed the backend memory of pc-dimm is always properly initialized,
we make get_memory_region internally catch the case if something is
wrong


but here you call not pc-dimm's get_memory_region, but common 
ddc->get_memory_region, which may be
nvdimm or possibly other future dimm, so, why not check it here? And than 
pc_dimm_get_memory_region
may be left untouched (error_abort is ok, because errp is unused).


Hmm, because 'here' is not the only place calling ->get_memory_region, this 
method has
multiple callers:

$ git grep "\->get_memory_region"
hw/i386/pc.c:MemoryRegion *mr = ddc->get_memory_region(dimm);
hw/i386/pc.c:MemoryRegion *mr = ddc->get_memory_region(dimm);
hw/mem/dimm.c:mr = ddc->get_memory_region(dimm);
hw/mem/nvdimm.c:ddc->get_memory_region = nvdimm_get_memory_region;
hw/mem/pc-dimm.c:ddc->get_memory_region = pc_dimm_get_memory_region;
hw/ppc/spapr.c:MemoryRegion *mr = ddc->get_memory_region(dimm);

memory region validation is also done for NVDIMM in nvdimm device.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2 0/3] dma ops and virtio

2015-11-02 Thread Sebastian Ott

Hi,

On Fri, 30 Oct 2015, Christian Borntraeger wrote:

> here is the 2nd version of providing an DMA API for s390.
> 
> There are some attempts to unify the dma ops (Christoph) as well
> as some attempts to make virtio use the dma API (Andy).
> 
> At kernel summit we concluded that we want to use the same code on all
> platforms, whereever possible, so having a dummy dma_op might be the
> easiest solution to keep virtio-ccw as similar as possible to
> virtio-pci.Together with a fixed up patch set from Andy Lutomirski
> this seems to work.  
> 
> We will also need a fixup for powerc and QEMU changes to make virtio
> work with iommu on power and x86.
> 
> TODO:
> - future add-on patches to also fold in x86 no iommu
>   - dma_mask
>   - checking?
> - make compilation of dma-noop dependent on something
> 
> v1->v2:
> - initial testing
> - always use dma_noop_ops if device has no private dma_ops
> - get rid of setup in virtio_ccw,kvm_virtio
> - set CONFIG_HAS_DMA(ATTRS) for virtio (fixes compile for !PCI)
> - rename s390_dma_ops to s390_pci_dma_ops
> 
> Christian Borntraeger (3):
>   Provide simple noop dma ops
>   alpha: use common noop dma ops
>   s390/dma: Allow per device dma ops
> 
>  arch/alpha/kernel/pci-noop.c| 46 ++
>  arch/s390/Kconfig   |  3 +-
>  arch/s390/include/asm/device.h  |  6 ++-
>  arch/s390/include/asm/dma-mapping.h |  6 ++-
>  arch/s390/pci/pci.c |  1 +
>  arch/s390/pci/pci_dma.c |  4 +-
>  include/linux/dma-mapping.h |  2 +
>  lib/Makefile|  2 +-
>  lib/dma-noop.c  | 77 
> +
>  9 files changed, 98 insertions(+), 49 deletions(-)
>  create mode 100644 lib/dma-noop.c
> 
> -- 

I agree with these changes in principle. As long as we don't do MMIO
(writel and friends) on a dummy mapping we're fine.

Regards,
Sebastian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 11/35] util: introduce qemu_file_getlength()

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

It is used to get the size of the specified file, also qemu_fd_getlength()
is introduced to unify the code with raw_getlength() in block/raw-posix.c

Signed-off-by: Xiao Guangrong 
---
  block/raw-posix.c|  7 +--
  include/qemu/osdep.h |  2 ++
  util/osdep.c | 31 +++
  3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 918c756..734e6dd 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1592,18 +1592,13 @@ static int64_t raw_getlength(BlockDriverState *bs)
  {
  BDRVRawState *s = bs->opaque;
  int ret;
-int64_t size;
  
  ret = fd_open(bs);

  if (ret < 0) {
  return ret;
  }
  
-size = lseek(s->fd, 0, SEEK_END);

-if (size < 0) {
-return -errno;
-}
-return size;
+return qemu_fd_getlength(s->fd);
  }
  #endif
  
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h

index dbc17dc..ca4c3fa 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -303,4 +303,6 @@ int qemu_read_password(char *buf, int buf_size);
  pid_t qemu_fork(Error **errp);
  
  size_t qemu_file_get_page_size(const char *mem_path, Error **errp);

+int64_t qemu_fd_getlength(int fd);
+size_t qemu_file_getlength(const char *file, Error **errp);
  #endif
diff --git a/util/osdep.c b/util/osdep.c
index 0092bb6..5a61e19 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -428,3 +428,34 @@ writev(int fd, const struct iovec *iov, int iov_cnt)
  return readv_writev(fd, iov, iov_cnt, true);
  }
  #endif
+
+int64_t qemu_fd_getlength(int fd)
+{
+int64_t size;
+
+size = lseek(fd, 0, SEEK_END);
+if (size < 0) {
+return -errno;
+}
+return size;
+}
+
+size_t qemu_file_getlength(const char *file, Error **errp)
+{
+int64_t size;
+int fd = qemu_open(file, O_RDONLY);
+
+if (fd < 0) {
+error_setg_file_open(errp, errno, file);
+return 0;
+}
+
+size = qemu_fd_getlength(fd);
+if (size < 0) {
+error_setg_errno(errp, -size, "can't get size of file %s", file);
+size = 0;
+}
+
+qemu_close(fd);
+return size;
+}


Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] alpha: use common noop dma ops

2015-11-02 Thread Joerg Roedel

On Fri, Oct 30, 2015 at 02:20:36PM +0100, Christian Borntraeger wrote:
> Some of the alpha pci noop dma ops are identical to the common ones.
> Use them.
> 
> Signed-off-by: Christian Borntraeger 
> ---
>  arch/alpha/kernel/pci-noop.c | 46 
> 
>  1 file changed, 4 insertions(+), 42 deletions(-)

Reviewed-by: Joerg Roedel 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] kvmtool: Cleanup kernel loading

2015-11-02 Thread Dimitri John Ledkov

On 2 November 2015 at 14:58, Will Deacon  wrote:
> On Fri, Oct 30, 2015 at 06:26:53PM +, Andre Przywara wrote:
>> Hi,
>
> Hello Andre,
>
>> this series cleans up kvmtool's kernel loading functionality a bit.
>> It has been broken out of a previous series I sent [1] and contains
>> just the cleanup and bug fix parts, which should be less controversial
>> and thus easier to merge ;-)
>> I will resend the pipe loading part later on as a separate series.
>>
>> The first patch properly abstracts kernel loading to move
>> responsibility into each architecture's code. It removes quite some
>> ugly code from the generic kvm.c file.
>> The later patches address the naive usage of read(2) to, well, read
>> data from files. Doing this without coping with the subtleties of
>> the UNIX read semantics (returning with less or none data read is not
>> an error) can provoke hard to debug failures.
>> So these patches make use of the existing and one new wrapper function
>> to make sure we read everything we actually wanted to.
>> The last patch moves the ARM kernel loading code into the proper
>> location to be in line with the other architectures.
>>
>> Please have a look and give some comments!
>
> Looks good to me, but I'd like to see some comments from some mips/ppc/x86
> people on the changes you're making over there.

Looks mostly good to me, as one of the kvmtool down streams. Over at
https://github.com/clearlinux/kvmtool we have some patches to tweak
the x86 boot flow, which will need rebasing/retweaking) specifically
this commit here -
https://github.com/clearlinux/kvmtool/commit/a8dee709f85735d16739d0eda0cc00d3c1b17477

-- 
Regards,

Dimitri.
53 sleeps till Christmas, or less

https://clearlinux.org
Open Source Technology Center
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] Provide simple noop dma ops

2015-11-02 Thread Joerg Roedel

On Fri, Oct 30, 2015 at 02:20:35PM +0100, Christian Borntraeger wrote:
> +static void *dma_noop_alloc(struct device *dev, size_t size,
> + dma_addr_t *dma_handle, gfp_t gfp,
> + struct dma_attrs *attrs)
> +{
> + void *ret;
> +
> + ret = (void *)__get_free_pages(gfp, get_order(size));
> + if (ret) {
> + memset(ret, 0, size);

There is no need to zero out the memory here. If the user wants
initialized memory it can call dma_zalloc_coherent. Having the memset
here means to clear the memory twice in the dma_zalloc_coherent path.

Otherwise it looks good.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 09/35] exec: allow file_ram_alloc to work on file

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a


It isn't try to allow, it allows, as I understand)


directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
  exec.c | 80 ++
  1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index 9075f4d..db0fdaf 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,14 +1174,60 @@ void qemu_mutex_unlock_ramlist(void)
  }
  
  #ifdef __linux__

+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, &fs) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_ram_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with '_'. */
+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);


one empty line will be very nice here, and it was in master branch


+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
  Error **errp)
  {
-char *filename;
-char *sanitized_name;
-char *c;
  void *area;
  int fd;
  uint64_t pagesize;
@@ -1212,38 +1258,14 @@ static void *file_ram_alloc(RAMBlock *block,
  goto error;
  }
  
-/* Make name safe to use with mkstemp by replacing '/' with '_'. */

-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);
  
-fd = mkstemp(filename);

+fd = open_ram_file_path(block, path, memory);
  if (fd < 0) {
  error_setg_errno(errp, errno,
   "unable to create backing store for path %s", path);
-g_free(filename);
  goto error;
  }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}
  
  area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);

  if (area == MAP_FAILED) {



Reviewed-by: Vladimir Sementsov-Ogievskiy 

--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Joerg Roedel

On Fri, Oct 30, 2015 at 11:32:06AM +0100, Arnd Bergmann wrote:
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

Yeah, definitly. Currently we only have iommu=pt to enable pass-through
mode for _all_ devices. I think it makes sense to introduce a per-device
opt-in for pass-through, but have it configured by the user and not by
the device driver.

If the user enables the IOMMU in his system, he expects to be secure
against DMA attacks. If drivers could opt-out, every protection would be
voided.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

2015-11-02 Thread Radim Krcmar

2015-11-02 13:21+0100, Paolo Bonzini:
> We do not want to do too much work in atomic context, in particular
> not walking all the VCPUs of the virtual machine.  So we want
> to distinguish the architecture-specific injection function for irqfd
> from kvm_set_msi.  Since it's still empty, reuse the newly added
> kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.

kvm/queue uses kvm_arch_set_irq since b7184313f4b9 ("kvm/x86: Hyper-V
synthetic interrupt controller").

Is synic going to be dropped before this patch is merged?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] kvmtool: Cleanup kernel loading

2015-11-02 Thread Will Deacon

On Fri, Oct 30, 2015 at 06:26:53PM +, Andre Przywara wrote:
> Hi,

Hello Andre,

> this series cleans up kvmtool's kernel loading functionality a bit.
> It has been broken out of a previous series I sent [1] and contains
> just the cleanup and bug fix parts, which should be less controversial
> and thus easier to merge ;-)
> I will resend the pipe loading part later on as a separate series.
> 
> The first patch properly abstracts kernel loading to move
> responsibility into each architecture's code. It removes quite some
> ugly code from the generic kvm.c file.
> The later patches address the naive usage of read(2) to, well, read
> data from files. Doing this without coping with the subtleties of
> the UNIX read semantics (returning with less or none data read is not
> an error) can provoke hard to debug failures.
> So these patches make use of the existing and one new wrapper function
> to make sure we read everything we actually wanted to.
> The last patch moves the ARM kernel loading code into the proper
> location to be in line with the other architectures.
> 
> Please have a look and give some comments!

Looks good to me, but I'd like to see some comments from some mips/ppc/x86
people on the changes you're making over there.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 08/35] exec: allow memory to be allocated from any kind of path

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
locates at DAX enabled filesystem

So this patch let it work on any kind of path


No, this patch doesn't change any logic, but only fix variable name and 
some error messages.




Signed-off-by: Xiao Guangrong 
---
  exec.c | 24 
  1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/exec.c b/exec.c
index 9de38be..9075f4d 100644
--- a/exec.c
+++ b/exec.c
@@ -1184,25 +1184,25 @@ static void *file_ram_alloc(RAMBlock *block,
  char *c;
  void *area;
  int fd;
-uint64_t hpagesize;
+uint64_t pagesize;
  Error *local_err = NULL;
  
-hpagesize = qemu_file_get_page_size(path, &local_err);

+pagesize = qemu_file_get_page_size(path, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
  goto error;
  }
  
-if (hpagesize == getpagesize()) {

-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");


Why do you remove path from error message? It is good additional 
information.. What if we have several memory file backends?



  }
  
-block->mr->align = hpagesize;

+block->mr->align = pagesize;
  
-if (memory < hpagesize) {

+if (memory < pagesize) {
  error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
  goto error;
  }
  
@@ -1226,14 +1226,14 @@ static void *file_ram_alloc(RAMBlock *block,

  fd = mkstemp(filename);
  if (fd < 0) {
  error_setg_errno(errp, errno,
- "unable to create backing store for hugepages");
+ "unable to create backing store for path %s", path);
  g_free(filename);
  goto error;
  }
  unlink(filename);
  g_free(filename);
  
-memory = ROUND_UP(memory, hpagesize);

+memory = ROUND_UP(memory, pagesize);
  
  /*

   * ftruncate is not supported by hugetlbfs in older
@@ -1245,10 +1245,10 @@ static void *file_ram_alloc(RAMBlock *block,
  perror("ftruncate");
  }
  
-area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);

+area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
  if (area == MAP_FAILED) {
  error_setg_errno(errp, errno,
- "unable to map backing store for hugepages");
+ "unable to map backing store for path %s", path);
  close(fd);
  goto error;
  }



--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Joerg Roedel

On Thu, Oct 29, 2015 at 09:32:32AM +0200, Shamir Rabinovitch wrote:
> For the IB case, setting and tearing DMA mappings for the drivers data buffers
> is expensive. But we could for example consider to map all the HW control 
> objects
> that validate the HW access to the drivers data buffers as IOMMU protected 
> and so
> have IOMMU protection on those critical objects while having fast 
> set-up/tear-down
> of driver data buffers. The HW control objects have stable mapping for the 
> lifetime
> of the system. So the case of using both types of DMA mappings is still valid.

How do you envision this per-mapping by-pass to work? For the DMA-API
mappings you have only one device address space. For by-pass to work,
you need to map all physical memory of the machine into this space,
which leaves the question where you want to put the window for
remapping.

You surely have to put it after the physical mappings, but any
protection will be gone, as the device can access all physical memory.

So instead of working around the shortcomings of DMA-API
implementations, can you present us some numbers and analysis of how bad
the performance impact with an IOMMU is and what causes it?

I know that we have lock-contention issues in our IOMMU drivers, which
can be fixed. Maybe the performance problems you are seeing can be fixed
too, when you give us more details about them.

Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 20/35] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 16:08, Xiao Guangrong wrote:



On 11/02/2015 08:19 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Current code did not check the return value of get_memory_region as it
assumed the backend memory of pc-dimm is always properly initialized,
we make get_memory_region internally catch the case if something is
wrong


but here you call not pc-dimm's get_memory_region, but common 
ddc->get_memory_region, which may be nvdimm or possibly other future 
dimm, so, why not check it here? And than pc_dimm_get_memory_region may 
be left untouched (error_abort is ok, because errp is unused).




Signed-off-by: Xiao Guangrong 
---
  hw/mem/dimm.c|  3 ++-
  hw/mem/pc-dimm.c | 12 +++-
  2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 4a63409..498d380 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -377,8 +377,9 @@ static void dimm_get_size(Object *obj, Visitor 
*v, void *opaque,

  int64_t value;
  MemoryRegion *mr;
  DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+mr = ddc->get_memory_region(dimm);
  value = memory_region_size(mr);
  visit_type_int(v, &value, name, errp);
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 38323e9..e6b6a9f 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -22,7 +22,17 @@
  static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)
  {
-return host_memory_backend_get_memory(dimm->hostmem, 
&error_abort);

+Error *local_err = NULL;
+MemoryRegion *mr;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, &local_err);
+
+/*
+ * plug a pc-dimm device whose backend memory was not properly
+ * initialized?
+ */
+assert(!local_err && mr);
+return mr;
  }


this should squashed into previous patch, I think


You mean merger this patch with 19/35 (dimm: abstract dimm device from 
pc-dimm)?
The 19/35 mostly ‘moves’ the things, this one changes the core logic, 
it is not

a big deal. :D


stupid me, you are right)






  static void pc_dimm_class_init(ObjectClass *oc, void *data)


I've discovered suddenly, that

MemoryRegion *
host_memory_backend_get_memory(HostMemoryBackend *backend, Error **errp)
{
 return memory_region_size(&backend->mr) ? &backend->mr : NULL;
}

- it doesn't use errp at all. In my opinion, this should be fixed 
globally, by deleting useless
parameter in separate patch. Or just squash your function into 
previous patch.




Yup, this is a globally interface so i prefer to make a separate patch 
to do the

cleanup after this patchset.




--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 07/35] util: introduce qemu_file_get_page_size()

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

There are three places use the some logic to get the page size on
the file path or file fd

Windows did not support file hugepage, so it will return normal page
for this case. And this interface has not been used on windows so far
  
This patch introduces qemu_file_get_page_size() to unify the code


Signed-off-by: Xiao Guangrong 
---
  exec.c   | 33 ++---
  include/qemu/osdep.h |  1 +
  target-ppc/kvm.c | 23 +--
  util/oslib-posix.c   | 37 +
  util/oslib-win32.c   |  5 +
  5 files changed, 50 insertions(+), 49 deletions(-)

diff --git a/exec.c b/exec.c
index 8af2570..9de38be 100644
--- a/exec.c
+++ b/exec.c
@@ -1174,32 +1174,6 @@ void qemu_mutex_unlock_ramlist(void)
  }
  
  #ifdef __linux__

-
-#include 
-
-#define HUGETLBFS_MAGIC   0x958458f6
-
-static long gethugepagesize(const char *path, Error **errp)
-{
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(path, &fs);
-} while (ret != 0 && errno == EINTR);
-
-if (ret != 0) {
-error_setg_errno(errp, errno, "failed to get page size of file %s",
- path);
-return 0;
-}
-
-if (fs.f_type != HUGETLBFS_MAGIC)
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
-
-return fs.f_bsize;
-}
-
  static void *file_ram_alloc(RAMBlock *block,
  ram_addr_t memory,
  const char *path,
@@ -1213,11 +1187,16 @@ static void *file_ram_alloc(RAMBlock *block,
  uint64_t hpagesize;
  Error *local_err = NULL;
  
-hpagesize = gethugepagesize(path, &local_err);

+hpagesize = qemu_file_get_page_size(path, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
  goto error;
  }
+
+if (hpagesize == getpagesize()) {
+fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
+}
+
  block->mr->align = hpagesize;
  
  if (memory < hpagesize) {

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index b568424..dbc17dc 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -302,4 +302,5 @@ int qemu_read_password(char *buf, int buf_size);
   */
  pid_t qemu_fork(Error **errp);
  
+size_t qemu_file_get_page_size(const char *mem_path, Error **errp);

  #endif
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index ac70f08..d8760ea 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -308,28 +308,15 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
kvm_ppc_smmu_info *info)
  
  static long gethugepagesize(const char *mem_path)

  {
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(mem_path, &fs);
-} while (ret != 0 && errno == EINTR);
+Error *local_err = NULL;
+long size = qemu_file_get_page_size(mem_path, local_err);
  
-if (ret != 0) {

-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
-strerror(errno));
+if (local_err) {
+error_report_err(local_err);
  exit(1);
  }
  
-#define HUGETLBFS_MAGIC   0x958458f6

-
-if (fs.f_type != HUGETLBFS_MAGIC) {
-/* Explicit mempath, but it's ordinary pages */
-return getpagesize();
-}
-
-/* It's hugepage, return the huge page size */
-return fs.f_bsize;
+return size;
  }
  
  static int find_max_supported_pagesize(Object *obj, void *opaque)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 914cef5..51437ff 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -340,7 +340,7 @@ static void sigbus_handler(int signal)
  siglongjmp(sigjump, 1);
  }
  
-static size_t fd_getpagesize(int fd)

+static size_t fd_getpagesize(int fd, Error **errp)
  {
  #ifdef CONFIG_LINUX
  struct statfs fs;
@@ -351,7 +351,12 @@ static size_t fd_getpagesize(int fd)
  ret = fstatfs(fd, &fs);
  } while (ret != 0 && errno == EINTR);
  
-if (ret == 0 && fs.f_type == HUGETLBFS_MAGIC) {

+if (ret) {
+error_setg_errno(errp, errno, "fstatfs is failed");
+return 0;
+}
+
+if (fs.f_type == HUGETLBFS_MAGIC) {
  return fs.f_bsize;
  }
  }
@@ -360,6 +365,22 @@ static size_t fd_getpagesize(int fd)
  return getpagesize();
  }
  
+size_t qemu_file_get_page_size(const char *path, Error **errp)

+{
+size_t size = 0;
+int fd = qemu_open(path, O_RDONLY);
+
+if (fd < 0) {
+error_setg_file_open(errp, errno, path);
+goto exit;
+}
+
+size = fd_getpagesize(fd, errp);
+qemu_close(fd);
+exit:
+return size;
+}
+
  void os_mem_prealloc(int fd, char *area, size_t memory)
  {
  int ret;
@@ -387,8 +408,16 @@ void os_mem_prealloc(int fd, char *area, size_t memory)
  exit(1);
  } else {
  int i;
-size_t hpagesize = fd_getpagesize(fd);
-size_t numpages = DIV_ROUND_UP(memory, hp

Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-02 Thread Haozhong Zhang

On Mon, Nov 02, 2015 at 09:40:18AM +, James Hogan wrote:
> On Mon, Nov 02, 2015 at 05:26:42PM +0800, Haozhong Zhang wrote:
> > The value of the migrated vcpu's TSC rate is determined as below.
> >  1. If a TSC rate is specified by the cpu option 'tsc-freq', then this
> > user-specified value will be used.
> >  2. If neither a user-specified TSC rate nor a migrated TSC rate is
> > present, we will use the TSC rate from KVM (returned by
> > KVM_GET_TSC_KHZ).
> >  3. Otherwise, we will use the migrated TSC rate.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  include/sysemu/kvm.h |  2 ++
> >  kvm-all.c|  1 +
> >  target-arm/kvm.c |  5 +
> >  target-i386/kvm.c| 33 +
> >  target-mips/kvm.c|  5 +
> >  target-ppc/kvm.c |  5 +
> >  target-s390x/kvm.c   |  5 +
> >  7 files changed, 56 insertions(+)
> > 
> > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > index 461ef65..0ec8b98 100644
> > --- a/include/sysemu/kvm.h
> > +++ b/include/sysemu/kvm.h
> > @@ -328,6 +328,8 @@ int kvm_arch_fixup_msi_route(struct 
> > kvm_irq_routing_entry *route,
> >  
> >  int kvm_arch_msi_data_to_gsi(uint32_t data);
> >  
> > +int kvm_arch_setup_tsc_khz(CPUState *cpu);
> > +
> >  int kvm_set_irq(KVMState *s, int irq, int level);
> >  int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg);
> >  
> > diff --git a/kvm-all.c b/kvm-all.c
> > index c442838..1ecaf04 100644
> > --- a/kvm-all.c
> > +++ b/kvm-all.c
> > @@ -1757,6 +1757,7 @@ static void do_kvm_cpu_synchronize_post_init(void 
> > *arg)
> >  {
> >  CPUState *cpu = arg;
> >  
> > +kvm_arch_setup_tsc_khz(cpu);
> 
> Sorry if this is a stupid question, but why aren't you doing this from
> the i386 kvm_arch_put_registers when level == KVM_PUT_FULL_STATE, rather
> than introducing x86 specifics to the generic KVM api?
> 
> Cheers
> James
>

Yes, I could call kvm_arch_setup_tsc_khz() in kvm_arch_put_registers()
of target-i386 when level == KVM_PUT_FULL_STATE, so that I will not need
to make kvm_arch_setup_tsc_khz() a generic KVM API (which looks weird
for targets other than i386).

Thanks, James!

Haozhong

> >  kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
> >  cpu->kvm_vcpu_dirty = false;
> >  }
> > diff --git a/target-arm/kvm.c b/target-arm/kvm.c
> > index 79ef4c6..a724f6d 100644
> > --- a/target-arm/kvm.c
> > +++ b/target-arm/kvm.c
> > @@ -614,3 +614,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
> >  {
> >  return (data - 32) & 0x;
> >  }
> > +
> > +int kvm_arch_setup_tsc_khz(CPUState *cs)
> > +{
> > +return 0;
> > +}
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 64046cb..aae5e58 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -3034,3 +3034,36 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
> >  {
> >  abort();
> >  }
> > +
> > +int kvm_arch_setup_tsc_khz(CPUState *cs)
> > +{
> > +X86CPU *cpu = X86_CPU(cs);
> > +CPUX86State *env = &cpu->env;
> > +int r;
> > +
> > +/*
> > + * Prepare vcpu's TSC rate to be migrated.
> > + *
> > + * - If the user specifies the TSC rate by cpu option 'tsc-freq',
> > + *   we will use the user-specified value.
> > + *
> > + * - If there is neither user-specified TSC rate nor migrated TSC
> > + *   rate, we will ask KVM for the TSC rate by calling
> > + *   KVM_GET_TSC_KHZ.
> > + *
> > + * - Otherwise, if there is a migrated TSC rate, we will use the
> > + *   migrated value.
> > + */
> > +if (env->tsc_khz) {
> > +env->tsc_khz_saved = env->tsc_khz;
> > +} else if (!env->tsc_khz_saved) {
> > +r = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> > +if (r < 0) {
> > +fprintf(stderr, "KVM_GET_TSC_KHZ failed\n");
> > +return r;
> > +}
> > +env->tsc_khz_saved = r;
> > +}
> > +
> > +return 0;
> > +}
> > diff --git a/target-mips/kvm.c b/target-mips/kvm.c
> > index 12d7db3..fb26d7e 100644
> > --- a/target-mips/kvm.c
> > +++ b/target-mips/kvm.c
> > @@ -687,3 +687,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
> >  {
> >  abort();
> >  }
> > +
> > +int kvm_arch_setup_tsc_khz(CPUState *cs)
> > +{
> > +return 0;
> > +}
> > diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> > index ac70f08..c429f0c 100644
> > --- a/target-ppc/kvm.c
> > +++ b/target-ppc/kvm.c
> > @@ -2510,3 +2510,8 @@ int kvmppc_enable_hwrng(void)
> >  
> >  return kvmppc_enable_hcall(kvm_state, H_RANDOM);
> >  }
> > +
> > +int kvm_arch_setup_tsc_khz(CPUState *cs)
> > +{
> > +return 0;
> > +}
> > diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
> > index c3be180..db5d436 100644
> > --- a/target-s390x/kvm.c
> > +++ b/target-s390x/kvm.c
> > @@ -2248,3 +2248,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
> >  {
> >  abort();
> >  }
> > +
> > +int kvm_arch_setup_tsc_khz(CPUState *cs)
> > +{
> > +return 0;
> > +}
> > -- 
> > 2.4.8
> > 


--
To unsubscribe from this l

Re: [PATCH v7 20/35] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Xiao Guangrong




On 11/02/2015 08:19 PM, Vladimir Sementsov-Ogievskiy wrote:

On 02.11.2015 12:13, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Current code did not check the return value of get_memory_region as it
assumed the backend memory of pc-dimm is always properly initialized,
we make get_memory_region internally catch the case if something is
wrong

Signed-off-by: Xiao Guangrong 
---
  hw/mem/dimm.c|  3 ++-
  hw/mem/pc-dimm.c | 12 +++-
  2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 4a63409..498d380 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -377,8 +377,9 @@ static void dimm_get_size(Object *obj, Visitor *v, void 
*opaque,
  int64_t value;
  MemoryRegion *mr;
  DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+mr = ddc->get_memory_region(dimm);
  value = memory_region_size(mr);
  visit_type_int(v, &value, name, errp);
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 38323e9..e6b6a9f 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -22,7 +22,17 @@
  static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)
  {
-return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
+Error *local_err = NULL;
+MemoryRegion *mr;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, &local_err);
+
+/*
+ * plug a pc-dimm device whose backend memory was not properly
+ * initialized?
+ */
+assert(!local_err && mr);
+return mr;
  }


this should squashed into previous patch, I think


You mean merger this patch with 19/35 (dimm: abstract dimm device from pc-dimm)?
The 19/35 mostly ‘moves’ the things, this one changes the core logic, it is not
a big deal. :D




  static void pc_dimm_class_init(ObjectClass *oc, void *data)


I've discovered suddenly, that

MemoryRegion *
host_memory_backend_get_memory(HostMemoryBackend *backend, Error **errp)
{
 return memory_region_size(&backend->mr) ? &backend->mr : NULL;
}

- it doesn't use errp at all. In my opinion, this should be fixed globally, by 
deleting useless
parameter in separate patch. Or just squash your function into previous patch.



Yup, this is a globally interface so i prefer to make a separate patch to do the
cleanup after this patchset.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] Please pull my kvm-ppc-next branch

2015-11-02 Thread Paolo Bonzini

On 26/10/2015 05:17, Paul Mackerras wrote:
> Paolo,
> 
> Here is my current patch queue for KVM on PPC.  There's nothing much
> in the way of new features this time; it's mostly bug fixes, plus
> Nikunj has implemented support for KVM_CAP_NR_MEMSLOTS.  These are
> intended for the "next" branch of the KVM tree.  Please pull.
> 
> Thanks,
> Paul.

Pulled, thanks.

Paolo

> The following changes since commit 9ffecb10283508260936b96022d4ee43a7798b4c:
> 
>   Linux 4.3-rc3 (2015-09-27 07:50:08 -0400)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc.git 
> kvm-ppc-next
> 
> for you to fetch changes up to 70aa3961a196ac32baf54032b2051bac9a941118:
> 
>   KVM: PPC: Book3S HV: Handle H_DOORBELL on the guest exit path (2015-10-21 
> 16:31:52 +1100)
> 
> 
> Andrzej Hajda (1):
>   KVM: PPC: e500: fix handling local_sid_lookup result
> 
> Gautham R. Shenoy (1):
>   KVM: PPC: Book3S HV: Handle H_DOORBELL on the guest exit path
> 
> Mahesh Salgaonkar (1):
>   KVM: PPC: Book3S HV: Deliver machine check with MSR(RI=0) to guest as 
> MCE
> 
> Nikunj A Dadhania (1):
>   KVM: PPC: Implement extension to report number of memslots
> 
> Paul Mackerras (2):
>   KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation 
> ioctl
>   KVM: PPC: Book3S HV: Make H_REMOVE return correct HPTE value for absent 
> HPTEs
> 
> Tudor Laurentiu (3):
>   powerpc/e6500: add TMCFG0 register definition
>   KVM: PPC: e500: Emulate TMCFG0 TMRN register
>   KVM: PPC: e500: fix couple of shift operations on 64 bits
> 
>  arch/powerpc/include/asm/disassemble.h  |  5 +
>  arch/powerpc/include/asm/reg_booke.h|  6 ++
>  arch/powerpc/kvm/book3s_64_mmu_hv.c |  3 ++-
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c |  2 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 29 ++---
>  arch/powerpc/kvm/e500.c |  3 ++-
>  arch/powerpc/kvm/e500_emulate.c | 19 +++
>  arch/powerpc/kvm/e500_mmu_host.c|  4 ++--
>  arch/powerpc/kvm/powerpc.c  |  3 +++
>  9 files changed, 63 insertions(+), 11 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-unit-tests PATCH] x86: hyperv_synic: Hyper-V SynIC test

2015-11-02 Thread Roman Kagan

On Mon, Nov 02, 2015 at 01:16:02PM +0100, Paolo Bonzini wrote:
> On 26/10/2015 10:56, Andrey Smetanin wrote:
> > Hyper-V SynIC is a Hyper-V synthetic interrupt controller.
> > 
> > The test runs on every vCPU and performs the following steps:
> > * read from all Hyper-V SynIC MSR's
> > * setup Hyper-V SynIC evt/msg pages
> > * setup SINT's routing
> > * inject SINT's into destination vCPU by 'hyperv-synic-test-device'
> > * wait for SINT's isr's completion
> > * clear Hyper-V SynIC evt/msg pages and destroy SINT's routing
> > 
> > Signed-off-by: Andrey Smetanin 
> > Reviewed-by: Roman Kagan 
> > Signed-off-by: Denis V. Lunev 
> > CC: Vitaly Kuznetsov 
> > CC: "K. Y. Srinivasan" 
> > CC: Gleb Natapov 
> > CC: Paolo Bonzini 
> > CC: Roman Kagan 
> > CC: Denis V. Lunev 
> > CC: qemu-de...@nongnu.org
> > CC: virtualizat...@lists.linux-foundation.org
> 
> Bad news.
> 
> The test breaks with APICv, because of the following sequence of events:

Thanks for testing and analyzing this!

(... running around looking for an APICv-capable machine to be able to
catch this ourselves before we resubmit ...)

> The question then is... does Hyper-V actually use auto-EOI interrupts?
> If it doesn't, we might as well not implement them... :/

As Den wrote, we've yet to see a hyperv device which doesn't :(

Roman.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] KVM: device assignment: remove pointless #ifdefs

2015-11-02 Thread Paolo Bonzini

The symbols are always defined.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/assigned-dev.c | 25 -
 1 file changed, 25 deletions(-)

diff --git a/arch/x86/kvm/assigned-dev.c b/arch/x86/kvm/assigned-dev.c
index d090ecf08809..1c17ee807ef7 100644
--- a/arch/x86/kvm/assigned-dev.c
+++ b/arch/x86/kvm/assigned-dev.c
@@ -131,7 +131,6 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, 
void *dev_id)
return IRQ_HANDLED;
 }
 
-#ifdef __KVM_HAVE_MSI
 static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
@@ -150,9 +149,7 @@ static irqreturn_t kvm_assigned_dev_thread_msi(int irq, 
void *dev_id)
 
return IRQ_HANDLED;
 }
-#endif
 
-#ifdef __KVM_HAVE_MSIX
 static irqreturn_t kvm_assigned_dev_msix(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
@@ -183,7 +180,6 @@ static irqreturn_t kvm_assigned_dev_thread_msix(int irq, 
void *dev_id)
 
return IRQ_HANDLED;
 }
-#endif
 
 /* Ack the irq line for an assigned device */
 static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
@@ -386,7 +382,6 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
return 0;
 }
 
-#ifdef __KVM_HAVE_MSI
 static int assigned_device_enable_host_msi(struct kvm *kvm,
   struct kvm_assigned_dev_kernel *dev)
 {
@@ -408,9 +403,7 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
 
return 0;
 }
-#endif
 
-#ifdef __KVM_HAVE_MSIX
 static int assigned_device_enable_host_msix(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
 {
@@ -443,8 +436,6 @@ err:
return r;
 }
 
-#endif
-
 static int assigned_device_enable_guest_intx(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev,
struct kvm_assigned_irq *irq)
@@ -454,7 +445,6 @@ static int assigned_device_enable_guest_intx(struct kvm 
*kvm,
return 0;
 }
 
-#ifdef __KVM_HAVE_MSI
 static int assigned_device_enable_guest_msi(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev,
struct kvm_assigned_irq *irq)
@@ -463,9 +453,7 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
dev->ack_notifier.gsi = -1;
return 0;
 }
-#endif
 
-#ifdef __KVM_HAVE_MSIX
 static int assigned_device_enable_guest_msix(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev,
struct kvm_assigned_irq *irq)
@@ -474,7 +462,6 @@ static int assigned_device_enable_guest_msix(struct kvm 
*kvm,
dev->ack_notifier.gsi = -1;
return 0;
 }
-#endif
 
 static int assign_host_irq(struct kvm *kvm,
   struct kvm_assigned_dev_kernel *dev,
@@ -492,16 +479,12 @@ static int assign_host_irq(struct kvm *kvm,
case KVM_DEV_IRQ_HOST_INTX:
r = assigned_device_enable_host_intx(kvm, dev);
break;
-#ifdef __KVM_HAVE_MSI
case KVM_DEV_IRQ_HOST_MSI:
r = assigned_device_enable_host_msi(kvm, dev);
break;
-#endif
-#ifdef __KVM_HAVE_MSIX
case KVM_DEV_IRQ_HOST_MSIX:
r = assigned_device_enable_host_msix(kvm, dev);
break;
-#endif
default:
r = -EINVAL;
}
@@ -534,16 +517,12 @@ static int assign_guest_irq(struct kvm *kvm,
case KVM_DEV_IRQ_GUEST_INTX:
r = assigned_device_enable_guest_intx(kvm, dev, irq);
break;
-#ifdef __KVM_HAVE_MSI
case KVM_DEV_IRQ_GUEST_MSI:
r = assigned_device_enable_guest_msi(kvm, dev, irq);
break;
-#endif
-#ifdef __KVM_HAVE_MSIX
case KVM_DEV_IRQ_GUEST_MSIX:
r = assigned_device_enable_guest_msix(kvm, dev, irq);
break;
-#endif
default:
r = -EINVAL;
}
@@ -826,7 +805,6 @@ out:
 }
 
 
-#ifdef __KVM_HAVE_MSIX
 static int kvm_vm_ioctl_set_msix_nr(struct kvm *kvm,
struct kvm_assigned_msix_nr *entry_nr)
 {
@@ -906,7 +884,6 @@ msix_entry_out:
 
return r;
 }
-#endif
 
 static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm,
struct kvm_assigned_pci_dev *assigned_dev)
@@ -1012,7 +989,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, 
unsigned ioctl,
goto out;
break;
}
-#ifdef __KVM_HAVE_MSIX
case KVM_ASSIGN_SET_MSIX_NR: {
struct kvm_assigned_msix_nr entry_nr;
r = -EFAULT;
@@ -1033,7 +1009,6 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, 
unsigned ioctl,
goto out;
break;
}
-#endif
case KVM_ASSIGN_SET_INTX_MASK: {
struct kvm_assigned_pci_dev assigned_dev;
 
-- 
1.8.3.1


--
To unsubscribe from this list:

Re: [kvm-unit-tests PATCH] x86: hyperv_synic: Hyper-V SynIC test

2015-11-02 Thread Paolo Bonzini



On 02/11/2015 13:18, Denis V. Lunev wrote:
>> I'm keeping the kernel patches queued for my own testing, but this of
>> course has to be fixed before including them---which will delay this
>> feature to 4.5, unfortunately.
> 
> well, the problem is that it actually uses auto EOI

Ok, no big deal.  We can just disable APICv when SynIC is enabled.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] KVM: x86: clean up interrupt injection

2015-11-02 Thread Paolo Bonzini

Legacy device assignment attempted to only do lightweight work when
injecting interrupts from atomic context.  This will be important
if we let VFIO inject interrupts from a non-threaded interrupt handler.
This series lets irqfd ditinguish between atomic-context and generic
interrupt injection.

Patch 1 is the real change, everything else cleans up what's left behind.

Paolo

Paolo Bonzini (3):
  KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
  KVM: device assignment: remove pointless #ifdefs
  KVM: x86: move kvm_set_irq_inatomic to legacy device assignment

 arch/x86/kvm/assigned-dev.c | 62 +++--
 arch/x86/kvm/irq_comm.c | 44 +---
 include/linux/kvm_host.h|  8 +++---
 virt/kvm/eventfd.c  | 11 +++-
 4 files changed, 50 insertions(+), 75 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] KVM: x86: move kvm_set_irq_inatomic to legacy device assignment

2015-11-02 Thread Paolo Bonzini

The function is not used outside device assignment, and
kvm_arch_set_irq_inatomic has a different prototype.  Move it here and
make it static to avoid confusion.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/assigned-dev.c | 37 +
 arch/x86/kvm/irq_comm.c | 34 --
 include/linux/kvm_host.h|  1 -
 3 files changed, 37 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/assigned-dev.c b/arch/x86/kvm/assigned-dev.c
index 1c17ee807ef7..9dc091acd5fb 100644
--- a/arch/x86/kvm/assigned-dev.c
+++ b/arch/x86/kvm/assigned-dev.c
@@ -21,6 +21,7 @@
 #include 
 #include "irq.h"
 #include "assigned-dev.h"
+#include "trace/events/kvm.h"
 
 struct kvm_assigned_dev_kernel {
struct kvm_irq_ack_notifier ack_notifier;
@@ -131,6 +132,42 @@ static irqreturn_t kvm_assigned_dev_thread_intx(int irq, 
void *dev_id)
return IRQ_HANDLED;
 }
 
+/*
+ * Deliver an IRQ in an atomic context if we can, or return a failure,
+ * user can retry in a process context.
+ * Return value:
+ *  -EWOULDBLOCK - Can't deliver in atomic context: retry in a process context.
+ *  Other values - No need to retry.
+ */
+static int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq,
+   int level)
+{
+   struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
+   struct kvm_kernel_irq_routing_entry *e;
+   int ret = -EINVAL;
+   int idx;
+
+   trace_kvm_set_irq(irq, level, irq_source_id);
+
+   /*
+* Injection into either PIC or IOAPIC might need to scan all CPUs,
+* which would need to be retried from thread context;  when same GSI
+* is connected to both PIC and IOAPIC, we'd have to report a
+* partial failure here.
+* Since there's no easy way to do this, we only support injecting MSI
+* which is limited to 1:1 GSI mapping.
+*/
+   idx = srcu_read_lock(&kvm->irq_srcu);
+   if (kvm_irq_map_gsi(kvm, entries, irq) > 0) {
+   e = &entries[0];
+   ret = kvm_arch_set_irq_inatomic(e, kvm, irq_source_id,
+   irq, level);
+   }
+   srcu_read_unlock(&kvm->irq_srcu, idx);
+   return ret;
+}
+
+
 static irqreturn_t kvm_assigned_dev_msi(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 26d830c1b7af..35b0dfb7f704 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -142,40 +142,6 @@ int kvm_arch_set_irq_inatomic(struct 
kvm_kernel_irq_routing_entry *e,
return -EWOULDBLOCK;
 }
 
-/*
- * Deliver an IRQ in an atomic context if we can, or return a failure,
- * user can retry in a process context.
- * Return value:
- *  -EWOULDBLOCK - Can't deliver in atomic context: retry in a process context.
- *  Other values - No need to retry.
- */
-int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int 
level)
-{
-   struct kvm_kernel_irq_routing_entry entries[KVM_NR_IRQCHIPS];
-   struct kvm_kernel_irq_routing_entry *e;
-   int ret = -EINVAL;
-   int idx;
-
-   trace_kvm_set_irq(irq, level, irq_source_id);
-
-   /*
-* Injection into either PIC or IOAPIC might need to scan all CPUs,
-* which would need to be retried from thread context;  when same GSI
-* is connected to both PIC and IOAPIC, we'd have to report a
-* partial failure here.
-* Since there's no easy way to do this, we only support injecting MSI
-* which is limited to 1:1 GSI mapping.
-*/
-   idx = srcu_read_lock(&kvm->irq_srcu);
-   if (kvm_irq_map_gsi(kvm, entries, irq) > 0) {
-   e = &entries[0];
-   ret = kvm_arch_set_irq_inatomic(e, kvm, irq_source_id,
-   irq, level);
-   }
-   srcu_read_unlock(&kvm->irq_srcu, idx);
-   return ret;
-}
-
 int kvm_request_irq_source_id(struct kvm *kvm)
 {
unsigned long *bitmap = &kvm->arch.irq_sources_bitmap;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 42fb9e089fc9..e5d517a9d922 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -825,7 +825,6 @@ int kvm_irq_map_chip_pin(struct kvm *kvm, unsigned irqchip, 
unsigned pin);
 
 int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level,
bool line_status);
-int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int 
level);
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm 
*kvm,
int irq_source_id, int level, bool line_status);
 int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/m

[PATCH 1/3] KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic

2015-11-02 Thread Paolo Bonzini

We do not want to do too much work in atomic context, in particular
not walking all the VCPUs of the virtual machine.  So we want
to distinguish the architecture-specific injection function for irqfd
from kvm_set_msi.  Since it's still empty, reuse the newly added
kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/irq_comm.c  | 14 --
 include/linux/kvm_host.h |  7 +++
 virt/kvm/eventfd.c   | 11 ---
 3 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 8f4499c7ffc1..26d830c1b7af 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -124,12 +124,16 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
 }
 
 
-static int kvm_set_msi_inatomic(struct kvm_kernel_irq_routing_entry *e,
-struct kvm *kvm)
+int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
+ struct kvm *kvm, int irq_source_id, int level,
+ bool line_status)
 {
struct kvm_lapic_irq irq;
int r;
 
+   if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
+   return -EWOULDBLOCK;
+
kvm_set_msi_irq(e, &irq);
 
if (kvm_irq_delivery_to_apic_fast(kvm, NULL, &irq, &r, NULL))
@@ -165,10 +169,8 @@ int kvm_set_irq_inatomic(struct kvm *kvm, int 
irq_source_id, u32 irq, int level)
idx = srcu_read_lock(&kvm->irq_srcu);
if (kvm_irq_map_gsi(kvm, entries, irq) > 0) {
e = &entries[0];
-   if (likely(e->type == KVM_IRQ_ROUTING_MSI))
-   ret = kvm_set_msi_inatomic(e, kvm);
-   else
-   ret = -EWOULDBLOCK;
+   ret = kvm_arch_set_irq_inatomic(e, kvm, irq_source_id,
+   irq, level);
}
srcu_read_unlock(&kvm->irq_srcu, idx);
return ret;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index eba9caebc9c1..42fb9e089fc9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -828,10 +828,9 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level,
 int kvm_set_irq_inatomic(struct kvm *kvm, int irq_source_id, u32 irq, int 
level);
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *irq_entry, struct kvm 
*kvm,
int irq_source_id, int level, bool line_status);
-
-int kvm_arch_set_irq(struct kvm_kernel_irq_routing_entry *irq, struct kvm *kvm,
-int irq_source_id, int level, bool line_status);
-
+int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
+  struct kvm *kvm, int irq_source_id,
+  int level, bool line_status);
 bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin);
 void kvm_notify_acked_gsi(struct kvm *kvm, int gsi);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin);
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index e29fd2640709..46dbc0a7dfc1 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -171,7 +171,7 @@ irqfd_deactivate(struct kvm_kernel_irqfd *irqfd)
queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
 }
 
-int __attribute__((weak)) kvm_arch_set_irq(
+int __attribute__((weak)) kvm_arch_set_irq_inatomic(
struct kvm_kernel_irq_routing_entry *irq,
struct kvm *kvm, int irq_source_id,
int level,
@@ -201,12 +201,9 @@ irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, 
void *key)
irq = irqfd->irq_entry;
} while (read_seqcount_retry(&irqfd->irq_entry_sc, seq));
/* An event has been signaled, inject an interrupt */
-   if (irq.type == KVM_IRQ_ROUTING_MSI)
-   kvm_set_msi(&irq, kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 1,
-   false);
-   else if (kvm_arch_set_irq(&irq, kvm,
- KVM_USERSPACE_IRQ_SOURCE_ID, 1,
- false) == -EWOULDBLOCK)
+   if (kvm_arch_set_irq_inatomic(&irq, kvm,
+ KVM_USERSPACE_IRQ_SOURCE_ID, 1,
+ false) == -EWOULDBLOCK)
schedule_work(&irqfd->inject);
srcu_read_unlock(&kvm->irq_srcu, idx);
}
-- 
1.8.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 20/35] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Vladimir Sementsov-Ogievskiy


On 02.11.2015 12:13, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Current code did not check the return value of get_memory_region as it
assumed the backend memory of pc-dimm is always properly initialized,
we make get_memory_region internally catch the case if something is
wrong

Signed-off-by: Xiao Guangrong 
---
  hw/mem/dimm.c|  3 ++-
  hw/mem/pc-dimm.c | 12 +++-
  2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 4a63409..498d380 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -377,8 +377,9 @@ static void dimm_get_size(Object *obj, Visitor *v, void 
*opaque,
  int64_t value;
  MemoryRegion *mr;
  DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
  
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);

+mr = ddc->get_memory_region(dimm);
  value = memory_region_size(mr);
  
  visit_type_int(v, &value, name, errp);

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 38323e9..e6b6a9f 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -22,7 +22,17 @@
  
  static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)

  {
-return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
+Error *local_err = NULL;
+MemoryRegion *mr;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, &local_err);
+
+/*
+ * plug a pc-dimm device whose backend memory was not properly
+ * initialized?
+ */
+assert(!local_err && mr);
+return mr;
  }


this should squashed into previous patch, I think

  
  static void pc_dimm_class_init(ObjectClass *oc, void *data)


I've discovered suddenly, that

MemoryRegion *
host_memory_backend_get_memory(HostMemoryBackend *backend, Error **errp)
{
return memory_region_size(&backend->mr) ? &backend->mr : NULL;
}

- it doesn't use errp at all. In my opinion, this should be fixed 
globally, by deleting useless parameter in separate patch. Or just 
squash your function into previous patch.


--
Best regards,
Vladimir
* now, @virtuozzo.com instead of @parallels.com. Sorry for this inconvenience.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-unit-tests PATCH] x86: hyperv_synic: Hyper-V SynIC test

2015-11-02 Thread Denis V. Lunev


On 11/02/2015 03:16 PM, Paolo Bonzini wrote:

On 26/10/2015 10:56, Andrey Smetanin wrote:

Hyper-V SynIC is a Hyper-V synthetic interrupt controller.

The test runs on every vCPU and performs the following steps:
* read from all Hyper-V SynIC MSR's
* setup Hyper-V SynIC evt/msg pages
* setup SINT's routing
* inject SINT's into destination vCPU by 'hyperv-synic-test-device'
* wait for SINT's isr's completion
* clear Hyper-V SynIC evt/msg pages and destroy SINT's routing

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Vitaly Kuznetsov 
CC: "K. Y. Srinivasan" 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: Roman Kagan 
CC: Denis V. Lunev 
CC: qemu-de...@nongnu.org
CC: virtualizat...@lists.linux-foundation.org

Bad news.

The test breaks with APICv, because of the following sequence of events:

1) non-auto-EOI interrupt 176 is injected into IRR and ISR

2) The PPR register is now 176

3) auto-EOI interrupt 179 is injected into IRR only, because (179 &
0xf0) <= (PPR & 0xf0)

4) interrupt 176 ISR performs an EOI

5) at this point, because virtual interrupt delivery is enabled, the
processor does not perform TPR virtualization (SDM 29.1.2).

In addition (and even worse) because virtual interrupt delivery is
enabled, an auto-EOI interrupt that was stashed in IRR can be injected
by the processor, and the auto-EOI behavior will be skipped.

The solution is to have userspace enable KVM_CAP_HYPERV_SYNIC through
KVM_ENABLE_CAP, and modify vmx.c to not use apicv on VMs that have it
enabled.  This requires some changes to the callbacks that only work if
enable_apicv or !enable_apicv:

if (enable_apicv)
kvm_x86_ops->update_cr8_intercept = NULL;
else {
kvm_x86_ops->hwapic_irr_update = NULL;
kvm_x86_ops->hwapic_isr_update = NULL;
kvm_x86_ops->deliver_posted_interrupt = NULL;
kvm_x86_ops->sync_pir_to_irr = vmx_sync_pir_to_irr_dummy;
}

The question then is... does Hyper-V actually use auto-EOI interrupts?
If it doesn't, we might as well not implement them... :/

I'm keeping the kernel patches queued for my own testing, but this of
course has to be fixed before including them---which will delay this
feature to 4.5, unfortunately.

Paolo


well, the problem is that it actually uses auto EOI

Den
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [kvm-unit-tests PATCH] x86: hyperv_synic: Hyper-V SynIC test

2015-11-02 Thread Paolo Bonzini

On 26/10/2015 10:56, Andrey Smetanin wrote:
> Hyper-V SynIC is a Hyper-V synthetic interrupt controller.
> 
> The test runs on every vCPU and performs the following steps:
> * read from all Hyper-V SynIC MSR's
> * setup Hyper-V SynIC evt/msg pages
> * setup SINT's routing
> * inject SINT's into destination vCPU by 'hyperv-synic-test-device'
> * wait for SINT's isr's completion
> * clear Hyper-V SynIC evt/msg pages and destroy SINT's routing
> 
> Signed-off-by: Andrey Smetanin 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Vitaly Kuznetsov 
> CC: "K. Y. Srinivasan" 
> CC: Gleb Natapov 
> CC: Paolo Bonzini 
> CC: Roman Kagan 
> CC: Denis V. Lunev 
> CC: qemu-de...@nongnu.org
> CC: virtualizat...@lists.linux-foundation.org

Bad news.

The test breaks with APICv, because of the following sequence of events:

1) non-auto-EOI interrupt 176 is injected into IRR and ISR

2) The PPR register is now 176

3) auto-EOI interrupt 179 is injected into IRR only, because (179 &
0xf0) <= (PPR & 0xf0)

4) interrupt 176 ISR performs an EOI

5) at this point, because virtual interrupt delivery is enabled, the
processor does not perform TPR virtualization (SDM 29.1.2).

In addition (and even worse) because virtual interrupt delivery is
enabled, an auto-EOI interrupt that was stashed in IRR can be injected
by the processor, and the auto-EOI behavior will be skipped.

The solution is to have userspace enable KVM_CAP_HYPERV_SYNIC through
KVM_ENABLE_CAP, and modify vmx.c to not use apicv on VMs that have it
enabled.  This requires some changes to the callbacks that only work if
enable_apicv or !enable_apicv:

   if (enable_apicv)
   kvm_x86_ops->update_cr8_intercept = NULL;
   else {
   kvm_x86_ops->hwapic_irr_update = NULL;
   kvm_x86_ops->hwapic_isr_update = NULL;
   kvm_x86_ops->deliver_posted_interrupt = NULL;
   kvm_x86_ops->sync_pir_to_irr = vmx_sync_pir_to_irr_dummy;
   }

The question then is... does Hyper-V actually use auto-EOI interrupts?
If it doesn't, we might as well not implement them... :/

I'm keeping the kernel patches queued for my own testing, but this of
course has to be fixed before including them---which will delay this
feature to 4.5, unfortunately.

Paolo


> ---
>  config/config-x86-common.mak |   5 +-
>  lib/x86/msr.h|  23 +
>  x86/hyperv_synic.c   | 229 
> +++
>  x86/run  |  10 +-
>  x86/unittests.cfg|   5 +
>  5 files changed, 270 insertions(+), 2 deletions(-)
>  create mode 100644 x86/hyperv_synic.c
> 
> diff --git a/config/config-x86-common.mak b/config/config-x86-common.mak
> index c2f9908..dc80eaa 100644
> --- a/config/config-x86-common.mak
> +++ b/config/config-x86-common.mak
> @@ -36,7 +36,8 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat 
> \
> $(TEST_DIR)/kvmclock_test.flat  $(TEST_DIR)/eventinj.flat \
> $(TEST_DIR)/s3.flat $(TEST_DIR)/pmu.flat \
> $(TEST_DIR)/tsc_adjust.flat $(TEST_DIR)/asyncpf.flat \
> -   $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat
> +   $(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
> +   $(TEST_DIR)/hyperv_synic.flat
>  
>  ifdef API
>  tests-common += api/api-sample
> @@ -108,6 +109,8 @@ $(TEST_DIR)/vmx.elf: $(cstart.o) $(TEST_DIR)/vmx.o 
> $(TEST_DIR)/vmx_tests.o
>  
>  $(TEST_DIR)/debug.elf: $(cstart.o) $(TEST_DIR)/debug.o
>  
> +$(TEST_DIR)/hyperv_synic.elf: $(cstart.o) $(TEST_DIR)/hyperv_synic.o
> +
>  arch_clean:
>   $(RM) $(TEST_DIR)/*.o $(TEST_DIR)/*.flat $(TEST_DIR)/*.elf \
>   $(TEST_DIR)/.*.d lib/x86/.*.d
> diff --git a/lib/x86/msr.h b/lib/x86/msr.h
> index 281255a..54da420 100644
> --- a/lib/x86/msr.h
> +++ b/lib/x86/msr.h
> @@ -408,4 +408,27 @@
>  #define MSR_VM_IGNNE0xc0010115
>  #define MSR_VM_HSAVE_PA 0xc0010117
>  
> +/* Define synthetic interrupt controller model specific registers. */
> +#define HV_X64_MSR_SCONTROL 0x4080
> +#define HV_X64_MSR_SVERSION 0x4081
> +#define HV_X64_MSR_SIEFP0x4082
> +#define HV_X64_MSR_SIMP 0x4083
> +#define HV_X64_MSR_EOM  0x4084
> +#define HV_X64_MSR_SINT00x4090
> +#define HV_X64_MSR_SINT10x4091
> +#define HV_X64_MSR_SINT20x4092
> +#define HV_X64_MSR_SINT30x4093
> +#define HV_X64_MSR_SINT40x4094
> +#define HV_X64_MSR_SINT50x4095
> +#define HV_X64_MSR_SINT60x4096
> +#define HV_X64_MSR_SINT70x4097
> +#define HV_X64_MSR_SINT80x4098
> +#define HV_X64_MSR_SINT90x4099
> +#define HV_X64_MSR_SINT10

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch

On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt wrote:
> 
> Chosing on a per-mapping basis *in the back end* might still make some

In my case, choosing mapping based on the hardware that will use this
mappings makes more sense. Most hardware are not that performance sensitive
as the Infiniband hardware.

> amount of sense. What I don't completely grasp is what does it give
> you to expose that choice to the *driver* all the way up the chain. Why
> does the driver knows better whether something should use the bypass or
> not ?

The driver know for what hardware it is mapping the memory so it know if
the memory will be used by performance sensitive hardware or not.

> 
> I can imagine some in-between setups, for example, on POWER (and
> probably x86), I could setup a window that is TCE-mapped (TCEs are our
> iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
> map to the same physical page. I could then use map/unmap to adjust the
> protection, the idea being that only "relaxing" the protection requires
> flushing the IO TLB, ie, we could delay most flushes.

In your case, what will give the better performance - 1:1 mapping or IOMMU
mapping? When you say 'relaxing the protection' you refer to 1:1 mapping?

Also, how this 1:1 window address the security concerns that other raised
by other here?

> 
> But that sort of optimization only makes sense in the back-end.
> 
> So what was your original idea where you thought the driver was the
> right one to decide whether to use the bypass or not for a given map
> operation ? That's what I don't grasp... you might have a valid case
> that I just fail to see.

Please see above.

> 
> Cheers,
> Ben.
> 

I think that given that IOMMU bypass on per allocation basis raise some
concerns, the only path for SPARC is this:

1. Support 'iommu=pt' as x86 for total IOMMU as intermediate step. Systems
that use Infiniband will be set to pass through.

2. Add support in DVMA which allow less contention on the IOMMU resources
while doing the map/unmap, bigger address range and full protection.
Still this is not clear what will be the performance cost of using
DVMA.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: removing unused variable

2015-11-02 Thread Paolo Bonzini



On 30/10/2015 08:26, Saurabh Sengar wrote:
> removing unused variables, found by coccinelle
> 
> Signed-off-by: Saurabh Sengar 
> ---
>  arch/x86/kvm/x86.c | 16 +---
>  1 file changed, 5 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a9a198..ec15294 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3424,41 +3424,35 @@ static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, 
> struct kvm_irqchip *chip)
>  
>  static int kvm_vm_ioctl_get_pit(struct kvm *kvm, struct kvm_pit_state *ps)
>  {
> - int r = 0;
> -
>   mutex_lock(&kvm->arch.vpit->pit_state.lock);
>   memcpy(ps, &kvm->arch.vpit->pit_state, sizeof(struct kvm_pit_state));
>   mutex_unlock(&kvm->arch.vpit->pit_state.lock);
> - return r;
> + return 0;
>  }
>  
>  static int kvm_vm_ioctl_set_pit(struct kvm *kvm, struct kvm_pit_state *ps)
>  {
> - int r = 0;
> -
>   mutex_lock(&kvm->arch.vpit->pit_state.lock);
>   memcpy(&kvm->arch.vpit->pit_state, ps, sizeof(struct kvm_pit_state));
>   kvm_pit_load_count(kvm, 0, ps->channels[0].count, 0);
>   mutex_unlock(&kvm->arch.vpit->pit_state.lock);
> - return r;
> + return 0;
>  }
>  
>  static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps)
>  {
> - int r = 0;
> -
>   mutex_lock(&kvm->arch.vpit->pit_state.lock);
>   memcpy(ps->channels, &kvm->arch.vpit->pit_state.channels,
>   sizeof(ps->channels));
>   ps->flags = kvm->arch.vpit->pit_state.flags;
>   mutex_unlock(&kvm->arch.vpit->pit_state.lock);
>   memset(&ps->reserved, 0, sizeof(ps->reserved));
> - return r;
> + return 0;
>  }
>  
>  static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps)
>  {
> - int r = 0, start = 0;
> + int start = 0;
>   u32 prev_legacy, cur_legacy;
>   mutex_lock(&kvm->arch.vpit->pit_state.lock);
>   prev_legacy = kvm->arch.vpit->pit_state.flags & 
> KVM_PIT_FLAGS_HPET_LEGACY;
> @@ -3470,7 +3464,7 @@ static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, 
> struct kvm_pit_state2 *ps)
>   kvm->arch.vpit->pit_state.flags = ps->flags;
>   kvm_pit_load_count(kvm, 0, kvm->arch.vpit->pit_state.channels[0].count, 
> start);
>   mutex_unlock(&kvm->arch.vpit->pit_state.lock);
> - return r;
> + return 0;
>  }
>  
>  static int kvm_vm_ioctl_reinject(struct kvm *kvm,
> 

Applied, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: zero apic_arb_prio on reset

2015-11-02 Thread Paolo Bonzini



On 30/10/2015 15:48, Radim Krčmář wrote:
> BSP doesn't get INIT so its apic_arb_prio isn't zeroed after reboot.
> BSP won't get lowest priority interrupts until other VCPUs get enough
> interrupts to match their pre-reboot apic_arb_prio.
> 
> That behavior doesn't fit into KVM's round-robin-like interpretation of
> lowest priority delivery ... userspace should KVM_SET_LAPIC on reset, so
> just zero apic_arb_prio there.
> 
> Reported-by: Yuki Shibuya 
> Signed-off-by: Radim Krčmář 
> ---
>  arch/x86/kvm/lapic.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 1b02c44c7b8b..08655020417d 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1924,6 +1924,8 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
>   kvm_make_request(KVM_REQ_EVENT, vcpu);
>   if (ioapic_in_kernel(vcpu->kvm))
>   kvm_rtc_eoi_tracking_restore_one(vcpu);
> +
> + vcpu->arch.apic_arb_prio = 0;
>  }
>  
>  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
> 

Applied, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] Hyper-V Synthetic interrupt controller

2015-11-02 Thread Paolo Bonzini



On 26/10/2015 10:50, Andrey Smetanin wrote:
> Hyper-V SynIC (synthetic interrupt controller) device
> implementation.
> 
> The implementation contains:
> * msr's support
> * irq routing setup
> * irq injection
> * irq ack callback registration
> * event/message pages changes tracking at Hyper-V exit
> * Hyper-V test device to test SynIC by kvm-unit-tests
> 
> Andrey Smetanin (7):
>   standard-headers/x86: add Hyper-V SynIC constants
>   target-i386/kvm: Hyper-V SynIC MSR's support
>   linux-headers/kvm: add Hyper-V SynIC irq routing type and struct
>   kvm: Hyper-V SynIC irq routing support
>   linux-headers/kvm: KVM_EXIT_HYPERV type and struct
>   target-i386/hyperv: Hyper-V SynIC SINT routing and vCPU exit
>   hw/misc: Hyper-V test device 'hyperv-testdev'
> 
>  default-configs/i386-softmmu.mak  |   1 +
>  default-configs/x86_64-softmmu.mak|   1 +
>  hw/misc/Makefile.objs |   1 +
>  hw/misc/hyperv_testdev.c  | 164 
> ++
>  include/standard-headers/asm-x86/hyperv.h |  12 +++
>  include/sysemu/kvm.h  |   1 +
>  kvm-all.c |  33 ++
>  linux-headers/linux/kvm.h |  25 +
>  target-i386/Makefile.objs |   2 +-
>  target-i386/cpu-qom.h |   1 +
>  target-i386/cpu.c |   1 +
>  target-i386/cpu.h |   5 +
>  target-i386/hyperv.c  | 127 +++
>  target-i386/hyperv.h  |  42 
>  target-i386/kvm.c |  66 +++-
>  target-i386/machine.c |  39 +++
>  16 files changed, 519 insertions(+), 2 deletions(-)
>  create mode 100644 hw/misc/hyperv_testdev.c
>  create mode 100644 target-i386/hyperv.c
>  create mode 100644 target-i386/hyperv.h
> 
> Signed-off-by: Andrey Smetanin 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Vitaly Kuznetsov 
> CC: "K. Y. Srinivasan" 
> CC: Gleb Natapov 
> CC: Paolo Bonzini 
> CC: Roman Kagan 
> CC: Denis V. Lunev 
> CC: kvm@vger.kernel.org
> CC: virtualizat...@lists.linux-foundation.org
> 

Reviewed-by: Paolo Bonzini 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 00/35] implement vNVDIMM

2015-11-02 Thread Stefan Hajnoczi

I have reviewed ACPI interface:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature

Re: [PATCH/RFC 0/4] dma ops and virtio

2015-11-02 Thread Cornelia Huck

On Fri, 30 Oct 2015 13:33:07 -0700
Andy Lutomirski  wrote:

> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck  
> wrote:
> > On Thu, 29 Oct 2015 15:50:38 -0700
> > Andy Lutomirski  wrote:
> >
> >> Progress!  After getting that sort-of-working, I figured out what was
> >> wrong with my earlier command, and I got that working, too.  Now I
> >> get:
> >>
> >> qemu-system-s390x -fsdev
> >> local,id=virtfs1,path=/,security_model=none,readonly -device
> >> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio
> >> -nodefaults -device sclpconsole,chardev=console -parallel none -net
> >> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on
> >> -serial chardev:console -mon chardev=console -vga none -display none
> >> -kernel arch/s390/boot/bzImage -append
> >> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init
> >> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8"
> >> TERM=xterm-256color rootfstype=9p
> >> rootflags=ro,version=9p2000.L,trans=virtio,access=any
> >> raid=noautodetect debug'
> >
> > The commandline looks sane AFAICS.
> >
> > (...)
> >
> >> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80
> >> virtio_ccw 0.0.: Failed to set online: -5
> >>
> >> ^^^ bad news!
> >
> > I'd like to see where in the onlining process this fails. Could you set
> > up qemu tracing for css_* and virtio_ccw_* (instructions in
> > qemu/docs/tracing.txt)?
> 
> I have a file called events that contains:
> 
> css_*
> virtio_ccw_*
> 
> pointing -trace events= at it results in a trace- file that's 549
> bytes long and contains nothing.  Are wildcards not as well-supported
> as the docs suggest?

Just tried it, seemed to work for me as expected. And as your messages
indicate, at least some of the css tracepoints are guaranteed to be
hit. Odd.

Can you try the following sophisticated printf debug patch?

diff --git a/hw/s390x/css.c b/hw/s390x/css.c
index c033612..6a87bd6 100644
--- a/hw/s390x/css.c
+++ b/hw/s390x/css.c
@@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr ccw_addr)
 sch->ccw_no_data_cnt++;
 }
 
+fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code);
+
 /* Look at the command. */
 switch (ccw.cmd_code) {
 case CCW_CMD_NOOP:
@@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr ccw_addr)
 }
 break;
 }
+fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret);
 sch->last_cmd = ccw;
 sch->last_cmd_valid = true;
 if (ret == 0) {


> > Which qemu version is this, btw.?
> >
> 
> git from yesterday.

Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt

On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote:
> To summary -
> 
> 1. The whole point of the IOMMU pass through was to get bigger address space
> and faster map/unmap operations for performance critical hardware
> 2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
> protection concerns raised above. Not sure what will be the 
> performance
> impact though. This still need a lot of work before we could test 
> this.
> 3. On x86 we use IOMMU in pass through mode so all the above concerns are 
> valid
> 
> The question are -
> 
> 1. Does partial use of IOMMU while the pass through window is enabled add some
> protection?
> 2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
> translations at kernel level?
> 
> I think that I can live with option (2) till I have DVMA if there is strong
> disagree on the need for per allocation IOMMU bypass.

Chosing on a per-mapping basis *in the back end* might still make some
amount of sense. What I don't completely grasp is what does it give
you to expose that choice to the *driver* all the way up the chain. Why
does the driver knows better whether something should use the bypass or
not ?

I can imagine some in-between setups, for example, on POWER (and
probably x86), I could setup a window that is TCE-mapped (TCEs are our
iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
map to the same physical page. I could then use map/unmap to adjust the
protection, the idea being that only "relaxing" the protection requires
flushing the IO TLB, ie, we could delay most flushes.

But that sort of optimization only makes sense in the back-end.

So what was your original idea where you thought the driver was the
right one to decide whether to use the bypass or not for a given map
operation ? That's what I don't grasp... you might have a valid case
that I just fail to see.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails

2015-11-02 Thread Thomas Huth

On 27/10/15 06:13, Paul Mackerras wrote:
> When handling a hypervisor data or instruction storage interrupt (HDSI
> or HISI), we look up the SLB entry for the address being accessed in
> order to translate the effective address to a virtual address which can
> be looked up in the guest HPT.  This lookup can occasionally fail due
> to the guest replacing an SLB entry without invalidating the evicted
> SLB entry.  In this situation an ERAT (effective to real address
> translation cache) entry can persist and be used by the hardware even
> though there is no longer a corresponding SLB entry.
> 
> Previously we would just deliver a data or instruction storage interrupt
> (DSI or ISI) to the guest in this case.  However, this is not correct
> and has been observed to cause guests to crash, typically with a
> data storage protection interrupt on a store to the vmemmap area.
> 
> Instead, what we do now is to synthesize a data or instruction segment
> interrupt.  That should cause the guest to reload an appropriate entry
> into the SLB and retry the faulting instruction.  If it still faults,
> we should find an appropriate SLB entry next time and be able to handle
> the fault.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b1dab8d..3c6badc 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -1749,7 +1749,8 @@ kvmppc_hdsi:
>   beq 3f
>   clrrdi  r0, r4, 28
>   PPC_SLBFEE_DOT(R5, R0)  /* if so, look up SLB */
> - bne 1f  /* if no SLB entry found */
> + li  r0, BOOK3S_INTERRUPT_DATA_SEGMENT
> + bne 7f  /* if no SLB entry found */
>  4:   std r4, VCPU_FAULT_DAR(r9)
>   stw r6, VCPU_FAULT_DSISR(r9)
>  
> @@ -1768,14 +1769,15 @@ kvmppc_hdsi:
>   cmpdi   r3, -2  /* MMIO emulation; need instr word */
>   beq 2f
>  
> - /* Synthesize a DSI for the guest */
> + /* Synthesize a DSI (or DSegI) for the guest */
>   ld  r4, VCPU_FAULT_DAR(r9)
>   mr  r6, r3
> -1:   mtspr   SPRN_DAR, r4
> +1:   li  r0, BOOK3S_INTERRUPT_DATA_STORAGE
>   mtspr   SPRN_DSISR, r6
> +7:   mtspr   SPRN_DAR, r4

I first though "hey, you're not setting DSISR for the DsegI now" ... but
then I saw in the PowerISA that DSISR is "undefined" for the DSegI, so
this should be fine, I think.

>   mtspr   SPRN_SRR0, r10
>   mtspr   SPRN_SRR1, r11
> - li  r10, BOOK3S_INTERRUPT_DATA_STORAGE
> + mr  r10, r0
>   bl  kvmppc_msr_interrupt
>  fast_interrupt_c_return:
>  6:   ld  r7, VCPU_CTR(r9)
> @@ -1823,7 +1825,8 @@ kvmppc_hisi:
>   beq 3f
>   clrrdi  r0, r10, 28
>   PPC_SLBFEE_DOT(R5, R0)  /* if so, look up SLB */
> - bne 1f  /* if no SLB entry found */
> + li  r0, BOOK3S_INTERRUPT_INST_SEGMENT
> + bne 7f  /* if no SLB entry found */
>  4:
>   /* Search the hash table. */
>   mr  r3, r9  /* vcpu pointer */
> @@ -1840,11 +1843,12 @@ kvmppc_hisi:
>   cmpdi   r3, -1  /* handle in kernel mode */
>   beq guest_exit_cont
>  
> - /* Synthesize an ISI for the guest */
> + /* Synthesize an ISI (or ISegI) for the guest */
>   mr  r11, r3
> -1:   mtspr   SPRN_SRR0, r10
> +1:   li  r0, BOOK3S_INTERRUPT_INST_STORAGE
> +7:   mtspr   SPRN_SRR0, r10
>   mtspr   SPRN_SRR1, r11
> - li  r10, BOOK3S_INTERRUPT_INST_STORAGE
> + mr  r10, r0
>   bl  kvmppc_msr_interrupt
>   b   fast_interrupt_c_return

As far as I can see (but I'm really not an expert here), the patch seems
to be fine. And we have a test running with it for almost 6 days now and
did not see any further guest crashes, so it seems to me like this is
the right fix for this issue! So if you like, you can add my
"Reviewed-by" or "Tested-by" to this patch.

 Thanks a lot for fixing this!
  Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-02 Thread James Hogan

On Mon, Nov 02, 2015 at 05:26:42PM +0800, Haozhong Zhang wrote:
> The value of the migrated vcpu's TSC rate is determined as below.
>  1. If a TSC rate is specified by the cpu option 'tsc-freq', then this
> user-specified value will be used.
>  2. If neither a user-specified TSC rate nor a migrated TSC rate is
> present, we will use the TSC rate from KVM (returned by
> KVM_GET_TSC_KHZ).
>  3. Otherwise, we will use the migrated TSC rate.
> 
> Signed-off-by: Haozhong Zhang 
> ---
>  include/sysemu/kvm.h |  2 ++
>  kvm-all.c|  1 +
>  target-arm/kvm.c |  5 +
>  target-i386/kvm.c| 33 +
>  target-mips/kvm.c|  5 +
>  target-ppc/kvm.c |  5 +
>  target-s390x/kvm.c   |  5 +
>  7 files changed, 56 insertions(+)
> 
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 461ef65..0ec8b98 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -328,6 +328,8 @@ int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry 
> *route,
>  
>  int kvm_arch_msi_data_to_gsi(uint32_t data);
>  
> +int kvm_arch_setup_tsc_khz(CPUState *cpu);
> +
>  int kvm_set_irq(KVMState *s, int irq, int level);
>  int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg);
>  
> diff --git a/kvm-all.c b/kvm-all.c
> index c442838..1ecaf04 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -1757,6 +1757,7 @@ static void do_kvm_cpu_synchronize_post_init(void *arg)
>  {
>  CPUState *cpu = arg;
>  
> +kvm_arch_setup_tsc_khz(cpu);

Sorry if this is a stupid question, but why aren't you doing this from
the i386 kvm_arch_put_registers when level == KVM_PUT_FULL_STATE, rather
than introducing x86 specifics to the generic KVM api?

Cheers
James

>  kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
>  cpu->kvm_vcpu_dirty = false;
>  }
> diff --git a/target-arm/kvm.c b/target-arm/kvm.c
> index 79ef4c6..a724f6d 100644
> --- a/target-arm/kvm.c
> +++ b/target-arm/kvm.c
> @@ -614,3 +614,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  return (data - 32) & 0x;
>  }
> +
> +int kvm_arch_setup_tsc_khz(CPUState *cs)
> +{
> +return 0;
> +}
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 64046cb..aae5e58 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -3034,3 +3034,36 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  abort();
>  }
> +
> +int kvm_arch_setup_tsc_khz(CPUState *cs)
> +{
> +X86CPU *cpu = X86_CPU(cs);
> +CPUX86State *env = &cpu->env;
> +int r;
> +
> +/*
> + * Prepare vcpu's TSC rate to be migrated.
> + *
> + * - If the user specifies the TSC rate by cpu option 'tsc-freq',
> + *   we will use the user-specified value.
> + *
> + * - If there is neither user-specified TSC rate nor migrated TSC
> + *   rate, we will ask KVM for the TSC rate by calling
> + *   KVM_GET_TSC_KHZ.
> + *
> + * - Otherwise, if there is a migrated TSC rate, we will use the
> + *   migrated value.
> + */
> +if (env->tsc_khz) {
> +env->tsc_khz_saved = env->tsc_khz;
> +} else if (!env->tsc_khz_saved) {
> +r = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
> +if (r < 0) {
> +fprintf(stderr, "KVM_GET_TSC_KHZ failed\n");
> +return r;
> +}
> +env->tsc_khz_saved = r;
> +}
> +
> +return 0;
> +}
> diff --git a/target-mips/kvm.c b/target-mips/kvm.c
> index 12d7db3..fb26d7e 100644
> --- a/target-mips/kvm.c
> +++ b/target-mips/kvm.c
> @@ -687,3 +687,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  abort();
>  }
> +
> +int kvm_arch_setup_tsc_khz(CPUState *cs)
> +{
> +return 0;
> +}
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index ac70f08..c429f0c 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -2510,3 +2510,8 @@ int kvmppc_enable_hwrng(void)
>  
>  return kvmppc_enable_hcall(kvm_state, H_RANDOM);
>  }
> +
> +int kvm_arch_setup_tsc_khz(CPUState *cs)
> +{
> +return 0;
> +}
> diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
> index c3be180..db5d436 100644
> --- a/target-s390x/kvm.c
> +++ b/target-s390x/kvm.c
> @@ -2248,3 +2248,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
>  {
>  abort();
>  }
> +
> +int kvm_arch_setup_tsc_khz(CPUState *cs)
> +{
> +return 0;
> +}
> -- 
> 2.4.8
> 


signature.asc
Description: Digital signature

Re: [PATCH 0/3] KVM: x86: simplify RSM into 64-bit protected mode

2015-11-02 Thread Paolo Bonzini



On 31/10/2015 20:50, Laszlo Ersek wrote:
> Tested-by: Laszlo Ersek 

Thanks Laszlo, I applied patches 1 and 2 (since your "part 2" never was :)).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/3] target-i386: load the migrated vcpu's TSC rate

2015-11-02 Thread Haozhong Zhang

Set vcpu's TSC rate to the migrated value if the user does not specify a
TSC rate by cpu option 'tsc-freq' and a migrated TSC rate does exist. If
KVM supports TSC scaling, guest programs will observe TSC increasing in
the migrated rate other than the host TSC rate.

Signed-off-by: Haozhong Zhang 
---
 target-i386/kvm.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index aae5e58..2be70df 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -3042,6 +3042,27 @@ int kvm_arch_setup_tsc_khz(CPUState *cs)
 int r;
 
 /*
+ * If a TSC rate is migrated and the user does not specify the
+ * vcpu's TSC rate on the destination, the migrated TSC rate will
+ * be used on the destination after the migration.
+ */
+if (env->tsc_khz_saved && !env->tsc_khz) {
+if (kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL)) {
+r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz_saved);
+if (r < 0) {
+fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
+}
+} else {
+r = -1;
+fprintf(stderr, "KVM doesn't support TSC scaling\n");
+}
+if (r < 0) {
+fprintf(stderr, "Use host TSC frequency instead. "
+"Guest TSC may be inaccurate.\n");
+}
+}
+
+/*
  * Prepare vcpu's TSC rate to be migrated.
  *
  * - If the user specifies the TSC rate by cpu option 'tsc-freq',
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/3] target-i386: add a subsection for migrating vcpu's TSC rate

2015-11-02 Thread Haozhong Zhang

A new subsection 'vmstate_tsc_khz' is added to migrate vcpu's TSC
rate. For the backwards compatibility, this subsection is not migrated
on pc-*-2.4 and older machine types.

Signed-off-by: Haozhong Zhang 
---
 hw/i386/pc.c  |  1 +
 hw/i386/pc_piix.c |  1 +
 hw/i386/pc_q35.c  |  1 +
 include/hw/i386/pc.h  |  1 +
 target-i386/cpu.h |  1 +
 target-i386/machine.c | 21 +
 6 files changed, 26 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 0cb8afd..2f2fc93 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1952,6 +1952,7 @@ static void pc_machine_class_init(ObjectClass *oc, void 
*data)
 HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(oc);
 
 pcmc->get_hotplug_handler = mc->get_hotplug_handler;
+pcmc->save_tsc_khz = true;
 mc->get_hotplug_handler = pc_get_hotpug_handler;
 mc->cpu_index_to_socket_id = pc_cpu_index_to_socket_id;
 mc->default_boot_order = "cad";
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 393dcc4..fc71321 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -487,6 +487,7 @@ static void pc_i440fx_2_4_machine_options(MachineClass *m)
 m->alias = NULL;
 m->is_default = 0;
 pcmc->broken_reserved_end = true;
+pcmc->save_tsc_khz = false;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_4);
 }
 
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 2f8f396..858ed69 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -385,6 +385,7 @@ static void pc_q35_2_4_machine_options(MachineClass *m)
 pc_q35_2_5_machine_options(m);
 m->alias = NULL;
 pcmc->broken_reserved_end = true;
+pcmc->save_tsc_khz = false;
 SET_MACHINE_COMPAT(m, PC_COMPAT_2_4);
 }
 
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 606dbc2..875d099 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -60,6 +60,7 @@ struct PCMachineClass {
 
 /*< public >*/
 bool broken_reserved_end;
+bool save_tsc_khz;
 HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
DeviceState *dev);
 };
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 62f7879..4f2f4a3 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -970,6 +970,7 @@ typedef struct CPUX86State {
 uint32_t sipi_vector;
 bool tsc_valid;
 int64_t tsc_khz;
+int64_t tsc_khz_saved;
 void *kvm_xsave_buf;
 
 uint64_t mcg_cap;
diff --git a/target-i386/machine.c b/target-i386/machine.c
index a18e16e..4d8157c 100644
--- a/target-i386/machine.c
+++ b/target-i386/machine.c
@@ -775,6 +775,26 @@ static const VMStateDescription vmstate_xss = {
 }
 };
 
+static bool tsc_khz_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+MachineClass *mc = MACHINE_GET_CLASS((qdev_get_machine()));
+PCMachineClass *pcmc = PC_MACHINE_CLASS(mc);
+return env->tsc_khz_saved && pcmc->save_tsc_khz;
+}
+
+static const VMStateDescription vmstate_tsc_khz = {
+.name = "cpu/tsc_khz",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = tsc_khz_needed,
+.fields = (VMStateField[]) {
+VMSTATE_INT64(env.tsc_khz_saved, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
 VMStateDescription vmstate_x86_cpu = {
 .name = "cpu",
 .version_id = 12,
@@ -895,6 +915,7 @@ VMStateDescription vmstate_x86_cpu = {
 &vmstate_msr_hyperv_runtime,
 &vmstate_avx512,
 &vmstate_xss,
+&vmstate_tsc_khz,
 NULL
 }
 };
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/3] target-i386: calculate vcpu's TSC rate to be migrated

2015-11-02 Thread Haozhong Zhang

The value of the migrated vcpu's TSC rate is determined as below.
 1. If a TSC rate is specified by the cpu option 'tsc-freq', then this
user-specified value will be used.
 2. If neither a user-specified TSC rate nor a migrated TSC rate is
present, we will use the TSC rate from KVM (returned by
KVM_GET_TSC_KHZ).
 3. Otherwise, we will use the migrated TSC rate.

Signed-off-by: Haozhong Zhang 
---
 include/sysemu/kvm.h |  2 ++
 kvm-all.c|  1 +
 target-arm/kvm.c |  5 +
 target-i386/kvm.c| 33 +
 target-mips/kvm.c|  5 +
 target-ppc/kvm.c |  5 +
 target-s390x/kvm.c   |  5 +
 7 files changed, 56 insertions(+)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 461ef65..0ec8b98 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -328,6 +328,8 @@ int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry 
*route,
 
 int kvm_arch_msi_data_to_gsi(uint32_t data);
 
+int kvm_arch_setup_tsc_khz(CPUState *cpu);
+
 int kvm_set_irq(KVMState *s, int irq, int level);
 int kvm_irqchip_send_msi(KVMState *s, MSIMessage msg);
 
diff --git a/kvm-all.c b/kvm-all.c
index c442838..1ecaf04 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1757,6 +1757,7 @@ static void do_kvm_cpu_synchronize_post_init(void *arg)
 {
 CPUState *cpu = arg;
 
+kvm_arch_setup_tsc_khz(cpu);
 kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
 cpu->kvm_vcpu_dirty = false;
 }
diff --git a/target-arm/kvm.c b/target-arm/kvm.c
index 79ef4c6..a724f6d 100644
--- a/target-arm/kvm.c
+++ b/target-arm/kvm.c
@@ -614,3 +614,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 return (data - 32) & 0x;
 }
+
+int kvm_arch_setup_tsc_khz(CPUState *cs)
+{
+return 0;
+}
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 64046cb..aae5e58 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -3034,3 +3034,36 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+int kvm_arch_setup_tsc_khz(CPUState *cs)
+{
+X86CPU *cpu = X86_CPU(cs);
+CPUX86State *env = &cpu->env;
+int r;
+
+/*
+ * Prepare vcpu's TSC rate to be migrated.
+ *
+ * - If the user specifies the TSC rate by cpu option 'tsc-freq',
+ *   we will use the user-specified value.
+ *
+ * - If there is neither user-specified TSC rate nor migrated TSC
+ *   rate, we will ask KVM for the TSC rate by calling
+ *   KVM_GET_TSC_KHZ.
+ *
+ * - Otherwise, if there is a migrated TSC rate, we will use the
+ *   migrated value.
+ */
+if (env->tsc_khz) {
+env->tsc_khz_saved = env->tsc_khz;
+} else if (!env->tsc_khz_saved) {
+r = kvm_vcpu_ioctl(cs, KVM_GET_TSC_KHZ);
+if (r < 0) {
+fprintf(stderr, "KVM_GET_TSC_KHZ failed\n");
+return r;
+}
+env->tsc_khz_saved = r;
+}
+
+return 0;
+}
diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index 12d7db3..fb26d7e 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -687,3 +687,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+int kvm_arch_setup_tsc_khz(CPUState *cs)
+{
+return 0;
+}
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index ac70f08..c429f0c 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -2510,3 +2510,8 @@ int kvmppc_enable_hwrng(void)
 
 return kvmppc_enable_hcall(kvm_state, H_RANDOM);
 }
+
+int kvm_arch_setup_tsc_khz(CPUState *cs)
+{
+return 0;
+}
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index c3be180..db5d436 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -2248,3 +2248,8 @@ int kvm_arch_msi_data_to_gsi(uint32_t data)
 {
 abort();
 }
+
+int kvm_arch_setup_tsc_khz(CPUState *cs)
+{
+return 0;
+}
-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/3] target-i386: save/restore vcpu's TSC rate during migration

2015-11-02 Thread Haozhong Zhang

This patchset enables QEMU to save/restore vcpu's TSC rate during the
migration on machine types pc-*-2.5 or newer.

On the source machine:
 * If the vcpu's TSC rate is specified by the cpu option 'tsc-freq',
   then this user-specified TSC rate will be migrated.
 * Otherwise, the TSC rate returned by KVM_GET_TSC_KHZ will be
   migrated. For a fresh VM, this is the host TSC rate.

On the destination machine:
 * If the vcpu's TSC rate is specified by the cpu option 'tsc-freq',
   then QEMU will try to use this user-specified TSC rate rather than
   the migrated value.
 * Otherwise, QEMU will try to use the migrated TSC rate. If KVM on
   the destination supports TSC scaling, guest programs will observe a
   consistent TSC rate across the migration. If TSC scaling is not
   supported, the migration will not be aborted and QEMU will behave
   like before, i.e using the host TSC rate instead.

Changes in v3:
 * Change the cpu option 'save-tsc-freq' to an internal flag.
 * Remove the cpu option 'load-tsc-freq' and change the logic of
   loading the migrated TSC rate as above.
 * Move the setup of migrated TSC rate back to
   do_kvm_cpu_synchronize_post_init().

Changes in v2:
 * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to
   control the migration of vcpu's TSC rate.
 * Move all logic of setting TSC rate to target-i386.
 * Remove the duplicated TSC setup in kvm_arch_init_vcpu().

Haozhong Zhang (3):
  target-i386: add a subsection for migrating vcpu's TSC rate
  target-i386: calculate vcpu's TSC rate to be migrated
  target-i386: load the migrated vcpu's TSC rate

 hw/i386/pc.c  |  1 +
 hw/i386/pc_piix.c |  1 +
 hw/i386/pc_q35.c  |  1 +
 include/hw/i386/pc.h  |  1 +
 include/sysemu/kvm.h  |  2 ++
 kvm-all.c |  1 +
 target-arm/kvm.c  |  5 +
 target-i386/cpu.h |  1 +
 target-i386/kvm.c | 54 +++
 target-i386/machine.c | 21 
 target-mips/kvm.c |  5 +
 target-ppc/kvm.c  |  5 +
 target-s390x/kvm.c|  5 +
 13 files changed, 103 insertions(+)

-- 
2.4.8

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH v6 18/33] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-11-02 Thread Xiao Guangrong




On 10/31/2015 10:15 PM, Xiao Guangrong wrote:



On 10/31/2015 06:52 PM, Vladimir Sementsov-Ogievskiy wrote:

On 30.10.2015 08:56, Xiao Guangrong wrote:

Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Signed-off-by: Xiao Guangrong 
---
  hw/mem/dimm.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 4a63409..498d380 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -377,8 +377,9 @@ static void dimm_get_size(Object *obj, Visitor *v, void 
*opaque,
  int64_t value;
  MemoryRegion *mr;
  DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+mr = ddc->get_memory_region(dimm);
  value = memory_region_size(mr);
  visit_type_int(v, &value, name, errp);


Isn't it better to add Error** parameter to get_memory_region and not mix 
returning error in errp
parameter and aborting through &error_abort?



Yes, it's better, will do. Thanks for your suggestion.


Vladimir,

After checking the current code, the @errp in pc-dimm's get_memory_region() is 
useless
and all its caller did not check the return value as they assumed the backend 
memory of
pc-dimm is always properly initialized.

So we catch the error condition internally instead of propagating it to the 
caller:

 static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)
 {
-return host_memory_backend_get_memory(dimm->hostmem, &error_abort);
+Error *local_err = NULL;
+MemoryRegion *mr;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, &local_err);
+
+/*
+ * plug a pc-dimm device whose backend memory was not properly
+ * initialized?
+ */
+assert(!local_err && mr);
+return mr;
 }

please review it in the new version i have CCed you.

Thanks for your review.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL 0/3] KVM: s390: Bugfix and cleanups for kvm/next (4.4)

2015-11-02 Thread Paolo Bonzini



On 29/10/2015 16:08, Christian Borntraeger wrote:
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux.git  
> tags/kvm-s390-next-20151028

Pulled, thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 35/35] nvdimm: add maintain info

2015-11-02 Thread Xiao Guangrong

Add NVDIMM maintainer

Signed-off-by: Xiao Guangrong 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3144113..865c0cf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -907,6 +907,13 @@ M: Jiri Pirko 
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong 
+S: Maintained
+F: hw/acpi/nvdimm.c
+F: hw/mem/nvdimm.c
+F: include/hw/mem/nvdimm.h
+
 Subsystems
 --
 Audio
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 34/35] nvdimm acpi: support _FIT method

2015-11-02 Thread Xiao Guangrong

FIT buffer is not completely mapped into guest address space, so a new
function, Read FIT, function index 0x, is reserved by QEMU to
read the piece of FIT buffer. The buffer is concatenated before _FIT
return

Refer to docs/specs/acpi-nvdimm.txt for detailed design

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 168 +--
 1 file changed, 164 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index eab5d9c..f5b5c12 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -384,6 +384,18 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/*
+ * define UUID for NVDIMM Root Device according to Chapter 3 DSM Interface
+ * for NVDIMM Root Device - Example in DSM Spec Rev1.
+ */
+#define NVDIMM_DSM_ROOT_UUID "2F10E7A4-9E91-11E4-89D3-123B93F75CBA"
+
+/*
+ * Read FIT Function, which is a QEMU internal use only function, more detail
+ * refer to docs/specs/acpi_nvdimm.txt
+ */
+#define NVDIMM_DSM_FUNC_READ_FIT 0x
+
 /* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
 enum {
 /* Common return status codes. */
@@ -420,6 +432,11 @@ struct NvdimmFuncInSetLabelData {
 } QEMU_PACKED;
 typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;
 
+struct NvdimmFuncInReadFit {
+uint32_t offset; /* fit offset */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInReadFit NvdimmFuncInReadFit;
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
@@ -429,6 +446,7 @@ struct NvdimmDsmIn {
 uint8_t arg3[0];
 NvdimmFuncInSetLabelData func_set_label_data;
 NvdimmFuncInGetLabelData func_get_label_data;
+NvdimmFuncInReadFit func_read_fit;
 };
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
@@ -450,13 +468,71 @@ struct NvdimmFuncOutGetLabelData {
 } QEMU_PACKED;
 typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;
 
+struct NvdimmFuncOutReadFit {
+uint32_t status;/* return status code. */
+uint32_t length;/* the length of fit data we read. */
+uint8_t fit_data[0]; /* fit data. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutReadFit NvdimmFuncOutReadFit;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
 build_append_int_noprefix(out, status, sizeof(status));
 }
 
-static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
+/* Build fit memory which is presented to guest via _FIT method. */
+static void nvdimm_build_fit(AcpiNVDIMMState *state)
+{
+if (!state->fit) {
+GSList *device_list = nvdimm_get_plugged_device_list();
+
+nvdimm_debug("Rebuild FIT...\n");
+state->fit = nvdimm_build_device_structure(device_list);
+g_slist_free(device_list);
+}
+}
+
+/* Read FIT data, defined in docs/specs/acpi_nvdimm.txt. */
+static void nvdimm_dsm_func_read_fit(AcpiNVDIMMState *state,
+ NvdimmDsmIn *in, GArray *out)
+{
+NvdimmFuncInReadFit *read_fit = &in->func_read_fit;
+NvdimmFuncOutReadFit fit_out;
+uint32_t read_length = TARGET_PAGE_SIZE - sizeof(NvdimmFuncOutReadFit);
+uint32_t status = NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS;
+
+nvdimm_build_fit(state);
+
+le32_to_cpus(&read_fit->offset);
+
+nvdimm_debug("Read FIT offset %#x.\n", read_fit->offset);
+
+if (read_fit->offset > state->fit->len) {
+nvdimm_debug("offset %#x is beyond fit size (%#x).\n",
+ read_fit->offset, state->fit->len);
+goto exit;
+}
+
+read_length = MIN(read_length, state->fit->len - read_fit->offset);
+nvdimm_debug("read length %#x.\n", read_length);
+
+fit_out.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+fit_out.length = cpu_to_le32(read_length);
+g_array_append_vals(out, &fit_out, sizeof(fit_out));
+
+if (read_length) {
+g_array_append_vals(out, state->fit->data + read_fit->offset,
+read_length);
+}
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_root(AcpiNVDIMMState *state, NvdimmDsmIn *in,
+GArray *out)
 {
 uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 
@@ -475,6 +551,10 @@ static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
 return;
 }
 
+if (in->function == NVDIMM_DSM_FUNC_READ_FIT /* FIT Read */) {
+return nvdimm_dsm_func_read_fit(state, in, out);
+}
+
 nvdimm_debug("Return status %#x.\n", status);
 nvdimm_dsm_write_status(out, status);
 }
@@ -710,7 +790,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 
 /* Handle 0 is reserved for NVDIMM Root Device. */
 if (!in->handle) {
-nvdimm_dsm_root(in, out);
+nvdimm_dsm_root(state, in, out);
 goto exit;
 }
 
@@ -927,8 +1007,88 @@ static void nvdimm_build_acpi_devices(GSList 
*device_list, A

[PATCH v7 32/35] nvdimm acpi: support Set Namespace Label Data function

2015-11-02 Thread Xiao Guangrong

Function 6 is used to set Namespace Label Data

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index f30e2ff..2553be9 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -587,6 +587,34 @@ exit:
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * DSM Spec Rev1 4.6 Set Namespace Label Data (Function Index 6).
+ */
+static void nvdimm_dsm_func_set_label_data(NVDIMMDevice *nvdimm,
+   NvdimmDsmIn *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+NvdimmFuncInSetLabelData *set_label_data = &in->func_set_label_data;
+uint32_t status;
+
+le32_to_cpus(&set_label_data->offset);
+le32_to_cpus(&set_label_data->length);
+
+nvdimm_debug("Write Label Data: offset %#x length %#x.\n",
+ set_label_data->offset, set_label_data->length);
+
+status = nvdimm_rw_label_data_check(nvdimm, set_label_data->offset,
+set_label_data->length);
+if (status != NVDIMM_DSM_STATUS_SUCCESS) {
+goto exit;
+}
+
+nvc->write_label_data(nvdimm, set_label_data->in_buf,
+  set_label_data->length, set_label_data->offset);
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -617,6 +645,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 case 0x5 /* Get Namespace Label Data */:
 nvdimm_dsm_func_get_label_data(nvdimm, in, out);
 goto free;
+case 0x6 /* Set Namespace Label Data */:
+nvdimm_dsm_func_set_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 33/35] nvdimm: allow using whole backend memory as pmem

2015-11-02 Thread Xiao Guangrong

Introduce a parameter, named "reserve-label-data", if it is
false which indicates that QEMU does not reserve any region
on the backend memory to support label data. It is a
'label-less' NVDIMM device mode that linux will use whole
memory on the device as a single namesapce

This is useful for the users who want to pass whole nvdimm
device and make its data completely be visible to guest

The parameter is false on default

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 12 
 hw/mem/nvdimm.c | 43 ---
 include/hw/mem/nvdimm.h |  6 ++
 3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 2553be9..eab5d9c 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -531,6 +531,12 @@ static void nvdimm_dsm_func_label_size(NVDIMMDevice 
*nvdimm, GArray *out)
 static uint32_t nvdimm_rw_label_data_check(NVDIMMDevice *nvdimm,
uint32_t offset, uint32_t length)
 {
+if (!nvdimm->reserve_label_data) {
+nvdimm_debug("read/write label request on the device without "
+ "label data reserved.\n");
+return NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+}
+
 if (offset + length < offset) {
 nvdimm_debug("offset %#x + length %#x is overflow.\n", offset,
  length);
@@ -637,6 +643,12 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
1 << 4 /* Get Namespace Label Size */ |
1 << 5 /* Get Namespace Label Data */ |
1 << 6 /* Set Namespace Label Data */);
+
+/* no function support if the device does not have label data. */
+if (!nvdimm->reserve_label_data) {
+cmd_list = cpu_to_le64(0);
+}
+
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
 case 0x4 /* Get Namespace Label Size */:
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index c310887..fde1c7f 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -39,14 +39,15 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 {
 MemoryRegion *mr;
 NVDIMMDevice *nvdimm = NVDIMM(dimm);
-uint64_t size;
+uint64_t reserved_label_size, size;
 
 nvdimm->label_size = MIN_NAMESPACE_LABEL_SIZE;
+reserved_label_size = nvdimm->reserve_label_data ? nvdimm->label_size : 0;
 
 mr = host_memory_backend_get_memory(dimm->hostmem, errp);
 size = memory_region_size(mr);
 
-if (size <= nvdimm->label_size) {
+if (size <= reserved_label_size) {
 char *path = 
object_get_canonical_path_component(OBJECT(dimm->hostmem));
 error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too small"
" to contain nvdimm namespace label (0x%" PRIx64 ")", path,
@@ -55,15 +56,19 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 }
 
 memory_region_init_alias(&nvdimm->nvdimm_mr, OBJECT(dimm), "nvdimm-memory",
- mr, 0, size - nvdimm->label_size);
-nvdimm->label_data = memory_region_get_ram_ptr(mr) +
- memory_region_size(&nvdimm->nvdimm_mr);
+ mr, 0, size - reserved_label_size);
+
+if (reserved_label_size) {
+nvdimm->label_data = memory_region_get_ram_ptr(mr) +
+ memory_region_size(&nvdimm->nvdimm_mr);
+}
 }
 
 static void nvdimm_read_label_data(NVDIMMDevice *nvdimm, void *buf,
uint64_t size, uint64_t offset)
 {
-assert((nvdimm->label_size >= size + offset) && (offset + size > offset));
+assert(nvdimm->reserve_label_data &&
+   (nvdimm->label_size >= size + offset) && (offset + size > offset));
 
 memcpy(buf, nvdimm->label_data + offset, size);
 }
@@ -75,7 +80,8 @@ static void nvdimm_write_label_data(NVDIMMDevice *nvdimm, 
const void *buf,
 DIMMDevice *dimm = DIMM(nvdimm);
 uint64_t backend_offset;
 
-assert((nvdimm->label_size >= size + offset) && (offset + size > offset));
+assert(nvdimm->reserve_label_data &&
+   (nvdimm->label_size >= size + offset) && (offset + size > offset));
 
 memcpy(nvdimm->label_data + offset, buf, size);
 
@@ -100,10 +106,33 @@ static void nvdimm_class_init(ObjectClass *oc, void *data)
 nvc->write_label_data = nvdimm_write_label_data;
 }
 
+static bool nvdimm_get_reserve_label_data(Object *obj, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+return nvdimm->reserve_label_data;
+}
+
+static void
+nvdimm_set_reserve_label_data(Object *obj, bool value, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+nvdimm->reserve_label_data = value;
+}
+
+static void nvdimm_init(Object *obj)
+{
+object_property_add_bool(obj, "reserve-label-data",
+ nvdimm_get_reserve_label_data,
+ nvdimm_set_reserve_lab

[PATCH v7 31/35] nvdimm acpi: support Get Namespace Label Data function

2015-11-02 Thread Xiao Guangrong

Function 5 is used to get Namespace Label Data

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 63 
 1 file changed, 63 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index cdc361c..f30e2ff 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -428,6 +428,7 @@ struct NvdimmDsmIn {
 union {
 uint8_t arg3[0];
 NvdimmFuncInSetLabelData func_set_label_data;
+NvdimmFuncInGetLabelData func_get_label_data;
 };
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
@@ -527,6 +528,65 @@ static void nvdimm_dsm_func_label_size(NVDIMMDevice 
*nvdimm, GArray *out)
 g_array_append_vals(out, &func_label_size, sizeof(func_label_size));
 }
 
+static uint32_t nvdimm_rw_label_data_check(NVDIMMDevice *nvdimm,
+   uint32_t offset, uint32_t length)
+{
+if (offset + length < offset) {
+nvdimm_debug("offset %#x + length %#x is overflow.\n", offset,
+ length);
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+if (nvdimm->label_size < offset + length) {
+nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
+ offset + length, nvdimm->label_size);
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+if (length > nvdimm_get_max_xfer_label_size()) {
+nvdimm_debug("length (%#x) is larger than max_xfer (%#x).\n",
+ length, nvdimm_get_max_xfer_label_size());
+return NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+}
+
+return NVDIMM_DSM_STATUS_SUCCESS;
+}
+
+/*
+ * DSM Spec Rev1 4.5 Get Namespace Label Data (Function Index 5).
+ */
+static void nvdimm_dsm_func_get_label_data(NVDIMMDevice *nvdimm,
+   NvdimmDsmIn *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+NvdimmFuncInGetLabelData *get_label_data = &in->func_get_label_data;
+void *buf;
+uint32_t status;
+
+le32_to_cpus(&get_label_data->offset);
+le32_to_cpus(&get_label_data->length);
+
+nvdimm_debug("Read Label Data: offset %#x length %#x.\n",
+ get_label_data->offset, get_label_data->length);
+
+status = nvdimm_rw_label_data_check(nvdimm, get_label_data->offset,
+get_label_data->length);
+if (status != NVDIMM_DSM_STATUS_SUCCESS) {
+goto exit;
+}
+
+/* write nvdimm_func_out_get_label_data.status. */
+nvdimm_dsm_write_status(out, status);
+/* write nvdimm_func_out_get_label_data.out_buf. */
+buf = acpi_data_push(out, get_label_data->length);
+nvc->read_label_data(nvdimm, buf, get_label_data->length,
+ get_label_data->offset);
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -554,6 +614,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 case 0x4 /* Get Namespace Label Size */:
 nvdimm_dsm_func_label_size(nvdimm, out);
 goto free;
+case 0x5 /* Get Namespace Label Data */:
+nvdimm_dsm_func_get_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 30/35] nvdimm acpi: support Get Namespace Label Size function

2015-11-02 Thread Xiao Guangrong

Function 4 is used to get Namespace label size

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 87 +++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 3895ad8..cdc361c 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -407,15 +407,48 @@ enum {
 NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
 };
 
+struct NvdimmFuncInGetLabelData {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be read via the function. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInGetLabelData NvdimmFuncInGetLabelData;
+
+struct NvdimmFuncInSetLabelData {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be written via the function. */
+uint8_t in_buf[0]; /* the data written to label data area. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncInSetLabelData NvdimmFuncInSetLabelData;
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
 uint32_t function;
/* the remaining size in the page is used by arg3. */
-uint8_t arg3[0];
+union {
+uint8_t arg3[0];
+NvdimmFuncInSetLabelData func_set_label_data;
+};
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
 
+struct NvdimmFuncOutLabelSize {
+uint32_t status; /* return status code. */
+uint32_t label_size; /* the size of label data area. */
+/*
+ * Maximum size of the namespace label data length supported by
+ * the platform in Get/Set Namespace Label Data functions.
+ */
+uint32_t max_xfer;
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutLabelSize NvdimmFuncOutLabelSize;
+
+struct NvdimmFuncOutGetLabelData {
+uint32_t status;/*return status code. */
+uint8_t out_buf[0]; /* the data got via Get Namesapce Label function. */
+} QEMU_PACKED;
+typedef struct NvdimmFuncOutGetLabelData NvdimmFuncOutGetLabelData;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
@@ -445,6 +478,55 @@ static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * the max transfer size is the max size transferred by both a
+ * 'Get Namespace Label Data' function and a 'Set Namespace Label Data'
+ * function.
+ */
+static uint32_t nvdimm_get_max_xfer_label_size(void)
+{
+NvdimmDsmIn *in;
+uint32_t max_get_size, max_set_size, dsm_memory_size = TARGET_PAGE_SIZE;
+
+/*
+ * the max data ACPI can read one time which is transferred by
+ * the response of 'Get Namespace Label Data' function.
+ */
+max_get_size = dsm_memory_size - sizeof(NvdimmFuncOutGetLabelData);
+
+/*
+ * the max data ACPI can write one time which is transferred by
+ * 'Set Namespace Label Data' function.
+ */
+max_set_size = dsm_memory_size - offsetof(NvdimmDsmIn, arg3) -
+   sizeof(in->func_set_label_data);
+
+return MIN(max_get_size, max_set_size);
+}
+
+/*
+ * DSM Spec Rev1 4.4 Get Namespace Label Size (Function Index 4).
+ *
+ * It gets the size of Namespace Label data area and the max data size
+ * that Get/Set Namespace Label Data functions can transfer.
+ */
+static void nvdimm_dsm_func_label_size(NVDIMMDevice *nvdimm, GArray *out)
+{
+NvdimmFuncOutLabelSize func_label_size;
+uint32_t label_size, mxfer;
+
+label_size = nvdimm->label_size;
+mxfer = nvdimm_get_max_xfer_label_size();
+
+nvdimm_debug("label_size %#x, max_xfer %#x.\n", label_size, mxfer);
+
+func_label_size.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+func_label_size.label_size = cpu_to_le32(label_size);
+func_label_size.max_xfer = cpu_to_le32(mxfer);
+
+g_array_append_vals(out, &func_label_size, sizeof(func_label_size));
+}
+
 static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -469,6 +551,9 @@ static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
1 << 6 /* Set Namespace Label Data */);
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
+case 0x4 /* Get Namespace Label Size */:
+nvdimm_dsm_func_label_size(nvdimm, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 24/35] docs: add NVDIMM ACPI documentation

2015-11-02 Thread Xiao Guangrong

It describes the basic concepts of NVDIMM ACPI and the interface
between QEMU and the ACPI BIOS

Signed-off-by: Xiao Guangrong 
---
 docs/specs/acpi_nvdimm.txt | 179 +
 1 file changed, 179 insertions(+)
 create mode 100644 docs/specs/acpi_nvdimm.txt

diff --git a/docs/specs/acpi_nvdimm.txt b/docs/specs/acpi_nvdimm.txt
new file mode 100644
index 000..cc5db2c
--- /dev/null
+++ b/docs/specs/acpi_nvdimm.txt
@@ -0,0 +1,179 @@
+QEMU<->ACPI BIOS NVDIMM interface
+-
+
+QEMU supports NVDIMM via ACPI. This document describes the basic concepts of
+NVDIMM ACPI and the interface between QEMU and the ACPI BIOS.
+
+NVDIMM ACPI Background
+--
+NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under
+_SB scope with a _HID of “ACPI0012”. For each NVDIMM present or intended
+to be supported by platform, platform firmware also exposes an ACPI
+Namespace Device under the root device.
+
+The NVDIMM child devices under the NVDIMM root device are defined with _ADR
+corresponding to the NFIT device handle. The NVDIMM root device and the
+NVDIMM devices can have device specific methods (_DSM) to provide additional
+functions specific to a particular NVDIMM implementation.
+
+This is an example from ACPI 6.0, a platform contains one NVDIMM:
+
+Scope (\_SB){
+   Device (NVDR) // Root device
+   {
+  Name (_HID, “ACPI0012”)
+  Method (_STA) {...}
+  Method (_FIT) {...}
+  Method (_DSM, ...) {...}
+  Device (NVD)
+  {
+ Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
+ Method (_DSM, ...) {...}
+  }
+   }
+}
+
+Methods supported on both NVDIMM root device and NVDIMM device are
+1) _STA(Status)
+   It returns the current status of a device, which can be one of the
+   following: enabled, disabled, or removed.
+
+   Arguments: None
+
+   Return Value:
+   It returns an An Integer which is defined as followings:
+   Bit [0] – Set if the device is present.
+   Bit [1] – Set if the device is enabled and decoding its resources.
+   Bit [2] – Set if the device should be shown in the UI.
+   Bit [3] – Set if the device is functioning properly (cleared if device
+ failed its diagnostics).
+   Bit [4] – Set if the battery is present.
+   Bits [31:5] – Reserved (must be cleared).
+
+2) _DSM (Device Specific Method)
+   It is a control method that enables devices to provide device specific
+   control functions that are consumed by the device driver.
+   The NVDIMM DSM specification can be found at:
+http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+
+   Arguments:
+   Arg0 – A Buffer containing a UUID (16 Bytes)
+   Arg1 – An Integer containing the Revision ID (4 Bytes)
+   Arg2 – An Integer containing the Function Index (4 Bytes)
+   Arg3 – A package containing parameters for the function specified by the
+  UUID, Revision ID, and Function Index
+
+   Return Value:
+   If Function Index = 0, a Buffer containing a function index bitfield.
+   Otherwise, the return value and type depends on the UUID, revision ID
+   and function index which are described in the DSM specification.
+
+Methods on NVDIMM ROOT Device
+_FIT(Firmware Interface Table)
+   It evaluates to a buffer returning data in the format of a series of NFIT
+   Type Structure.
+
+   Arguments: None
+
+   Return Value:
+   A Buffer containing a list of NFIT Type structure entries.
+
+   The detailed definition of the structure can be found at ACPI 6.0: 5.2.25
+   NVDIMM Firmware Interface Table (NFIT).
+
+QEMU NVDIMM Implemention
+
+QEMU reserves a page starting from 0xFF0 and 4 bytes IO Port starting
+from 0x0a18 for NVDIMM ACPI.
+
+Memory 0xFF0 - 0xFF00FFF:
+   This page is RAM-based and it is used to transfer data between _DSM
+   method and QEMU. If ACPI has control, this pages is owned by ACPI which
+   writes _DSM input data to it, otherwise, it is owned by QEMU which
+   emulates _DSM access and writes the output data to it.
+
+   ACPI Writes _DSM Input Data:
+   [0xFF0 - 0xFF3]: 4 bytes, NVDIMM Devcie Handle, 0 is reserved
+for NVDIMM Root device.
+   [0xFF4 - 0xFF7]: 4 bytes, Revision ID, that is the Arg1 of _DSM
+method.
+   [0xFF8 - 0xFFB]: 4 bytes. Function Index, that is the Arg2 of
+_DSM method.
+   [0xFFC - 0xFF00FFF]: 4084 bytes, the Arg3 of _DSM method
+
+   QEMU Writes Output Data:
+   [0xFF0 - 0xFF00FFF]: the DSM return result filled by QEMU
+
+IO Port 0x0a18 - 0xa1b:
+   ACPI uses it to transfer control from guest to QEMU and read the size
+   of return result filled by QEMU
+
+   Read Access:
+   [0x0a18 - 0xa1b]: 4 bytes, the buffer size of _DSM output data.
+
+_DSM process diagram:
+-
+The page, 0xFF0 - 0xFF00FFF, is used by _DSM Virtualization.
+
+ +--+  +

[PATCH v7 28/35] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-11-02 Thread Xiao Guangrong

Check if the input Arg3 is valid then store it into dsm_in if needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 53ed675..e179a72 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -524,13 +524,38 @@ static void nvdimm_build_acpi_devices(GSList 
*device_list, Aml *sb_scope)
 
 method = aml_method_serialized("NCAL", 4);
 {
-Aml *buffer_size = aml_local(0);
+Aml *ifctx, *pckg, *buffer_size = aml_local(0);
 
 aml_append(method, aml_store(aml_arg(0), aml_name("HDLE")));
 aml_append(method, aml_store(aml_arg(1), aml_name("REVS")));
 aml_append(method, aml_store(aml_arg(2), aml_name("FUNC")));
 
 /*
+ * The fourth parameter (Arg3) of _DSM is a package which contains
+ * a buffer, the layout of the buffer is specified by UUID (Arg0),
+ * Revision ID (Arg1) and Function Index (Arg2) which are documented
+ * in the DSM Spec.
+ */
+pckg = aml_arg(3);
+ifctx = aml_if(aml_and(aml_equal(aml_object_type(pckg),
+ aml_int(4 /* Package */)),
+   aml_equal(aml_sizeof(pckg),
+ aml_int(1;
+{
+Aml *pckg_index, *pckg_buf;
+
+pckg_index = aml_local(2);
+pckg_buf = aml_local(3);
+
+aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)),
+pckg_index));
+aml_append(ifctx, aml_store(aml_derefof(pckg_index),
+pckg_buf));
+aml_append(ifctx, aml_store(pckg_buf, aml_name("ARG3")));
+}
+aml_append(method, ifctx);
+
+/*
  * transfer control to QEMU and the buffer size filled by
  * QEMU is returned.
  */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v7 27/35] nvdimm acpi: build ACPI nvdimm devices

2015-11-02 Thread Xiao Guangrong

NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

We reserve handle 0 for root device. In this patch, we save handle, handle,
arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 184 +++
 1 file changed, 184 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index dd84e5f..53ed675 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+struct NvdimmDsmIn {
+uint32_t handle;
+uint32_t revision;
+uint32_t function;
+   /* the remaining size in the page is used by arg3. */
+uint8_t arg3[0];
+} QEMU_PACKED;
+typedef struct NvdimmDsmIn NvdimmDsmIn;
+
 static uint64_t
 nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 {
@@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 static void
 nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
 {
+fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
 }
 
 static const MemoryRegionOps nvdimm_dsm_ops = {
@@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
MemoryRegion *io,
 memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, &state->io_mr);
 }
 
+#define BUILD_STA_METHOD(_dev_, _method_)  \
+do {   \
+_method_ = aml_method("_STA", 0);  \
+aml_append(_method_, aml_return(aml_int(0x0f)));   \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
+do {   \
+Aml *ifctx, *uuid; \
+_method_ = aml_method("_DSM", 4);  \
+/* check UUID if it is we expect, return the errorcode if not.*/   \
+uuid = aml_touuid(_uuid_); \
+ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
+aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
+aml_append(method, ifctx); \
+aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
+   aml_arg(1), aml_arg(2), aml_arg(3;  \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
+aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
+
+#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
+BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
+
+static void build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+for (; device_list; device_list = device_list->next) {
+NVDIMMDevice *nvdimm = device_list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+uint32_t handle = nvdimm_slot_to_handle(slot);
+Aml *dev, *method;
+
+dev = aml_device("NV%02X", slot);
+aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
+
+BUILD_STA_METHOD(dev, method);
+
+/*
+ * Chapter 4: _DSM Interface for NVDIMM Device (non-root) - Example
+ * in DSM Spec Rev1.
+ */
+BUILD_DSM_METHOD(dev, method,
+ handle /* NVDIMM Device Handle */,
+ "4309AC30-0D11-11E4-9191-0800200C9A66"
+ /* UUID for NVDIMM Devices. */);
+
+aml_append(root_dev, dev);
+}
+}
+
+static void nvdimm_build_acpi_devices(GSList *device_list, Aml *sb_scope)
+{
+Aml *dev, *method, *field;
+uint64_t page_size = TARGET_PAGE_SIZE;
+
+dev = aml_device("NVDR");
+aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+/* map DSM memory and IO into ACPI namespace. */
+aml_append(dev, aml_operation_region("NPIO", AML_SYSTEM_IO,
+   NVDIMM_ACPI_IO_BASE, NVDIMM_ACPI_IO_LEN));
+aml_append(dev, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
+   NVDIMM_ACPI_MEM_BASE, page_size));
+
+/*
+ * DSM notifier:
+ * @NOTI: Read it will notify QEMU that _DSM method is being
+ *called and the parameters can be found in NvdimmDsmIn.
+ *The value read from it is the buffer size of

[PATCH v7 29/35] nvdimm acpi: support function 0

2015-11-02 Thread Xiao Guangrong

__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)

Function 0 is a query function. We do not support any function on root
device and only 3 functions are support for NVDIMM device, Get Namespace
Label Size, Get Namespace Label Data and Set Namespace Label Data, that
means we currently only allow to access device's Label Namespace

Reviewed-by: Stefan Hajnoczi 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c |   2 +-
 hw/acpi/nvdimm.c| 158 +++-
 include/hw/acpi/aml-build.h |   1 +
 3 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 8bee8b2..90229c5 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -231,7 +231,7 @@ static void build_extop_package(GArray *package, uint8_t op)
 build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
 }
 
-static void build_append_int_noprefix(GArray *table, uint64_t value, int size)
+void build_append_int_noprefix(GArray *table, uint64_t value, int size)
 {
 int i;
 
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index e179a72..3895ad8 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -212,6 +212,22 @@ static uint32_t nvdimm_slot_to_dcr_index(int slot)
 return nvdimm_slot_to_spa_index(slot) + 1;
 }
 
+static NVDIMMDevice
+*nvdimm_get_device_by_handle(GSList *list, uint32_t handle)
+{
+for (; list; list = list->next) {
+NVDIMMDevice *nvdimm = list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+
+if (nvdimm_slot_to_handle(slot) == handle) {
+return nvdimm;
+}
+}
+
+return NULL;
+}
+
 /* ACPI 6.0: 5.2.25.1 System Physical Address Range Structure */
 static void
 nvdimm_build_structure_spa(GArray *structures, NVDIMMDevice *nvdimm)
@@ -368,6 +384,29 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
+enum {
+/* Common return status codes. */
+/* Success */
+NVDIMM_DSM_STATUS_SUCCESS = 0,
+/* Not Supported */
+NVDIMM_DSM_STATUS_NOT_SUPPORTED = 1,
+
+/* NVDIMM Root Device _DSM function return status codes*/
+/* Invalid Input Parameters */
+NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS = 2,
+/* Function-Specific Error */
+NVDIMM_DSM_ROOT_DEV_STATUS_FUNCTION_SPECIFIC_ERROR = 3,
+
+/* NVDIMM Device (non-root) _DSM function return status codes*/
+/* Non-Existing Memory Device */
+NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV = 2,
+/* Invalid Input Parameters */
+NVDIMM_DSM_DEV_STATUS_INVALID_PARAS = 3,
+/* Vendor Specific Error */
+NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
+};
+
 struct NvdimmDsmIn {
 uint32_t handle;
 uint32_t revision;
@@ -377,10 +416,127 @@ struct NvdimmDsmIn {
 } QEMU_PACKED;
 typedef struct NvdimmDsmIn NvdimmDsmIn;
 
+static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
+{
+status = cpu_to_le32(status);
+build_append_int_noprefix(out, status, sizeof(status));
+}
+
+static void nvdimm_dsm_root(NvdimmDsmIn *in, GArray *out)
+{
+uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+
+/*
+ * Query command implemented per ACPI Specification, it is defined in
+ * ACPI 6.0: 9.14.1 _DSM (Device Specific Method).
+ */
+if (in->function == 0x0) {
+/*
+ * Set it to zero to indicate no function is supported for NVDIMM
+ * root.
+ */
+uint64_t cmd_list = cpu_to_le64(0);
+
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+return;
+}
+
+nvdimm_debug("Return status %#x.\n", status);
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_device(NvdimmDsmIn *in, GArray *out)
+{
+GSList *list = nvdimm_get_plugged_device_list();
+NVDIMMDevice *nvdimm = nvdimm_get_device_by_handle(list, in->handle);
+uint32_t status = NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV;
+uint64_t cmd_list;
+
+if (!nvdimm) {
+goto set_status_free;
+}
+
+/* Encode DSM function according to DSM Spec Rev1. */
+switch (in->function) {
+/* see comments in nvdimm_dsm_root(). */
+case 0x0:
+cmd_list = cpu_to_le64(0x1 /* Bit 0 indicates whether there is
+  support for any functions other
+  than function 0.
+*/   |
+   1 << 4 /* Get Namespace Label Size */ |
+   1 << 5 /* Get Namespace Label Data */ |
+   1 << 6 /* Set Namespace Label Data */);
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+goto free;
+default:
+status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+

[PATCH v7 25/35] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-11-02 Thread Xiao Guangrong

A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
for detailed design

A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
that controls if nvdimm support is enabled, it is true on default and
it is false on 2.4 and its earlier version to keep compatibility

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak |  1 +
 default-configs/mips-softmmu.mak |  1 +
 default-configs/mips64-softmmu.mak   |  1 +
 default-configs/mips64el-softmmu.mak |  1 +
 default-configs/mipsel-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak   |  1 +
 hw/acpi/Makefile.objs|  1 +
 hw/acpi/ich9.c   | 24 ++
 hw/acpi/nvdimm.c | 63 
 hw/acpi/piix4.c  | 27 
 include/hw/acpi/ich9.h   |  3 ++
 include/hw/i386/pc.h | 10 ++
 include/hw/mem/nvdimm.h  | 34 +++
 13 files changed, 161 insertions(+), 7 deletions(-)
 create mode 100644 hw/acpi/nvdimm.c

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 4e84a1c..51e71d4 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/mips-softmmu.mak b/default-configs/mips-softmmu.mak
index 44467c3..6b8b70e 100644
--- a/default-configs/mips-softmmu.mak
+++ b/default-configs/mips-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64-softmmu.mak 
b/default-configs/mips64-softmmu.mak
index 66ed5f9..ea820f6 100644
--- a/default-configs/mips64-softmmu.mak
+++ b/default-configs/mips64-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64el-softmmu.mak 
b/default-configs/mips64el-softmmu.mak
index bfca2b2..8993851 100644
--- a/default-configs/mips64el-softmmu.mak
+++ b/default-configs/mips64el-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mipsel-softmmu.mak 
b/default-configs/mipsel-softmmu.mak
index 0162ef0..87ab964 100644
--- a/default-configs/mipsel-softmmu.mak
+++ b/default-configs/mipsel-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index e877a86..0a7dc10 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 7d3230c..84c082d 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
 common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
 common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
+obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI) += acpi_interface.o
 common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
 common-obj-$(CONFIG_ACPI) += aml-build.o
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index 1e9ae20..603c1bd 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -280,6 +280,12 @@ void ich9_pm_init(PCIDevice *lpc_pci, ICH9LPCPMRegs *pm,
 acpi_memory_hotplug_init(pci_address_space_io(lpc_pci), 
OBJECT(lpc_pci),
  &pm->acpi_memory_hotplug);
 }
+
+if (pm->acpi_nvdimm_state.is_enabled) {
+nvdimm_init_acpi_state(pci_address_space(lpc_pci),
+   pci_address_space_io(lpc_pci), OBJECT(lpc_pci),
+   &pm->acpi_nvdimm_state);
+}
 }
 
 static void ich9_pm_get_gpe0_blk(Object *obj, Visitor *v,
@@ -307,6 +313,20 @@ static void ich9_pm_set_memory_hotplug_support(Object 
*obj, bool value,
 s->pm.acpi_memory_hotplug.is_enabled = value;
 }
 
+static bool ich9_pm_get_nvdimm_support(Object *obj, Error **errp)
+{
+ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
+
+return s->pm.acpi_nvdimm_state.is_enabled;
+}
+
+static void ich9_pm_set_nvdimm_support(Object *obj, bool value, Error **

[PATCH v7 26/35] nvdimm acpi: build ACPI NFIT table

2015-11-02 Thread Xiao Guangrong

NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 structures:
- SPA structure, defines the PMEM region info

- MEM DEV structure, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR structure, it defines vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 355 
 hw/i386/acpi-build.c|   6 +
 include/hw/mem/nvdimm.h |  10 ++
 3 files changed, 371 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 1223da2..dd84e5f 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -26,8 +26,348 @@
  * License along with this library; if not, see 
  */
 
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
 #include "hw/mem/nvdimm.h"
 
+static int nvdimm_plugged_device_list(Object *obj, void *opaque)
+{
+GSList **list = opaque;
+
+if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+if (memory_region_is_mapped(&nvdimm->nvdimm_mr)) {
+*list = g_slist_append(*list, DEVICE(obj));
+}
+}
+
+object_child_foreach(obj, nvdimm_plugged_device_list, opaque);
+return 0;
+}
+
+/*
+ * inquire plugged NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *nvdimm_get_plugged_device_list(void)
+{
+GSList *list = NULL;
+
+object_child_foreach(qdev_get_machine(), nvdimm_plugged_device_list,
+ &list);
+return list;
+}
+
+#define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7) \
+   { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
+ (b) & 0xff, ((b) >> 8) & 0xff, (c) & 0xff, ((c) >> 8) & 0xff,  \
+ (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }
+/*
+ * define Byte Addressable Persistent Memory (PM) Region according to
+ * ACPI 6.0: 5.2.25.1 System Physical Address Range Structure.
+ */
+static const uint8_t nvdimm_nfit_spa_uuid[] =
+  NVDIMM_UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d, 0x33,
+ 0x18, 0xb7, 0x8c, 0xdb);
+
+/*
+ * NVDIMM Firmware Interface Table
+ * @signature: "NFIT"
+ *
+ * It provides information that allows OSPM to enumerate NVDIMM present in
+ * the platform and associate system physical address ranges created by the
+ * NVDIMMs.
+ *
+ * It is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ */
+struct NvdimmNfitHeader {
+ACPI_TABLE_HEADER_DEF
+uint32_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitHeader NvdimmNfitHeader;
+
+/*
+ * define NFIT structures according to ACPI 6.0: 5.2.25 NVDIMM Firmware
+ * Interface Table (NFIT).
+ */
+
+/*
+ * System Physical Address Range Structure
+ *
+ * It describes the system physical address ranges occupied by NVDIMMs and
+ * the types of the regions.
+ */
+struct NvdimmNfitSpa {
+uint16_t type;
+uint16_t length;
+uint16_t spa_index;
+uint16_t flags;
+uint32_t reserved;
+uint32_t proximity_domain;
+uint8_t type_guid[16];
+uint64_t spa_base;
+uint64_t spa_length;
+uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct NvdimmNfitSpa NvdimmNfitSpa;
+
+/*
+ * Memory Device to System Physical Address Range Mapping Structure
+ *
+ * It enables identifying each NVDIMM region and the corresponding SPA
+ * describing the memory interleave
+ */
+struct NvdimmNfitMemDev {
+uint16_t type;
+uint16_t length;
+uint32_t nfit_handle;
+uint16_t phys_id;
+uint16_t region_id;
+uint16_t spa_index;
+uint16_t dcr_index;
+uint64_t region_len;
+uint64_t region_offset;
+uint64_t region_dpa;
+uint16_t interleave_index;
+uint16_t interleave_ways;
+uint16_t flags;
+uint16_t reserved;
+} QEMU_PACKED;
+typedef struct NvdimmNfitMemDev NvdimmNfitMemDev;
+
+/*
+ * NVDIMM Control Region Structure
+ *
+ * It describes the NVDIMM and if applicable, Block Control Window.
+ */
+struct NvdimmNfitControlRegion {
+uint16_t type;
+uint16_t length;
+uint16_t dcr_index;
+uint16_t vendor_id;
+uint16_t device_id;
+uint16_t revision_id;
+uint16_t sub_vendor_id;
+uint16_t sub_device_id;
+uint16_t sub_revision_id;
+uint8_t reserved[6];
+uint32_t serial_number;
+uint16_t fic;
+uint16_t num_bcw;
+uint64_t bcw_size;
+uint64_t cmd_offset;
+uint64_t cmd_size;
+uint64_t status_offset;
+uint64_t status_size;
+uint16_t flags;
+uint8_t reserved2[6];
+} QEMU_PACK

[PATCH v7 22/35] dimm: introduce realize callback

2015-11-02 Thread Xiao Guangrong

nvdimm need check if the backend memory is large enough to contain label
data and init its memory region when the device is realized, so introduce
realize callback which is called after common dimm has been realize

Reviewed-by: Vladimir Sementsov-Ogievskiy 
Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 5 +
 include/hw/mem/dimm.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 7d1..0ae23ce 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -426,6 +426,7 @@ static void dimm_init(Object *obj)
 static void dimm_realize(DeviceState *dev, Error **errp)
 {
 DIMMDevice *dimm = DIMM(dev);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(dimm);
 
 if (!dimm->hostmem) {
 error_setg(errp, "'" DIMM_MEMDEV_PROP "' property is not set");
@@ -438,6 +439,10 @@ static void dimm_realize(DeviceState *dev, Error **errp)
dimm->node, nb_numa_nodes ? nb_numa_nodes : 1);
 return;
 }
+
+if (ddc->realize) {
+ddc->realize(dimm, errp);
+}
 }
 
 static void dimm_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/mem/dimm.h b/include/hw/mem/dimm.h
index 50f768a..72ec24c 100644
--- a/include/hw/mem/dimm.h
+++ b/include/hw/mem/dimm.h
@@ -65,6 +65,7 @@ typedef struct DIMMDeviceClass {
 DeviceClass parent_class;
 
 /* public */
+void (*realize)(DIMMDevice *dimm, Error **errp);
 MemoryRegion *(*get_memory_region)(DIMMDevice *dimm);
 } DIMMDeviceClass;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 124 matches

Mail list logo