Re: [PATCH v4 27/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 12:04:48PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 01:16 AM, Michael S. Tsirkin wrote:
> >On Mon, Oct 19, 2015 at 08:54:13AM +0800, Xiao Guangrong wrote:
> >>Check if the input Arg3 is valid then store it into dsm_in if needed
> >>
> >>Signed-off-by: Xiao Guangrong 
> >>---
> >>  hw/acpi/nvdimm.c | 19 +++
> >>  1 file changed, 19 insertions(+)
> >>
> >>diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> >>index 7e99889..b211b8b 100644
> >>--- a/hw/acpi/nvdimm.c
> >>+++ b/hw/acpi/nvdimm.c
> >>@@ -624,10 +624,29 @@ static void nvdimm_build_acpi_devices(NVDIMMState 
> >>*state, GSList *device_list,
> >>
> >>  method = aml_method_serialized("NCAL", 4);
> >>  {
> >>+Aml *ifctx;
> >>+
> >>  aml_append(method, aml_store(aml_arg(0), aml_name("HDLE")));
> >>  aml_append(method, aml_store(aml_arg(1), aml_name("REVS")));
> >>  aml_append(method, aml_store(aml_arg(2), aml_name("FUNC")));
> >>
> >>+/* Arg3 is passed as Package and it has one element? */
> >>+ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
> >>+ aml_int(4)),
> >>+   aml_equal(aml_sizeof(aml_arg(3)),
> >>+ aml_int(1;
> >>+{
> >>+/* Local0 = Index(Arg3, 0) */
> >>+aml_append(ifctx, aml_store(aml_index(aml_arg(3), aml_int(0)),
> >>+aml_local(0)));
> >>+/* Local3 = DeRefOf(Local0) */
> >>+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
> >>+aml_local(3)));
> >>+/* ARG3 = Local3 */
> >>+aml_append(ifctx, aml_store(aml_local(3), aml_name("ARG3")));
> >>+}
> >>+aml_append(method, ifctx);
> >>+
> >>  aml_append(method, aml_store(aml_int(NOTIFY_VALUE), 
> >> aml_name("NOTI")));
> >>
> >>  aml_append(method, aml_store(aml_name("RLEN"), aml_local(6)));
> >
> >I commented on this patch on v3.
> >It doesn't look like this was addressed.
> >
> 
> Ah... I see no one commented this patch ([PATCH v3 26/32] nvdimm: save arg3 
> for NVDIMM
> device_DSM method) on v3.
> 
> Do you mean we need more and better comment to explain arg3? Or anything else?

Interesting. I have it in my sent mail file, but it doesn't seem to
be on list. I've just resent it, and a couple of other messages
that seem to have disappeared into the ether.

These are the messages I have for v3:

33587   F To Xiao Guangro Re: [PATCH v3 26/32] nvdimm: save arg3 for NVDIMM 
device _DSM method
33588   F To Xiao Guangro Re: [PATCH v3 22/32] nvdimm: init the address region 
used by NVDIMM ACPI
33589   F To Xiao Guangro Re: [PATCH v3 23/32] nvdimm: build ACPI NFIT table
33590   F To Xiao Guangro Re: [PATCH v3 00/32] implement vNVDIMM
33597   F To Xiao Guangro Re: [PATCH v3 23/32] nvdimm: build ACPI NFIT table
33598   F To Xiao Guangro Re: [PATCH v3 00/32] implement vNVDIMM
33599   F To Xiao Guangro Re: [PATCH v3 23/32] nvdimm: build ACPI NFIT table

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: MMU: Initialize force_pt_level before calling mapping_level()

2015-10-19 Thread Takuya Yoshikawa
Commit fd1369021878 ("KVM: x86: MMU: Move mapping_level_dirty_bitmap()
call in mapping_level()") forgot to initialize force_pt_level to false
in FNAME(page_fault)() before calling mapping_level() like
nonpaging_map() does.  This can sometimes result in forcing page table
level mapping unnecessarily.

Fix this and move the first *force_pt_level check in mapping_level()
before kvm_vcpu_gfn_to_memslot() call to make it a bit clearer that
the variable must be initialized before mapping_level() gets called.

This change can also avoid calling kvm_vcpu_gfn_to_memslot() when
!check_hugepage_cache_consistency() check in tdp_page_fault() forces
page table level mapping.

Signed-off-by: Takuya Yoshikawa 
---
 arch/x86/kvm/mmu.c | 7 ---
 arch/x86/kvm/paging_tmpl.h | 2 +-
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index dd2a7c6..7d85bca 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -886,10 +886,11 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn,
int host_level, level, max_level;
struct kvm_memory_slot *slot;
 
-   slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
+   if (unlikely(*force_pt_level))
+   return PT_PAGE_TABLE_LEVEL;
 
-   if (likely(!*force_pt_level))
-   *force_pt_level = !memslot_valid_for_gpte(slot, true);
+   slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
+   *force_pt_level = !memslot_valid_for_gpte(slot, true);
if (unlikely(*force_pt_level))
return PT_PAGE_TABLE_LEVEL;
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index bf39d0f..b41faa9 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -698,7 +698,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
int r;
pfn_t pfn;
int level = PT_PAGE_TABLE_LEVEL;
-   bool force_pt_level;
+   bool force_pt_level = false;
unsigned long mmu_seq;
bool map_writable, is_self_change_mapping;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 12:39:24PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 02:05 AM, Michael S. Tsirkin wrote:
> >On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote:
> >>__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)
> >>
> >>Function 0 is a query function. We do not support any function on root
> >>device and only 3 functions are support for NVDIMM device,
> >>DSM_DEV_FUN_NAMESPACE_LABEL_SIZE, DSM_DEV_FUN_GET_NAMESPACE_LABEL_DATA and
> >>DSM_DEV_FUN_SET_NAMESPACE_LABEL_DATA, that means we currently only allow to
> >>access device's Label Namespace
> >>
> >>Signed-off-by: Xiao Guangrong 
> >>---
> >>  hw/acpi/nvdimm.c | 184 
> >> ++-
> >>  1 file changed, 182 insertions(+), 2 deletions(-)
> >>
> >>diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> >>index b211b8b..37fea1c 100644
> >>--- a/hw/acpi/nvdimm.c
> >>+++ b/hw/acpi/nvdimm.c
> >>@@ -260,6 +260,22 @@ static uint32_t nvdimm_slot_to_dcr_index(int slot)
> >>  return nvdimm_slot_to_spa_index(slot) + 1;
> >>  }
> >>
> >>+static NVDIMMDevice
> >>+*nvdimm_get_device_by_handle(GSList *list, uint32_t handle)
> >>+{
> >>+for (; list; list = list->next) {
> >>+NVDIMMDevice *nvdimm = list->data;
> >>+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
> >>+   NULL);
> >>+
> >>+if (nvdimm_slot_to_handle(slot) == handle) {
> >>+return nvdimm;
> >>+}
> >>+}
> >>+
> >>+return NULL;
> >>+}
> >>+
> >>  /*
> >>   * Please refer to ACPI 6.0: 5.2.25.1 System Physical Address Range
> >>   * Structure
> >>@@ -411,6 +427,60 @@ static void nvdimm_build_nfit(GArray *structures, 
> >>GArray *table_offsets,
> >>  /* detailed _DSM design please refer to docs/specs/acpi_nvdimm.txt */
> >>  #define NOTIFY_VALUE  0x99
> >
> >Again, please prefix everything consistently.
> 
> Okay, will do. Sorry for i missed it.
> 
> >
> >>
> >>+enum {
> >>+DSM_FUN_IMPLEMENTED = 0,
> >>+
> >>+/* NVDIMM Root Device Functions */
> >>+DSM_ROOT_DEV_FUN_ARS_CAP = 1,
> >>+DSM_ROOT_DEV_FUN_ARS_START = 2,
> >>+DSM_ROOT_DEV_FUN_ARS_QUERY = 3,
> >>+
> >>+/* NVDIMM Device (non-root) Functions */
> >>+DSM_DEV_FUN_SMART = 1,
> >>+DSM_DEV_FUN_SMART_THRESHOLD = 2,
> >>+DSM_DEV_FUN_BLOCK_NVDIMM_FLAGS = 3,
> >>+DSM_DEV_FUN_NAMESPACE_LABEL_SIZE = 4,
> >>+DSM_DEV_FUN_GET_NAMESPACE_LABEL_DATA = 5,
> >>+DSM_DEV_FUN_SET_NAMESPACE_LABEL_DATA = 6,
> >>+DSM_DEV_FUN_VENDOR_EFFECT_LOG_SIZE = 7,
> >>+DSM_DEV_FUN_GET_VENDOR_EFFECT_LOG = 8,
> >>+DSM_DEV_FUN_VENDOR_SPECIFIC = 9,
> >>+};
> >
> >Does FUN stand for "function"? FUNC or FN is probably better.
> >
> 
> Yes.
> 
> >Please list exact names as they appear in spec so
> >they can be searched for.
> 
> The spec reference was at where this _FUN_ is used, eg:
> 
> /*
>  * Please refer to DSM specification 4.4.1 Get Namespace Label Size
>  * (Function Index 4).
>  *
>  * It gets the size of Namespace Label data area and the max data size
>  * that Get/Set Namespace Label Data functions can transfer.
>  */
> static void nvdimm_dsm_func_label_size(NVDIMMDevice *nvdimm, GArray *out)
> 
> I will follow your ‘single use’ comments below, these definitions will
> be dropped, the code will be like this:
> 
> switch (function) {
> case 4 /* DSM Spec 4.4.1 Get Namespace Label Size Get Namespace Label Size. 
> */:

If it's the same spec, you don't have to repeat it:

/* Encode function according to DSM Spec rev 1.0 */
> switch (function) {
> case 4 /* 4.4.1 Get Namespace Label Size Get Namespace Label Size. */:

same for chapter etc.

> nvdimm_dsm_func_label_size();
> case ...
> ...
> };
> 
> >
> >
> >
> >>+
> >>+enum {
> >>+/* Common return status codes. */
> >>+DSM_STATUS_SUCCESS = 0,   /* Success */
> >>+DSM_STATUS_NOT_SUPPORTED = 1, /* Not Supported */
> >>+
> >>+/* NVDIMM Root Device _DSM function return status codes*/
> >>+DSM_ROOT_DEV_STATUS_INVALID_PARAS = 2,/* Invalid Input Parameters 
> >>*/
> >>+DSM_ROOT_DEV_STATUS_FUNCTION_SPECIFIC_ERROR = 3, /* Function-Specific
> >>+Error */
> >>+
> >>+/* NVDIMM Device (non-root) _DSM function return status codes*/
> >>+DSM_DEV_STATUS_NON_EXISTING_MEM_DEV = 2,  /* Non-Existing Memory 
> >>Device */
> >>+DSM_DEV_STATUS_INVALID_PARAS = 3, /* Invalid Input Parameters 
> >>*/
> >>+DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4, /* Vendor Specific Error */
> >>+};
> >>+
> >>+/* Current revision supported by DSM specification is 1. */
> >>+#define DSM_REVISION(1)
> >>+
> >>+/*
> >>+ * please refer to ACPI 6.0: 9.14.1 _DSM (Device Specific Method): Return
> >>+ * Value Information:
> >
> >Drop "please refer to".
> 
> Okay.
> 
> >
> >>+ *   if set to zero, no functions are supported (other than function zero)

Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:
> We reserve the memory region 0xFF0 ~ 0xFFF0 for NVDIMM ACPI
> which is used as:
> - the first page is mapped as MMIO, ACPI write data to this page to
>   transfer the control to QEMU
> 
> - the second page is RAM-based which used to save the input info of
>   _DSM method and QEMU reuse it store output info
> 
> - the left is mapped as RAM, it's the buffer returned by _FIT method,
>   this is needed by NVDIMM hotplug
> 

Isn't there some way to document this in code, e.g. with
macros?

Adding text under docs/specs would also be a good idea.


> Signed-off-by: Xiao Guangrong 
> ---
>  hw/i386/pc.c|   3 ++
>  hw/mem/Makefile.objs|   2 +-
>  hw/mem/nvdimm/acpi.c| 120 
> 
>  include/hw/i386/pc.h|   2 +
>  include/hw/mem/nvdimm.h |  19 
>  5 files changed, 145 insertions(+), 1 deletion(-)
>  create mode 100644 hw/mem/nvdimm/acpi.c
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 6694b18..8fea4c3 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1360,6 +1360,9 @@ FWCfgState *pc_memory_init(PCMachineState *pcms,
>  exit(EXIT_FAILURE);
>  }
>  
> +nvdimm_init_memory_state(>nvdimm_memory, system_memory, 
> machine,
> + TARGET_PAGE_SIZE);
> +

Shouldn't this be conditional on presence of the nvdimm device?


>  pcms->hotplug_memory.base =
>  ROUND_UP(0x1ULL + pcms->above_4g_mem_size, 1ULL << 30);
>  
> diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
> index e0ff328..7310bac 100644
> --- a/hw/mem/Makefile.objs
> +++ b/hw/mem/Makefile.objs
> @@ -1,3 +1,3 @@
>  common-obj-$(CONFIG_DIMM) += dimm.o
>  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
> -common-obj-$(CONFIG_NVDIMM) += nvdimm/nvdimm.o
> +common-obj-$(CONFIG_NVDIMM) += nvdimm/nvdimm.o nvdimm/acpi.o
> diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> new file mode 100644
> index 000..b640874
> --- /dev/null
> +++ b/hw/mem/nvdimm/acpi.c
> @@ -0,0 +1,120 @@
> +/*
> + * NVDIMM ACPI Implementation
> + *
> + * Copyright(C) 2015 Intel Corporation.
> + *
> + * Author:
> + *  Xiao Guangrong 
> + *
> + * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
> + * and the DSM specfication can be found at:
> + *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> + *
> + * Currently, it only supports PMEM Virtualization.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see 
> 
> + */
> +
> +#include "qemu-common.h"
> +#include "hw/acpi/acpi.h"
> +#include "hw/acpi/aml-build.h"
> +#include "hw/mem/nvdimm.h"
> +#include "internal.h"
> +
> +/* System Physical Address Range Structure */
> +struct nfit_spa {
> +uint16_t type;
> +uint16_t length;
> +uint16_t spa_index;
> +uint16_t flags;
> +uint32_t reserved;
> +uint32_t proximity_domain;
> +uint8_t type_guid[16];
> +uint64_t spa_base;
> +uint64_t spa_length;
> +uint64_t mem_attr;
> +} QEMU_PACKED;
> +typedef struct nfit_spa nfit_spa;
> +
> +/* Memory Device to System Physical Address Range Mapping Structure */
> +struct nfit_memdev {
> +uint16_t type;
> +uint16_t length;
> +uint32_t nfit_handle;
> +uint16_t phys_id;
> +uint16_t region_id;
> +uint16_t spa_index;
> +uint16_t dcr_index;
> +uint64_t region_len;
> +uint64_t region_offset;
> +uint64_t region_dpa;
> +uint16_t interleave_index;
> +uint16_t interleave_ways;
> +uint16_t flags;
> +uint16_t reserved;
> +} QEMU_PACKED;
> +typedef struct nfit_memdev nfit_memdev;
> +
> +/* NVDIMM Control Region Structure */
> +struct nfit_dcr {
> +uint16_t type;
> +uint16_t length;
> +uint16_t dcr_index;
> +uint16_t vendor_id;
> +uint16_t device_id;
> +uint16_t revision_id;
> +uint16_t sub_vendor_id;
> +uint16_t sub_device_id;
> +uint16_t sub_revision_id;
> +uint8_t reserved[6];
> +uint32_t serial_number;
> +uint16_t fic;
> +uint16_t num_bcw;
> +uint64_t bcw_size;
> +uint64_t cmd_offset;
> +uint64_t cmd_size;
> +uint64_t status_offset;
> +uint64_t status_size;
> +uint16_t flags;
> +uint8_t reserved2[6];
> +} 

Re: [PATCH v3 00/32] implement vNVDIMM

2015-10-19 Thread Michael S. Tsirkin
On Sun, Oct 11, 2015 at 11:52:32AM +0800, Xiao Guangrong wrote:
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.

Thanks!
This needs some changes in coding style, and more comments, to
make it easier to maintain going forward.

High level comments - I didn't point out all instances,
please go over code and locate them yourself.
I focused on acpi code in this review.

- fix coding style violations, prefix eveything with nvdimm_ etc
- in apci code, avoid manual memory management/complex pointer math
- comments are needed to document apis & explain what's going on
- constants need comments too, refer to text that
  can be looked up in acpi spec verbatim


> - changes from Igor's comments:
>   1) abstract dimm device type from pc-dimm and create nvdimm device based on
>  dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>  easily been implemented.
>   2) let file-backend device support any kind of filesystem not only for
>  hugetlbfs and let it work on file not only for directory which is
>  achieved by extending 'mem-path' - if it's a directory then it works as
>  current behavior, otherwise if it's file then directly allocates memory
>  from it.
>   3) we figure out a unused memory hole below 4G that is 0xFF0 ~ 
>  0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit
>  ACPI SSDT/DSDT table will break windows XP.
>  BTW, only make SSDT.rev = 2 can not work since the width is only depended
>  on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>  in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32 
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT 
> sets 
> | the global integer width for all integers, including integers in SSDTs.
>   4) use the lowest ACPI spec version to document AML terms.
>   5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
> 
> - changes from Stefan's comments:
>   1) do not do endian adjustment in-place since _DSM memory is visible to 
> guest
>   2) use target platform's target page size instead of fixed PAGE_SIZE
>  definition
>   3) lots of code style improvement and typo fixes.
>   4) live migration fix
> - changes from Paolo's comments:
>   1) improve the name of memory region
>   
> - other changes:
>   1) return exact buffer size for _DSM method instead of the page size.
>   2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>  devices.
>   3) NUMA support
>   4) implement _FIT method
>   5) rename "configdata" to "reserve-label-data"
>   6) simplify _DSM arg3 determination
>   7) main changelog update to let it reflect v3.
> 
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
> 
> - introduce a new parameter, @configdata, if it's false, Qemu will
>   build a static and readonly namespace in memory and use it serveing
>   for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>   reserved region is needed at the end of the @file, it is good for
>   the user who want to pass whole nvdimm device and make its data
>   completely be visible to guest
> 
> - divide the source code into separated files and add maintain info
> 
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
> 
> == Background ==
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: 
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: 
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> 
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
> 
> == Design ==
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
> 
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
> 
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory 

Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 03:06 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 12:39:24PM +0800, Xiao Guangrong wrote:



On 10/19/2015 02:05 AM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote:

__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)

Function 0 is a query function. We do not support any function on root
device and only 3 functions are support for NVDIMM device,
DSM_DEV_FUN_NAMESPACE_LABEL_SIZE, DSM_DEV_FUN_GET_NAMESPACE_LABEL_DATA and
DSM_DEV_FUN_SET_NAMESPACE_LABEL_DATA, that means we currently only allow to
access device's Label Namespace

Signed-off-by: Xiao Guangrong 
---
  hw/acpi/nvdimm.c | 184 ++-
  1 file changed, 182 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index b211b8b..37fea1c 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -260,6 +260,22 @@ static uint32_t nvdimm_slot_to_dcr_index(int slot)
  return nvdimm_slot_to_spa_index(slot) + 1;
  }

+static NVDIMMDevice
+*nvdimm_get_device_by_handle(GSList *list, uint32_t handle)
+{
+for (; list; list = list->next) {
+NVDIMMDevice *nvdimm = list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+
+if (nvdimm_slot_to_handle(slot) == handle) {
+return nvdimm;
+}
+}
+
+return NULL;
+}
+
  /*
   * Please refer to ACPI 6.0: 5.2.25.1 System Physical Address Range
   * Structure
@@ -411,6 +427,60 @@ static void nvdimm_build_nfit(GArray *structures, GArray 
*table_offsets,
  /* detailed _DSM design please refer to docs/specs/acpi_nvdimm.txt */
  #define NOTIFY_VALUE  0x99


Again, please prefix everything consistently.


Okay, will do. Sorry for i missed it.





+enum {
+DSM_FUN_IMPLEMENTED = 0,
+
+/* NVDIMM Root Device Functions */
+DSM_ROOT_DEV_FUN_ARS_CAP = 1,
+DSM_ROOT_DEV_FUN_ARS_START = 2,
+DSM_ROOT_DEV_FUN_ARS_QUERY = 3,
+
+/* NVDIMM Device (non-root) Functions */
+DSM_DEV_FUN_SMART = 1,
+DSM_DEV_FUN_SMART_THRESHOLD = 2,
+DSM_DEV_FUN_BLOCK_NVDIMM_FLAGS = 3,
+DSM_DEV_FUN_NAMESPACE_LABEL_SIZE = 4,
+DSM_DEV_FUN_GET_NAMESPACE_LABEL_DATA = 5,
+DSM_DEV_FUN_SET_NAMESPACE_LABEL_DATA = 6,
+DSM_DEV_FUN_VENDOR_EFFECT_LOG_SIZE = 7,
+DSM_DEV_FUN_GET_VENDOR_EFFECT_LOG = 8,
+DSM_DEV_FUN_VENDOR_SPECIFIC = 9,
+};


Does FUN stand for "function"? FUNC or FN is probably better.



Yes.


Please list exact names as they appear in spec so
they can be searched for.


The spec reference was at where this _FUN_ is used, eg:

/*
  * Please refer to DSM specification 4.4.1 Get Namespace Label Size
  * (Function Index 4).
  *
  * It gets the size of Namespace Label data area and the max data size
  * that Get/Set Namespace Label Data functions can transfer.
  */
static void nvdimm_dsm_func_label_size(NVDIMMDevice *nvdimm, GArray *out)

I will follow your ‘single use’ comments below, these definitions will
be dropped, the code will be like this:

switch (function) {
case 4 /* DSM Spec 4.4.1 Get Namespace Label Size Get Namespace Label Size. */:


If it's the same spec, you don't have to repeat it:

/* Encode function according to DSM Spec rev 1.0 */

switch (function) {
 case 4 /* 4.4.1 Get Namespace Label Size Get Namespace Label Size. */:


same for chapter etc.


Okay.




nvdimm_dsm_func_label_size();
case ...
...
};






+
+enum {
+/* Common return status codes. */
+DSM_STATUS_SUCCESS = 0,   /* Success */
+DSM_STATUS_NOT_SUPPORTED = 1, /* Not Supported */
+
+/* NVDIMM Root Device _DSM function return status codes*/
+DSM_ROOT_DEV_STATUS_INVALID_PARAS = 2,/* Invalid Input Parameters */
+DSM_ROOT_DEV_STATUS_FUNCTION_SPECIFIC_ERROR = 3, /* Function-Specific
+Error */
+
+/* NVDIMM Device (non-root) _DSM function return status codes*/
+DSM_DEV_STATUS_NON_EXISTING_MEM_DEV = 2,  /* Non-Existing Memory Device */
+DSM_DEV_STATUS_INVALID_PARAS = 3, /* Invalid Input Parameters */
+DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4, /* Vendor Specific Error */
+};
+
+/* Current revision supported by DSM specification is 1. */
+#define DSM_REVISION(1)
+
+/*
+ * please refer to ACPI 6.0: 9.14.1 _DSM (Device Specific Method): Return
+ * Value Information:


Drop "please refer to".


Okay.




+ *   if set to zero, no functions are supported (other than function zero)
+ *   for the specified UUID and Revision ID. If set to one, at least one
+ *   additional function is supported.
+ */
+
+/* do not support any function on root. */
+#define ROOT_SUPPORT_FUN (0ULL)


Needs a name that implies the comment somehow.


+#define DIMM_SUPPORT_FUN((1 << DSM_FUN_IMPLEMENTED)   \
+   | (1 << DSM_DEV_FUN_NAMESPACE_LABEL_SIZE)  

Re: [PATCH v3 26/32] nvdimm: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 02:50 PM, Michael S. Tsirkin wrote:

On Sun, Oct 11, 2015 at 11:52:58AM +0800, Xiao Guangrong wrote:

Check if the input Arg3 is valid then store it into dsm_in if needed

We only do the save on NVDIMM device since we are not going to support any
function on root device

Signed-off-by: Xiao Guangrong 
---
  hw/mem/nvdimm/acpi.c | 21 -
  1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
index d9fa0fd..3b9399c 100644
--- a/hw/mem/nvdimm/acpi.c
+++ b/hw/mem/nvdimm/acpi.c
@@ -442,7 +442,7 @@ static void build_nvdimm_devices(NVDIMMState *state, GSList 
*device_list,
  int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
 NULL);
  uint32_t handle = nvdimm_slot_to_handle(slot);
-Aml *dev, *method;
+Aml *dev, *method, *ifctx;

  dev = aml_device("NV%02X", slot);
  aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
@@ -452,6 +452,24 @@ static void build_nvdimm_devices(NVDIMMState *state, 
GSList *device_list,
  method = aml_method("_DSM", 4);
  {
  SAVE_ARG012_HANDLE_LOCK(method, aml_int(handle));
+
+/* Arg3 is passed as Package and it has one element? */
+ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
+ aml_int(4)),
+   aml_equal(aml_sizeof(aml_arg(3)),


aml_arg(3) is used many times below.
Pls give it a name that makes sense (not arg3! what is it for?)



Er. aml_arg(3) is just the fourth parameter of _DSM method. Will add some
comments:

/*
 * The fourth parameter (Arg3) of _DMS is a package which contains a buffer, the
 * layout of the buffer is specified by UUID (Arg0), Revision ID (Arg1) and
 * Function Index (Arg2) which are documented in the DSM specification.
 */


+ aml_int(1;


Pls document AML constants used.
Like this:

 ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
  aml_int(4 /* 4 - Package */) ),
aml_equal(aml_sizeof(aml_arg(3)),
  aml_int(1;


+{
+/* Local0 = Index(Arg3, 0) */
+aml_append(ifctx, aml_store(aml_index(aml_arg(3), aml_int(0)),
+aml_local(0)));
+/* Local3 = DeRefOf(Local0) */
+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
+aml_local(3)));
+/* ARG3 = Local3 */
+aml_append(ifctx, aml_store(aml_local(3), aml_name("ARG3")));


This isn't a good way to comment things: you are
just adding ASL before the equivalent C.
Pls document what's going on.



Okay... i just thought C is little readable than AML. Will change the comment
to:

/* fetch buffer from the package (Arg3) and store it to DSM memory. */

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 26/32] nvdimm: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 03:14:28PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 02:50 PM, Michael S. Tsirkin wrote:
> >On Sun, Oct 11, 2015 at 11:52:58AM +0800, Xiao Guangrong wrote:
> >>Check if the input Arg3 is valid then store it into dsm_in if needed
> >>
> >>We only do the save on NVDIMM device since we are not going to support any
> >>function on root device
> >>
> >>Signed-off-by: Xiao Guangrong 
> >>---
> >>  hw/mem/nvdimm/acpi.c | 21 -
> >>  1 file changed, 20 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> >>index d9fa0fd..3b9399c 100644
> >>--- a/hw/mem/nvdimm/acpi.c
> >>+++ b/hw/mem/nvdimm/acpi.c
> >>@@ -442,7 +442,7 @@ static void build_nvdimm_devices(NVDIMMState *state, 
> >>GSList *device_list,
> >>  int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
> >> NULL);
> >>  uint32_t handle = nvdimm_slot_to_handle(slot);
> >>-Aml *dev, *method;
> >>+Aml *dev, *method, *ifctx;
> >>
> >>  dev = aml_device("NV%02X", slot);
> >>  aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
> >>@@ -452,6 +452,24 @@ static void build_nvdimm_devices(NVDIMMState *state, 
> >>GSList *device_list,
> >>  method = aml_method("_DSM", 4);
> >>  {
> >>  SAVE_ARG012_HANDLE_LOCK(method, aml_int(handle));
> >>+
> >>+/* Arg3 is passed as Package and it has one element? */
> >>+ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
> >>+ aml_int(4)),
> >>+   aml_equal(aml_sizeof(aml_arg(3)),
> >
> >aml_arg(3) is used many times below.
> >Pls give it a name that makes sense (not arg3! what is it for?)
> >
> 
> Er. aml_arg(3) is just the fourth parameter of _DSM method. Will add some
> comments:
> 
> /*
>  * The fourth parameter (Arg3) of _DMS is a package which contains a buffer, 
> the
>  * layout of the buffer is specified by UUID (Arg0), Revision ID (Arg1) and
>  * Function Index (Arg2) which are documented in the DSM specification.
>  */
> 
> >>+ aml_int(1;
> >
> >Pls document AML constants used.
> >Like this:
> >
> > ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
> >  aml_int(4 /* 4 - Package */) ),
> >aml_equal(aml_sizeof(aml_arg(3)),
> >  aml_int(1;
> >
> >>+{
> >>+/* Local0 = Index(Arg3, 0) */
> >>+aml_append(ifctx, aml_store(aml_index(aml_arg(3), 
> >>aml_int(0)),
> >>+aml_local(0)));
> >>+/* Local3 = DeRefOf(Local0) */
> >>+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
> >>+aml_local(3)));
> >>+/* ARG3 = Local3 */
> >>+aml_append(ifctx, aml_store(aml_local(3), 
> >>aml_name("ARG3")));
> >
> >This isn't a good way to comment things: you are
> >just adding ASL before the equivalent C.
> >Pls document what's going on.
> >
> 
> Okay... i just thought C is little readable than AML. Will change the comment
> to:
> 
> /* fetch buffer from the package (Arg3) and store it to DSM memory. */
> 
> Thanks.

You can use variables to make the logic clear. E.g.:

Aml *pckg = aml_arg(3);
Aml *pckg_idx = aml_local(0);
Aml *pckg_buf = aml_local(3);

aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)), pckg_idx);
aml_append(ifctx, aml_store(aml_derefof(pckg_idx), pckg_buf));


This is also better than repeating aml_arg(3) many times.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: just an observation about USB

2015-10-19 Thread Gerd Hoffmann
On Fr, 2015-10-16 at 11:48 -0400, Eric S. Johansson wrote:
> 
> On 10/16/2015 07:55 AM, Stefan Hajnoczi wrote:
> > QEMU can emulate PCI soundcards, including the Intel HD Audio codec 
> > cards (-device intel-hda or -soundhw hda might do the trick). Low 
> > latency and power consumption are usually at odds with each other. 
> > That's because real-time audio requires small buffers many times per 
> > second, so lots of interrupts and power consumption. Anyway, PCI 
> > should be an improvement from USB audio. Stefan 
> 
> I set it up with ich9.  I switched the default audio to my headset. I 
> hear the windows startup sound in the headset. Dragon reports that the 
> mic is not plugged in.  I can see the audio level move in the sound 
> settings so I know the host is hearing the audio
> 
> what should I look at next?

Try '-device intel-hda -device hda-micro' (instead of -device intel-hda
-device hda-duplex', or '-soundhw hda' which is a shortcut for the
latter).

'hda-duplex' presents a codec with line-in and line-out to the guest.
'hda-micro' presents a codec with microphone and speaker to the guest.
Other than having in and out tagged differently the codecs are
identical.  But especially declaring the input being a mic seems to be
needed to make some picky windows software happy.

cheers,
  Gerd


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> >>+nvdimm_init_memory_state(>nvdimm_memory, system_memory, 
> >>machine,
> >>+ TARGET_PAGE_SIZE);
> >>+
> >
> >Shouldn't this be conditional on presence of the nvdimm device?
> >
> 
> We will enable hotplug on nvdimm devices in the near future once Linux driver 
> is
> ready. I'd keep it here for future development.

No, I don't think we should add stuff unconditionally. If not nvdimm,
some other flag should indicate user intends to hotplug things.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:

+nvdimm_init_memory_state(>nvdimm_memory, system_memory, machine,
+ TARGET_PAGE_SIZE);
+


Shouldn't this be conditional on presence of the nvdimm device?



We will enable hotplug on nvdimm devices in the near future once Linux driver is
ready. I'd keep it here for future development.


No, I don't think we should add stuff unconditionally. If not nvdimm,
some other flag should indicate user intends to hotplug things.



Actually, it is not unconditionally which is called if parameter "-m aaa, 
maxmem=bbb"
(aaa < bbb) is used. It is on the some path of memoy-hotplug initiation.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-19 Thread Paolo Bonzini


On 16/10/2015 17:38, Alex Williamson wrote:
> According to Microsoft documentation, the signature in the standard
> hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
> for reporting and diagnostic purposes only.  We can therefore allow
> the user to change it to whatever they want, within the 12 character
> limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
> for this, ex:
> 
>  -cpu host,hv_time,hv_vendor_id=KeenlyKVM
> 
> Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
> Signed-off-by: Alex Williamson 
> ---
> 
> v2: Replace abort() with truncating the string, error report updated
> 
> Igor also had the idea of creating a DEFINE_PROP_STRING_LEN property
> where we could enforce the length earlier in the parameter checking.
> If we like that idea, we probably need to do it first since we don't
> want to switch from truncating to erroring between releases.  I can
> work on that if preferred.  Thanks,

I applied this one, because the truncation matches what is done in other
places (ACPI tables, SCSI product/vendor, etc.)

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 26/32] nvdimm: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Michael S. Tsirkin
On Sun, Oct 11, 2015 at 11:52:58AM +0800, Xiao Guangrong wrote:
> Check if the input Arg3 is valid then store it into dsm_in if needed
> 
> We only do the save on NVDIMM device since we are not going to support any
> function on root device
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/mem/nvdimm/acpi.c | 21 -
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
> index d9fa0fd..3b9399c 100644
> --- a/hw/mem/nvdimm/acpi.c
> +++ b/hw/mem/nvdimm/acpi.c
> @@ -442,7 +442,7 @@ static void build_nvdimm_devices(NVDIMMState *state, 
> GSList *device_list,
>  int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
> NULL);
>  uint32_t handle = nvdimm_slot_to_handle(slot);
> -Aml *dev, *method;
> +Aml *dev, *method, *ifctx;
>  
>  dev = aml_device("NV%02X", slot);
>  aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
> @@ -452,6 +452,24 @@ static void build_nvdimm_devices(NVDIMMState *state, 
> GSList *device_list,
>  method = aml_method("_DSM", 4);
>  {
>  SAVE_ARG012_HANDLE_LOCK(method, aml_int(handle));
> +
> +/* Arg3 is passed as Package and it has one element? */
> +ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
> + aml_int(4)),
> +   aml_equal(aml_sizeof(aml_arg(3)),

aml_arg(3) is used many times below.
Pls give it a name that makes sense (not arg3! what is it for?)

> + aml_int(1;

Pls document AML constants used.
Like this:

ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
 aml_int(4 /* 4 - Package */) ),
   aml_equal(aml_sizeof(aml_arg(3)),
 aml_int(1;

> +{
> +/* Local0 = Index(Arg3, 0) */
> +aml_append(ifctx, aml_store(aml_index(aml_arg(3), 
> aml_int(0)),
> +aml_local(0)));
> +/* Local3 = DeRefOf(Local0) */
> +aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
> +aml_local(3)));
> +/* ARG3 = Local3 */
> +aml_append(ifctx, aml_store(aml_local(3), aml_name("ARG3")));

This isn't a good way to comment things: you are
just adding ASL before the equivalent C.
Pls document what's going on.




> +}
> +aml_append(method, ifctx);
> +
>  NOTIFY_AND_RETURN_UNLOCK(method);
>  }
>  aml_append(dev, method);
> @@ -534,6 +552,7 @@ static void nvdimm_build_acpi_devices(NVDIMMState *state, 
> GSList *device_list,
>  method = aml_method("_DSM", 4);
>  {
>  SAVE_ARG012_HANDLE_LOCK(method, aml_int(0));
> +/* no command we support on ROOT device has Arg3. */
>  NOTIFY_AND_RETURN_UNLOCK(method);
>  }
>  aml_append(dev, method);
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/32] implement vNVDIMM

2015-10-19 Thread Michael S. Tsirkin
On Tue, Oct 13, 2015 at 01:29:48PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/12/2015 07:55 PM, Michael S. Tsirkin wrote:
> >On Sun, Oct 11, 2015 at 11:52:32AM +0800, Xiao Guangrong wrote:
> >>Changelog in v3:
> >>There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> >>Michael for their valuable comments, the patchset finally gets better shape.
> >
> >Thanks!
> >This needs some changes in coding style, and more comments, to
> >make it easier to maintain going forward.
> 
> Thanks for your review, Michael. I have learned lots of thing from
> your comments.
> 
> >
> >High level comments - I didn't point out all instances,
> >please go over code and locate them yourself.
> >I focused on acpi code in this review.
> 
> Okay, will do.
> 
> >
> > - fix coding style violations, prefix eveything with nvdimm_ etc
> 
> Actually i did not pay attention on naming the stuff which is only internally
> used. Thank you for pointing it out and will fix it in next version.
> 
> > - in apci code, avoid manual memory management/complex pointer math
> 
> I am not very good at ACPI ASL/AML, could you please more detail?

It's about C.

For example:
Foo *foo = acpi_data_push(table, sizeof *foo);
Bar *foo = acpi_data_push(table, sizeof *bar);
is pretty obviously safe, and it doesn't require you to do any
calculations.
char *buf = acpi_data_push(table, sizeof *foo + sizeof *bar);
is worse, now you need:
Bar *bar = (Bar *)(buf + sizeof *foo);
which will corrupt memory if you get the size wrong in push.

> > - comments are needed to document apis & explain what's going on
> > - constants need comments too, refer to text that
> >   can be looked up in acpi spec verbatim
> 
> Indeed, will document carefully.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 27/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 12:04:48PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 01:16 AM, Michael S. Tsirkin wrote:
> >On Mon, Oct 19, 2015 at 08:54:13AM +0800, Xiao Guangrong wrote:
> >>Check if the input Arg3 is valid then store it into dsm_in if needed
> >>
> >>Signed-off-by: Xiao Guangrong 
> >>---
> >>  hw/acpi/nvdimm.c | 19 +++
> >>  1 file changed, 19 insertions(+)
> >>
> >>diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> >>index 7e99889..b211b8b 100644
> >>--- a/hw/acpi/nvdimm.c
> >>+++ b/hw/acpi/nvdimm.c
> >>@@ -624,10 +624,29 @@ static void nvdimm_build_acpi_devices(NVDIMMState 
> >>*state, GSList *device_list,
> >>
> >>  method = aml_method_serialized("NCAL", 4);
> >>  {
> >>+Aml *ifctx;
> >>+
> >>  aml_append(method, aml_store(aml_arg(0), aml_name("HDLE")));
> >>  aml_append(method, aml_store(aml_arg(1), aml_name("REVS")));
> >>  aml_append(method, aml_store(aml_arg(2), aml_name("FUNC")));
> >>
> >>+/* Arg3 is passed as Package and it has one element? */
> >>+ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
> >>+ aml_int(4)),
> >>+   aml_equal(aml_sizeof(aml_arg(3)),
> >>+ aml_int(1;
> >>+{
> >>+/* Local0 = Index(Arg3, 0) */
> >>+aml_append(ifctx, aml_store(aml_index(aml_arg(3), aml_int(0)),
> >>+aml_local(0)));
> >>+/* Local3 = DeRefOf(Local0) */
> >>+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
> >>+aml_local(3)));
> >>+/* ARG3 = Local3 */
> >>+aml_append(ifctx, aml_store(aml_local(3), aml_name("ARG3")));
> >>+}
> >>+aml_append(method, ifctx);
> >>+
> >>  aml_append(method, aml_store(aml_int(NOTIFY_VALUE), 
> >> aml_name("NOTI")));
> >>
> >>  aml_append(method, aml_store(aml_name("RLEN"), aml_local(6)));
> >
> >I commented on this patch on v3.
> >It doesn't look like this was addressed.
> >
> 
> Ah... I see no one commented this patch ([PATCH v3 26/32] nvdimm: save arg3 
> for NVDIMM
> device_DSM method) on v3.
> 
> Do you mean we need more and better comment to explain arg3? Or anything else?
> 


I mean don't use ASL to comment C. It's not more readable.
Describe why the code is the way it is. Use variables by preference,
C does not have weird limitations like ASL so you don't need to call
your variables "arg3". What does it hold?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 02:56 PM, Michael S. Tsirkin wrote:

On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:

We reserve the memory region 0xFF0 ~ 0xFFF0 for NVDIMM ACPI
which is used as:
- the first page is mapped as MMIO, ACPI write data to this page to
   transfer the control to QEMU

- the second page is RAM-based which used to save the input info of
   _DSM method and QEMU reuse it store output info

- the left is mapped as RAM, it's the buffer returned by _FIT method,
   this is needed by NVDIMM hotplug



Isn't there some way to document this in code, e.g. with
macros?



Yes. It's also documented when DSM memory is created, see
nvdimm_build_dsm_memory() introduced in
[PATCH v4 25/33] nvdimm acpi: init the address region used by DSM


Adding text under docs/specs would also be a good idea.



Yes. A doc has been added in v4.




Signed-off-by: Xiao Guangrong 
---
  hw/i386/pc.c|   3 ++
  hw/mem/Makefile.objs|   2 +-
  hw/mem/nvdimm/acpi.c| 120 
  include/hw/i386/pc.h|   2 +
  include/hw/mem/nvdimm.h |  19 
  5 files changed, 145 insertions(+), 1 deletion(-)
  create mode 100644 hw/mem/nvdimm/acpi.c

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6694b18..8fea4c3 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1360,6 +1360,9 @@ FWCfgState *pc_memory_init(PCMachineState *pcms,
  exit(EXIT_FAILURE);
  }

+nvdimm_init_memory_state(>nvdimm_memory, system_memory, machine,
+ TARGET_PAGE_SIZE);
+


Shouldn't this be conditional on presence of the nvdimm device?



We will enable hotplug on nvdimm devices in the near future once Linux driver is
ready. I'd keep it here for future development.




  pcms->hotplug_memory.base =
  ROUND_UP(0x1ULL + pcms->above_4g_mem_size, 1ULL << 30);

diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index e0ff328..7310bac 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,3 +1,3 @@
  common-obj-$(CONFIG_DIMM) += dimm.o
  common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
-common-obj-$(CONFIG_NVDIMM) += nvdimm/nvdimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm/nvdimm.o nvdimm/acpi.o
diff --git a/hw/mem/nvdimm/acpi.c b/hw/mem/nvdimm/acpi.c
new file mode 100644
index 000..b640874
--- /dev/null
+++ b/hw/mem/nvdimm/acpi.c
@@ -0,0 +1,120 @@
+/*
+ * NVDIMM ACPI Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ * and the DSM specfication can be found at:
+ *   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "qemu-common.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
+#include "hw/mem/nvdimm.h"
+#include "internal.h"
+
+/* System Physical Address Range Structure */
+struct nfit_spa {
+uint16_t type;
+uint16_t length;
+uint16_t spa_index;
+uint16_t flags;
+uint32_t reserved;
+uint32_t proximity_domain;
+uint8_t type_guid[16];
+uint64_t spa_base;
+uint64_t spa_length;
+uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct nfit_spa nfit_spa;
+
+/* Memory Device to System Physical Address Range Mapping Structure */
+struct nfit_memdev {
+uint16_t type;
+uint16_t length;
+uint32_t nfit_handle;
+uint16_t phys_id;
+uint16_t region_id;
+uint16_t spa_index;
+uint16_t dcr_index;
+uint64_t region_len;
+uint64_t region_offset;
+uint64_t region_dpa;
+uint16_t interleave_index;
+uint16_t interleave_ways;
+uint16_t flags;
+uint16_t reserved;
+} QEMU_PACKED;
+typedef struct nfit_memdev nfit_memdev;
+
+/* NVDIMM Control Region Structure */
+struct nfit_dcr {
+uint16_t type;
+uint16_t length;
+uint16_t dcr_index;
+uint16_t vendor_id;
+uint16_t device_id;
+uint16_t revision_id;
+uint16_t sub_vendor_id;
+uint16_t sub_device_id;
+uint16_t sub_revision_id;
+uint8_t reserved[6];
+uint32_t serial_number;
+uint16_t fic;
+uint16_t num_bcw;
+uint64_t bcw_size;
+uint64_t cmd_offset;
+uint64_t 

Re: [PATCH v3 26/32] nvdimm: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 03:47 PM, Michael S. Tsirkin wrote:


aml_arg(3) is used many times below.
Pls give it a name that makes sense (not arg3! what is it for?)



Er. aml_arg(3) is just the fourth parameter of _DSM method. Will add some
comments:

/*
  * The fourth parameter (Arg3) of _DMS is a package which contains a buffer, 
the
  * layout of the buffer is specified by UUID (Arg0), Revision ID (Arg1) and
  * Function Index (Arg2) which are documented in the DSM specification.
  */


+ aml_int(1;


Pls document AML constants used.
Like this:

 ifctx = aml_if(aml_and(aml_equal(aml_object_type(aml_arg(3)),
  aml_int(4 /* 4 - Package */) ),
aml_equal(aml_sizeof(aml_arg(3)),
  aml_int(1;


+{
+/* Local0 = Index(Arg3, 0) */
+aml_append(ifctx, aml_store(aml_index(aml_arg(3), aml_int(0)),
+aml_local(0)));
+/* Local3 = DeRefOf(Local0) */
+aml_append(ifctx, aml_store(aml_derefof(aml_local(0)),
+aml_local(3)));
+/* ARG3 = Local3 */
+aml_append(ifctx, aml_store(aml_local(3), aml_name("ARG3")));


This isn't a good way to comment things: you are
just adding ASL before the equivalent C.
Pls document what's going on.



Okay... i just thought C is little readable than AML. Will change the comment
to:

/* fetch buffer from the package (Arg3) and store it to DSM memory. */

Thanks.


You can use variables to make the logic clear. E.g.:

Aml *pckg = aml_arg(3);
Aml *pckg_idx = aml_local(0);
Aml *pckg_buf = aml_local(3);

aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)), pckg_idx);
aml_append(ifctx, aml_store(aml_derefof(pckg_idx), pckg_buf));


This is also better than repeating aml_arg(3) many times.



Indeed, it's more clearer now.

Thanks for your review and really appreciate for your patience, Michael!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvmtool/build: introduce GUEST_PRE_INIT target

2015-10-19 Thread Oleg Nesterov
This comes as a separate patch because I do not really understand
/usr/bin/make, probably it should be updated.

Change the main Makefile so that if an arch defines ARCH_PRE_INIT
then we

- build $GUEST_INIT without "-static"

- add -DCONFIG_GUEST_PRE_INIT to $CFLAGS

- build $ARCH_PRE_INIT as guest/guest_pre_init.o and embed it
  into lkvm the same as we do with guest/guest_init.o

This also means that ARCH_PRE_INIT case doesn't depend on glibc-static,
we can relax the SOURCE_STATIC check later.

Signed-off-by: Oleg Nesterov 
---
 .gitignore |  1 +
 Makefile   | 25 -
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/.gitignore b/.gitignore
index f10d3c5..697a63f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -9,6 +9,7 @@ include/common-cmds.h
 tests/boot/boot_test.iso
 tests/boot/rootfs/
 guest/init
+guest/pre_init
 guest/init_stage2
 KVMTOOLS-VERSION-FILE
 /x86/bios/bios.bin
diff --git a/Makefile b/Makefile
index f1701aa..0f8e003 100644
--- a/Makefile
+++ b/Makefile
@@ -281,6 +281,14 @@ ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
CFLAGS  += -DCONFIG_GUEST_INIT
GUEST_INIT  := guest/init
GUEST_OBJS  = guest/guest_init.o
+   ifeq ($(ARCH_PRE_INIT),)
+   GUEST_INIT_FLAGS+= -static
+   else
+   CFLAGS  += -DCONFIG_GUEST_PRE_INIT
+   GUEST_INIT_FLAGS+= -DCONFIG_GUEST_PRE_INIT
+   GUEST_PRE_INIT  := guest/pre_init
+   GUEST_OBJS  += guest/guest_pre_init.o
+   endif
 else
$(warning No static libc found. Skipping guest init)
NOTFOUND+= static-libc
@@ -346,7 +354,7 @@ ifneq ($(WERROR),0)
CFLAGS += -Werror
 endif
 
-all: $(PROGRAM) $(PROGRAM_ALIAS) $(GUEST_INIT)
+all: $(PROGRAM) $(PROGRAM_ALIAS) $(GUEST_INIT) $(GUEST_PRE_INIT)
 
 # CFLAGS used when building objects
 # This is intentionally not assigned using :=
@@ -360,11 +368,11 @@ c_flags   = -Wp,-MD,$(depfile) $(CFLAGS)
 #
 STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
 
-$(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
+$(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT) $(GUEST_PRE_INIT)
$(E) "  LINK" $@
$(Q) $(CC) -static $(CFLAGS) $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_OBJS) 
$(LIBS) $(LIBS_STATOPT) -o $@
 
-$(PROGRAM): $(OBJS) $(OBJS_DYNOPT) $(OTHEROBJS) $(GUEST_INIT)
+$(PROGRAM): $(OBJS) $(OBJS_DYNOPT) $(OTHEROBJS) $(GUEST_INIT) $(GUEST_PRE_INIT)
$(E) "  LINK" $@
$(Q) $(CC) $(CFLAGS) $(OBJS) $(OBJS_DYNOPT) $(OTHEROBJS) $(GUEST_OBJS) 
$(LIBS) $(LIBS_DYNOPT) -o $@
 
@@ -372,9 +380,16 @@ $(PROGRAM_ALIAS): $(PROGRAM)
$(E) "  LN  " $@
$(Q) ln -f $(PROGRAM) $@
 
+ifneq ($(ARCH_PRE_INIT),)
+$(GUEST_PRE_INIT): $(ARCH_PRE_INIT)
+   $(E) "  LINK" $@
+   $(Q) $(CC) -s -nostdlib $(ARCH_PRE_INIT) -o $@
+   $(Q) $(LD) $(LDFLAGS) -r -b binary -o guest/guest_pre_init.o 
$(GUEST_PRE_INIT)
+endif
+
 $(GUEST_INIT): guest/init.c
$(E) "  LINK" $@
-   $(Q) $(CC) -static guest/init.c -o $@
+   $(Q) $(CC) $(GUEST_INIT_FLAGS) guest/init.c -o $@
$(Q) $(LD) $(LDFLAGS) -r -b binary -o guest/guest_init.o $(GUEST_INIT)
 
 %.s: %.c
@@ -473,7 +488,7 @@ clean:
$(Q) rm -f x86/bios/bios-rom.h
$(Q) rm -f tests/boot/boot_test.iso
$(Q) rm -rf tests/boot/rootfs/
-   $(Q) rm -f $(DEPS) $(OBJS) $(OTHEROBJS) $(OBJS_DYNOPT) $(STATIC_OBJS) 
$(PROGRAM) $(PROGRAM_ALIAS) $(PROGRAM)-static $(GUEST_INIT) $(GUEST_OBJS)
+   $(Q) rm -f $(DEPS) $(OBJS) $(OTHEROBJS) $(OBJS_DYNOPT) $(STATIC_OBJS) 
$(PROGRAM) $(PROGRAM_ALIAS) $(PROGRAM)-static $(GUEST_INIT) $(GUEST_PRE_INIT) 
$(GUEST_OBJS)
$(Q) rm -f cscope.*
$(Q) rm -f tags
$(Q) rm -f TAGS
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] kvmtool/x86: implement guest_pre_init

2015-10-19 Thread Oleg Nesterov
Add the tiny x86/init.S which just mounts /host and execs
/virt/init.

NOTE: of course, the usage of CONFIG_GUEST_PRE_INIT is ugly, we
need to cleanup this code. But I'd prefer to do this on top of
this minimal/simple change. And I think this needs cleanups in
any case, for example I think lkvm shouldn't abuse the "init="
kernel parameter at all.

TODO: x86/init.S doesn't handle i386, should be simple to add.

Signed-off-by: Oleg Nesterov 
---
 Makefile|  1 +
 builtin-run.c   |  4 
 builtin-setup.c | 14 +-
 guest/init.c|  2 ++
 x86/init.S  | 38 ++
 5 files changed, 58 insertions(+), 1 deletion(-)
 create mode 100644 x86/init.S

diff --git a/Makefile b/Makefile
index 0f8e003..f8f7cc4 100644
--- a/Makefile
+++ b/Makefile
@@ -110,6 +110,7 @@ endif
 ifeq ($(ARCH),x86_64)
ARCH := x86
DEFINES  += -DCONFIG_X86_64
+   ARCH_PRE_INIT = x86/init.S
 endif
 
 ### Arch-specific stuff
diff --git a/builtin-run.c b/builtin-run.c
index e0c8732..3da00da 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -600,7 +600,11 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
if (kvm->cfg.custom_rootfs) {
kvm_run_set_sandbox(kvm);
 
+#ifdef CONFIG_GUEST_PRE_INIT
+   strcat(real_cmdline, " init=/virt/pre_init");
+#else
strcat(real_cmdline, " init=/virt/init");
+#endif
 
if (!kvm->cfg.no_dhcp)
strcat(real_cmdline, "  ip=dhcp");
diff --git a/builtin-setup.c b/builtin-setup.c
index bb7c71c..1e5b1e4 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -144,12 +144,24 @@ static int extract_file(const char *guestfs_name, const 
char *filename,
 
 extern char _binary_guest_init_start;
 extern char _binary_guest_init_size;
+extern char _binary_guest_pre_init_start;
+extern char _binary_guest_pre_init_size;
 
 int kvm_setup_guest_init(const char *guestfs_name)
 {
-   return extract_file(guestfs_name, "virt/init",
+   int err;
+
+#ifdef CONFIG_GUEST_PRE_INIT
+   err = extract_file(guestfs_name, "virt/pre_init",
+   &_binary_guest_pre_init_start,
+   &_binary_guest_pre_init_size);
+   if (err)
+   return err;
+#endif
+   err = extract_file(guestfs_name, "virt/init",
&_binary_guest_init_start,
&_binary_guest_init_size);
+   return err;
 }
 #else
 int kvm_setup_guest_init(const char *guestfs_name)
diff --git a/guest/init.c b/guest/init.c
index 7277a07..46e3fa4 100644
--- a/guest/init.c
+++ b/guest/init.c
@@ -30,7 +30,9 @@ static int run_process_sandbox(char *filename)
 
 static void do_mounts(void)
 {
+#ifndef CONFIG_GUEST_PRE_INIT
mount("hostfs", "/host", "9p", MS_RDONLY, 
"trans=virtio,version=9p2000.L");
+#endif
mount("", "/sys", "sysfs", 0, NULL);
mount("proc", "/proc", "proc", 0, NULL);
mount("devtmpfs", "/dev", "devtmpfs", 0, NULL);
diff --git a/x86/init.S b/x86/init.S
new file mode 100644
index 000..488a93f
--- /dev/null
+++ b/x86/init.S
@@ -0,0 +1,38 @@
+.data
+
+.m_dev:
+.string "hostfs"
+.m_dir:
+.string "/host"
+.m_typ:
+.string "9p"
+.m_opt:
+.string "trans=virtio,version=9p2000.L"
+
+.e_nam:
+.string "/virt/init"
+
+.text
+.globl _start
+_start:
+
+   mov $165, %rax  # __NR_mount
+   mov $.m_dev, %rdi
+   mov $.m_dir, %rsi
+   mov $.m_typ, %rdx
+   mov $1, %r10# MS_RDONLY
+   mov $.m_opt, %r8
+   syscall
+
+   mov $59, %rax   # __NR_execve
+   mov $.e_nam, %rdi
+   lea 8(%rsp), %rsi   # argv[]
+   mov %rdi, (%rsi)# change argv[0]
+   pop %rcx# argc
+   inc %rcx
+   lea (%rsi,%rcx,8), %rdx # envp[]
+   syscall
+
+   mov $60, %rax   # __NR_exit
+   mov $1, %rdi
+   syscall # panic
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvmtool/setup: Introduce extract_file() helper

2015-10-19 Thread Oleg Nesterov
Turn kvm_setup_guest_init(guestfs_name) into a more generic helper,
extract_file(guestfs_name, filename, data, size) and reimplement
kvm_setup_guest_init() as a trivial wrapper.

Signed-off-by: Oleg Nesterov 
---
 builtin-setup.c | 24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/builtin-setup.c b/builtin-setup.c
index 40fef15..bb7c71c 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -122,30 +122,34 @@ static const char *guestfs_symlinks[] = {
 };
 
 #ifdef CONFIG_GUEST_INIT
-extern char _binary_guest_init_start;
-extern char _binary_guest_init_size;
-
-int kvm_setup_guest_init(const char *guestfs_name)
+static int extract_file(const char *guestfs_name, const char *filename,
+   const void *data, const void *_size)
 {
char path[PATH_MAX];
-   size_t size;
int fd, ret;
-   char *data;
 
-   size = (size_t)&_binary_guest_init_size;
-   data = (char *)&_binary_guest_init_start;
-   snprintf(path, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), 
guestfs_name);
+   snprintf(path, PATH_MAX, "%s%s/%s", kvm__get_dir(),
+   guestfs_name, filename);
remove(path);
fd = open(path, O_CREAT | O_WRONLY, 0755);
if (fd < 0)
die("Fail to setup %s", path);
-   ret = xwrite(fd, data, size);
+   ret = xwrite(fd, data, (size_t)_size);
if (ret < 0)
die("Fail to setup %s", path);
close(fd);
 
return 0;
+}
 
+extern char _binary_guest_init_start;
+extern char _binary_guest_init_size;
+
+int kvm_setup_guest_init(const char *guestfs_name)
+{
+   return extract_file(guestfs_name, "virt/init",
+   &_binary_guest_init_start,
+   &_binary_guest_init_size);
 }
 #else
 int kvm_setup_guest_init(const char *guestfs_name)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvmtool: tiny init fox x86_64

2015-10-19 Thread Oleg Nesterov
Hello,

Yesterday I discovered kvmtool and it looks really cool! Thanks!
Looks like, it does exactly what I need.

But the static /virt/init doesn't look good. This series lessens
the size of lkvm binary from 1045952 to 196200 bytes on x86_64,
but this is minor. I want to write my /virt/init in shell or perl
and this change can help.

I don't really know who should be cc'ed, I've picked some names
from git-log guest/init.c.

Oleg.

 .gitignore  |  1 +
 Makefile| 26 +-
 builtin-run.c   |  4 
 builtin-setup.c | 36 ++--
 guest/init.c|  2 ++
 x86/init.S  | 38 ++
 6 files changed, 92 insertions(+), 15 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
> >On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> +nvdimm_init_memory_state(>nvdimm_memory, system_memory, 
> machine,
> + TARGET_PAGE_SIZE);
> +
> >>>
> >>>Shouldn't this be conditional on presence of the nvdimm device?
> >>>
> >>
> >>We will enable hotplug on nvdimm devices in the near future once Linux 
> >>driver is
> >>ready. I'd keep it here for future development.
> >
> >No, I don't think we should add stuff unconditionally. If not nvdimm,
> >some other flag should indicate user intends to hotplug things.
> >
> 
> Actually, it is not unconditionally which is called if parameter "-m aaa, 
> maxmem=bbb"
> (aaa < bbb) is used. It is on the some path of memoy-hotplug initiation.
> 

Right, but that's not the same as nvdimm.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network hangs when communicating with host

2015-10-19 Thread Andre Przywara
Hi Dmitry,

On 19/10/15 10:05, Dmitry Vyukov wrote:
> On Fri, Oct 16, 2015 at 7:25 PM, Sasha Levin  wrote:
>> On 10/15/2015 04:20 PM, Dmitry Vyukov wrote:
>>> Hello,
>>>
>>> I am trying to run a program in lkvm sandbox so that it communicates
>>> with a program on host. I run lkvm as:
>>>
>>> ./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel
>>> /arch/x86/boot/bzImage --network mode=user -- /my_prog
>>>
>>> /my_prog then connects to a program on host over a tcp socket.
>>> I see that host receives some data, sends some data back, but then
>>> my_prog hangs on network read.
>>>
>>> To localize this I wrote 2 programs (attached). ping is run on host
>>> and pong is run from lkvm sandbox. They successfully establish tcp
>>> connection, but after some iterations both hang on read.
>>>
>>> Networking code in Go runtime is there for more than 3 years, widely
>>> used in production and does not have any known bugs. However, it uses
>>> epoll edge-triggered readiness notifications that known to be tricky.
>>> Is it possible that lkvm contains some networking bug? Can it be
>>> related to the data races in lkvm I reported earlier today?

Just to let you know:
I think we have seen networking issues in the past - root over NFS had
issues IIRC. Will spent some time on debugging this and it looked like a
race condition in kvmtool's virtio implementation. I think pinning
kvmtool's virtio threads to one host core made this go away. However
although he tried hard (even by Will's standards!) he couldn't find a
the real root cause or a fix at the time he looked at it and we found
other ways to work around the issues (using virtio-blk or initrd's).

So it's quite possible that there are issues. I haven't had time yet to
look at your sanitizer reports, but it looks like a promising approach
to find the root cause.

Cheers,
Andre.

>>>
>>> I am on commit 3695adeb227813d96d9c41850703fb53a23845eb.
>>
>> Hey Dmitry,
>>
>> How long does it take to reproduce? I've been running ping/pong as you've
>> described and it looks like it doesn't get stuck (read/writes keep going
>> on both sides).
>>
>> Can you share your guest kernel config maybe?
> 
> 
> Humm it my setup it happens within a minute or so.
> 
> My kernel is not completely standard, but it works with qemu without
> any problems.
> It is not trivial to reproduce, but FWIW I on commit
> f9fbf6b72ffaaca8612979116c872c9d5d9cc1f5 of
> https://github.com/dvyukov/linux/commits/coverage branch. Config file
> is attached. Then, I build it with custom gcc: revision 228818 +
> https://codereview.appspot.com/267910043 patch. This is all per
> https://github.com/google/syzkaller instructions.
> 
> I run lkvm as:
> ./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel
> /arch/x86/boot/bzImage --network mode=user -- /pong
> 
> kvmtool is on 3695adeb227813d96d9c41850703fb53a23845eb.
> 
> Just tried to do the same with qemu, it does not hang.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm: x86: zero EFER on INIT

2015-10-19 Thread Paolo Bonzini
Not zeroing EFER means that a 32-bit firmware cannot enter paging mode
without clearing EFER.LME first (which it should not know about).
Yang Zhang from Intel confirmed that the manual is wrong and EFER is
cleared to zero on INIT.

Fixes: d28bc9dd25ce023270d2e039e7c98d38ecbf7758
Cc: sta...@vger.kernel.org
Cc: Yang Z Zhang 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/svm.c | 11 +--
 arch/x86/kvm/vmx.c |  3 +--
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cd8659cfc632..f2c8e4917688 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1089,7 +1089,7 @@ static u64 svm_compute_tsc_offset(struct kvm_vcpu *vcpu, 
u64 target_tsc)
return target_tsc - tsc;
 }
 
-static void init_vmcb(struct vcpu_svm *svm, bool init_event)
+static void init_vmcb(struct vcpu_svm *svm)
 {
struct vmcb_control_area *control = >vmcb->control;
struct vmcb_save_area *save = >vmcb->save;
@@ -1160,8 +1160,7 @@ static void init_vmcb(struct vcpu_svm *svm, bool 
init_event)
init_sys_seg(>ldtr, SEG_TYPE_LDT);
init_sys_seg(>tr, SEG_TYPE_BUSY_TSS16);
 
-   if (!init_event)
-   svm_set_efer(>vcpu, 0);
+   svm_set_efer(>vcpu, 0);
save->dr6 = 0x0ff0;
kvm_set_rflags(>vcpu, 2);
save->rip = 0xfff0;
@@ -1215,7 +1214,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
if (kvm_vcpu_is_reset_bsp(>vcpu))
svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
}
-   init_vmcb(svm, init_event);
+   init_vmcb(svm);
 
kvm_cpuid(vcpu, , , , );
kvm_register_write(vcpu, VCPU_REGS_RDX, eax);
@@ -1271,7 +1270,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
clear_page(svm->vmcb);
svm->vmcb_pa = page_to_pfn(page) << PAGE_SHIFT;
svm->asid_generation = 0;
-   init_vmcb(svm, false);
+   init_vmcb(svm);
 
svm_init_osvw(>vcpu);
 
@@ -1893,7 +1892,7 @@ static int shutdown_interception(struct vcpu_svm *svm)
 * so reinitialize it.
 */
clear_page(svm->vmcb);
-   init_vmcb(svm, false);
+   init_vmcb(svm);
 
kvm_run->exit_reason = KVM_EXIT_SHUTDOWN;
return 0;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 15bff51d492a..4d0aa31a42ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4932,8 +4932,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
vmx_set_cr0(vcpu, cr0); /* enter rmode */
vmx->vcpu.arch.cr0 = cr0;
vmx_set_cr4(vcpu, 0);
-   if (!init_event)
-   vmx_set_efer(vcpu, 0);
+   vmx_set_efer(vcpu, 0);
vmx_fpu_activate(vcpu);
update_exception_bitmap(vcpu);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] KVM/arm: enable enhanced armv7 fp/simd lazy switch

2015-10-19 Thread Christoffer Dall
On Sat, Sep 26, 2015 at 04:43:29PM -0700, Mario Smarduch wrote:
> This patch enhances current lazy vfp/simd hardware switch. In addition to
> current lazy switch, it tracks vfp/simd hardware state with a vcpu 
> lazy flag. 
> 
> vcpu lazy flag is set on guest access and trap to vfp/simd hardware switch 
> handler. On vm-enter if lazy flag is set skip trap enable and saving 
> host fpexc. On vm-exit if flag is set skip hardware context switch
> and return to host with guest context.
> 
> On vcpu_put check if vcpu lazy flag is set, and execute a hardware context 
> switch to restore host.
> 
> Signed-off-by: Mario Smarduch 
> ---
>  arch/arm/include/asm/kvm_asm.h |  1 +
>  arch/arm/kvm/arm.c | 17 
>  arch/arm/kvm/interrupts.S  | 60 
> +++---
>  arch/arm/kvm/interrupts_head.S | 12 ++---
>  4 files changed, 71 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
> index 194c91b..4b45d79 100644
> --- a/arch/arm/include/asm/kvm_asm.h
> +++ b/arch/arm/include/asm/kvm_asm.h
> @@ -97,6 +97,7 @@ extern char __kvm_hyp_code_end[];
>  extern void __kvm_flush_vm_context(void);
>  extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
>  extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
> +extern void __kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu);
>  
>  extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
>  #endif
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index ce404a5..79f49c7 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -105,6 +105,20 @@ void kvm_arch_check_processor_compat(void *rtn)
>   *(int *)rtn = 0;
>  }
>  
> +/**
> + * kvm_switch_fp_regs() - switch guest/host VFP/SIMD registers
> + * @vcpu:pointer to vcpu structure.
> + *

nit: stray blank line

> + */
> +static void kvm_switch_fp_regs(struct kvm_vcpu *vcpu)
> +{
> +#ifdef CONFIG_ARM
> + if (vcpu->arch.vfp_lazy == 1) {
> + kvm_call_hyp(__kvm_restore_host_vfp_state, vcpu);

why do you have to do this in HYP mode ?

> + vcpu->arch.vfp_lazy = 0;
> + }
> +#endif

we've tried to put stuff like this in header files to avoid the ifdefs
so far.  Could that be done here as well?

> +}
>  
>  /**
>   * kvm_arch_init_vm - initializes a VM data structure
> @@ -295,6 +309,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  
>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  {
> + /* Check if Guest accessed VFP registers */

misleading comment: this function does more than checking

> + kvm_switch_fp_regs(vcpu);
> +
>   /*
>* The arch-generic KVM code expects the cpu field of a vcpu to be -1
>* if the vcpu is no longer assigned to a cpu.  This is used for the
> diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
> index 900ef6d..6d98232 100644
> --- a/arch/arm/kvm/interrupts.S
> +++ b/arch/arm/kvm/interrupts.S
> @@ -96,6 +96,29 @@ ENTRY(__kvm_flush_vm_context)
>   bx  lr
>  ENDPROC(__kvm_flush_vm_context)
>  
> +/**
> + * void __kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes a lazy
> + * fp/simd switch, saves the guest, restores host.
> + *

nit: stray blank line

> + */
> +ENTRY(__kvm_restore_host_vfp_state)
> +#ifdef CONFIG_VFPv3
> + push{r3-r7}
> +
> + add r7, r0, #VCPU_VFP_GUEST
> + store_vfp_state r7
> +
> + add r7, r0, #VCPU_VFP_HOST
> + ldr r7, [r7]
> + restore_vfp_state r7
> +
> + ldr r3, [vcpu, #VCPU_VFP_FPEXC]

either use r0 or vcpu throughout this function please

> + VFPFMXR FPEXC, r3
> +
> + pop {r3-r7}
> +#endif
> + bx  lr
> +ENDPROC(__kvm_restore_host_vfp_state)
>  
>  /
>   *  Hypervisor world-switch code
> @@ -119,11 +142,15 @@ ENTRY(__kvm_vcpu_run)
>   @ If the host kernel has not been configured with VFPv3 support,
>   @ then it is safer if we deny guests from using it as well.
>  #ifdef CONFIG_VFPv3
> + @ r7 must be preserved until next vfp lazy check

I don't understand this comment

> + vfp_inlazy_mode r7, skip_save_host_fpexc
> +
>   @ Set FPEXC_EN so the guest doesn't trap floating point instructions
>   VFPFMRX r2, FPEXC   @ VMRS
> - push{r2}
> + str r2, [vcpu, #VCPU_VFP_FPEXC]
>   orr r2, r2, #FPEXC_EN
>   VFPFMXR FPEXC, r2   @ VMSR
> +skip_save_host_fpexc:
>  #endif
>  
>   @ Configure Hyp-role
> @@ -131,7 +158,14 @@ ENTRY(__kvm_vcpu_run)
>  
>   @ Trap coprocessor CRx accesses
>   set_hstr vmentry
> - set_hcptr vmentry, (HCPTR_TTA | HCPTR_TCP(10) | HCPTR_TCP(11))
> + set_hcptr vmentry, (HCPTR_TTA)
> +
> + @ check if vfp_lazy flag set
> + cmp r7, #1

if you meant that you need to preserve r7 down to here, could you
instead just move the VFP logic above down here and do the whole VFP

Re: [PATCH v2 1/2] KVM/arm: add hooks for armv7 fp/simd lazy switch support

2015-10-19 Thread Christoffer Dall
On Sat, Sep 26, 2015 at 04:43:28PM -0700, Mario Smarduch wrote:
> This patch adds vcpu fields to track lazy state, save host FPEXC, and 
> offsets to fields.
> 
> Signed-off-by: Mario Smarduch 
> ---
>  arch/arm/include/asm/kvm_host.h | 6 ++
>  arch/arm/kernel/asm-offsets.c   | 2 ++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index dcba0fa..194a8ef 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -111,6 +111,12 @@ struct kvm_vcpu_arch {
>   /* Interrupt related fields */
>   u32 irq_lines;  /* IRQ and FIQ levels */
>  
> + /* Track fp/simd lazy switch state */
> + u32 vfp_lazy;

so is this a flags field or basically a boolean?  If the latter, what is
does it mean when the field is true vs. false?

> +
> + /* Save host FPEXC register to restore on vcpu put */
> + u32 saved_fpexc;

is this only the host's state?  If so, why not name it host_fpexc?

Thanks,
-Christoffer

> +
>   /* Exception Information */
>   struct kvm_vcpu_fault_info fault;
>  
> diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
> index 871b826..e1c3a41 100644
> --- a/arch/arm/kernel/asm-offsets.c
> +++ b/arch/arm/kernel/asm-offsets.c
> @@ -186,6 +186,8 @@ int main(void)
>DEFINE(VCPU_CPSR,  offsetof(struct kvm_vcpu, 
> arch.regs.usr_regs.ARM_cpsr));
>DEFINE(VCPU_HCR,   offsetof(struct kvm_vcpu, arch.hcr));
>DEFINE(VCPU_IRQ_LINES, offsetof(struct kvm_vcpu, arch.irq_lines));
> +  DEFINE(VCPU_VFP_LAZY,  offsetof(struct kvm_vcpu, 
> arch.vfp_lazy));
> +  DEFINE(VCPU_VFP_FPEXC, offsetof(struct kvm_vcpu, arch.saved_fpexc));
>DEFINE(VCPU_HSR,   offsetof(struct kvm_vcpu, arch.fault.hsr));
>DEFINE(VCPU_HxFAR, offsetof(struct kvm_vcpu, arch.fault.hxfar));
>DEFINE(VCPU_HPFAR, offsetof(struct kvm_vcpu, arch.fault.hpfar));
> -- 
> 1.9.1
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 09:56:12 +0300
"Michael S. Tsirkin"  wrote:

> On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:
[...]
> > diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> > index f6bd2c4..aa95961 100644
> > --- a/include/hw/mem/nvdimm.h
> > +++ b/include/hw/mem/nvdimm.h
> > @@ -15,6 +15,10 @@
> >  
> >  #include "hw/mem/dimm.h"
> >  
> > +/* Memory region 0xFF0 ~ 0xFFF0 is reserved for NVDIMM
> > ACPI. */ +#define NVDIMM_ACPI_MEM_BASE   0xFF00ULL
Michael,

If it's ok to map control RAM region directly from QEMU at arbitrary
location let's do the same for VMGENID too (i.e. use v16
implementation which does exactly the same thing as this series).

> > +#define NVDIMM_ACPI_MEM_SIZE   0xF0ULL
[...]

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network hangs when communicating with host

2015-10-19 Thread Dmitry Vyukov
On Mon, Oct 19, 2015 at 11:22 AM, Andre Przywara  wrote:
> Hi Dmitry,
>
> On 19/10/15 10:05, Dmitry Vyukov wrote:
>> On Fri, Oct 16, 2015 at 7:25 PM, Sasha Levin  wrote:
>>> On 10/15/2015 04:20 PM, Dmitry Vyukov wrote:
 Hello,

 I am trying to run a program in lkvm sandbox so that it communicates
 with a program on host. I run lkvm as:

 ./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel
 /arch/x86/boot/bzImage --network mode=user -- /my_prog

 /my_prog then connects to a program on host over a tcp socket.
 I see that host receives some data, sends some data back, but then
 my_prog hangs on network read.

 To localize this I wrote 2 programs (attached). ping is run on host
 and pong is run from lkvm sandbox. They successfully establish tcp
 connection, but after some iterations both hang on read.

 Networking code in Go runtime is there for more than 3 years, widely
 used in production and does not have any known bugs. However, it uses
 epoll edge-triggered readiness notifications that known to be tricky.
 Is it possible that lkvm contains some networking bug? Can it be
 related to the data races in lkvm I reported earlier today?
>
> Just to let you know:
> I think we have seen networking issues in the past - root over NFS had
> issues IIRC. Will spent some time on debugging this and it looked like a
> race condition in kvmtool's virtio implementation. I think pinning
> kvmtool's virtio threads to one host core made this go away. However
> although he tried hard (even by Will's standards!) he couldn't find a
> the real root cause or a fix at the time he looked at it and we found
> other ways to work around the issues (using virtio-blk or initrd's).
>
> So it's quite possible that there are issues. I haven't had time yet to
> look at your sanitizer reports, but it looks like a promising approach
> to find the root cause.


Thanks, Andre!

ping/pong does not hang within at least 5 minutes when I run lkvm
under taskset 1.

And, yeah, this pretty strongly suggests a data race. ThreadSanitizer
can point you to the bug within a minute, so you just need to say
"aha! it is here". Or maybe not. There are no guarantees. But if you
already spent significant time on this, then checking the reports
definitely looks like a good idea.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: MMU: Initialize force_pt_level before calling mapping_level()

2015-10-19 Thread Paolo Bonzini


On 19/10/2015 08:13, Takuya Yoshikawa wrote:
> Commit fd1369021878 ("KVM: x86: MMU: Move mapping_level_dirty_bitmap()
> call in mapping_level()") forgot to initialize force_pt_level to false
> in FNAME(page_fault)() before calling mapping_level() like
> nonpaging_map() does.  This can sometimes result in forcing page table
> level mapping unnecessarily.
> 
> Fix this and move the first *force_pt_level check in mapping_level()
> before kvm_vcpu_gfn_to_memslot() call to make it a bit clearer that
> the variable must be initialized before mapping_level() gets called.
> 
> This change can also avoid calling kvm_vcpu_gfn_to_memslot() when
> !check_hugepage_cache_consistency() check in tdp_page_fault() forces
> page table level mapping.
> 
> Signed-off-by: Takuya Yoshikawa 
> ---
>  arch/x86/kvm/mmu.c | 7 ---
>  arch/x86/kvm/paging_tmpl.h | 2 +-
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index dd2a7c6..7d85bca 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -886,10 +886,11 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
> large_gfn,
>   int host_level, level, max_level;
>   struct kvm_memory_slot *slot;
>  
> - slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
> + if (unlikely(*force_pt_level))
> + return PT_PAGE_TABLE_LEVEL;
>  
> - if (likely(!*force_pt_level))
> - *force_pt_level = !memslot_valid_for_gpte(slot, true);
> + slot = kvm_vcpu_gfn_to_memslot(vcpu, large_gfn);
> + *force_pt_level = !memslot_valid_for_gpte(slot, true);
>   if (unlikely(*force_pt_level))
>   return PT_PAGE_TABLE_LEVEL;
>  
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index bf39d0f..b41faa9 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -698,7 +698,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
> addr, u32 error_code,
>   int r;
>   pfn_t pfn;
>   int level = PT_PAGE_TABLE_LEVEL;
> - bool force_pt_level;
> + bool force_pt_level = false;
>   unsigned long mmu_seq;
>   bool map_writable, is_self_change_mapping;
>  
> 

Looks good, thanks.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: x86: zero EFER on INIT

2015-10-19 Thread Paolo Bonzini
Not zeroing EFER means that a 32-bit firmware cannot enter paging mode
without clearing EFER.LME first (which it should not know about).
Yang Zhang from Intel confirmed that the manual is wrong and EFER is
cleared to zero on INIT.

Fixes: d28bc9dd25ce023270d2e039e7c98d38ecbf7758
Cc: sta...@vger.kernel.org
Cc: Yang Z Zhang 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/kvm/svm.c | 7 +++
 arch/x86/kvm/vmx.c | 3 +--
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index cd8659cfc632..1336d68fa965 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1089,7 +1089,7 @@ static u64 svm_compute_tsc_offset(struct kvm_vcpu *vcpu, 
u64 target_tsc)
return target_tsc - tsc;
 }
 
-static void init_vmcb(struct vcpu_svm *svm, bool init_event)
+static void init_vmcb(struct vcpu_svm *svm)
 {
struct vmcb_control_area *control = >vmcb->control;
struct vmcb_save_area *save = >vmcb->save;
@@ -1160,8 +1160,7 @@ static void init_vmcb(struct vcpu_svm *svm, bool 
init_event)
init_sys_seg(>ldtr, SEG_TYPE_LDT);
init_sys_seg(>tr, SEG_TYPE_BUSY_TSS16);
 
-   if (!init_event)
-   svm_set_efer(>vcpu, 0);
+   svm_set_efer(>vcpu, 0);
save->dr6 = 0x0ff0;
kvm_set_rflags(>vcpu, 2);
save->rip = 0xfff0;
@@ -1215,7 +1214,7 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
if (kvm_vcpu_is_reset_bsp(>vcpu))
svm->vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
}
-   init_vmcb(svm, init_event);
+   init_vmcb(svm);
 
kvm_cpuid(vcpu, , , , );
kvm_register_write(vcpu, VCPU_REGS_RDX, eax);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 15bff51d492a..4d0aa31a42ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4932,8 +4932,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
vmx_set_cr0(vcpu, cr0); /* enter rmode */
vmx->vcpu.arch.cr0 = cr0;
vmx_set_cr4(vcpu, 0);
-   if (!init_event)
-   vmx_set_efer(vcpu, 0);
+   vmx_set_efer(vcpu, 0);
vmx_fpu_activate(vcpu);
update_exception_bitmap(vcpu);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 06:01:17PM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 05:46 PM, Igor Mammedov wrote:
> >On Mon, 19 Oct 2015 12:17:22 +0300
> >"Michael S. Tsirkin"  wrote:
> >
> >>On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>>On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
> On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> >>>+nvdimm_init_memory_state(>nvdimm_memory,
> >>>system_memory, machine,
> >>>+ TARGET_PAGE_SIZE);
> >>>+
> >>
> >>Shouldn't this be conditional on presence of the nvdimm device?
> >>
> >
> >We will enable hotplug on nvdimm devices in the near future once
> >Linux driver is ready. I'd keep it here for future development.
> 
> No, I don't think we should add stuff unconditionally. If not
> nvdimm, some other flag should indicate user intends to hotplug
> things.
> 
> >>>
> >>>Actually, it is not unconditionally which is called if parameter
> >>>"-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path of
> >>>memoy-hotplug initiation.
> >>>
> >>
> >>Right, but that's not the same as nvdimm.
> >>
> >
> >it could be pc-machine property, then it could be turned on like this:
> >  -machine nvdimm_support=on
> 
> Er, I do not understand why this separate switch is needed and why nvdimm
> and pc-dimm is different. :(
> 
> NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and dimm
> device, even some of ACPI logic to do hotplug things, etc. Both nvdimm
> and pc-dimm are built on the same infrastructure.
> 
> 
> 
> 

It does seem to add a bunch of devices in ACPI and memory regions in
memory space.
Whatever your device is, it's generally safe to assume that most
people don't need it.

-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 12:17:22 +0300
"Michael S. Tsirkin"  wrote:

> On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> > 
> > 
> > On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
> > >On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> > +nvdimm_init_memory_state(>nvdimm_memory,
> > system_memory, machine,
> > + TARGET_PAGE_SIZE);
> > +
> > >>>
> > >>>Shouldn't this be conditional on presence of the nvdimm device?
> > >>>
> > >>
> > >>We will enable hotplug on nvdimm devices in the near future once
> > >>Linux driver is ready. I'd keep it here for future development.
> > >
> > >No, I don't think we should add stuff unconditionally. If not
> > >nvdimm, some other flag should indicate user intends to hotplug
> > >things.
> > >
> > 
> > Actually, it is not unconditionally which is called if parameter
> > "-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path of
> > memoy-hotplug initiation.
> > 
> 
> Right, but that's not the same as nvdimm.
> 

it could be pc-machine property, then it could be turned on like this:
 -machine nvdimm_support=on
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 05:46 PM, Igor Mammedov wrote:

On Mon, 19 Oct 2015 12:17:22 +0300
"Michael S. Tsirkin"  wrote:


On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:



On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:

+nvdimm_init_memory_state(>nvdimm_memory,
system_memory, machine,
+ TARGET_PAGE_SIZE);
+


Shouldn't this be conditional on presence of the nvdimm device?



We will enable hotplug on nvdimm devices in the near future once
Linux driver is ready. I'd keep it here for future development.


No, I don't think we should add stuff unconditionally. If not
nvdimm, some other flag should indicate user intends to hotplug
things.



Actually, it is not unconditionally which is called if parameter
"-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path of
memoy-hotplug initiation.



Right, but that's not the same as nvdimm.



it could be pc-machine property, then it could be turned on like this:
  -machine nvdimm_support=on


Er, I do not understand why this separate switch is needed and why nvdimm
and pc-dimm is different. :(

NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and dimm
device, even some of ACPI logic to do hotplug things, etc. Both nvdimm
and pc-dimm are built on the same infrastructure.





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs

2015-10-19 Thread Jan Beulich
The symbol was missing a KVM dependency.

Signed-off-by: Jan Beulich 
---
 virt/kvm/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 4.3-rc6/virt/kvm/Kconfig
+++ 4.3-rc6-KVM_COMPAT-dependency/virt/kvm/Kconfig
@@ -46,4 +46,4 @@ config KVM_GENERIC_DIRTYLOG_READ_PROTECT
 
 config KVM_COMPAT
def_bool y
-   depends on COMPAT && !S390
+   depends on KVM && COMPAT && !S390



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Igor Mammedov
On Mon, 19 Oct 2015 18:01:17 +0800
Xiao Guangrong  wrote:

> 
> 
> On 10/19/2015 05:46 PM, Igor Mammedov wrote:
> > On Mon, 19 Oct 2015 12:17:22 +0300
> > "Michael S. Tsirkin"  wrote:
> >
> >> On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:
> >>>
> >>>
> >>> On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:
>  On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:
> >>> +nvdimm_init_memory_state(>nvdimm_memory,
> >>> system_memory, machine,
> >>> + TARGET_PAGE_SIZE);
> >>> +
> >>
> >> Shouldn't this be conditional on presence of the nvdimm device?
> >>
> >
> > We will enable hotplug on nvdimm devices in the near future once
> > Linux driver is ready. I'd keep it here for future development.
> 
>  No, I don't think we should add stuff unconditionally. If not
>  nvdimm, some other flag should indicate user intends to hotplug
>  things.
> 
> >>>
> >>> Actually, it is not unconditionally which is called if parameter
> >>> "-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path
> >>> of memoy-hotplug initiation.
> >>>
> >>
> >> Right, but that's not the same as nvdimm.
> >>
> >
> > it could be pc-machine property, then it could be turned on like
> > this: -machine nvdimm_support=on
> 
> Er, I do not understand why this separate switch is needed and why
> nvdimm and pc-dimm is different. :(
> 
> NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and
> dimm device, even some of ACPI logic to do hotplug things, etc. Both
> nvdimm and pc-dimm are built on the same infrastructure.
NVDIMM support consumes precious low RAM  and MMIO resources and
not small amount at that. So turning it on unconditionally with
memory hotplug even if NVDIMM wouldn't ever be used isn't nice.

However that concern could be dropped if instead of allocating it's
own control MMIO/RAM regions, NVDIMM would reuse memory hotplug's MMIO
region and replace RAM region with serializing/marshaling label data
over the same MMIO interface (yes, it's slower but it's not
performance critical path).  

> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Dmitry Vyukov
On Sat, Oct 17, 2015 at 4:16 PM, Sasha Levin  wrote:
> On 10/15/2015 06:21 AM, Dmitry Vyukov wrote:
>> Hello,
>>
>> I've run a set of sanitizers on
>> git://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git commit
>> 3695adeb227813d96d9c41850703fb53a23845eb. I've just booted a VM and
>> shut it down.
>>
>> AddressSanitizer detected a heap-use-after-free:
>>
>> AddressSanitizer: heap-use-after-free on address 0x6040df90 at pc
>> 0x004f46d0 bp 0x7ffc79def2d0 sp 0x7ffc79def2c8
>> READ of size 8 at 0x6040df90 thread T0
>> #0 0x4f46cf in kvm__pause kvm.c:436:7
>> #1 0x4f0d5d in ioport__unregister ioport.c:129:2
>> #2 0x4efb2f in serial8250__exit hw/serial.c:446:7
>> #3 0x516204 in init_list__exit util/init.c:59:8
>> #4 0x4ea956 in kvm_cmd_run_exit builtin-run.c:645:2
>> #5 0x4ea956 in kvm_cmd_run builtin-run.c:661
>> #6 0x51596f in handle_command kvm-cmd.c:84:8
>> #7 0x7fa398101ec4 in __libc_start_main
>> /build/buildd/eglibc-2.19/csu/libc-start.c:287
>> #8 0x41a505 in _start (lkvm+0x41a505)
>>
>> 0x6040df90 is located 0 bytes inside of 40-byte region
>> [0x6040df90,0x6040dfb8)
>> freed by thread T0 here:
>> #0 0x4b75a0 in free
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cc:30
>> #1 0x4f29e6 in kvm_cpu__exit kvm-cpu.c:263:2
>> #2 0x5160c4 in init_list__exit util/init.c:59:8
>>
>> previously allocated by thread T0 here:
>> #0 0x4b7a2c in calloc
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/asan/asan_malloc_linux.cc:56
>> #1 0x4f2491 in kvm_cpu__init kvm-cpu.c:221:14
>
> I'm sending a patch to fix this + another issue it uncovered. This was caused 
> by
> the kvm cpu exit function to be marked as late call rather than core call, so 
> it
> would free the vcpus before anything else had a chance to exit.
>
>>
>> ThreadSanitizer detected a whole bunch of data races, for example:
>>
>> WARNING: ThreadSanitizer: data race (pid=109228)
>>   Write of size 1 at 0x7d6c0001f384 by thread T55:
>> #0 memcpy 
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:608
>> (lkvm+0x0044b28b)
>> #1 virtio_pci__msix_mmio_callback virtio/pci.c:269:3 
>> (lkvm+0x004b3ee0)
>> #2 kvm__emulate_mmio mmio.c:131:3 (lkvm+0x004ac332)
>> #3 kvm_cpu__emulate_mmio x86/include/kvm/kvm-cpu-arch.h:46:9
>> (lkvm+0x004aa8c6)
>> #4 kvm_cpu__start kvm-cpu.c:147 (lkvm+0x004aa8c6)
>> #5 kvm_cpu_thread builtin-run.c:174:6 (lkvm+0x004a6e3e)
>>
>>   Previous read of size 4 at 0x7d6c0001f384 by thread T58:
>> #0 virtio_pci__signal_vq virtio/pci.c:290:29 (lkvm+0x004b36b6)
>> #1 virtio_net_tx_thread virtio/net.c:210:4 (lkvm+0x004b1fb5)
>>
>>   Location is heap block of size 1648 at 0x7d6c0001f100 allocated by
>> main thread:
>> #0 calloc 
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:544
>> (lkvm+0x0043e812)
>> #1 virtio_init virtio/core.c:191:12 (lkvm+0x004afa48)
>> #2 virtio_net__init_one virtio/net.c:846:2 (lkvm+0x004b095d)
>> #3 virtio_net__init virtio/net.c:868:3 (lkvm+0x004b0296)
>> #4 init_list__init util/init.c:40:8 (lkvm+0x004bc7ee)
>> #5 kvm_cmd_run_init builtin-run.c:621:6 (lkvm+0x004a6799)
>> #6 kvm_cmd_run builtin-run.c:656 (lkvm+0x004a6799)
>> #7 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
>> #8 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
>> #9 main main.c:18 (lkvm+0x004ac0b4)
>>
>>   Thread T55 'kvm-vcpu-2' (tid=109285, running) created by main thread at:
>> #0 pthread_create
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848
>> (lkvm+0x004478a3)
>> #1 kvm_cmd_run_work builtin-run.c:633:7 (lkvm+0x004a683f)
>> #2 kvm_cmd_run builtin-run.c:660 (lkvm+0x004a683f)
>> #3 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
>> #4 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
>> #5 main main.c:18 (lkvm+0x004ac0b4)
>>
>>   Thread T58 'virtio-net-tx' (tid=109334, running) created by thread T53 at:
>> #0 pthread_create
>> /usr/local/google/home/dvyukov/src/llvm/build/../projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848
>> (lkvm+0x004478a3)
>> #1 init_vq virtio/net.c:526:4 (lkvm+0x004b1523)
>> #2 virtio_pci__io_out virtio/pci.c:219:3 (lkvm+0x004b484c)
>> #3 kvm__emulate_io ioport.c:196:11 (lkvm+0x004aa0f8)
>> #4 virtio_pci__io_mmio_callback virtio/pci.c:340:2 (lkvm+0x004b3e55)
>> #5 kvm__emulate_mmio mmio.c:131:3 (lkvm+0x004ac332)
>> #6 kvm_cpu__emulate_mmio x86/include/kvm/kvm-cpu-arch.h:46:9
>> (lkvm+0x004aa8c6)
>> #7 kvm_cpu__start kvm-cpu.c:147 (lkvm+0x004aa8c6)
>> #8 kvm_cpu_thread builtin-run.c:174:6 

Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 11:18:36AM +0200, Igor Mammedov wrote:
> On Mon, 19 Oct 2015 09:56:12 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:
> [...]
> > > diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> > > index f6bd2c4..aa95961 100644
> > > --- a/include/hw/mem/nvdimm.h
> > > +++ b/include/hw/mem/nvdimm.h
> > > @@ -15,6 +15,10 @@
> > >  
> > >  #include "hw/mem/dimm.h"
> > >  
> > > +/* Memory region 0xFF0 ~ 0xFFF0 is reserved for NVDIMM
> > > ACPI. */ +#define NVDIMM_ACPI_MEM_BASE   0xFF00ULL
> Michael,
> 
> If it's ok to map control RAM region directly from QEMU at arbitrary
> location

It's a fair question. Where is it reserved? In which spec?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs

2015-10-19 Thread Paolo Bonzini


On 19/10/2015 12:37, Jan Beulich wrote:
> The symbol was missing a KVM dependency.
> 
> Signed-off-by: Jan Beulich 
> ---
>  virt/kvm/Kconfig |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- 4.3-rc6/virt/kvm/Kconfig
> +++ 4.3-rc6-KVM_COMPAT-dependency/virt/kvm/Kconfig
> @@ -46,4 +46,4 @@ config KVM_GENERIC_DIRTYLOG_READ_PROTECT
>  
>  config KVM_COMPAT
> def_bool y
> -   depends on COMPAT && !S390
> +   depends on KVM && COMPAT && !S390
> 
> 
> 

Thanks, rebased and pushed to kvm/queue.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] VFIO platform reset module rework

2015-10-19 Thread Arnd Bergmann
On Sunday 18 October 2015 18:00:11 Eric Auger wrote:
> This series fixes the current implementation by getting rid of the
> usage of __symbol_get which caused a compilation issue with
> CONFIG_MODULES disabled. On top of this, the usage of MODULE_ALIAS makes
> possible to add a new reset module without being obliged to update the
> framework. The new implementation relies on the reset module registering
> its reset function to the vfio-platform driver.
> 
> The series is available at
> 
> https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.3-rc5-rework-xgbe-v2
> 
> 

Looks good for the most part, just a few comments on coding style
for patches 1 and 2.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] vfio: platform: add capability to register a reset function

2015-10-19 Thread Arnd Bergmann
On Sunday 18 October 2015 18:00:12 Eric Auger wrote:
> 
> +struct list_head reset_list;
> +LIST_HEAD(reset_list);
> 

These two should be marked 'static'.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] arm/arm64: KVM: Fix arch timer behavior for disabled interrupts

2015-10-19 Thread Eric Auger
Hi Christoffer,
On 10/17/2015 10:30 PM, Christoffer Dall wrote:
> We have an interesting issue when the guest disables the timer interrupt
> on the VGIC, which happens when turning VCPUs off using PSCI, for
> example.
> 
> The problem is that because the guest disables the virtual interrupt at
> the VGIC level, we never inject interrupts to the guest and therefore
> never mark the interrupt as active on the physical distributor.  The
> host also never takes the timer interrupt (we only use the timer device
> to trigger a guest exit and everything else is done in software), so the
> interrupt does not become active through normal means.
> 
> The result is that we keep entering the guest with a programmed timer
> that will always fire as soon as we context switch the hardware timer
> state and run the guest, preventing forward progress for the VCPU.
> 
> Since the active state on the physical distributor is really part of the
> timer logic, it is the job of our virtual arch timer driver to manage
> this state.
> 
> The timer->map->active boolean field indicates whether we have signalled
> this interrupt to the vgic and if that interrupt is still pending or
> active.  As long as that is the case, the hardware doesn't have to
> generate physical interrupts and therefore we mark the interrupt as
> active on the physical distributor.
> 
> Cc: Marc Zyngier 
> Reported-by: Lorenzo Pieralisi 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/arch_timer.c | 19 +++
>  virt/kvm/arm/vgic.c   | 43 +++
>  2 files changed, 30 insertions(+), 32 deletions(-)
> 
> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> index 48c6e1a..b9d3a32 100644
> --- a/virt/kvm/arm/arch_timer.c
> +++ b/virt/kvm/arm/arch_timer.c
> @@ -137,6 +137,8 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
>  void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>  {
>   struct arch_timer_cpu *timer = >arch.timer_cpu;
> + bool phys_active;
> + int ret;
>  
>   /*
>* We're about to run this vcpu again, so there is no need to
> @@ -151,6 +153,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>*/
>   if (kvm_timer_should_fire(vcpu))
>   kvm_timer_inject_irq(vcpu);
> +
> + /*
> +  * We keep track of whether the edge-triggered interrupt has been
> +  * signalled to the vgic/guest, and if so, we mask the interrupt and
> +  * the physical distributor to prevent the timer from raising a
> +  * physical interrupt whenever we run a guest, preventing forward
> +  * VCPU progress.
In practice don't you simply mark the IRQ as active at GIC physical
distributor level, hence preventing the same IRQ from hitting again
> +  */
> + if (kvm_vgic_get_phys_irq_active(timer->map))
> + phys_active = true;
> + else
> + phys_active = false;
> +
> + ret = irq_set_irqchip_state(timer->map->irq,
> + IRQCHIP_STATE_ACTIVE,
> + phys_active);

physical distributor state is set in arch timer flush. It relates to a
shared device behavior so I find it natural to do it there.

However the map->active is set in arch_timer IRQ injection and unset in
vgic sync. Why not doing the set in kvm_vgic_inject_mapped_irq?

> + WARN_ON(ret);
>  }
>  
>  /**
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 596455a..ea21bc2 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -1092,6 +1092,15 @@ static void vgic_retire_lr(int lr_nr, int irq, struct 
> kvm_vcpu *vcpu)
>   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
>   struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
>  
> + /*
> +  * We must transfer the pending state back to the distributor before
> +  * retiring the LR, otherwise we may loose edge-triggered interrupts.
> +  */
> + if (vlr.state & LR_STATE_PENDING) {
> + vgic_dist_irq_set_pending(vcpu, irq);
> + vlr.hwirq = 0;
> + }
That fix applies to any edge-sensitive IRQ, ie. not especially the
timer's one? In the positive shouldn't you precise this in the commit
msg too?

Best Regards

Eric

> +
>   vlr.state = 0;
>   vgic_set_lr(vcpu, lr_nr, vlr);
>   clear_bit(lr_nr, vgic_cpu->lr_used);
> @@ -1241,7 +1250,7 @@ static void __kvm_vgic_flush_hwstate(struct kvm_vcpu 
> *vcpu)
>   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
>   struct vgic_dist *dist = >kvm->arch.vgic;
>   unsigned long *pa_percpu, *pa_shared;
> - int i, vcpu_id, lr, ret;
> + int i, vcpu_id;
>   int overflow = 0;
>   int nr_shared = vgic_nr_shared_irqs(dist);
>  
> @@ -1296,31 +1305,6 @@ epilog:
>*/
>   clear_bit(vcpu_id, dist->irq_pending_on_cpu);
>   }
> -
> - for (lr = 0; lr < vgic->nr_lr; lr++) {
> - struct 

Re: [PATCH 1/3] arm/arm64: KVM: Fix arch timer behavior for disabled interrupts

2015-10-19 Thread Christoffer Dall
On Mon, Oct 19, 2015 at 03:27:42PM +0200, Eric Auger wrote:
> On 10/19/2015 03:14 PM, Christoffer Dall wrote:
> > On Mon, Oct 19, 2015 at 03:07:16PM +0200, Eric Auger wrote:
> >> Hi Christoffer,
> >> On 10/17/2015 10:30 PM, Christoffer Dall wrote:
> >>> We have an interesting issue when the guest disables the timer interrupt
> >>> on the VGIC, which happens when turning VCPUs off using PSCI, for
> >>> example.
> >>>
> >>> The problem is that because the guest disables the virtual interrupt at
> >>> the VGIC level, we never inject interrupts to the guest and therefore
> >>> never mark the interrupt as active on the physical distributor.  The
> >>> host also never takes the timer interrupt (we only use the timer device
> >>> to trigger a guest exit and everything else is done in software), so the
> >>> interrupt does not become active through normal means.
> >>>
> >>> The result is that we keep entering the guest with a programmed timer
> >>> that will always fire as soon as we context switch the hardware timer
> >>> state and run the guest, preventing forward progress for the VCPU.
> >>>
> >>> Since the active state on the physical distributor is really part of the
> >>> timer logic, it is the job of our virtual arch timer driver to manage
> >>> this state.
> >>>
> >>> The timer->map->active boolean field indicates whether we have signalled
> >>> this interrupt to the vgic and if that interrupt is still pending or
> >>> active.  As long as that is the case, the hardware doesn't have to
> >>> generate physical interrupts and therefore we mark the interrupt as
> >>> active on the physical distributor.
> >>>
> >>> Cc: Marc Zyngier 
> >>> Reported-by: Lorenzo Pieralisi 
> >>> Signed-off-by: Christoffer Dall 
> >>> ---
> >>>  virt/kvm/arm/arch_timer.c | 19 +++
> >>>  virt/kvm/arm/vgic.c   | 43 
> >>> +++
> >>>  2 files changed, 30 insertions(+), 32 deletions(-)
> >>>
> >>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> >>> index 48c6e1a..b9d3a32 100644
> >>> --- a/virt/kvm/arm/arch_timer.c
> >>> +++ b/virt/kvm/arm/arch_timer.c
> >>> @@ -137,6 +137,8 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
> >>>  void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> >>>  {
> >>>   struct arch_timer_cpu *timer = >arch.timer_cpu;
> >>> + bool phys_active;
> >>> + int ret;
> >>>  
> >>>   /*
> >>>* We're about to run this vcpu again, so there is no need to
> >>> @@ -151,6 +153,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> >>>*/
> >>>   if (kvm_timer_should_fire(vcpu))
> >>>   kvm_timer_inject_irq(vcpu);
> >>> +
> >>> + /*
> >>> +  * We keep track of whether the edge-triggered interrupt has been
> >>> +  * signalled to the vgic/guest, and if so, we mask the interrupt and
> >>> +  * the physical distributor to prevent the timer from raising a
> >>> +  * physical interrupt whenever we run a guest, preventing forward
> >>> +  * VCPU progress.
> >> In practice don't you simply mark the IRQ as active at GIC physical
> >> distributor level, hence preventing the same IRQ from hitting again
> > 
> > yes, that's what I meant with my comment, I should reword to "...we mark
> > the interrupt as active on the physical distributor..."
> > 
> >>> +  */
> >>> + if (kvm_vgic_get_phys_irq_active(timer->map))
> >>> + phys_active = true;
> >>> + else
> >>> + phys_active = false;
> >>> +
> >>> + ret = irq_set_irqchip_state(timer->map->irq,
> >>> + IRQCHIP_STATE_ACTIVE,
> >>> + phys_active);
> >>
> >> physical distributor state is set in arch timer flush. It relates to a
> >> shared device behavior so I find it natural to do it there.
> >>
> >> However the map->active is set in arch_timer IRQ injection and unset in
> >> vgic sync. Why not doing the set in kvm_vgic_inject_mapped_irq?
> > 
> > Because you have to set it at every entry to the guest if you run
> > multiple VCPUs/VMs on this CPU or migrate this VCPU to a different CPU.
> I meant   kvm_vgic_set_phys_irq_active(timer->map, true) call in
> kvm_timer_inject_irq? Couldn' that been done in
> kvm_vgic_inject_mapped_irq instead. Doesn't this apply to all mapped IRQs?
> 
I could, but that would be outside the scope of this patch, which aims
to fix the behavior in 4.3 simply.  Unless I'm missing some sort of
change in functionality there?

The map->active stuff all goes away in 4.4 when we change to
level-triggered semantics anyway.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-19 Thread Eric Auger
On 10/19/2015 03:25 PM, Arnd Bergmann wrote:
> On Monday 19 October 2015 15:17:30 Eric Auger wrote:
>> Hi Arnd,
>> On 10/19/2015 03:04 PM, Arnd Bergmann wrote:
>>> On Sunday 18 October 2015 18:00:13 Eric Auger wrote:
 diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
 b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
 index 619dc7d..4f76b17 100644
 --- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
 +++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
 @@ -29,8 +29,7 @@
  #define DRIVER_VERSION  "0.1"
  #define DRIVER_AUTHOR   "Eric Auger "
  #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform 
 device"
 -
 -#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
 +#define COMPAT"calxeda,hb-xgmac"
>>>
>>> Why the rename?
>> This define was not used. Since the code is meant to be duplicated from
>> one reset module to the other I thought it did not bring any have a
>> specialized name
> 
> I'd say it would be clearer to remove the macro then, and pass the
> string literal in the function call.
OK
> 
  /* XGMAC Register definitions */
  #define XGMAC_CONTROL   0x  /* MAC Configuration */
 @@ -80,6 +79,43 @@ int vfio_platform_calxedaxgmac_reset(struct 
 vfio_platform_device *vdev)
  }
  EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
  
 +static int __init vfio_platform_calxedaxgmac_init(void)
 +{
 +  int (*register_reset)(struct module *, char*,
 +  vfio_platform_reset_fn_t);
 +  int ret;
 +
 +  register_reset = symbol_get(vfio_platform_register_reset);
 +  if (!register_reset)
 +  return -EINVAL;
>>>
>>> I don't understand what you do the symbol_get() here for.
>>> Why not just call that function directly
>> the function is implemented in a separate module. This is just to make
>> sure the vfio-platform module is loaded, in case the end-user attempts
>> to load the reset module without having this latter loaded.
> 
> The module loader does all this for you.
Ah OK

I will respin shortly taking into account all your comments

Thanks for your time!

Best Regards

Eric


> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Sasha Levin
On 10/19/2015 04:37 AM, Dmitry Vyukov wrote:
>> So in this case (and most of the other data race cases described in the full 
>> log) it
>> > seems like ThreadSanitizer is mixing with different accesses by the guest 
>> > to one underlying
>> > block of memory on the host.
>> >
>> > Here, for example, T55 accesses the msix block of the virtio-net PCI 
>> > device, and T58 is accessing
>> > the virtqueue exposed by that device. While they both get to the same 
>> > block of memory inside
> I don't understand this.
> Do you mean that this is a false positive? Or it is a real issue in lkvm?

I suspect it's a false positive because ThreadSanitizer sees the memory as one 
big
block, but the logic that makes sure we don't race here is within the guest vm, 
which
ThreadSanitizer doesn't see.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network hangs when communicating with host

2015-10-19 Thread Sasha Levin
On 10/19/2015 05:28 AM, Dmitry Vyukov wrote:
> On Mon, Oct 19, 2015 at 11:22 AM, Andre Przywara  
> wrote:
>> Hi Dmitry,
>>
>> On 19/10/15 10:05, Dmitry Vyukov wrote:
>>> On Fri, Oct 16, 2015 at 7:25 PM, Sasha Levin  wrote:
 On 10/15/2015 04:20 PM, Dmitry Vyukov wrote:
> Hello,
>
> I am trying to run a program in lkvm sandbox so that it communicates
> with a program on host. I run lkvm as:
>
> ./lkvm sandbox --disk sandbox-test --mem=2048 --cpus=4 --kernel
> /arch/x86/boot/bzImage --network mode=user -- /my_prog
>
> /my_prog then connects to a program on host over a tcp socket.
> I see that host receives some data, sends some data back, but then
> my_prog hangs on network read.
>
> To localize this I wrote 2 programs (attached). ping is run on host
> and pong is run from lkvm sandbox. They successfully establish tcp
> connection, but after some iterations both hang on read.
>
> Networking code in Go runtime is there for more than 3 years, widely
> used in production and does not have any known bugs. However, it uses
> epoll edge-triggered readiness notifications that known to be tricky.
> Is it possible that lkvm contains some networking bug? Can it be
> related to the data races in lkvm I reported earlier today?
>>
>> Just to let you know:
>> I think we have seen networking issues in the past - root over NFS had
>> issues IIRC. Will spent some time on debugging this and it looked like a
>> race condition in kvmtool's virtio implementation. I think pinning
>> kvmtool's virtio threads to one host core made this go away. However
>> although he tried hard (even by Will's standards!) he couldn't find a
>> the real root cause or a fix at the time he looked at it and we found
>> other ways to work around the issues (using virtio-blk or initrd's).
>>
>> So it's quite possible that there are issues. I haven't had time yet to
>> look at your sanitizer reports, but it looks like a promising approach
>> to find the root cause.
> 
> 
> Thanks, Andre!
> 
> ping/pong does not hang within at least 5 minutes when I run lkvm
> under taskset 1.
> 
> And, yeah, this pretty strongly suggests a data race. ThreadSanitizer
> can point you to the bug within a minute, so you just need to say
> "aha! it is here". Or maybe not. There are no guarantees. But if you
> already spent significant time on this, then checking the reports
> definitely looks like a good idea.

Okay, that's good to know.

I have a few busy days, but I'll definitely try to clear up these reports
as they seem to be pointing to real issues.


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-19 Thread Arnd Bergmann
On Sunday 18 October 2015 18:00:13 Eric Auger wrote:
> diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
> b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> index 619dc7d..4f76b17 100644
> --- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> +++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> @@ -29,8 +29,7 @@
>  #define DRIVER_VERSION  "0.1"
>  #define DRIVER_AUTHOR   "Eric Auger "
>  #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform 
> device"
> -
> -#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
> +#define COMPAT   "calxeda,hb-xgmac"

Why the rename?

>  /* XGMAC Register definitions */
>  #define XGMAC_CONTROL   0x  /* MAC Configuration */
> @@ -80,6 +79,43 @@ int vfio_platform_calxedaxgmac_reset(struct 
> vfio_platform_device *vdev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
>  
> +static int __init vfio_platform_calxedaxgmac_init(void)
> +{
> + int (*register_reset)(struct module *, char*,
> + vfio_platform_reset_fn_t);
> + int ret;
> +
> + register_reset = symbol_get(vfio_platform_register_reset);
> + if (!register_reset)
> + return -EINVAL;

I don't understand what you do the symbol_get() here for.
Why not just call that function directly?

> + ret = register_reset(THIS_MODULE, COMPAT,
> + vfio_platform_calxedaxgmac_reset);

This is whitespace damaged, the second line needs to be indented
to the opening braces.

I would also suggest defining register_reset as a macro that
implicitly passes the THIS_MODULE argument, as other subsystems
do.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] arm/arm64: KVM: Fix arch timer behavior for disabled interrupts

2015-10-19 Thread Christoffer Dall
On Mon, Oct 19, 2015 at 03:07:16PM +0200, Eric Auger wrote:
> Hi Christoffer,
> On 10/17/2015 10:30 PM, Christoffer Dall wrote:
> > We have an interesting issue when the guest disables the timer interrupt
> > on the VGIC, which happens when turning VCPUs off using PSCI, for
> > example.
> > 
> > The problem is that because the guest disables the virtual interrupt at
> > the VGIC level, we never inject interrupts to the guest and therefore
> > never mark the interrupt as active on the physical distributor.  The
> > host also never takes the timer interrupt (we only use the timer device
> > to trigger a guest exit and everything else is done in software), so the
> > interrupt does not become active through normal means.
> > 
> > The result is that we keep entering the guest with a programmed timer
> > that will always fire as soon as we context switch the hardware timer
> > state and run the guest, preventing forward progress for the VCPU.
> > 
> > Since the active state on the physical distributor is really part of the
> > timer logic, it is the job of our virtual arch timer driver to manage
> > this state.
> > 
> > The timer->map->active boolean field indicates whether we have signalled
> > this interrupt to the vgic and if that interrupt is still pending or
> > active.  As long as that is the case, the hardware doesn't have to
> > generate physical interrupts and therefore we mark the interrupt as
> > active on the physical distributor.
> > 
> > Cc: Marc Zyngier 
> > Reported-by: Lorenzo Pieralisi 
> > Signed-off-by: Christoffer Dall 
> > ---
> >  virt/kvm/arm/arch_timer.c | 19 +++
> >  virt/kvm/arm/vgic.c   | 43 +++
> >  2 files changed, 30 insertions(+), 32 deletions(-)
> > 
> > diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
> > index 48c6e1a..b9d3a32 100644
> > --- a/virt/kvm/arm/arch_timer.c
> > +++ b/virt/kvm/arm/arch_timer.c
> > @@ -137,6 +137,8 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
> >  void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> >  {
> > struct arch_timer_cpu *timer = >arch.timer_cpu;
> > +   bool phys_active;
> > +   int ret;
> >  
> > /*
> >  * We're about to run this vcpu again, so there is no need to
> > @@ -151,6 +153,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
> >  */
> > if (kvm_timer_should_fire(vcpu))
> > kvm_timer_inject_irq(vcpu);
> > +
> > +   /*
> > +* We keep track of whether the edge-triggered interrupt has been
> > +* signalled to the vgic/guest, and if so, we mask the interrupt and
> > +* the physical distributor to prevent the timer from raising a
> > +* physical interrupt whenever we run a guest, preventing forward
> > +* VCPU progress.
> In practice don't you simply mark the IRQ as active at GIC physical
> distributor level, hence preventing the same IRQ from hitting again

yes, that's what I meant with my comment, I should reword to "...we mark
the interrupt as active on the physical distributor..."

> > +*/
> > +   if (kvm_vgic_get_phys_irq_active(timer->map))
> > +   phys_active = true;
> > +   else
> > +   phys_active = false;
> > +
> > +   ret = irq_set_irqchip_state(timer->map->irq,
> > +   IRQCHIP_STATE_ACTIVE,
> > +   phys_active);
> 
> physical distributor state is set in arch timer flush. It relates to a
> shared device behavior so I find it natural to do it there.
> 
> However the map->active is set in arch_timer IRQ injection and unset in
> vgic sync. Why not doing the set in kvm_vgic_inject_mapped_irq?

Because you have to set it at every entry to the guest if you run
multiple VCPUs/VMs on this CPU or migrate this VCPU to a different CPU.

> 
> > +   WARN_ON(ret);
> >  }
> >  
> >  /**
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index 596455a..ea21bc2 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -1092,6 +1092,15 @@ static void vgic_retire_lr(int lr_nr, int irq, 
> > struct kvm_vcpu *vcpu)
> > struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
> > struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
> >  
> > +   /*
> > +* We must transfer the pending state back to the distributor before
> > +* retiring the LR, otherwise we may loose edge-triggered interrupts.
> > +*/
> > +   if (vlr.state & LR_STATE_PENDING) {
> > +   vgic_dist_irq_set_pending(vcpu, irq);
> > +   vlr.hwirq = 0;
> > +   }
> That fix applies to any edge-sensitive IRQ, ie. not especially the
> timer's one? In the positive shouldn't you precise this in the commit
> msg too?
> 

Probably, it could also be a separate patch.  I'll rework this.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 2/4] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-19 Thread Eric Auger
Hi Arnd,
On 10/19/2015 03:04 PM, Arnd Bergmann wrote:
> On Sunday 18 October 2015 18:00:13 Eric Auger wrote:
>> diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
>> b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
>> index 619dc7d..4f76b17 100644
>> --- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
>> +++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
>> @@ -29,8 +29,7 @@
>>  #define DRIVER_VERSION  "0.1"
>>  #define DRIVER_AUTHOR   "Eric Auger "
>>  #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform 
>> device"
>> -
>> -#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
>> +#define COMPAT  "calxeda,hb-xgmac"
> 
> Why the rename?
This define was not used. Since the code is meant to be duplicated from
one reset module to the other I thought it did not bring any have a
specialized name
> 
>>  /* XGMAC Register definitions */
>>  #define XGMAC_CONTROL   0x  /* MAC Configuration */
>> @@ -80,6 +79,43 @@ int vfio_platform_calxedaxgmac_reset(struct 
>> vfio_platform_device *vdev)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
>>  
>> +static int __init vfio_platform_calxedaxgmac_init(void)
>> +{
>> +int (*register_reset)(struct module *, char*,
>> +vfio_platform_reset_fn_t);
>> +int ret;
>> +
>> +register_reset = symbol_get(vfio_platform_register_reset);
>> +if (!register_reset)
>> +return -EINVAL;
> 
> I don't understand what you do the symbol_get() here for.
> Why not just call that function directly
the function is implemented in a separate module. This is just to make
sure the vfio-platform module is loaded, in case the end-user attempts
to load the reset module without having this latter loaded.
> 
>> +ret = register_reset(THIS_MODULE, COMPAT,
>> +vfio_platform_calxedaxgmac_reset);
> 
> This is whitespace damaged, the second line needs to be indented
> to the opening braces.
ok
> 
> I would also suggest defining register_reset as a macro that
> implicitly passes the THIS_MODULE argument, as other subsystems
> do.
ok

Thanks

Best Regards

Eric
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] arm/arm64: KVM: Fix arch timer behavior for disabled interrupts

2015-10-19 Thread Eric Auger
On 10/19/2015 03:14 PM, Christoffer Dall wrote:
> On Mon, Oct 19, 2015 at 03:07:16PM +0200, Eric Auger wrote:
>> Hi Christoffer,
>> On 10/17/2015 10:30 PM, Christoffer Dall wrote:
>>> We have an interesting issue when the guest disables the timer interrupt
>>> on the VGIC, which happens when turning VCPUs off using PSCI, for
>>> example.
>>>
>>> The problem is that because the guest disables the virtual interrupt at
>>> the VGIC level, we never inject interrupts to the guest and therefore
>>> never mark the interrupt as active on the physical distributor.  The
>>> host also never takes the timer interrupt (we only use the timer device
>>> to trigger a guest exit and everything else is done in software), so the
>>> interrupt does not become active through normal means.
>>>
>>> The result is that we keep entering the guest with a programmed timer
>>> that will always fire as soon as we context switch the hardware timer
>>> state and run the guest, preventing forward progress for the VCPU.
>>>
>>> Since the active state on the physical distributor is really part of the
>>> timer logic, it is the job of our virtual arch timer driver to manage
>>> this state.
>>>
>>> The timer->map->active boolean field indicates whether we have signalled
>>> this interrupt to the vgic and if that interrupt is still pending or
>>> active.  As long as that is the case, the hardware doesn't have to
>>> generate physical interrupts and therefore we mark the interrupt as
>>> active on the physical distributor.
>>>
>>> Cc: Marc Zyngier 
>>> Reported-by: Lorenzo Pieralisi 
>>> Signed-off-by: Christoffer Dall 
>>> ---
>>>  virt/kvm/arm/arch_timer.c | 19 +++
>>>  virt/kvm/arm/vgic.c   | 43 +++
>>>  2 files changed, 30 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
>>> index 48c6e1a..b9d3a32 100644
>>> --- a/virt/kvm/arm/arch_timer.c
>>> +++ b/virt/kvm/arm/arch_timer.c
>>> @@ -137,6 +137,8 @@ bool kvm_timer_should_fire(struct kvm_vcpu *vcpu)
>>>  void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>>>  {
>>> struct arch_timer_cpu *timer = >arch.timer_cpu;
>>> +   bool phys_active;
>>> +   int ret;
>>>  
>>> /*
>>>  * We're about to run this vcpu again, so there is no need to
>>> @@ -151,6 +153,23 @@ void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>>>  */
>>> if (kvm_timer_should_fire(vcpu))
>>> kvm_timer_inject_irq(vcpu);
>>> +
>>> +   /*
>>> +* We keep track of whether the edge-triggered interrupt has been
>>> +* signalled to the vgic/guest, and if so, we mask the interrupt and
>>> +* the physical distributor to prevent the timer from raising a
>>> +* physical interrupt whenever we run a guest, preventing forward
>>> +* VCPU progress.
>> In practice don't you simply mark the IRQ as active at GIC physical
>> distributor level, hence preventing the same IRQ from hitting again
> 
> yes, that's what I meant with my comment, I should reword to "...we mark
> the interrupt as active on the physical distributor..."
> 
>>> +*/
>>> +   if (kvm_vgic_get_phys_irq_active(timer->map))
>>> +   phys_active = true;
>>> +   else
>>> +   phys_active = false;
>>> +
>>> +   ret = irq_set_irqchip_state(timer->map->irq,
>>> +   IRQCHIP_STATE_ACTIVE,
>>> +   phys_active);
>>
>> physical distributor state is set in arch timer flush. It relates to a
>> shared device behavior so I find it natural to do it there.
>>
>> However the map->active is set in arch_timer IRQ injection and unset in
>> vgic sync. Why not doing the set in kvm_vgic_inject_mapped_irq?
> 
> Because you have to set it at every entry to the guest if you run
> multiple VCPUs/VMs on this CPU or migrate this VCPU to a different CPU.
I meant kvm_vgic_set_phys_irq_active(timer->map, true) call in
kvm_timer_inject_irq? Couldn' that been done in
kvm_vgic_inject_mapped_irq instead. Doesn't this apply to all mapped IRQs?

Eric
> 
>>
>>> +   WARN_ON(ret);
>>>  }
>>>  
>>>  /**
>>> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
>>> index 596455a..ea21bc2 100644
>>> --- a/virt/kvm/arm/vgic.c
>>> +++ b/virt/kvm/arm/vgic.c
>>> @@ -1092,6 +1092,15 @@ static void vgic_retire_lr(int lr_nr, int irq, 
>>> struct kvm_vcpu *vcpu)
>>> struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
>>> struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
>>>  
>>> +   /*
>>> +* We must transfer the pending state back to the distributor before
>>> +* retiring the LR, otherwise we may loose edge-triggered interrupts.
>>> +*/
>>> +   if (vlr.state & LR_STATE_PENDING) {
>>> +   vgic_dist_irq_set_pending(vcpu, irq);
>>> +   vlr.hwirq = 0;
>>> +   }
>> That fix applies to any edge-sensitive IRQ, ie. not especially the
>> timer's one? In the positive shouldn't you precise 

Re: [PATCH 2/4] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-19 Thread Arnd Bergmann
On Monday 19 October 2015 15:17:30 Eric Auger wrote:
> Hi Arnd,
> On 10/19/2015 03:04 PM, Arnd Bergmann wrote:
> > On Sunday 18 October 2015 18:00:13 Eric Auger wrote:
> >> diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
> >> b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> >> index 619dc7d..4f76b17 100644
> >> --- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> >> +++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
> >> @@ -29,8 +29,7 @@
> >>  #define DRIVER_VERSION  "0.1"
> >>  #define DRIVER_AUTHOR   "Eric Auger "
> >>  #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform 
> >> device"
> >> -
> >> -#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
> >> +#define COMPAT"calxeda,hb-xgmac"
> > 
> > Why the rename?
> This define was not used. Since the code is meant to be duplicated from
> one reset module to the other I thought it did not bring any have a
> specialized name

I'd say it would be clearer to remove the macro then, and pass the
string literal in the function call.

> >>  /* XGMAC Register definitions */
> >>  #define XGMAC_CONTROL   0x  /* MAC Configuration */
> >> @@ -80,6 +79,43 @@ int vfio_platform_calxedaxgmac_reset(struct 
> >> vfio_platform_device *vdev)
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
> >>  
> >> +static int __init vfio_platform_calxedaxgmac_init(void)
> >> +{
> >> +  int (*register_reset)(struct module *, char*,
> >> +  vfio_platform_reset_fn_t);
> >> +  int ret;
> >> +
> >> +  register_reset = symbol_get(vfio_platform_register_reset);
> >> +  if (!register_reset)
> >> +  return -EINVAL;
> > 
> > I don't understand what you do the symbol_get() here for.
> > Why not just call that function directly
> the function is implemented in a separate module. This is just to make
> sure the vfio-platform module is loaded, in case the end-user attempts
> to load the reset module without having this latter loaded.

The module loader does all this for you.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvmtool/term: unexport term_set_tty, term_init, term_exit

2015-10-19 Thread Oleg Nesterov
According to git grep they can be static.

term_got_escape can be static too, and we can even move it into
term_getc().

"int term_escape_char" doesn't make sense at least until we allow
to redefine it, turn it into preprocessor constant.

Signed-off-by: Oleg Nesterov 
---
 include/kvm/term.h |  3 ---
 term.c | 15 ---
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/include/kvm/term.h b/include/kvm/term.h
index dc9882e..b9793f0 100644
--- a/include/kvm/term.h
+++ b/include/kvm/term.h
@@ -18,9 +18,6 @@ int term_putc(char *addr, int cnt, int term);
 int term_getc(struct kvm *kvm, int term);
 
 bool term_readable(int term);
-void term_set_tty(int term);
-int term_init(struct kvm *kvm);
-int term_exit(struct kvm *kvm);
 int tty_parser(const struct option *opt, const char *arg, int unset);
 
 #endif /* KVM__TERM_H */
diff --git a/term.c b/term.c
index 9763211..aebd174 100644
--- a/term.c
+++ b/term.c
@@ -19,15 +19,16 @@
 
 static struct termios  orig_term;
 
-int term_escape_char   = 0x01; /* ctrl-a is used for escape */
-bool term_got_escape   = false;
-
-int term_fds[TERM_MAX_DEVS][2];
+static int term_fds[TERM_MAX_DEVS][2];
 
 static pthread_t term_poll_thread;
 
+/* ctrl-a is used for escape */
+#define term_escape_char   0x01
+
 int term_getc(struct kvm *kvm, int term)
 {
+   static bool term_got_escape = false;
unsigned char c;
 
if (read_in_full(term_fds[term][TERM_FD_IN], , 1) < 0)
@@ -137,7 +138,7 @@ static void term_sig_cleanup(int sig)
raise(sig);
 }
 
-void term_set_tty(int term)
+static void term_set_tty(int term)
 {
struct termios orig_term;
int master, slave;
@@ -167,7 +168,7 @@ int tty_parser(const struct option *opt, const char *arg, 
int unset)
return 0;
 }
 
-int term_init(struct kvm *kvm)
+static int term_init(struct kvm *kvm)
 {
struct termios term;
int i, r;
@@ -204,7 +205,7 @@ int term_init(struct kvm *kvm)
 }
 dev_init(term_init);
 
-int term_exit(struct kvm *kvm)
+static int term_exit(struct kvm *kvm)
 {
return 0;
 }
-- 
2.4.3


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Dmitry Vyukov
On Mon, Oct 19, 2015 at 4:35 PM, Sasha Levin  wrote:
> On 10/19/2015 10:24 AM, Dmitry Vyukov wrote:
>> On Mon, Oct 19, 2015 at 4:19 PM, Sasha Levin  wrote:
>>> > On 10/19/2015 04:37 AM, Dmitry Vyukov wrote:
> >>> So in this case (and most of the other data race cases described in 
> >>> the full log) it
>> >>> > seems like ThreadSanitizer is mixing with different accesses by 
>> >>> > the guest to one underlying
>> >>> > block of memory on the host.
>> >>> >
>> >>> > Here, for example, T55 accesses the msix block of the virtio-net 
>> >>> > PCI device, and T58 is accessing
>> >>> > the virtqueue exposed by that device. While they both get to the 
>> >>> > same block of memory inside
 >> I don't understand this.
 >> Do you mean that this is a false positive? Or it is a real issue in 
 >> lkvm?
>>> >
>>> > I suspect it's a false positive because ThreadSanitizer sees the memory 
>>> > as one big
>>> > block, but the logic that makes sure we don't race here is within the 
>>> > guest vm, which
>>> > ThreadSanitizer doesn't see.
>>
>> But isn't the task of a hypervisor to sandbox the guest OS and to not
>> trust it in any way, shape or form? What if the guest VM intentionally
>> omits the synchronization? Can it exploit the host?
>
> Right, the memory areas that are accessed both by the hypervisor and the guest
> should be treated as untrusted input, but the hypervisor is supposed to 
> validate
> the input carefully before using it - so I'm not sure how data races would
> introduce anything new that we didn't catch during validation.


One possibility would be: if result of a racy read is passed to guest,
that can leak arbitrary host data into guest. Does not sound good.
Also, without usage of proper atomic operations, it is basically
impossible to verify untrusted data, as it can be changing under your
feet. And storing data into a local variable does not prevent the data
from changing.

There are also some data races with free(), it does not looks good at all.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: x86: zero EFER on INIT

2015-10-19 Thread Radim Krčmář
2015-10-19 11:35+0200, Paolo Bonzini:
> Not zeroing EFER means that a 32-bit firmware cannot enter paging mode
> without clearing EFER.LME first (which it should not know about).
> Yang Zhang from Intel confirmed that the manual is wrong and EFER is
> cleared to zero on INIT.

Also in APM v2, "Table 14-1. Initial Processor State.".

> 
> Fixes: d28bc9dd25ce023270d2e039e7c98d38ecbf7758
> Cc: sta...@vger.kernel.org
> Cc: Yang Z Zhang 
> Signed-off-by: Paolo Bonzini 
> ---

Reviewed-by: Radim Krčmář 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Sasha Levin
On 10/19/2015 10:47 AM, Dmitry Vyukov wrote:
>> Right, the memory areas that are accessed both by the hypervisor and the 
>> guest
>> > should be treated as untrusted input, but the hypervisor is supposed to 
>> > validate
>> > the input carefully before using it - so I'm not sure how data races would
>> > introduce anything new that we didn't catch during validation.
> 
> One possibility would be: if result of a racy read is passed to guest,
> that can leak arbitrary host data into guest. Does not sound good.
> Also, without usage of proper atomic operations, it is basically
> impossible to verify untrusted data, as it can be changing under your
> feet. And storing data into a local variable does not prevent the data
> from changing.

What's missing here is that the guest doesn't directly read/write the memory:
every time it accesses a memory that is shared with the host it will trigger
an exit, which will stop the vcpu thread that made the access and kernel side
kvm will pass the hypervisor the value the guest wrote (or the memory address
it attempted to read). The value/address can't change under us in that scenario.

> There are also some data races with free(), it does not looks good at all.

I definitely agree there are quite a few real bugs there :)


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Dmitry Vyukov
On Mon, Oct 19, 2015 at 4:19 PM, Sasha Levin  wrote:
> On 10/19/2015 04:37 AM, Dmitry Vyukov wrote:
>>> So in this case (and most of the other data race cases described in the 
>>> full log) it
>>> > seems like ThreadSanitizer is mixing with different accesses by the guest 
>>> > to one underlying
>>> > block of memory on the host.
>>> >
>>> > Here, for example, T55 accesses the msix block of the virtio-net PCI 
>>> > device, and T58 is accessing
>>> > the virtqueue exposed by that device. While they both get to the same 
>>> > block of memory inside
>> I don't understand this.
>> Do you mean that this is a false positive? Or it is a real issue in lkvm?
>
> I suspect it's a false positive because ThreadSanitizer sees the memory as 
> one big
> block, but the logic that makes sure we don't race here is within the guest 
> vm, which
> ThreadSanitizer doesn't see.


But isn't the task of a hypervisor to sandbox the guest OS and to not
trust it in any way, shape or form? What if the guest VM intentionally
omits the synchronization? Can it exploit the host?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Sasha Levin
On 10/19/2015 10:24 AM, Dmitry Vyukov wrote:
> On Mon, Oct 19, 2015 at 4:19 PM, Sasha Levin  wrote:
>> > On 10/19/2015 04:37 AM, Dmitry Vyukov wrote:
 >>> So in this case (and most of the other data race cases described in 
 >>> the full log) it
> >>> > seems like ThreadSanitizer is mixing with different accesses by the 
> >>> > guest to one underlying
> >>> > block of memory on the host.
> >>> >
> >>> > Here, for example, T55 accesses the msix block of the virtio-net 
> >>> > PCI device, and T58 is accessing
> >>> > the virtqueue exposed by that device. While they both get to the 
> >>> > same block of memory inside
>>> >> I don't understand this.
>>> >> Do you mean that this is a false positive? Or it is a real issue in lkvm?
>> >
>> > I suspect it's a false positive because ThreadSanitizer sees the memory as 
>> > one big
>> > block, but the logic that makes sure we don't race here is within the 
>> > guest vm, which
>> > ThreadSanitizer doesn't see.
> 
> But isn't the task of a hypervisor to sandbox the guest OS and to not
> trust it in any way, shape or form? What if the guest VM intentionally
> omits the synchronization? Can it exploit the host?

Right, the memory areas that are accessed both by the hypervisor and the guest
should be treated as untrusted input, but the hypervisor is supposed to validate
the input carefully before using it - so I'm not sure how data races would
introduce anything new that we didn't catch during validation.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-19 Thread Dmitry Vyukov
On Mon, Oct 19, 2015 at 5:08 PM, Sasha Levin  wrote:
> On 10/19/2015 10:47 AM, Dmitry Vyukov wrote:
>>> Right, the memory areas that are accessed both by the hypervisor and the 
>>> guest
>>> > should be treated as untrusted input, but the hypervisor is supposed to 
>>> > validate
>>> > the input carefully before using it - so I'm not sure how data races would
>>> > introduce anything new that we didn't catch during validation.
>>
>> One possibility would be: if result of a racy read is passed to guest,
>> that can leak arbitrary host data into guest. Does not sound good.
>> Also, without usage of proper atomic operations, it is basically
>> impossible to verify untrusted data, as it can be changing under your
>> feet. And storing data into a local variable does not prevent the data
>> from changing.
>
> What's missing here is that the guest doesn't directly read/write the memory:
> every time it accesses a memory that is shared with the host it will trigger
> an exit, which will stop the vcpu thread that made the access and kernel side
> kvm will pass the hypervisor the value the guest wrote (or the memory address
> it attempted to read). The value/address can't change under us in that 
> scenario.

But still: if result of a racy read is passed to guest, that can leak
arbitrary host data into guest.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvmtool/run: append cfg.kernel_cmdline at the end of real_cmdline

2015-10-19 Thread Oleg Nesterov
This allows the user to always override the paramaters set by lkvm.
Say, currently 'lkvm run -p ro' doesn't work.

To keep the current logic we need to change strstr("root=") to check
cfg.kernel_cmdline, not real_cmdline. And perhaps we can even add a
simple helper add_param(name, val) to make this all more consistent;
it should only append "name=val" to real_cmdline if cfg.kernel_cmdline
doesn't include this paramater.

Signed-off-by: Oleg Nesterov 
---
 builtin-run.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/builtin-run.c b/builtin-run.c
index 3da00da..ab1fc2a 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -566,12 +566,6 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
memset(real_cmdline, 0, sizeof(real_cmdline));
kvm__arch_set_cmdline(real_cmdline, kvm->cfg.vnc || kvm->cfg.sdl || 
kvm->cfg.gtk);
 
-   if (strlen(real_cmdline) > 0)
-   strcat(real_cmdline, " ");
-
-   if (kvm->cfg.kernel_cmdline)
-   strlcat(real_cmdline, kvm->cfg.kernel_cmdline, 
sizeof(real_cmdline));
-
if (!kvm->cfg.guest_name) {
if (kvm->cfg.custom_rootfs) {
kvm->cfg.guest_name = kvm->cfg.custom_rootfs_name;
@@ -611,10 +605,15 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
if (kvm_setup_guest_init(kvm->cfg.custom_rootfs_name))
die("Failed to setup init for guest.");
}
-   } else if (!strstr(real_cmdline, "root=")) {
+   } else if (!strstr(kvm->cfg.kernel_cmdline, "root=")) {
strlcat(real_cmdline, " root=/dev/vda rw ", 
sizeof(real_cmdline));
}
 
+   if (kvm->cfg.kernel_cmdline) {
+   strcat(real_cmdline, " ");
+   strlcat(real_cmdline, kvm->cfg.kernel_cmdline, 
sizeof(real_cmdline));
+   }
+
kvm->cfg.real_cmdline = real_cmdline;
 
printf("  # %s run -k %s -m %Lu -c %d --name %s\n", KVM_BINARY_NAME,
-- 
2.4.3


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] arm/arm64: KVM: Clear map->active on pend/active clear

2015-10-19 Thread Christoffer Dall
On Mon, Oct 19, 2015 at 05:32:42PM +0200, Eric Auger wrote:
> Hi,
> On 10/17/2015 10:30 PM, Christoffer Dall wrote:
> > When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
> > to clear the pending and active states of an interrupt through the
> > emulated VGIC distributor.  However, since we emulate an edge-triggered
> > based on a level-triggered device,
> I do not get this sentence.

Fair enough.  The guest expects a level-triggered timer, so it expects
to be able to do:

  disable interrupt, set timer(*), clear pending state, enable interrupt

and this only works because the device is level-triggered, so it should
hit again after clearing the pending state and the guest should see it
after enabling interrupts.

but because we have virtual edge-triggered behavior, we loose the
interrupt generated at (*) unless we inject it again, which we will
after this patch, because the actual device is level triggered, so we'll
sample its state in the KVM timer code contiuously and now  (after this
patch) notice that we have to inject the edge-triggered IRQ again.

Yes, I know, we're lucky we can make it work like this.

> 
> Besides that
> Reviewed-by: Eric Auger 

Thanks!

-Christoffer

>  the guest expects the timer interrupt
> > to hit even after clearing the pending state.
> > 
> > We currently do not signal the VGIC when the map->active field is true,
> > because it indicates that the guest has already been signalled of the
> > interrupt as required.  Normally this field is set to false when the
> > guest deactivates the virtual interrupt through the sync path.
> > 
> > We also need to catch the case where the guest deactivates the interrupt
> > through the emulated distributor, again allowing guests to boot even if
> > the original virtual timer signal hit before the guest's GIC
> > initialization sequence is run.
> > 
> > Signed-off-by: Christoffer Dall 
> > ---
> >  virt/kvm/arm/vgic.c | 32 +++-
> >  1 file changed, 31 insertions(+), 1 deletion(-)
> > 
> > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> > index ea21bc2..58b1256 100644
> > --- a/virt/kvm/arm/vgic.c
> > +++ b/virt/kvm/arm/vgic.c
> > @@ -531,6 +531,34 @@ bool vgic_handle_set_pending_reg(struct kvm *kvm,
> > return false;
> >  }
> >  
> > +/*
> > + * If a mapped interrupt's state has been modified by the guest such that 
> > it
> > + * is no longer active or pending, without it have gone through the sync 
> > path,
> > + * then the map->active field must be cleared so the interrupt can be taken
> > + * again.
> > + */
> > +static void vgic_handle_clear_mapped_irq(struct kvm_vcpu *vcpu)
> > +{
> > +   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
> > +   struct list_head *root;
> > +   struct irq_phys_map_entry *entry;
> > +   struct irq_phys_map *map;
> > +
> > +   rcu_read_lock();
> > +
> > +   /* Check for PPIs */
> > +   root = _cpu->irq_phys_map_list;
> > +   list_for_each_entry_rcu(entry, root, entry) {
> > +   map = >map;
> > +
> > +   if (!vgic_dist_irq_is_pending(vcpu, map->virt_irq) &&
> > +   !vgic_irq_is_active(vcpu, map->virt_irq))
> > +   map->active = false;
> > +   }
> > +
> > +   rcu_read_unlock();
> > +}
> > +
> >  bool vgic_handle_clear_pending_reg(struct kvm *kvm,
> >struct kvm_exit_mmio *mmio,
> >phys_addr_t offset, int vcpu_id)
> > @@ -561,6 +589,7 @@ bool vgic_handle_clear_pending_reg(struct kvm *kvm,
> >   vcpu_id, offset);
> > vgic_reg_access(mmio, reg, offset, mode);
> >  
> > +   vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
> > vgic_update_state(kvm);
> > return true;
> > }
> > @@ -598,6 +627,7 @@ bool vgic_handle_clear_active_reg(struct kvm *kvm,
> > ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT);
> >  
> > if (mmio->is_write) {
> > +   vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
> > vgic_update_state(kvm);
> > return true;
> > }
> > @@ -1406,7 +1436,7 @@ static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, 
> > struct vgic_lr vlr)
> > return 0;
> >  
> > map = vgic_irq_map_search(vcpu, vlr.irq);
> > -   BUG_ON(!map || !map->active);
> > +   BUG_ON(!map);
> >  
> > ret = irq_get_irqchip_state(map->irq,
> > IRQCHIP_STATE_ACTIVE,
> > 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] arm/arm64: KVM: Clear map->active on pend/active clear

2015-10-19 Thread Eric Auger
On 10/19/2015 05:39 PM, Christoffer Dall wrote:
> On Mon, Oct 19, 2015 at 05:32:42PM +0200, Eric Auger wrote:
>> Hi,
>> On 10/17/2015 10:30 PM, Christoffer Dall wrote:
>>> When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
>>> to clear the pending and active states of an interrupt through the
>>> emulated VGIC distributor.  However, since we emulate an edge-triggered
>>> based on a level-triggered device,
>> I do not get this sentence.
> 
> Fair enough.  The guest expects a level-triggered timer, so it expects
> to be able to do:
> 
>   disable interrupt, set timer(*), clear pending state, enable interrupt
> 
> and this only works because the device is level-triggered, so it should
> hit again after clearing the pending state and the guest should see it
> after enabling interrupts.
> 
> but because we have virtual edge-triggered behavior, we loose the
> interrupt generated at (*) unless we inject it again, which we will
> after this patch, because the actual device is level triggered, so we'll
> sample its state in the KVM timer code contiuously and now  (after this
> patch) notice that we have to inject the edge-triggered IRQ again.

That helps ;-)

Eric
> 
> Yes, I know, we're lucky we can make it work like this.
> 
>>
>> Besides that
>> Reviewed-by: Eric Auger 
> 
> Thanks!
> 
> -Christoffer
> 
>>  the guest expects the timer interrupt
>>> to hit even after clearing the pending state.
>>>
>>> We currently do not signal the VGIC when the map->active field is true,
>>> because it indicates that the guest has already been signalled of the
>>> interrupt as required.  Normally this field is set to false when the
>>> guest deactivates the virtual interrupt through the sync path.
>>>
>>> We also need to catch the case where the guest deactivates the interrupt
>>> through the emulated distributor, again allowing guests to boot even if
>>> the original virtual timer signal hit before the guest's GIC
>>> initialization sequence is run.
>>>
>>> Signed-off-by: Christoffer Dall 
>>> ---
>>>  virt/kvm/arm/vgic.c | 32 +++-
>>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
>>> index ea21bc2..58b1256 100644
>>> --- a/virt/kvm/arm/vgic.c
>>> +++ b/virt/kvm/arm/vgic.c
>>> @@ -531,6 +531,34 @@ bool vgic_handle_set_pending_reg(struct kvm *kvm,
>>> return false;
>>>  }
>>>  
>>> +/*
>>> + * If a mapped interrupt's state has been modified by the guest such that 
>>> it
>>> + * is no longer active or pending, without it have gone through the sync 
>>> path,
>>> + * then the map->active field must be cleared so the interrupt can be taken
>>> + * again.
>>> + */
>>> +static void vgic_handle_clear_mapped_irq(struct kvm_vcpu *vcpu)
>>> +{
>>> +   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
>>> +   struct list_head *root;
>>> +   struct irq_phys_map_entry *entry;
>>> +   struct irq_phys_map *map;
>>> +
>>> +   rcu_read_lock();
>>> +
>>> +   /* Check for PPIs */
>>> +   root = _cpu->irq_phys_map_list;
>>> +   list_for_each_entry_rcu(entry, root, entry) {
>>> +   map = >map;
>>> +
>>> +   if (!vgic_dist_irq_is_pending(vcpu, map->virt_irq) &&
>>> +   !vgic_irq_is_active(vcpu, map->virt_irq))
>>> +   map->active = false;
>>> +   }
>>> +
>>> +   rcu_read_unlock();
>>> +}
>>> +
>>>  bool vgic_handle_clear_pending_reg(struct kvm *kvm,
>>>struct kvm_exit_mmio *mmio,
>>>phys_addr_t offset, int vcpu_id)
>>> @@ -561,6 +589,7 @@ bool vgic_handle_clear_pending_reg(struct kvm *kvm,
>>>   vcpu_id, offset);
>>> vgic_reg_access(mmio, reg, offset, mode);
>>>  
>>> +   vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
>>> vgic_update_state(kvm);
>>> return true;
>>> }
>>> @@ -598,6 +627,7 @@ bool vgic_handle_clear_active_reg(struct kvm *kvm,
>>> ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT);
>>>  
>>> if (mmio->is_write) {
>>> +   vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
>>> vgic_update_state(kvm);
>>> return true;
>>> }
>>> @@ -1406,7 +1436,7 @@ static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, 
>>> struct vgic_lr vlr)
>>> return 0;
>>>  
>>> map = vgic_irq_map_search(vcpu, vlr.irq);
>>> -   BUG_ON(!map || !map->active);
>>> +   BUG_ON(!map);
>>>  
>>> ret = irq_get_irqchip_state(map->irq,
>>> IRQCHIP_STATE_ACTIVE,
>>>
>>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] arm/arm64: KVM: Clear map->active on pend/active clear

2015-10-19 Thread Eric Auger
Hi,
On 10/17/2015 10:30 PM, Christoffer Dall wrote:
> When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
> to clear the pending and active states of an interrupt through the
> emulated VGIC distributor.  However, since we emulate an edge-triggered
> based on a level-triggered device,
I do not get this sentence.

Besides that
Reviewed-by: Eric Auger 

Best Regards

Eric
 the guest expects the timer interrupt
> to hit even after clearing the pending state.
> 
> We currently do not signal the VGIC when the map->active field is true,
> because it indicates that the guest has already been signalled of the
> interrupt as required.  Normally this field is set to false when the
> guest deactivates the virtual interrupt through the sync path.
> 
> We also need to catch the case where the guest deactivates the interrupt
> through the emulated distributor, again allowing guests to boot even if
> the original virtual timer signal hit before the guest's GIC
> initialization sequence is run.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  virt/kvm/arm/vgic.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index ea21bc2..58b1256 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -531,6 +531,34 @@ bool vgic_handle_set_pending_reg(struct kvm *kvm,
>   return false;
>  }
>  
> +/*
> + * If a mapped interrupt's state has been modified by the guest such that it
> + * is no longer active or pending, without it have gone through the sync 
> path,
> + * then the map->active field must be cleared so the interrupt can be taken
> + * again.
> + */
> +static void vgic_handle_clear_mapped_irq(struct kvm_vcpu *vcpu)
> +{
> + struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
> + struct list_head *root;
> + struct irq_phys_map_entry *entry;
> + struct irq_phys_map *map;
> +
> + rcu_read_lock();
> +
> + /* Check for PPIs */
> + root = _cpu->irq_phys_map_list;
> + list_for_each_entry_rcu(entry, root, entry) {
> + map = >map;
> +
> + if (!vgic_dist_irq_is_pending(vcpu, map->virt_irq) &&
> + !vgic_irq_is_active(vcpu, map->virt_irq))
> + map->active = false;
> + }
> +
> + rcu_read_unlock();
> +}
> +
>  bool vgic_handle_clear_pending_reg(struct kvm *kvm,
>  struct kvm_exit_mmio *mmio,
>  phys_addr_t offset, int vcpu_id)
> @@ -561,6 +589,7 @@ bool vgic_handle_clear_pending_reg(struct kvm *kvm,
> vcpu_id, offset);
>   vgic_reg_access(mmio, reg, offset, mode);
>  
> + vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
>   vgic_update_state(kvm);
>   return true;
>   }
> @@ -598,6 +627,7 @@ bool vgic_handle_clear_active_reg(struct kvm *kvm,
>   ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT);
>  
>   if (mmio->is_write) {
> + vgic_handle_clear_mapped_irq(kvm_get_vcpu(kvm, vcpu_id));
>   vgic_update_state(kvm);
>   return true;
>   }
> @@ -1406,7 +1436,7 @@ static int vgic_sync_hwirq(struct kvm_vcpu *vcpu, 
> struct vgic_lr vlr)
>   return 0;
>  
>   map = vgic_irq_map_search(vcpu, vlr.irq);
> - BUG_ON(!map || !map->active);
> + BUG_ON(!map);
>  
>   ret = irq_get_irqchip_state(map->irq,
>   IRQCHIP_STATE_ACTIVE,
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [kvm-unit-tests PATCHv4 3/3] arm: pmu: Add CPI checking

2015-10-19 Thread Christopher Covington
Hi Drew,

I appreciate your feedback on these patches.

On 10/18/2015 02:28 PM, Andrew Jones wrote:

>> --- a/arm/pmu.c
>> +++ b/arm/pmu.c
>> @@ -37,6 +37,18 @@ static inline unsigned long get_pmccntr(void)
>>  asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r" (cycles));
>>  return cycles;
>>  }
>> +
>> +static inline void loop(int i, uint32_t pmcr)
>> +{
>> +uint32_t z = 0;
>> +
>> +asm volatile(
>> +"   mcr p15, 0, %[pmcr], c9, c12, 0\n"
>> +"   1: subs %[i], %[i], #1\n"
>> +"   bgt 1b\n"
>> +"   mcr p15, 0, %[z], c9, c12, 0\n"
>> +: [i] "+r" (i) : [pmcr] "r" (pmcr), [z] "r" (z) : "cc");
> 
> Assembly is always ugly, but we can do a bit better formatting with tabs
> 
>   asm volatile(
>   "   mcr p15, 0, %[pmcr], c9, c12, 0\n"
>   "1: subs%[i], %[i], #1\n"
>   "   bgt 1b\n"
>   "   mcr p15, 0, %[z], c9, c12, 0\n"
>   : [i] "+r" (i)
>   : [pmcr] "r" (pmcr), [z] "r" (z)
>   : "cc");
> 
> Actually it can be even cleaner because you already created set_pmcr()
> 
>   set_pmcr(pmcr);
> 
>   asm volatile(
>   "1: subs%0, %0, #1\n"
>   "   bgt 1b\n"
>   : "+r" (i) : : "cc");
> 
>   set_pmcr(0);

Is there any way to ensure that the compiler won't for example put a `mov rd,
#0` between the `bgt 1b` and the `mcr , rn`?

>> @@ -125,12 +147,79 @@ static bool check_cycles_increase(void)
>>  return true;
>>  }
>>  
>> -int main(void)
>> +/*
>> + * Execute a known number of guest instructions. Only odd instruction counts
>> + * greater than or equal to 3 are supported by the in-line assembly code. 
>> The
> 
> Not all odd counts, right? But rather all multiples of 3? IIUC this is because
> the loop is two instructions (sub + branch), and then the clearing of the pmcr
> register counts as the 3rd?

Clearing the PMCR doesn't happen as part of the loop, but as part of the loop
exit or epilogue.

total_instrs = iteration_count * loop_instrs + eipilogue_instrs
total_instrs = iteration_count * 2 + 1

Thanks,
Christopher Covington

-- 
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 27/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Dan Williams
On Mon, Oct 19, 2015 at 12:09 AM, Michael S. Tsirkin  wrote:
> On Mon, Oct 19, 2015 at 12:04:48PM +0800, Xiao Guangrong wrote:
> I mean don't use ASL to comment C. It's not more readable.
> Describe why the code is the way it is. Use variables by preference,
> C does not have weird limitations like ASL so you don't need to call
> your variables "arg3". What does it hold?
>

What it holds is function number specific.  It's similar to
SYSCALL_DEFINEx where the ASL code is there to marshal arguments from
the OS through ACPI to a BIOS routine.  See the definition of the
example _DSM functions here and the usages of "Arg3":
http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] userfault21 update

2015-10-19 Thread Andrea Arcangeli
Hello Patrick,

On Mon, Oct 12, 2015 at 11:04:11AM -0400, Patrick Donnelly wrote:
> Hello Andrea,
> 
> On Mon, Jun 15, 2015 at 1:22 PM, Andrea Arcangeli  wrote:
> > This is an incremental update to the userfaultfd code in -mm.
> 
> Sorry I'm late to this party. I'm curious how a ptrace monitor might
> use a userfaultfd to handle faults in all of its tracees. Is this
> possible without having each (newly forked) tracee "cooperate" by
> creating a userfaultfd and passing that to the tracer?

To make the non cooperative usage work, userfaulfd also needs more
features to track fork() and mremap() syscalls and such, as the
monitor needs to be aware about modifications to the address space of
each "mm" is managing and of new forked "mm" as well. So fork() won't
need to call userfaultfd once we add those features, but it still
doesn't need to know about the "pid". The uffd_msg already has padding
to add the features you need for that.

Pavel invented and developed those features for the non cooperative
usage to implement postcopy live migration of containers. He posted
some patchset on the lists too, but it probably needs to be rebased on
upstream.

The ptrace monitor thread can also fault into the userfault area if it
wants to (but only if it's not the userfault manager thread as well).
I didn't expect the ptrace monitor to want to be a userfault manager
too though.

On a side note, the signals the ptrace monitor sends to the tracee
(SIGCONT|STOP included) will only be executed by the tracee without
waiting for userfault resolution from the userfault manager, if the
tracees userfault wasn't triggered in kernel context (and in a non
cooperative usage that's not an assumption you can make). If the
tracee hits an userfault while running in kernel context, the
userfault manager must resolve the userfault before any signal (except
SIGKILL of course) can be executed by the tracee. Only SIGKILL is
instantly executed by all tracees no matter if it was an userfault in
kernel or user context. That may be another reason for not wanting the
ptrace monitor and the userfault manager in the same thread (they can
still be running in two different threads of the same external
process).

> Have you considered using one userfaultfd for an entire tree of
> processes (signaled through a flag)? Would not a process id included
> in the include/uapi/linux/userfaultfd.h:struct uffd_msg be sufficient
> to disambiguate faults?

I got a private email asking a corollary question about having the
faulting IP address in the uffd_msg recently, which I answered and I
take opportunity to quote it as well below, as it's somewhat connected
with your "pid" question and this adds more context.

===

At times it's the kernel accessing the page (copy-user get user pages)
like if the buffer is a parameter to the write or read syscalls, just
to make an example.

The IP address triggering the fault isn't necessarily a userland
address. Furthermore not even the pid is known, so you don't know
which process accessed it.

userfaultfd only notifies userland that a certain page is requested
and must be mapped ASAP. You don't know why or who touched it.

===

Now about adding the "pid": the association between "pid" and "mm"
isn't so strict in the kernel. You can tell which "pid" shares the
same "mm" but if you look from userland, you can't always tell which
"mm"(/process) the pid belongs to. At times async io threads or
vhost-net threads can impersonate the "mm" and in effect become part
of the process and you'd get those random "pid" of kernel threads.

It could also be a ptrace that triggers an userfault, with a "pid" that
isn't part of the application and the manager must still work
seamlessly no matter who or which "pid" triggered the userfault.

So overall dealing the "pid"s sounds like not very clean as the same
kernel thread "pid" can impersonate multiple "mm" and you wouldn't get
the information of which "mm" the "address" belongs to.

When userfaultfd() is called, it literally binds to the "mm" the
process is running on and it's pid agnostic. Then when a kernel thread
impersonating the "mm" faults into the "mm" with get_user_pages or
copy_user or when a ptrace faults into the "mm", the userafult manager
won't even see the difference.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 27/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Dan Williams
On Mon, Oct 19, 2015 at 2:19 PM, Michael S. Tsirkin  wrote:
> On Mon, Oct 19, 2015 at 10:29:50AM -0700, Dan Williams wrote:
>> On Mon, Oct 19, 2015 at 12:09 AM, Michael S. Tsirkin  wrote:
>> > On Mon, Oct 19, 2015 at 12:04:48PM +0800, Xiao Guangrong wrote:
>> > I mean don't use ASL to comment C. It's not more readable.
>> > Describe why the code is the way it is. Use variables by preference,
>> > C does not have weird limitations like ASL so you don't need to call
>> > your variables "arg3". What does it hold?
>> >
>>
>> What it holds is function number specific.  It's similar to
>> SYSCALL_DEFINEx where the ASL code is there to marshal arguments from
>> the OS through ACPI to a BIOS routine.  See the definition of the
>> example _DSM functions here and the usages of "Arg3":
>> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>
> So it seems to say "3.1.1.1.1 Input (Arg3)".
> Is that right? So
>
> Aml *input = aml_arg(3); /* Input (Arg3) */
> or even
> Aml *input_arg3 =  aml_arg(3); /* Input (Arg3) */
>
> My point is we are not writing ASL. There is no reason
> to use cryptic names.
>

Ah, ok, sounds good to me.  ASL is already hard to read and we
shouldn't be using it as code commentary.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Michael S. Tsirkin
On Tue, Oct 20, 2015 at 01:54:12AM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/19/2015 06:25 PM, Michael S. Tsirkin wrote:
> >On Mon, Oct 19, 2015 at 11:18:36AM +0200, Igor Mammedov wrote:
> >>On Mon, 19 Oct 2015 09:56:12 +0300
> >>"Michael S. Tsirkin"  wrote:
> >>
> >>>On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:
> >>[...]
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index f6bd2c4..aa95961 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -15,6 +15,10 @@
> 
>   #include "hw/mem/dimm.h"
> 
> +/* Memory region 0xFF0 ~ 0xFFF0 is reserved for NVDIMM
> ACPI. */ +#define NVDIMM_ACPI_MEM_BASE   0xFF00ULL
> >>Michael,
> >>
> >>If it's ok to map control RAM region directly from QEMU at arbitrary
> >>location
> >
> >It's a fair question. Where is it reserved? In which spec?
> >
> 
> The region 0xFF0 ~ 0xFFF0 is just reserved for vDSM implementation,
> it is not required by spec.

See Igor's comment then. You can't just steal it like that.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/2] block: enable dax for raw block devices

2015-10-19 Thread Dan Williams
On Sun, Oct 18, 2015 at 8:01 PM, Ross Zwisler
 wrote:
> On Fri, Oct 16, 2015 at 08:49:41PM -0400, Dan Williams wrote:
>> If an application wants exclusive access to all of the persistent memory
>> provided by an NVDIMM namespace it can use this raw-block-dax facility
>> to forgo establishing a filesystem.  This capability is targeted
>> primarily to hypervisors wanting to provision persistent memory for
>> guests.
>>
>> Cc: Jeff Moyer 
>> Cc: Christoph Hellwig 
>> Cc: Al Viro 
>> Cc: Andrew Morton 
>> Cc: Ross Zwisler 
>> Cc: Xiao Guangrong 
>> Signed-off-by: Dan Williams 
>> ---
>>
>> Only lighted tested so far, but seems to work, is the shortest path to a
>> DAX mapping, and makes it easier to trigger the pmd_fault path (no
>> fs-block-allocator interactions).
>>
>>  fs/block_dev.c |   84 
>> +++-
>>  1 file changed, 83 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 5277dd83d254..498b71455570 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -1687,13 +1687,95 @@ static const struct address_space_operations 
>> def_blk_aops = {
>>   .is_dirty_writeback = buffer_check_dirty_writeback,
>>  };
>>
>> +#ifdef CONFIG_FS_DAX
>> +static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault 
>> *vmf)
>> +{
>> + struct inode *bd_inode = file_bd_inode(vma->vm_file);
>> + struct block_device *bdev = I_BDEV(bd_inode);
>> + int ret;
>> +
>> + mutex_lock(>bd_mutex);
>> + ret = __dax_fault(vma, vmf, blkdev_get_block, NULL);
>> + mutex_unlock(>bd_mutex);
>> +
>> + return ret;
>> +}
>
> This all looks very straightforward.  The one comment I have is that this
> code is missing the calls to sb_[start|end]_pagefault(), and to
> file_update_time() that are found in ext[24]/xfs and the generic fault code.
>
> The previous version of this code used the generic fault implementation, and
> was calling these functions via filemap_page_mkwrite().
>
> It is possible that they were omitted for a reason - does protection from
> filesystem freezing still make sense when talking with a raw block device?
> For example, if that block device *has* a mounted filesystem on it that is
> frozen, does sb_start_pagefault() prevent against page faults on the raw
> device that try and make something writable?
>
> In any case, the presence of them in filemap_page_mkwrite() tells me that they
> at least aren't harmful, and I wanted to make sure they weren't needed before
> leaving them out.  If the omission was intentional, should we add a comment to
> explain why they are missing?

So, I left them out on purpose and labeled this "RFC" mainly because I
wasn't ready to assert that the new locking using bd_mutex did not
have bad interactions with those paths or the mmap path in general.

The access time and interactions with freezing are different for raw
block device files given that the inode for the per-instance
device-node file is separate from the inode for the block device
itself.  The device node file is typically on a tmpfs filesystem
mounted at /dev.  The access time of the device node doesn't have a
reliable correlation with the access time of the block-device due to
the fact that you can mknod() a new device node anywhere, on any
filesystem, and do i/o.

I'll consider adding them back if they do no harm, but "device special
files" are already sufficiently different than regular files that I'm
not sure it matters.  Either way I agree this could use more
commentary in the code.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 27/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-19 Thread Michael S. Tsirkin
On Mon, Oct 19, 2015 at 10:29:50AM -0700, Dan Williams wrote:
> On Mon, Oct 19, 2015 at 12:09 AM, Michael S. Tsirkin  wrote:
> > On Mon, Oct 19, 2015 at 12:04:48PM +0800, Xiao Guangrong wrote:
> > I mean don't use ASL to comment C. It's not more readable.
> > Describe why the code is the way it is. Use variables by preference,
> > C does not have weird limitations like ASL so you don't need to call
> > your variables "arg3". What does it hold?
> >
> 
> What it holds is function number specific.  It's similar to
> SYSCALL_DEFINEx where the ASL code is there to marshal arguments from
> the OS through ACPI to a BIOS routine.  See the definition of the
> example _DSM functions here and the usages of "Arg3":
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

So it seems to say "3.1.1.1.1 Input (Arg3)".
Is that right? So

Aml *input = aml_arg(3); /* Input (Arg3) */
or even
Aml *input_arg3 =  aml_arg(3); /* Input (Arg3) */

My point is we are not writing ASL. There is no reason
to use cryptic names.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] vfio: platform: add capability to register a reset function

2015-10-19 Thread Alex Williamson
On Sun, 2015-10-18 at 18:00 +0200, Eric Auger wrote:
> In preparation for subsequent changes in reset function lookup,
> lets introduce a dynamic list of reset combos (compat string,
> reset module, reset function). The list can be populated/voided with
> two new functions, vfio_platform_register/unregister_reset. Those are
> not yet used in this patch.
> 
> Signed-off-by: Eric Auger 
> ---
>  drivers/vfio/platform/vfio_platform_common.c  | 55 
> +++
>  drivers/vfio/platform/vfio_platform_private.h | 14 +++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/vfio/platform/vfio_platform_common.c 
> b/drivers/vfio/platform/vfio_platform_common.c
> index e43efb5..d36afc9 100644
> --- a/drivers/vfio/platform/vfio_platform_common.c
> +++ b/drivers/vfio/platform/vfio_platform_common.c
> @@ -23,6 +23,8 @@
>  
>  #include "vfio_platform_private.h"
>  
> +struct list_head reset_list;

What's the purpose of this one above?

> +LIST_HEAD(reset_list);

static

I think you also need a mutex protecting this list, otherwise nothing
prevents concurrent list changes and walks.  A rw lock probably fits the
usage model best, but I don't expect much contention if you just want to
use a mutex.

>  static DEFINE_MUTEX(driver_lock);
>  
>  static const struct vfio_platform_reset_combo reset_lookup_table[] = {
> @@ -573,3 +575,56 @@ struct vfio_platform_device 
> *vfio_platform_remove_common(struct device *dev)
>   return vdev;
>  }
>  EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
> +
> +int vfio_platform_register_reset(struct module *reset_owner, char *compat,

const char *

> +  vfio_platform_reset_fn_t reset)
> +{
> + struct vfio_platform_reset_node *node, *iter;
> + bool found = false;
> +
> + list_for_each_entry(iter, _list, link) {
> + if (!strcmp(iter->compat, compat)) {
> + found = true;
> + break;
> + }
> + }
> + if (found)
> + return -EINVAL;
> +
> + node = kmalloc(sizeof(*node), GFP_KERNEL);
> + if (!node)
> + return -ENOMEM;
> +
> + node->compat = kstrdup(compat, GFP_KERNEL);
> + if (!node->compat)
> + return -ENOMEM;

Leaks node

> +
> + node->owner = reset_owner;
> + node->reset = reset;
> +
> + list_add(>link, _list);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_platform_register_reset);
> +
> +int vfio_platform_unregister_reset(char *compat)

const char *

> +{
> + struct vfio_platform_reset_node *iter;
> + bool found = false;
> +
> + list_for_each_entry(iter, _list, link) {
> + if (!strcmp(iter->compat, compat)) {
> + found = true;
> + break;
> + }
> + }
> + if (!found)
> + return -EINVAL;
> +
> + list_del(>link);
> + kfree(iter->compat);
> + kfree(iter);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_platform_unregister_reset);
> +
> diff --git a/drivers/vfio/platform/vfio_platform_private.h 
> b/drivers/vfio/platform/vfio_platform_private.h
> index 1c9b3d5..17323f0 100644
> --- a/drivers/vfio/platform/vfio_platform_private.h
> +++ b/drivers/vfio/platform/vfio_platform_private.h
> @@ -76,6 +76,15 @@ struct vfio_platform_reset_combo {
>   const char *module_name;
>  };
>  
> +typedef int (*vfio_platform_reset_fn_t)(struct vfio_platform_device *vdev);
> +
> +struct vfio_platform_reset_node {
> + struct list_head link;
> + char *compat;
> + struct module *owner;
> + vfio_platform_reset_fn_t reset;
> +};
> +
>  extern int vfio_platform_probe_common(struct vfio_platform_device *vdev,
> struct device *dev);
>  extern struct vfio_platform_device *vfio_platform_remove_common
> @@ -89,4 +98,9 @@ extern int vfio_platform_set_irqs_ioctl(struct 
> vfio_platform_device *vdev,
>   unsigned start, unsigned count,
>   void *data);
>  
> +extern int vfio_platform_register_reset(struct module *owner,
> + char *compat,
> + vfio_platform_reset_fn_t reset);
> +extern int vfio_platform_unregister_reset(char *compat);
> +
>  #endif /* VFIO_PLATFORM_PRIVATE_H */



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 06:42 PM, Igor Mammedov wrote:

On Mon, 19 Oct 2015 18:01:17 +0800
Xiao Guangrong  wrote:




On 10/19/2015 05:46 PM, Igor Mammedov wrote:

On Mon, 19 Oct 2015 12:17:22 +0300
"Michael S. Tsirkin"  wrote:


On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:



On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:

+nvdimm_init_memory_state(>nvdimm_memory,
system_memory, machine,
+ TARGET_PAGE_SIZE);
+


Shouldn't this be conditional on presence of the nvdimm device?



We will enable hotplug on nvdimm devices in the near future once
Linux driver is ready. I'd keep it here for future development.


No, I don't think we should add stuff unconditionally. If not
nvdimm, some other flag should indicate user intends to hotplug
things.



Actually, it is not unconditionally which is called if parameter
"-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path
of memoy-hotplug initiation.



Right, but that's not the same as nvdimm.



it could be pc-machine property, then it could be turned on like
this: -machine nvdimm_support=on


Er, I do not understand why this separate switch is needed and why
nvdimm and pc-dimm is different. :(

NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and
dimm device, even some of ACPI logic to do hotplug things, etc. Both
nvdimm and pc-dimm are built on the same infrastructure.

NVDIMM support consumes precious low RAM  and MMIO resources and
not small amount at that. So turning it on unconditionally with
memory hotplug even if NVDIMM wouldn't ever be used isn't nice.


Okay, understand... will introduce nvdimm_support as your suggestion.
Thank you, Igor and Michael!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 06:25 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 11:18:36AM +0200, Igor Mammedov wrote:

On Mon, 19 Oct 2015 09:56:12 +0300
"Michael S. Tsirkin"  wrote:


On Sun, Oct 11, 2015 at 11:52:54AM +0800, Xiao Guangrong wrote:

[...]

diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index f6bd2c4..aa95961 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -15,6 +15,10 @@

  #include "hw/mem/dimm.h"

+/* Memory region 0xFF0 ~ 0xFFF0 is reserved for NVDIMM
ACPI. */ +#define NVDIMM_ACPI_MEM_BASE   0xFF00ULL

Michael,

If it's ok to map control RAM region directly from QEMU at arbitrary
location


It's a fair question. Where is it reserved? In which spec?



The region 0xFF0 ~ 0xFFF0 is just reserved for vDSM implementation,
it is not required by spec.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFT - PATCH v2 0/2] KVM/arm64: add fp/simd lazy switch support

2015-10-19 Thread Mario Smarduch


On 10/18/2015 2:07 PM, Christoffer Dall wrote:
> On Mon, Oct 12, 2015 at 09:29:23AM -0700, Mario Smarduch wrote:
>> Hi Christoffer, Marc -
>>   I just threw this test your way without any explanation.
> 
> I'm confused.  Did you send me something somewhere already?
Yes in the last patchset

https://lists.cs.columbia.edu/pipermail/kvmarm/2015-October/016698.html

I included a simple test I put together.

> 
>>
>> The test loops, does fp arithmetic and checks the truncated result.
>> It could be a little more dynamic have an initial run to
>> get the sum to compare against while looping, different fp
>> hardware may come up with a different sum, but truncation is
>> to 5'th decimal point.
>>
>> The rationale is that if there is any fp/simd corruption
>> one of these runs should fail. I think most likely scenario
>> for that is a world switch in midst of fp operation. I've
>> instrumented (basically add some tracing to vcpu_put()) and
>> validated vcpu_put gets called thousands of time (for v7,v8)
>> for an over night test running two guests/host crunching
>> fp operations.
>>
>> Other then that not sure how to really catch any problems
>> with the patches applied. Obviously this is a huge issues, if this has
>> any problems. If you or Marc have any other ideas I'd be happy
>> to enhance the test.
> 
> I think it's important to run two VMs at the same time, each with some
> floating-point work, and then run some floating point on the host at the
> same time.
> 
> You can make that even more interesting by doing 32-bit guests at the
> same time as well.

Yes that's the test combination I've been running.
> 
> I believe Marc was running Panranoia
> (http://www.netlib.org/paranoia/paranoia.c) to test the last lazy
> series.

I'll try this test and run it for several days, see if anything shows up.

Thanks.
> 
> Thanks,
> -Christoffer
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] KVM/arm: add hooks for armv7 fp/simd lazy switch support

2015-10-19 Thread Mario Smarduch


On 10/19/2015 1:53 AM, Christoffer Dall wrote:
> On Sat, Sep 26, 2015 at 04:43:28PM -0700, Mario Smarduch wrote:
>> This patch adds vcpu fields to track lazy state, save host FPEXC, and 
>> offsets to fields.
>>
>> Signed-off-by: Mario Smarduch 
>> ---
>>  arch/arm/include/asm/kvm_host.h | 6 ++
>>  arch/arm/kernel/asm-offsets.c   | 2 ++
>>  2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h 
>> b/arch/arm/include/asm/kvm_host.h
>> index dcba0fa..194a8ef 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -111,6 +111,12 @@ struct kvm_vcpu_arch {
>>  /* Interrupt related fields */
>>  u32 irq_lines;  /* IRQ and FIQ levels */
>>  
>> +/* Track fp/simd lazy switch state */
>> +u32 vfp_lazy;
> 
> so is this a flags field or basically a boolean?  If the latter, what is
> does it mean when the field is true vs. false?
Yes it's a bool will update, and clarify comments.
> 
>> +
>> +/* Save host FPEXC register to restore on vcpu put */
>> +u32 saved_fpexc;
> 
> is this only the host's state?  If so, why not name it host_fpexc?
That's right itis host state, will change.
> 
> Thanks,
> -Christoffer
> 
>> +
>>  /* Exception Information */
>>  struct kvm_vcpu_fault_info fault;
>>  
>> diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
>> index 871b826..e1c3a41 100644
>> --- a/arch/arm/kernel/asm-offsets.c
>> +++ b/arch/arm/kernel/asm-offsets.c
>> @@ -186,6 +186,8 @@ int main(void)
>>DEFINE(VCPU_CPSR, offsetof(struct kvm_vcpu, 
>> arch.regs.usr_regs.ARM_cpsr));
>>DEFINE(VCPU_HCR,  offsetof(struct kvm_vcpu, arch.hcr));
>>DEFINE(VCPU_IRQ_LINES,offsetof(struct kvm_vcpu, arch.irq_lines));
>> +  DEFINE(VCPU_VFP_LAZY, offsetof(struct kvm_vcpu, 
>> arch.vfp_lazy));
>> +  DEFINE(VCPU_VFP_FPEXC,offsetof(struct kvm_vcpu, arch.saved_fpexc));
>>DEFINE(VCPU_HSR,  offsetof(struct kvm_vcpu, arch.fault.hsr));
>>DEFINE(VCPU_HxFAR,offsetof(struct kvm_vcpu, 
>> arch.fault.hxfar));
>>DEFINE(VCPU_HPFAR,offsetof(struct kvm_vcpu, 
>> arch.fault.hpfar));
>> -- 
>> 1.9.1
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] KVM/arm: enable enhanced armv7 fp/simd lazy switch

2015-10-19 Thread Mario Smarduch


On 10/19/2015 3:14 AM, Christoffer Dall wrote:
> On Sat, Sep 26, 2015 at 04:43:29PM -0700, Mario Smarduch wrote:
>> This patch enhances current lazy vfp/simd hardware switch. In addition to
>> current lazy switch, it tracks vfp/simd hardware state with a vcpu 
>> lazy flag. 
>>
>> vcpu lazy flag is set on guest access and trap to vfp/simd hardware switch 
>> handler. On vm-enter if lazy flag is set skip trap enable and saving 
>> host fpexc. On vm-exit if flag is set skip hardware context switch
>> and return to host with guest context.
>>
>> On vcpu_put check if vcpu lazy flag is set, and execute a hardware context 
>> switch to restore host.
>>
>> Signed-off-by: Mario Smarduch 
>> ---
>>  arch/arm/include/asm/kvm_asm.h |  1 +
>>  arch/arm/kvm/arm.c | 17 
>>  arch/arm/kvm/interrupts.S  | 60 
>> +++---
>>  arch/arm/kvm/interrupts_head.S | 12 ++---
>>  4 files changed, 71 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
>> index 194c91b..4b45d79 100644
>> --- a/arch/arm/include/asm/kvm_asm.h
>> +++ b/arch/arm/include/asm/kvm_asm.h
>> @@ -97,6 +97,7 @@ extern char __kvm_hyp_code_end[];
>>  extern void __kvm_flush_vm_context(void);
>>  extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
>>  extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
>> +extern void __kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu);
>>  
>>  extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
>>  #endif
>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> index ce404a5..79f49c7 100644
>> --- a/arch/arm/kvm/arm.c
>> +++ b/arch/arm/kvm/arm.c
>> @@ -105,6 +105,20 @@ void kvm_arch_check_processor_compat(void *rtn)
>>  *(int *)rtn = 0;
>>  }
>>  
>> +/**
>> + * kvm_switch_fp_regs() - switch guest/host VFP/SIMD registers
>> + * @vcpu:   pointer to vcpu structure.
>> + *
> 
> nit: stray blank line
ok
> 
>> + */
>> +static void kvm_switch_fp_regs(struct kvm_vcpu *vcpu)
>> +{
>> +#ifdef CONFIG_ARM
>> +if (vcpu->arch.vfp_lazy == 1) {
>> +kvm_call_hyp(__kvm_restore_host_vfp_state, vcpu);
> 
> why do you have to do this in HYP mode ?
 Calling it directly works fine. I moved the function outside hyp start/end
range in interrupts.S. Not thinking outside the box, just thought let them all
be hyp calls.

> 
>> +vcpu->arch.vfp_lazy = 0;
>> +}
>> +#endif
> 
> we've tried to put stuff like this in header files to avoid the ifdefs
> so far.  Could that be done here as well?

That was a to do, but didn't get around to it.
> 
>> +}
>>  
>>  /**
>>   * kvm_arch_init_vm - initializes a VM data structure
>> @@ -295,6 +309,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  
>>  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>  {
>> +/* Check if Guest accessed VFP registers */
> 
> misleading comment: this function does more than checking
Yep sure does, will change.
> 
>> +kvm_switch_fp_regs(vcpu);
>> +
>>  /*
>>   * The arch-generic KVM code expects the cpu field of a vcpu to be -1
>>   * if the vcpu is no longer assigned to a cpu.  This is used for the
>> diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
>> index 900ef6d..6d98232 100644
>> --- a/arch/arm/kvm/interrupts.S
>> +++ b/arch/arm/kvm/interrupts.S
>> @@ -96,6 +96,29 @@ ENTRY(__kvm_flush_vm_context)
>>  bx  lr
>>  ENDPROC(__kvm_flush_vm_context)
>>  
>> +/**
>> + * void __kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes a lazy
>> + * fp/simd switch, saves the guest, restores host.
>> + *
> 
> nit: stray blank line
ok.
> 
>> + */
>> +ENTRY(__kvm_restore_host_vfp_state)
>> +#ifdef CONFIG_VFPv3
>> +push{r3-r7}
>> +
>> +add r7, r0, #VCPU_VFP_GUEST
>> +store_vfp_state r7
>> +
>> +add r7, r0, #VCPU_VFP_HOST
>> +ldr r7, [r7]
>> +restore_vfp_state r7
>> +
>> +ldr r3, [vcpu, #VCPU_VFP_FPEXC]
> 
> either use r0 or vcpu throughout this function please
Yeah that's bad - in the same function to
> 
>> +VFPFMXR FPEXC, r3
>> +
>> +pop {r3-r7}
>> +#endif
>> +bx  lr
>> +ENDPROC(__kvm_restore_host_vfp_state)
>>  
>>  /
>>   *  Hypervisor world-switch code
>> @@ -119,11 +142,15 @@ ENTRY(__kvm_vcpu_run)
>>  @ If the host kernel has not been configured with VFPv3 support,
>>  @ then it is safer if we deny guests from using it as well.
>>  #ifdef CONFIG_VFPv3
>> +@ r7 must be preserved until next vfp lazy check
> 
> I don't understand this comment
> 
>> +vfp_inlazy_mode r7, skip_save_host_fpexc
>> +
>>  @ Set FPEXC_EN so the guest doesn't trap floating point instructions
>>  VFPFMRX r2, FPEXC   @ VMRS
>> -push{r2}
>> +str r2, [vcpu, #VCPU_VFP_FPEXC]
>>  orr r2, r2, #FPEXC_EN
>>  VFPFMXR FPEXC, r2   @ VMSR
>> +skip_save_host_fpexc:

Steal time accounting in KVM. Benchmark.

2015-10-19 Thread Alexey Makhalov
Hi,

I did benchmarking of scheduler fairness with enabled steal time
accounting(STA) in KVM.
And results are really interesting.

Looks like STA provides worse scheduler fairness against disabled STA
(no-steal-acc cmdline param)

I created benchmark, main idea is: 2 cgroups with cpu.shares
proportion like 1/9, run identical work in both groups, and expecting
to get the same proportion of work done – 1/9. Condition – CPU
overcommit.

On bare metal it is fair +- some percentages of fluctation.
On KVM with no STA   it’s less fair. With STA enabled results are
ugly! One again – in CPU overcommit situation.


Host: ubuntu 14.04, Intel i5 – 2x2 CPUs
2 VMs (4 vCPU each) are working in parallel.   2:1 cpu overcommit.

Each VM has running benchmark:
cgroups cpu.shares proportion is 128/1152 (10%/90%), work – spinning
in cycle, number of cycles are being counted.

Results:
First column – work proportion. Expected value 90%
Second column – total number of loop cycles from all workers (x10).
Interval between printed lines – 1 sec

VM1: STA is disabled (no-steal-acc)
85% 29683
85% 29520
87% 29574
87% 31386
84% 29578
87% 29595
86% 29644
87% 29655
87% 30059
88% 3
90% 30037
90% 29720
90% 29951
86% 29958
88% 29971
90% 29829
82% 29841
82% 29998
82% 29752
80% 29722
88% 28825
85% 28283
85% 29746
77% 29657
82% 29674
84% 29764
87% 29567
79% 29529
84% 29633

VM2: STA is enabled
51% 30709
38% 29564
44% 29429
47% 29759
41% 29792
0% 29987
51% 52666
50% 29753
50% 29670
54% 29461
45% 29927
35% 29743
50% 29683
50% 29971
47% 30033
0% 121204
7% 116039
25% 119069
24% 40269
5% 37684
24% 30039
22% 30269
44% 31861
29% 29652
24% 29558
24% 29783
24% 29494
24% 29607
43% 29613
25% 29377

My benchmark sources:  https://github.com/YustasSwamp/steal-time-benchmark

Any ideas?
Do you have any steal time specific benchmarks?

Thanks,
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v3 22/32] nvdimm: init the address region used by NVDIMM ACPI

2015-10-19 Thread Xiao Guangrong



On 10/19/2015 06:42 PM, Igor Mammedov wrote:

On Mon, 19 Oct 2015 18:01:17 +0800
Xiao Guangrong  wrote:




On 10/19/2015 05:46 PM, Igor Mammedov wrote:

On Mon, 19 Oct 2015 12:17:22 +0300
"Michael S. Tsirkin"  wrote:


On Mon, Oct 19, 2015 at 03:44:13PM +0800, Xiao Guangrong wrote:



On 10/19/2015 03:39 PM, Michael S. Tsirkin wrote:

On Mon, Oct 19, 2015 at 03:27:21PM +0800, Xiao Guangrong wrote:

+nvdimm_init_memory_state(>nvdimm_memory,
system_memory, machine,
+ TARGET_PAGE_SIZE);
+


Shouldn't this be conditional on presence of the nvdimm device?



We will enable hotplug on nvdimm devices in the near future once
Linux driver is ready. I'd keep it here for future development.


No, I don't think we should add stuff unconditionally. If not
nvdimm, some other flag should indicate user intends to hotplug
things.



Actually, it is not unconditionally which is called if parameter
"-m aaa, maxmem=bbb" (aaa < bbb) is used. It is on the some path
of memoy-hotplug initiation.



Right, but that's not the same as nvdimm.



it could be pc-machine property, then it could be turned on like
this: -machine nvdimm_support=on


Er, I do not understand why this separate switch is needed and why
nvdimm and pc-dimm is different. :(

NVDIMM reuses memory-hotplug's framework, such as maxmem, slot, and
dimm device, even some of ACPI logic to do hotplug things, etc. Both
nvdimm and pc-dimm are built on the same infrastructure.

NVDIMM support consumes precious low RAM  and MMIO resources and
not small amount at that. So turning it on unconditionally with
memory hotplug even if NVDIMM wouldn't ever be used isn't nice.

However that concern could be dropped if instead of allocating it's
own control MMIO/RAM regions, NVDIMM would reuse memory hotplug's MMIO
region and replace RAM region with serializing/marshaling label data
over the same MMIO interface (yes, it's slower but it's not
performance critical path).12


I really do not want to reuse all memory-hotplug's resource, NVDIMM and
memory-hotplug do not have the same ACPI logic, that makes the AML code
really complex.

Another point is, Microsoft does use label data area oon its own way - label
data area will not be used as namespace area at all, too slow access for
_DSM is not acceptable for vNVDIMM usage.

Most important point is, we do not want to slow down system boot with NVDIMM
attached, (imagine accessing 128K data with single 8 bytes MMIO access, crazy
slowly.), NVDIMM will be use as boot device and it will be used for
light-way virtualization, such as Clear Container and Hyper, which require
boot the system up as fast as possible.

I understand your concern that reserve big resource is not so acceptable - okay,
then how about just reserve 4 bit IO port and 1 RAM?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Greetings,

2015-10-19 Thread Invesco Perpetual
Greetings,

We the Invesco Investment Company, We invite you to partner with us and
benefit in our new Loan and Project funding program. We offer flexible
loans and funding for various projects by passing the usual rigorGreetings,

We the Invesco Investment Company, We invite you to partner with us and benefit 
in our new Loan and Project funding program. We offer flexible loans and 
funding for various projects by passing the usual rigorous procedures.This 
Funding program allows a client to enjoy low interest payback for as low as 3 - 
4% per annum for a period of 8-7 years. We can approve a loan/funding for up to 
USD 500,000,000.00 or more depending on
the nature of business. We are currently funding for:

* Starting up a Franchise
* Business Acquisition
* Business Expansion
* Commercial Real Estate purchase
* Contract Execution

We are open to having a good business relationship with you. If you think you 
have a solid background.

please do not hesitate to contact us for possible business co-operation @ our 
e-mail: (contactinve...@gmail.com)

NOTE:
Please make sure you send your response to our email stated above.

Sincerely,

Thanks for your co-operation
Advert/Publicity Personnel
Invesco Investment Company

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
https://www.avast.com/antivirus

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html