Re: [PATCH v4 17/33] dimm: abstract dimm device from pc-dimm

2015-10-23 Thread Bharata B Rao
On Mon, Oct 19, 2015 at 6:24 AM, Xiao Guangrong
 wrote:
> A base device, dimm, is abstracted from pc-dimm, so that we can
> build nvdimm device based on dimm in the later patch
>
> Signed-off-by: Xiao Guangrong 
> ---
>  default-configs/i386-softmmu.mak   |  1 +
>  default-configs/x86_64-softmmu.mak |  1 +
>  hw/mem/Makefile.objs   |  3 ++-
>  hw/mem/dimm.c  | 11 ++---
>  hw/mem/pc-dimm.c   | 46 
> ++
>  include/hw/mem/dimm.h  |  4 ++--
>  include/hw/mem/pc-dimm.h   |  7 ++
>  7 files changed, 61 insertions(+), 12 deletions(-)
>  create mode 100644 hw/mem/pc-dimm.c
>  create mode 100644 include/hw/mem/pc-dimm.h
>
> diff --git a/default-configs/i386-softmmu.mak 
> b/default-configs/i386-softmmu.mak
> index 43c96d1..3ece8bb 100644
> --- a/default-configs/i386-softmmu.mak
> +++ b/default-configs/i386-softmmu.mak
> @@ -18,6 +18,7 @@ CONFIG_FDC=y
>  CONFIG_ACPI=y
>  CONFIG_ACPI_X86=y
>  CONFIG_ACPI_X86_ICH=y
> +CONFIG_DIMM=y
>  CONFIG_ACPI_MEMORY_HOTPLUG=y
>  CONFIG_ACPI_CPU_HOTPLUG=y
>  CONFIG_APM=y
> diff --git a/default-configs/x86_64-softmmu.mak 
> b/default-configs/x86_64-softmmu.mak
> index dfb8095..92ea7c1 100644
> --- a/default-configs/x86_64-softmmu.mak
> +++ b/default-configs/x86_64-softmmu.mak
> @@ -18,6 +18,7 @@ CONFIG_FDC=y
>  CONFIG_ACPI=y
>  CONFIG_ACPI_X86=y
>  CONFIG_ACPI_X86_ICH=y
> +CONFIG_DIMM=y

Same change needs to be done in default-configs/ppc64-softmmu.mak too.

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/32] implement vNVDIMM

2015-10-11 Thread Bharata B Rao
Xiao,

Are these patches present in any git tree so that they can be easily tried out.

Regards,
Bharata.

On Sun, Oct 11, 2015 at 9:22 AM, Xiao Guangrong
 wrote:
> Changelog in v3:
> There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
> Michael for their valuable comments, the patchset finally gets better shape.
> - changes from Igor's comments:
>   1) abstract dimm device type from pc-dimm and create nvdimm device based on
>  dimm, then it uses memory backend device as nvdimm's memory and NUMA has
>  easily been implemented.
>   2) let file-backend device support any kind of filesystem not only for
>  hugetlbfs and let it work on file not only for directory which is
>  achieved by extending 'mem-path' - if it's a directory then it works as
>  current behavior, otherwise if it's file then directly allocates memory
>  from it.
>   3) we figure out a unused memory hole below 4G that is 0xFF0 ~
>  0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit
>  ACPI SSDT/DSDT table will break windows XP.
>  BTW, only make SSDT.rev = 2 can not work since the width is only depended
>  on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
>  in ACPI spec:
> | Note: For compatibility with ACPI versions before ACPI 2.0, the bit
> | width of Integer objects is dependent on the ComplianceRevision of the DSDT.
> | If the ComplianceRevision is less than 2, all integers are restricted to 32
> | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets
> | the global integer width for all integers, including integers in SSDTs.
>   4) use the lowest ACPI spec version to document AML terms.
>   5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"
>
> - changes from Stefan's comments:
>   1) do not do endian adjustment in-place since _DSM memory is visible to 
> guest
>   2) use target platform's target page size instead of fixed PAGE_SIZE
>  definition
>   3) lots of code style improvement and typo fixes.
>   4) live migration fix
> - changes from Paolo's comments:
>   1) improve the name of memory region
>
> - other changes:
>   1) return exact buffer size for _DSM method instead of the page size.
>   2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
>  devices.
>   3) NUMA support
>   4) implement _FIT method
>   5) rename "configdata" to "reserve-label-data"
>   6) simplify _DSM arg3 determination
>   7) main changelog update to let it reflect v3.
>
> Changlog in v2:
> - Use litten endian for DSM method, thanks for Stefan's suggestion
>
> - introduce a new parameter, @configdata, if it's false, Qemu will
>   build a static and readonly namespace in memory and use it serveing
>   for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no
>   reserved region is needed at the end of the @file, it is good for
>   the user who want to pass whole nvdimm device and make its data
>   completely be visible to guest
>
> - divide the source code into separated files and add maintain info
>
> BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will
> be posted on next week
>
> == Background ==
> NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported
> on Intel's platform. They are discovered via ACPI and configured by _DSM
> method of NVDIMM device in ACPI. There has some supporting documents which
> can be found at:
> ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> DSM Interface Example: 
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> Driver Writer's Guide: 
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
> Currently, the NVDIMM driver has been merged into upstream Linux Kernel and
> this patchset tries to enable it in virtualization field
>
> == Design ==
> NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's
> address space then CPU can directly access it as normal memory, another is
> BLK which is used as block device to reduce the occupying of CPU address
> space
>
> BLK mode accesses NVDIMM via Command Register window and Data Register window.
> BLK virtualization has high workload since each sector access will cause at
> least two VM-EXIT. So we currently only imperilment vPMEM in this patchset
>
> --- vPMEM design ---
> We introduce a new device named "nvdimm", it uses memory backend device as
> NVDIMM memory. The file in file-backend device can be a regular file and block
> device. We can use any file when we do test or emulation, however,
> in the real word, the files passed to guest are:
> - the regular file in the filesystem with DAX enabled created on NVDIMM device
>   on host
> - the raw PMEM device on host, e,g /dev/pmem0
> Memory access on the address created by mmap on these kinds of files can
> directly reach NVDIMM 

Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote:
> On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote:
> > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
> > > Hello Bharata,
> > > 
> > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
> > > > May be it is a bit late to bring this up, but I needed the following fix
> > > > to userfault21 branch of your git tree to compile on powerpc.
> > > 
> > > Not late, just in time. I increased the number of syscalls in earlier
> > > versions, it must have gotten lost during a rejecting rebase, sorry.
> > > 
> > > I applied it to my tree and it can be applied to -mm and linux-next,
> > > thanks!
> > > 
> > > The syscall for arm32 are also ready and on their way to the arm tree,
> > > the testsuite worked fine there. ppc also should work fine if you
> > > could confirm it'd be interesting, just beware that I got a typo in
> > > the testcase:
> > 
> > The testsuite passes on powerpc.
> > 
> > 
> > running userfaultfd
> > 
> > nr_pages: 2040, nr_pages_per_cpu: 170
> > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 
> > 2 96 13 128
> > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 
> > 1 0
> > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 
> > 114 96 1 0
> > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
> > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 
> > 71 20
> > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
> > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 > > 0
> > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
> > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 
> > 3 6 0
> > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 
> > 40 34 0
> > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
> > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
> > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 
> > 0 0
> > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
> > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
> > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
> > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 
> > 0 180 75
> > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 
> > 0 30
> > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 
> > 68 1 2
> > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 
> > 10 1
> > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 
> > 51 37 1
> > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 
> > 25 10
> > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
> > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
> > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 
> > 7 4 5
> > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 
> > 23 2
> > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
> > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
> > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
> > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
> > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
> > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
> > [PASS]
> 
> Hmm, not for me. See below.
> 
> What setup were you testing on Bharata?

I was on commit a94572f5799dd of userfault21 branch in Andrea's tree
git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git

#uname -a
Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le 
GNU/Linux

In fact I had successfully done postcopy migration of sPAPR guest with
this setup.

> 
> Mine is:
> 
> $ uname -a
> Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 
> ppc64le ppc64le ppc64le GNU/Linux
> 
> Which is 7d9071a09502 plus a couple of powerpc patches.
> 
> $ zgrep USERFAULTFD /proc/config.gz
> CONFIG_USERFAULTFD=y
> 
> $ sudo ./userfaultfd 128 32
> nr_pages: 2048, nr_pages_per_cpu: 128
> bounces: 31, mode: rnd racing ver poll, error mutex 2 2
> error mutex 2 10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > In fact I had successfully done postcopy migration of sPAPR guest with
> > this setup.
> 
> Interesting - I'd not got that far myself on power; I was hitting a problem
> loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> stream (htab_shift=25) )
> 
> Did you have to make any changes to the qemu code to get that happy?

I should have mentioned that I tried only QEMU driven migration within
the same host using wp3-postcopy branch of your tree. I don't see the
above issue.

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off 
compress: off x-postcopy-ram: on 
Migration status: completed
total time: 39432 milliseconds
downtime: 162 milliseconds
setup: 14 milliseconds
transferred ram: 1297209 kbytes
throughput: 270.72 mbps
remaining ram: 0 kbytes
total ram: 4194560 kbytes
duplicate: 734015 pages
skipped: 0 pages
normal: 318469 pages
normal bytes: 1273876 kbytes
dirty sync count: 4

I will try migration between different hosts soon and check.

Regards,
Bharata.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-09-08 Thread Bharata B Rao
On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote:
> * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote:
> > > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote:
> > > > In fact I had successfully done postcopy migration of sPAPR guest with
> > > > this setup.
> > > 
> > > Interesting - I'd not got that far myself on power; I was hitting a 
> > > problem
> > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab 
> > > stream (htab_shift=25) )
> > > 
> > > Did you have to make any changes to the qemu code to get that happy?
> > 
> > I should have mentioned that I tried only QEMU driven migration within
> > the same host using wp3-postcopy branch of your tree. I don't see the
> > above issue.
> > 
> > (qemu) info migrate
> > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: 
> > off compress: off x-postcopy-ram: on 
> > Migration status: completed
> > total time: 39432 milliseconds
> > downtime: 162 milliseconds
> > setup: 14 milliseconds
> > transferred ram: 1297209 kbytes
> > throughput: 270.72 mbps
> > remaining ram: 0 kbytes
> > total ram: 4194560 kbytes
> > duplicate: 734015 pages
> > skipped: 0 pages
> > normal: 318469 pages
> > normal bytes: 1273876 kbytes
> > dirty sync count: 4
> > 
> > I will try migration between different hosts soon and check.
> 
> I hit that on the same host; are you sure you've switched into postcopy mode;
> i.e. issued a migrate_start_postcopy before the end of migration?

Sorry I was following your discussion with Li in this thread

https://www.marc.info/?l=qemu-devel=143035620026744=4

and it wasn't obvious to me that anything apart from turning on the
x-postcopy-ram capability was required :(

So I do see the problem now.

At the source
-
Error reading data from KVM HTAB fd: Bad file descriptor
Segmentation fault

At the target
-
htab_load() bad index 2113929216 (14336+0 entries) in htab stream 
(htab_shift=25)
qemu-system-ppc64: error while loading state section id 56(spapr/htab)
qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22
qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 
0x1f: delta 0xffe1
qemu-system-ppc64: error while loading state for instance 0x0 of device 
'pci@8002000:00.0/virtio-net'
*** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 
0x0100241234a0 ***
=== Backtrace: =
/lib64/power8/libc.so.6Segmentation fault

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote:
 This activates the userfaultfd syscall.
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  arch/powerpc/include/asm/systbl.h  | 1 +
  arch/powerpc/include/uapi/asm/unistd.h | 1 +
  arch/x86/syscalls/syscall_32.tbl   | 1 +
  arch/x86/syscalls/syscall_64.tbl   | 1 +
  include/linux/syscalls.h   | 1 +
  kernel/sys_ni.c| 1 +
  6 files changed, 6 insertions(+)
 
 diff --git a/arch/powerpc/include/asm/systbl.h 
 b/arch/powerpc/include/asm/systbl.h
 index f1863a1..4741b15 100644
 --- a/arch/powerpc/include/asm/systbl.h
 +++ b/arch/powerpc/include/asm/systbl.h
 @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create)
  SYSCALL_SPU(bpf)
  COMPAT_SYS(execveat)
  PPC64ONLY(switch_endian)
 +SYSCALL_SPU(userfaultfd)
 diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
 b/arch/powerpc/include/uapi/asm/unistd.h
 index e4aa173..6ad58d4 100644
 --- a/arch/powerpc/include/uapi/asm/unistd.h
 +++ b/arch/powerpc/include/uapi/asm/unistd.h
 @@ -386,5 +386,6 @@
  #define __NR_bpf 361
  #define __NR_execveat362
  #define __NR_switch_endian   363
 +#define __NR_userfaultfd 364

May be it is a bit late to bring this up, but I needed the following fix
to userfault21 branch of your git tree to compile on powerpc.


powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd

From: Bharata B Rao bhar...@linux.vnet.ibm.com

With userfaultfd syscall, the number of syscalls will be 365 on PowerPC.
Reflect the same in __NR_syscalls.

Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/unistd.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index f4f8b66..4a055b6 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include uapi/asm/unistd.h
 
 
-#define __NR_syscalls  364
+#define __NR_syscalls  365
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall

2015-08-11 Thread Bharata B Rao
On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote:
 Hello Bharata,
 
 On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote:
  May be it is a bit late to bring this up, but I needed the following fix
  to userfault21 branch of your git tree to compile on powerpc.
 
 Not late, just in time. I increased the number of syscalls in earlier
 versions, it must have gotten lost during a rejecting rebase, sorry.
 
 I applied it to my tree and it can be applied to -mm and linux-next,
 thanks!
 
 The syscall for arm32 are also ready and on their way to the arm tree,
 the testsuite worked fine there. ppc also should work fine if you
 could confirm it'd be interesting, just beware that I got a typo in
 the testcase:

The testsuite passes on powerpc.


running userfaultfd

nr_pages: 2040, nr_pages_per_cpu: 170
bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 
13 128
bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0
bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 
1 0
bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2
bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20
bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6
bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0
bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57
bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0
bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 
34 0
bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0
bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0
bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0
bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0
bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0
bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0
bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 
180 75
bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30
bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 
2
bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1
bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 
37 1
bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10
bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1
bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28
bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 
5
bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2
bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0
bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0
bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0
bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0
bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0
bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1
[PASS]

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object

2015-03-23 Thread Bharata B Rao
On Sat, Mar 21, 2015 at 8:28 PM, Alexander Graf ag...@suse.de wrote:


 On 20.03.15 16:51, Bharata B Rao wrote:
 On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote:


 On 20.03.15 12:26, Paul Mackerras wrote:
 On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote:


 On 20.03.15 10:39, Paul Mackerras wrote:
 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 Since KVM isn't equipped to handle closure of vcpu fd from 
 userspace(QEMU)
 correctly, certain work arounds have to be employed to allow reuse of
 vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
 proposed workaround is to park the vcpu fd in userspace during cpu unplug
 and reuse it later during next hotplug.

 More details can be found here:
 KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
 QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

 In order to support this workaround with PowerPC KVM, don't create or
 initialize ICP if the vCPU is found to be already associated with an ICP.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 Signed-off-by: Paul Mackerras pau...@samba.org

 This probably makes some sense, but please make sure that user space has
 some way to figure out whether hotplug works at all.

 Bharata is working on the qemu side of all this, so I assume he has
 that covered.

 Well, so far the kernel doesn't expose anything he can query, so I
 suppose he just blindly assumes that older host kernels will randomly
 break and nobody cares. I'd rather prefer to see a CAP exposed that qemu
 can check on.

 I see that you have already taken this into your tree. I have an updated
 patch to expose a CAP. If the below patch looks ok, then let me know how
 you would prefer to take this patch in.

 Regards,
 Bharata.

 KVM: PPC: BOOK3S: Allow reuse of vCPU object

 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
 correctly, certain work arounds have to be employed to allow reuse of
 vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
 proposed workaround is to park the vcpu fd in userspace during cpu unplug
 and reuse it later during next hotplug.

 More details can be found here:
 KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
 QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

 In order to support this workaround with PowerPC KVM, don't create or
 initialize ICP if the vCPU is found to be already associated with an ICP.
 User space (QEMU) can reuse the vCPU after checking for the availability
 of KVM_CAP_SPAPR_REUSE_VCPU capability.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_xics.c |9 +++--
  arch/powerpc/kvm/powerpc.c |   12 
  include/uapi/linux/kvm.h   |1 +
  3 files changed, 20 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
 index a4a8d9f..ead3a35 100644
 --- a/arch/powerpc/kvm/book3s_xics.c
 +++ b/arch/powerpc/kvm/book3s_xics.c
 @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
 struct kvm_vcpu *vcpu,
   return -EPERM;
   if (xics-kvm != vcpu-kvm)
   return -EPERM;
 - if (vcpu-arch.irq_type)
 - return -EBUSY;
 +
 + /*
 +  * If irq_type is already set, don't reinialize but
 +  * return success allowing this vcpu to be reused.
 +  */
 + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT)
 + return 0;

   r = kvmppc_xics_create_icp(vcpu, xcpu);
   if (!r)
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 27c0fac..5b7007c 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
 ext)
   r = 1;
   break;
  #endif
 + case KVM_CAP_SPAPR_REUSE_VCPU:
 + /*
 +  * Kernel currently doesn't support closing of vCPU fd from
 +  * user space (QEMU) correctly. Hence the option available
 +  * is to park the vCPU fd in user space whenever a guest
 +  * CPU is hot removed and reuse the same later when another
 +  * guest CPU is hotplugged. This capability determines whether
 +  * it is safe to assume if parking of vCPU fd and reuse from
 +  * user space works for sPAPR guests.

 I don't see how the code you're changing here has anything to do with
 parking vcpus. It's all about being able to call connect on an already
 connected vcpu and not erroring out. Please reflect this in the cap name
 and description.

 You also need to update Documentation/virtual/kvm/api.txt.

 Furthermore, thinking about this a bit more, I might still miss the
 exact case why you need this. Why is QEMU issuing a connect again? Could
 it maybe just not do it?

Thinking

Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object

2015-03-23 Thread Bharata B Rao
On Sat, Mar 21, 2015 at 8:28 PM, Alexander Graf ag...@suse.de wrote:


 On 20.03.15 16:51, Bharata B Rao wrote:
 On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote:


 On 20.03.15 12:26, Paul Mackerras wrote:
 On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote:


 On 20.03.15 10:39, Paul Mackerras wrote:
 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 Since KVM isn't equipped to handle closure of vcpu fd from 
 userspace(QEMU)
 correctly, certain work arounds have to be employed to allow reuse of
 vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
 proposed workaround is to park the vcpu fd in userspace during cpu unplug
 and reuse it later during next hotplug.

 More details can be found here:
 KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
 QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

 In order to support this workaround with PowerPC KVM, don't create or
 initialize ICP if the vCPU is found to be already associated with an ICP.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 Signed-off-by: Paul Mackerras pau...@samba.org

 This probably makes some sense, but please make sure that user space has
 some way to figure out whether hotplug works at all.

 Bharata is working on the qemu side of all this, so I assume he has
 that covered.

 Well, so far the kernel doesn't expose anything he can query, so I
 suppose he just blindly assumes that older host kernels will randomly
 break and nobody cares. I'd rather prefer to see a CAP exposed that qemu
 can check on.

 I see that you have already taken this into your tree. I have an updated
 patch to expose a CAP. If the below patch looks ok, then let me know how
 you would prefer to take this patch in.

 Regards,
 Bharata.

 KVM: PPC: BOOK3S: Allow reuse of vCPU object

 From: Bharata B Rao bhar...@linux.vnet.ibm.com

 Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
 correctly, certain work arounds have to be employed to allow reuse of
 vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
 proposed workaround is to park the vcpu fd in userspace during cpu unplug
 and reuse it later during next hotplug.

 More details can be found here:
 KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
 QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

 In order to support this workaround with PowerPC KVM, don't create or
 initialize ICP if the vCPU is found to be already associated with an ICP.
 User space (QEMU) can reuse the vCPU after checking for the availability
 of KVM_CAP_SPAPR_REUSE_VCPU capability.

 Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
 ---
  arch/powerpc/kvm/book3s_xics.c |9 +++--
  arch/powerpc/kvm/powerpc.c |   12 
  include/uapi/linux/kvm.h   |1 +
  3 files changed, 20 insertions(+), 2 deletions(-)

 diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
 index a4a8d9f..ead3a35 100644
 --- a/arch/powerpc/kvm/book3s_xics.c
 +++ b/arch/powerpc/kvm/book3s_xics.c
 @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
 struct kvm_vcpu *vcpu,
   return -EPERM;
   if (xics-kvm != vcpu-kvm)
   return -EPERM;
 - if (vcpu-arch.irq_type)
 - return -EBUSY;
 +
 + /*
 +  * If irq_type is already set, don't reinialize but
 +  * return success allowing this vcpu to be reused.
 +  */
 + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT)
 + return 0;

   r = kvmppc_xics_create_icp(vcpu, xcpu);
   if (!r)
 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 27c0fac..5b7007c 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
 ext)
   r = 1;
   break;
  #endif
 + case KVM_CAP_SPAPR_REUSE_VCPU:
 + /*
 +  * Kernel currently doesn't support closing of vCPU fd from
 +  * user space (QEMU) correctly. Hence the option available
 +  * is to park the vCPU fd in user space whenever a guest
 +  * CPU is hot removed and reuse the same later when another
 +  * guest CPU is hotplugged. This capability determines whether
 +  * it is safe to assume if parking of vCPU fd and reuse from
 +  * user space works for sPAPR guests.

 I don't see how the code you're changing here has anything to do with
 parking vcpus. It's all about being able to call connect on an already
 connected vcpu and not erroring out. Please reflect this in the cap name
 and description.

 You also need to update Documentation/virtual/kvm/api.txt.

 Furthermore, thinking about this a bit more, I might still miss the
 exact case why you need this. Why is QEMU issuing a connect again? Could
 it maybe just not do it?

Thinking

Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object

2015-03-20 Thread Bharata B Rao
On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote:
 
 
 On 20.03.15 12:26, Paul Mackerras wrote:
  On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote:
 
 
  On 20.03.15 10:39, Paul Mackerras wrote:
  From: Bharata B Rao bhar...@linux.vnet.ibm.com
 
  Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
  correctly, certain work arounds have to be employed to allow reuse of
  vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
  proposed workaround is to park the vcpu fd in userspace during cpu unplug
  and reuse it later during next hotplug.
 
  More details can be found here:
  KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
  QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html
 
  In order to support this workaround with PowerPC KVM, don't create or
  initialize ICP if the vCPU is found to be already associated with an ICP.
 
  Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
  Signed-off-by: Paul Mackerras pau...@samba.org
 
  This probably makes some sense, but please make sure that user space has
  some way to figure out whether hotplug works at all.
  
  Bharata is working on the qemu side of all this, so I assume he has
  that covered.
 
 Well, so far the kernel doesn't expose anything he can query, so I
 suppose he just blindly assumes that older host kernels will randomly
 break and nobody cares. I'd rather prefer to see a CAP exposed that qemu
 can check on.

I see that you have already taken this into your tree. I have an updated
patch to expose a CAP. If the below patch looks ok, then let me know how
you would prefer to take this patch in.

Regards,
Bharata.

KVM: PPC: BOOK3S: Allow reuse of vCPU object

From: Bharata B Rao bhar...@linux.vnet.ibm.com

Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
correctly, certain work arounds have to be employed to allow reuse of
vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
proposed workaround is to park the vcpu fd in userspace during cpu unplug
and reuse it later during next hotplug.

More details can be found here:
KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

In order to support this workaround with PowerPC KVM, don't create or
initialize ICP if the vCPU is found to be already associated with an ICP.
User space (QEMU) can reuse the vCPU after checking for the availability
of KVM_CAP_SPAPR_REUSE_VCPU capability.

Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_xics.c |9 +++--
 arch/powerpc/kvm/powerpc.c |   12 
 include/uapi/linux/kvm.h   |1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index a4a8d9f..ead3a35 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
struct kvm_vcpu *vcpu,
return -EPERM;
if (xics-kvm != vcpu-kvm)
return -EPERM;
-   if (vcpu-arch.irq_type)
-   return -EBUSY;
+
+   /*
+* If irq_type is already set, don't reinialize but
+* return success allowing this vcpu to be reused.
+*/
+   if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT)
+   return 0;
 
r = kvmppc_xics_create_icp(vcpu, xcpu);
if (!r)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 27c0fac..5b7007c 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
 #endif
+   case KVM_CAP_SPAPR_REUSE_VCPU:
+   /*
+* Kernel currently doesn't support closing of vCPU fd from
+* user space (QEMU) correctly. Hence the option available
+* is to park the vCPU fd in user space whenever a guest
+* CPU is hot removed and reuse the same later when another
+* guest CPU is hotplugged. This capability determines whether
+* it is safe to assume if parking of vCPU fd and reuse from
+* user space works for sPAPR guests.
+*/
+   r = 1;
+   break;
default:
r = 0;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8055706..8464755 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -760,6 +760,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_ENABLE_HCALL 104
 #define KVM_CAP_CHECK_EXTENSION_VM 105
 #define KVM_CAP_S390_USER_SIGP 106
+#define KVM_CAP_SPAPR_REUSE_VCPU 107
 
 #ifdef KVM_CAP_IRQ_ROUTING
 

--
To unsubscribe from this list: send the line

Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object

2015-03-20 Thread Bharata B Rao
On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote:
 
 
 On 20.03.15 12:26, Paul Mackerras wrote:
  On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote:
 
 
  On 20.03.15 10:39, Paul Mackerras wrote:
  From: Bharata B Rao bhar...@linux.vnet.ibm.com
 
  Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
  correctly, certain work arounds have to be employed to allow reuse of
  vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
  proposed workaround is to park the vcpu fd in userspace during cpu unplug
  and reuse it later during next hotplug.
 
  More details can be found here:
  KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
  QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html
 
  In order to support this workaround with PowerPC KVM, don't create or
  initialize ICP if the vCPU is found to be already associated with an ICP.
 
  Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
  Signed-off-by: Paul Mackerras pau...@samba.org
 
  This probably makes some sense, but please make sure that user space has
  some way to figure out whether hotplug works at all.
  
  Bharata is working on the qemu side of all this, so I assume he has
  that covered.
 
 Well, so far the kernel doesn't expose anything he can query, so I
 suppose he just blindly assumes that older host kernels will randomly
 break and nobody cares. I'd rather prefer to see a CAP exposed that qemu
 can check on.

I see that you have already taken this into your tree. I have an updated
patch to expose a CAP. If the below patch looks ok, then let me know how
you would prefer to take this patch in.

Regards,
Bharata.

KVM: PPC: BOOK3S: Allow reuse of vCPU object

From: Bharata B Rao bhar...@linux.vnet.ibm.com

Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU)
correctly, certain work arounds have to be employed to allow reuse of
vcpu array slot in KVM during cpu hot plug/unplug from guest. One such
proposed workaround is to park the vcpu fd in userspace during cpu unplug
and reuse it later during next hotplug.

More details can be found here:
KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html
QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html

In order to support this workaround with PowerPC KVM, don't create or
initialize ICP if the vCPU is found to be already associated with an ICP.
User space (QEMU) can reuse the vCPU after checking for the availability
of KVM_CAP_SPAPR_REUSE_VCPU capability.

Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_xics.c |9 +++--
 arch/powerpc/kvm/powerpc.c |   12 
 include/uapi/linux/kvm.h   |1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index a4a8d9f..ead3a35 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, 
struct kvm_vcpu *vcpu,
return -EPERM;
if (xics-kvm != vcpu-kvm)
return -EPERM;
-   if (vcpu-arch.irq_type)
-   return -EBUSY;
+
+   /*
+* If irq_type is already set, don't reinialize but
+* return success allowing this vcpu to be reused.
+*/
+   if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT)
+   return 0;
 
r = kvmppc_xics_create_icp(vcpu, xcpu);
if (!r)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 27c0fac..5b7007c 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
r = 1;
break;
 #endif
+   case KVM_CAP_SPAPR_REUSE_VCPU:
+   /*
+* Kernel currently doesn't support closing of vCPU fd from
+* user space (QEMU) correctly. Hence the option available
+* is to park the vCPU fd in user space whenever a guest
+* CPU is hot removed and reuse the same later when another
+* guest CPU is hotplugged. This capability determines whether
+* it is safe to assume if parking of vCPU fd and reuse from
+* user space works for sPAPR guests.
+*/
+   r = 1;
+   break;
default:
r = 0;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 8055706..8464755 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -760,6 +760,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_ENABLE_HCALL 104
 #define KVM_CAP_CHECK_EXTENSION_VM 105
 #define KVM_CAP_S390_USER_SIGP 106
+#define KVM_CAP_SPAPR_REUSE_VCPU 107
 
 #ifdef KVM_CAP_IRQ_ROUTING
 

--
To unsubscribe from this list: send the line

Re: [Qemu-devel] KVM call agenda for September 25th

2012-09-25 Thread Bharata B Rao
On Tue, Sep 25, 2012 at 04:51:15PM +0200, Kevin Wolf wrote:
 Am 25.09.2012 14:57, schrieb Anthony Liguori:
  qemu -device \
isa-serial,index=0,chr=tcp://localhost:1025/?server=onwait=off
 
 Your examples kind of prove this: They aren't much shorter than what
 exists today, but they contain ? and , which are nasty characters on
 the command line.

Right. '' can't even be specified directly on command line since that will
result in qemu command being treated as a background job with anything after
'' being discarded. I realized that '' needs to be escaped as %26.

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Bharata B Rao
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote:
 * Alexander Graf (ag...@suse.de) wrote:
  On 29.10.2011, at 20:45, Bharata B Rao wrote:
   As guests become NUMA aware, it becomes important for the guests to
   have correct NUMA policies when they run on NUMA aware hosts.
   Currently limited support for NUMA binding is available via libvirt
   where it is possible to apply a NUMA policy to the guest as a whole.
   However multinode guests would benefit if guest memory belonging to
   different guest nodes are mapped appropriately to different host NUMA 
   nodes.
   
   To achieve this we would need QEMU to expose information about
   guest RAM ranges (Guest Physical Address - GPA) and their host virtual
   address mappings (Host Virtual Address - HVA). Using GPA and HVA, any 
   external
   tool like libvirt would be able to divide the guest RAM as per the guest 
   NUMA
   node geometry and bind guest memory nodes to corresponding host memory 
   nodes
   using HVA. This needs both QEMU (and libvirt) changes as well as changes
   in the kernel.
  
  Ok, let's take a step back here. You are basically growing libvirt into a 
  memory resource manager that know how much memory is available on which 
  nodes and how these nodes would possibly fit into the host's memory layout.
  
  Shouldn't that be the kernel's job? It seems to me that architecturally the 
  kernel is the place I would want my memory resource controls to be in.
 
 I think that both Peter and Andrea are looking at this.  Before we commit
 an API to QEMU that has a different semantic than a possible new kernel
 interface (that perhaps QEMU could use directly to inform kernel of the
 binding/relationship between vcpu thread and it's memory at VM startuup)
 it would be useful to see what these guys are working on...

I looked at Peter's recent work in this area.
(https://lkml.org/lkml/2011/11/17/204)

It introduces two interfaces:

1. ms_tbind() to bind a thread to a memsched(*) group
2. ms_mbind() to bind a memory region to memsched group

I assume the 2nd interface could be used by QEMU to create
memsched groups for each of guest NUMA node memory regions.

In the past, Anthony has said that NUMA binding should be done from outside
of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
Though that was in a different context, may be we should re-look at that
and see if QEMU still sticks to that. I know its a bit early, but if needed
we should ask Peter to consider extending ms_mbind() to take a tid parameter
too instead of working on current task by default.

(*) memsched: An abstraction for representing coupling of threads with virtual
address ranges. Threads and virtual address ranges of a memsched group are
guaranteed (?) to be located on the same node.

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

2011-11-21 Thread Bharata B Rao
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote:
 On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote:
 
  I looked at Peter's recent work in this area.
  (https://lkml.org/lkml/2011/11/17/204)
  
  It introduces two interfaces:
  
  1. ms_tbind() to bind a thread to a memsched(*) group
  2. ms_mbind() to bind a memory region to memsched group
  
  I assume the 2nd interface could be used by QEMU to create
  memsched groups for each of guest NUMA node memory regions.
 
 No, you would need both, you'll need to group vcpu threads _and_ some
 vaddress space together.
 
 I understood QEMU currently uses a single big anonymous mmap() to
 allocate the guest memory, using this you could either use multiple or
 carve up the big alloc into virtual nodes by assigning different parts
 to different ms groups.
 
 Example: suppose you want to create a 2 node guest with 8 vcpus, create
 2 ms groups, each with 4 vcpu threads and assign half the total guest
 mmap to either.
 
  In the past, Anthony has said that NUMA binding should be done from outside
  of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041)
 
 If you want to expose a sense of virtual NUMA to your guest you really
 have no choice there. The only thing you can do externally is run whole
 VMs inside one particular node.
 
  Though that was in a different context, may be we should re-look at that
  and see if QEMU still sticks to that. I know its a bit early, but if needed
  we should ask Peter to consider extending ms_mbind() to take a tid parameter
  too instead of working on current task by default.
 
 Uh, what for? ms_mbind() works on the current process, not task.

In the original post of this mail thread, I proposed a way to export
guest RAM ranges (Guest Physical Address-GPA) and their corresponding host
host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor).
The idea was to use this GPA to HVA mappings from tools like libvirt to bind
specific parts of the guest RAM to different host nodes. This needed an
extension to existing mbind() to allow binding memory of a process(QEMU) from a
different process(libvirt). This was needed since we wanted to do all this from
libvirt.

Hence I was coming from that background when I asked for extending
ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA
binding should all be done from outside of QEMU, it is needed, otherwise
what you have should be sufficient.

Regards,
Bharata.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-07 Thread Bharata B Rao
On Sun, Jun 07, 2009 at 09:04:49AM +0300, Avi Kivity wrote:
 Bharata B Rao wrote:
 On Fri, Jun 05, 2009 at 09:01:50AM +0300, Avi Kivity wrote:
   
 Bharata B Rao wrote:
 
 But could there be client models where you are required to strictly
 adhere to the limit within the bandwidth and not provide more (by advancing
 the bandwidth period) in the presence of idle cycles ?
 
 That's the limit part.  I'd like to be able to specify limits and   
 guarantees on the same host and for the same groups; I don't think 
 that  works when you advance the bandwidth period.

 I think we need to treat guarantees as first-class goals, not 
 something  derived from limits (in fact I think guarantees are more 
 useful as they  can be used to provide SLAs).
 

 I agree that guarantees are important, but I am not sure about

 1. specifying both limits and guarantees for groups and
   

 Why would you allow specifying a lower bound for cpu usage (a  
 guarantee), and upper bound (a limit), but not both?

I was saying that we specify only limits and not guarantees since it
can be worked out from limits. Initial thinking was that the kernel will
be made aware of only limits and users could set the limits appropriately
to obtain the desired guarantees. I understand your concerns/objections
on this and we will address this in our next version of RFC as Balbir said.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-05 Thread Bharata B Rao
On Fri, Jun 05, 2009 at 09:03:37AM +0300, Avi Kivity wrote:
 Balbir Singh wrote:
 I think so.  Given guarantees G1..Gn (0 = Gi = 1; sum(Gi) = 1), 
 and a  cpu hog running in each group, how would the algorithm divide 
 resources?

 

 As per the matrix calculation, but as soon as we reach an idle point,
 we redistribute the b/w and start a new quantum so to speak, where all
 groups are charged up to their hard limits.

 For your question, if there is a CPU hog running, it would be as per
 the matrix calculation, since the system has no idle point during the
 bandwidth period.
   

 So the groups with guarantees get a priority boost.  That's not a good  
 side effect.

That happens only in the presence of idle cycles when other groups [with or
without guarantees] have nothing useful to do. So how would that matter
since there is nothing else to run anyway ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-05 Thread Bharata B Rao
On Fri, Jun 05, 2009 at 09:01:50AM +0300, Avi Kivity wrote:
 Bharata B Rao wrote:

 But could there be client models where you are required to strictly
 adhere to the limit within the bandwidth and not provide more (by advancing
 the bandwidth period) in the presence of idle cycles ?
   

 That's the limit part.  I'd like to be able to specify limits and  
 guarantees on the same host and for the same groups; I don't think that  
 works when you advance the bandwidth period.

 I think we need to treat guarantees as first-class goals, not something  
 derived from limits (in fact I think guarantees are more useful as they  
 can be used to provide SLAs).

I agree that guarantees are important, but I am not sure about

1. specifying both limits and guarantees for groups and
2. not deriving guarantees from limits.

Guarantees are met by some form of throttling or limiting and hence I think
limiting should drive the guarantees.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-05 Thread Bharata B Rao
On Fri, Jun 05, 2009 at 01:53:15AM -0700, Paul Menage wrote:
 On Wed, Jun 3, 2009 at 10:36 PM, Bharata B
 Raobhar...@linux.vnet.ibm.com wrote:
  - Hard limits can be used to provide guarantees.
 
 
 This claim (and the subsequent long thread it generated on how limits
 can provide guarantees) confused me a bit.
 
 Why do we need limits to provide guarantees when we can already
 provide guarantees via shares?

shares design is proportional and hence it can't by itself provide
guarantees.

 
 Suppose 10 cgroups each want 10% of the machine's CPU. We can just
 give each cgroup an equal share, and they're guaranteed 10% if they
 try to use it; if they don't use it, other cgroups can get access to
 the idle cycles.

Now if 11th group with same shares comes in, then each group will now
get 9% of CPU and that 10% guarantee breaks.

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-04 Thread Bharata B Rao
On Thu, Jun 04, 2009 at 03:19:22PM +0300, Avi Kivity wrote:
 Bharata B Rao wrote:
 2. Need for hard limiting CPU resource
 --
 - Pay-per-use: In enterprise systems that cater to multiple clients/customers
   where a customer demands a certain share of CPU resources and pays only
   that, CPU hard limits will be useful to hard limit the customer's job
   to consume only the specified amount of CPU resource.
 - In container based virtualization environments running multiple containers,
   hard limits will be useful to ensure a container doesn't exceed its
   CPU entitlement.
 - Hard limits can be used to provide guarantees.
   
 How can hard limits provide guarantees?

 Let's take an example where I have 1 group that I wish to guarantee a  
 20% share of the cpu, and anther 8 groups with no limits or guarantees.

 One way to achieve the guarantee is to hard limit each of the 8 other  
 groups to 10%; the sum total of the limits is 80%, leaving 20% for the  
 guarantee group. The downside is the arbitrary limit imposed on the  
 other groups.

This method sounds very similar to the openvz method:
http://wiki.openvz.org/Containers/Guarantees_for_resources


 Another way is to place the 8 groups in a container group, and limit  
 that to 80%. But that doesn't work if I want to provide guarantees to  
 several groups.

Hmm why not ? Reduce the guarantee of the container group and provide
the same to additional groups ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] CPU hard limits

2009-06-04 Thread Bharata B Rao
On Fri, Jun 05, 2009 at 01:27:55PM +0800, Balbir Singh wrote:
 * Avi Kivity a...@redhat.com [2009-06-05 08:21:43]:
 
  Balbir Singh wrote:
  But then there is no other way to make a *guarantee*, guarantees come
  at a cost of idling resources, no? Can you show me any other
  combination that will provide the guarantee and without idling the
  system for the specified guarantees?
  
 
  OK, I see part of your concern, but I think we could do some
  optimizations during design. For example if all groups have reached
  their hard-limit and the system is idle, should we do start a new hard
  limit interval and restart, so that idleness can be removed. Would
  that be an acceptable design point?
 
  I think so.  Given guarantees G1..Gn (0 = Gi = 1; sum(Gi) = 1), and a  
  cpu hog running in each group, how would the algorithm divide resources?
 
 
 As per the matrix calculation, but as soon as we reach an idle point,
 we redistribute the b/w and start a new quantum so to speak, where all
 groups are charged up to their hard limits.

But could there be client models where you are required to strictly
adhere to the limit within the bandwidth and not provide more (by advancing
the bandwidth period) in the presence of idle cycles ?

Regards,
Bharata.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] CPU hard limits

2009-06-03 Thread Bharata B Rao
Hi,

This is an RFC about the CPU hard limits feature where I have explained
the need for the feature, the proposed plan and the issues around it.
Before I come up with an implementation for hard limits, I would like to
know community's thoughts on this scheduler enhancement and any feedback
and suggestions.

Regards,
Bharata.

1. CPU hard limit
2. Need for hard limiting CPU resource
3. Granularity of enforcing CPU hard limits
4. Existing solutions
5. Specifying hard limits
6. Per task group vs global bandwidth period
7. Configuring
8. Throttling of tasks
9. Group scheduler hierarchy considerations
10. SMP considerations
11. Starvation
12. Hard limit and fairness

1. CPU hard limit
-
CFS is a proportional share scheduler which tries to divide the CPU time
proportionately between tasks or groups of tasks (task group/cgroup) depending
on the priority/weight of the task or shares assigned to groups of tasks.
In CFS, a task/task group can get more than its share of CPU if there are
enough idle CPU cycles available in the system, due to the work conserving
nature of the scheduler.

However there are scenarios (Sec 2) where giving more than the desired
CPU share to a task/task group is not acceptable. In those scenarios, the
scheduler needs to put a hard stop on the CPU resource consumption of
task/task group if it exceeds a preset limit. This is usually achieved by
throttling the task/task group when it fully consumes its allocated CPU time.

2. Need for hard limiting CPU resource
--
- Pay-per-use: In enterprise systems that cater to multiple clients/customers
  where a customer demands a certain share of CPU resources and pays only
  that, CPU hard limits will be useful to hard limit the customer's job
  to consume only the specified amount of CPU resource.
- In container based virtualization environments running multiple containers,
  hard limits will be useful to ensure a container doesn't exceed its
  CPU entitlement.
- Hard limits can be used to provide guarantees.

3. Granularity of enforcing CPU hard limits
---
Conceptually, hard limits can either be enforced for individual tasks or
groups of tasks.  However enforcing limits per task would be too fine
grained and would be a lot of work on the part of the system administrator
in terms of setting limits for every task. Based on the current understanding
of the users of this feature,  it is felt that hard limiting is more useful
at task group level than the individual tasks level. Hence in the subsequent
paragraphs, the concept of hard limit as applicable to task group/cgroup
is discussed.

4. Existing solutions
-
- Both Linux-VServer and OpenVZ virtualization solutions support CPU hard
  limiting.
- Per task limit can be enforced using rlimits, but it is not rate based.

5. Specifying hard limits
-
CPU time consumed by a task group is generally measured over a
time period (called bandwidth period) and the task group gets throttled
when its CPU time reaches a limit (hard limit) within a bandwidth period.
The task group remains throttled until the bandwidth period gets
renewed at which time additional CPU time becomes available
to the tasks in the system.

When a task group's hard limit is specified as a ratio X/Y, it means that
the group will get throttled if its CPU time consumption exceeds X seconds
in a bandwidth period of Y seconds.

Specifying the hard limit as X/Y requires us to specify the bandwidth
period also.

Is having a uniform/same bandwidth period for all the groups an option ?
If so, we could even specify the hard limit as a percentage, like
30% of a uniform bandwidth period.

6. Per task group vs global bandwidth period

The bandwidth period can either be per task group or global. With global
bandwidth period, the runtimes of all the task groups need to be
replenished when the period ends. Though this appears conceptually simple,
the implementation might not scale. Instead if every task group maintains its
bandwidth period separately, the refresh cycles of each group will happen
independent of each other. Moreover different groups might prefer different
bandwidth periods. Hence the first implementation will have per task group
bandwidth period.

Timers can be used to trigger bandwidth refresh cycles. (similar to rt group
sched)

7. Configuring
--
- User could set the hard limit (X and/or Y) through the cgroup fs.
- When the scheduler supports hard limiting, should it be enabled
  for all tasks groups of the system ? Or should user have an option
  to enable hard limiting per group ?
- When hard limiting is enabled for a group, should the limit be
  set to a default to start with ? Or should the user set the limit
  and the bandwidth before enabling the hard limiting ?
- What should be a sane default value for the bandwidth period ?

8. Throttling of