Re: [PATCH v4 17/33] dimm: abstract dimm device from pc-dimm
On Mon, Oct 19, 2015 at 6:24 AM, Xiao Guangrongwrote: > A base device, dimm, is abstracted from pc-dimm, so that we can > build nvdimm device based on dimm in the later patch > > Signed-off-by: Xiao Guangrong > --- > default-configs/i386-softmmu.mak | 1 + > default-configs/x86_64-softmmu.mak | 1 + > hw/mem/Makefile.objs | 3 ++- > hw/mem/dimm.c | 11 ++--- > hw/mem/pc-dimm.c | 46 > ++ > include/hw/mem/dimm.h | 4 ++-- > include/hw/mem/pc-dimm.h | 7 ++ > 7 files changed, 61 insertions(+), 12 deletions(-) > create mode 100644 hw/mem/pc-dimm.c > create mode 100644 include/hw/mem/pc-dimm.h > > diff --git a/default-configs/i386-softmmu.mak > b/default-configs/i386-softmmu.mak > index 43c96d1..3ece8bb 100644 > --- a/default-configs/i386-softmmu.mak > +++ b/default-configs/i386-softmmu.mak > @@ -18,6 +18,7 @@ CONFIG_FDC=y > CONFIG_ACPI=y > CONFIG_ACPI_X86=y > CONFIG_ACPI_X86_ICH=y > +CONFIG_DIMM=y > CONFIG_ACPI_MEMORY_HOTPLUG=y > CONFIG_ACPI_CPU_HOTPLUG=y > CONFIG_APM=y > diff --git a/default-configs/x86_64-softmmu.mak > b/default-configs/x86_64-softmmu.mak > index dfb8095..92ea7c1 100644 > --- a/default-configs/x86_64-softmmu.mak > +++ b/default-configs/x86_64-softmmu.mak > @@ -18,6 +18,7 @@ CONFIG_FDC=y > CONFIG_ACPI=y > CONFIG_ACPI_X86=y > CONFIG_ACPI_X86_ICH=y > +CONFIG_DIMM=y Same change needs to be done in default-configs/ppc64-softmmu.mak too. Regards, Bharata. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/32] implement vNVDIMM
Xiao, Are these patches present in any git tree so that they can be easily tried out. Regards, Bharata. On Sun, Oct 11, 2015 at 9:22 AM, Xiao Guangrongwrote: > Changelog in v3: > There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo, > Michael for their valuable comments, the patchset finally gets better shape. > - changes from Igor's comments: > 1) abstract dimm device type from pc-dimm and create nvdimm device based on > dimm, then it uses memory backend device as nvdimm's memory and NUMA has > easily been implemented. > 2) let file-backend device support any kind of filesystem not only for > hugetlbfs and let it work on file not only for directory which is > achieved by extending 'mem-path' - if it's a directory then it works as > current behavior, otherwise if it's file then directly allocates memory > from it. > 3) we figure out a unused memory hole below 4G that is 0xFF0 ~ > 0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit > ACPI SSDT/DSDT table will break windows XP. > BTW, only make SSDT.rev = 2 can not work since the width is only depended > on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block) > in ACPI spec: > | Note: For compatibility with ACPI versions before ACPI 2.0, the bit > | width of Integer objects is dependent on the ComplianceRevision of the DSDT. > | If the ComplianceRevision is less than 2, all integers are restricted to 32 > | bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets > | the global integer width for all integers, including integers in SSDTs. > 4) use the lowest ACPI spec version to document AML terms. > 5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm" > > - changes from Stefan's comments: > 1) do not do endian adjustment in-place since _DSM memory is visible to > guest > 2) use target platform's target page size instead of fixed PAGE_SIZE > definition > 3) lots of code style improvement and typo fixes. > 4) live migration fix > - changes from Paolo's comments: > 1) improve the name of memory region > > - other changes: > 1) return exact buffer size for _DSM method instead of the page size. > 2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm > devices. > 3) NUMA support > 4) implement _FIT method > 5) rename "configdata" to "reserve-label-data" > 6) simplify _DSM arg3 determination > 7) main changelog update to let it reflect v3. > > Changlog in v2: > - Use litten endian for DSM method, thanks for Stefan's suggestion > > - introduce a new parameter, @configdata, if it's false, Qemu will > build a static and readonly namespace in memory and use it serveing > for DSM GET_CONFIG_SIZE/GET_CONFIG_DATA requests. In this case, no > reserved region is needed at the end of the @file, it is good for > the user who want to pass whole nvdimm device and make its data > completely be visible to guest > > - divide the source code into separated files and add maintain info > > BTW, PCOMMIT virtualization on KVM side is work in progress, hopefully will > be posted on next week > > == Background == > NVDIMM (A Non-Volatile Dual In-line Memory Module) is going to be supported > on Intel's platform. They are discovered via ACPI and configured by _DSM > method of NVDIMM device in ACPI. There has some supporting documents which > can be found at: > ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf > NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf > DSM Interface Example: > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf > Driver Writer's Guide: > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf > > Currently, the NVDIMM driver has been merged into upstream Linux Kernel and > this patchset tries to enable it in virtualization field > > == Design == > NVDIMM supports two mode accesses, one is PMEM which maps NVDIMM into CPU's > address space then CPU can directly access it as normal memory, another is > BLK which is used as block device to reduce the occupying of CPU address > space > > BLK mode accesses NVDIMM via Command Register window and Data Register window. > BLK virtualization has high workload since each sector access will cause at > least two VM-EXIT. So we currently only imperilment vPMEM in this patchset > > --- vPMEM design --- > We introduce a new device named "nvdimm", it uses memory backend device as > NVDIMM memory. The file in file-backend device can be a regular file and block > device. We can use any file when we do test or emulation, however, > in the real word, the files passed to guest are: > - the regular file in the filesystem with DAX enabled created on NVDIMM device > on host > - the raw PMEM device on host, e,g /dev/pmem0 > Memory access on the address created by mmap on these kinds of files can > directly reach NVDIMM
Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
On Tue, Sep 08, 2015 at 04:08:06PM +1000, Michael Ellerman wrote: > On Wed, 2015-08-12 at 10:53 +0530, Bharata B Rao wrote: > > On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote: > > > Hello Bharata, > > > > > > On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote: > > > > May be it is a bit late to bring this up, but I needed the following fix > > > > to userfault21 branch of your git tree to compile on powerpc. > > > > > > Not late, just in time. I increased the number of syscalls in earlier > > > versions, it must have gotten lost during a rejecting rebase, sorry. > > > > > > I applied it to my tree and it can be applied to -mm and linux-next, > > > thanks! > > > > > > The syscall for arm32 are also ready and on their way to the arm tree, > > > the testsuite worked fine there. ppc also should work fine if you > > > could confirm it'd be interesting, just beware that I got a typo in > > > the testcase: > > > > The testsuite passes on powerpc. > > > > > > running userfaultfd > > > > nr_pages: 2040, nr_pages_per_cpu: 170 > > bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 > > 2 96 13 128 > > bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 > > 1 0 > > bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 > > 114 96 1 0 > > bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2 > > bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 > > 71 20 > > bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6 > > bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 > > 0 > > bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57 > > bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 > > 3 6 0 > > bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 > > 40 34 0 > > bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0 > > bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0 > > bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 > > 0 0 > > bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0 > > bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0 > > bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0 > > bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 > > 0 180 75 > > bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 > > 0 30 > > bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 > > 68 1 2 > > bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 > > 10 1 > > bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 > > 51 37 1 > > bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 > > 25 10 > > bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1 > > bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28 > > bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 > > 7 4 5 > > bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 > > 23 2 > > bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0 > > bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0 > > bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0 > > bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0 > > bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0 > > bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1 > > [PASS] > > Hmm, not for me. See below. > > What setup were you testing on Bharata? I was on commit a94572f5799dd of userfault21 branch in Andrea's tree git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git #uname -a Linux 4.1.0-rc8+ #1 SMP Tue Aug 11 11:33:50 IST 2015 ppc64le ppc64le ppc64le GNU/Linux In fact I had successfully done postcopy migration of sPAPR guest with this setup. > > Mine is: > > $ uname -a > Linux lebuntu 4.2.0-09705-g3a166acc1432 #2 SMP Tue Sep 8 15:18:00 AEST 2015 > ppc64le ppc64le ppc64le GNU/Linux > > Which is 7d9071a09502 plus a couple of powerpc patches. > > $ zgrep USERFAULTFD /proc/config.gz > CONFIG_USERFAULTFD=y > > $ sudo ./userfaultfd 128 32 > nr_pages: 2048, nr_pages_per_cpu: 128 > bounces: 31, mode: rnd racing ver poll, error mutex 2 2 > error mutex 2 10 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote: > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote: > > In fact I had successfully done postcopy migration of sPAPR guest with > > this setup. > > Interesting - I'd not got that far myself on power; I was hitting a problem > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab > stream (htab_shift=25) ) > > Did you have to make any changes to the qemu code to get that happy? I should have mentioned that I tried only QEMU driven migration within the same host using wp3-postcopy branch of your tree. I don't see the above issue. (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off x-postcopy-ram: on Migration status: completed total time: 39432 milliseconds downtime: 162 milliseconds setup: 14 milliseconds transferred ram: 1297209 kbytes throughput: 270.72 mbps remaining ram: 0 kbytes total ram: 4194560 kbytes duplicate: 734015 pages skipped: 0 pages normal: 318469 pages normal bytes: 1273876 kbytes dirty sync count: 4 I will try migration between different hosts soon and check. Regards, Bharata. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
On Tue, Sep 08, 2015 at 01:46:52PM +0100, Dr. David Alan Gilbert wrote: > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote: > > On Tue, Sep 08, 2015 at 09:59:47AM +0100, Dr. David Alan Gilbert wrote: > > > * Bharata B Rao (bhar...@linux.vnet.ibm.com) wrote: > > > > In fact I had successfully done postcopy migration of sPAPR guest with > > > > this setup. > > > > > > Interesting - I'd not got that far myself on power; I was hitting a > > > problem > > > loading htab ( htab_load() bad index 2113929216 (14848+0 entries) in htab > > > stream (htab_shift=25) ) > > > > > > Did you have to make any changes to the qemu code to get that happy? > > > > I should have mentioned that I tried only QEMU driven migration within > > the same host using wp3-postcopy branch of your tree. I don't see the > > above issue. > > > > (qemu) info migrate > > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: > > off compress: off x-postcopy-ram: on > > Migration status: completed > > total time: 39432 milliseconds > > downtime: 162 milliseconds > > setup: 14 milliseconds > > transferred ram: 1297209 kbytes > > throughput: 270.72 mbps > > remaining ram: 0 kbytes > > total ram: 4194560 kbytes > > duplicate: 734015 pages > > skipped: 0 pages > > normal: 318469 pages > > normal bytes: 1273876 kbytes > > dirty sync count: 4 > > > > I will try migration between different hosts soon and check. > > I hit that on the same host; are you sure you've switched into postcopy mode; > i.e. issued a migrate_start_postcopy before the end of migration? Sorry I was following your discussion with Li in this thread https://www.marc.info/?l=qemu-devel=143035620026744=4 and it wasn't obvious to me that anything apart from turning on the x-postcopy-ram capability was required :( So I do see the problem now. At the source - Error reading data from KVM HTAB fd: Bad file descriptor Segmentation fault At the target - htab_load() bad index 2113929216 (14336+0 entries) in htab stream (htab_shift=25) qemu-system-ppc64: error while loading state section id 56(spapr/htab) qemu-system-ppc64: postcopy_ram_listen_thread: loadvm failed: -22 qemu-system-ppc64: VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0x1f: delta 0xffe1 qemu-system-ppc64: error while loading state for instance 0x0 of device 'pci@8002000:00.0/virtio-net' *** Error in `./ppc64-softmmu/qemu-system-ppc64': corrupted double-linked list: 0x0100241234a0 *** === Backtrace: = /lib64/power8/libc.so.6Segmentation fault -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
On Thu, May 14, 2015 at 07:31:16PM +0200, Andrea Arcangeli wrote: This activates the userfaultfd syscall. Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/uapi/asm/unistd.h | 1 + arch/x86/syscalls/syscall_32.tbl | 1 + arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 1 + kernel/sys_ni.c| 1 + 6 files changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index f1863a1..4741b15 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -368,3 +368,4 @@ SYSCALL_SPU(memfd_create) SYSCALL_SPU(bpf) COMPAT_SYS(execveat) PPC64ONLY(switch_endian) +SYSCALL_SPU(userfaultfd) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index e4aa173..6ad58d4 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -386,5 +386,6 @@ #define __NR_bpf 361 #define __NR_execveat362 #define __NR_switch_endian 363 +#define __NR_userfaultfd 364 May be it is a bit late to bring this up, but I needed the following fix to userfault21 branch of your git tree to compile on powerpc. powerpc: Bump up __NR_syscalls to account for __NR_userfaultfd From: Bharata B Rao bhar...@linux.vnet.ibm.com With userfaultfd syscall, the number of syscalls will be 365 on PowerPC. Reflect the same in __NR_syscalls. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com --- arch/powerpc/include/asm/unistd.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index f4f8b66..4a055b6 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -12,7 +12,7 @@ #include uapi/asm/unistd.h -#define __NR_syscalls 364 +#define __NR_syscalls 365 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 19/23] userfaultfd: activate syscall
On Tue, Aug 11, 2015 at 03:48:26PM +0200, Andrea Arcangeli wrote: Hello Bharata, On Tue, Aug 11, 2015 at 03:37:29PM +0530, Bharata B Rao wrote: May be it is a bit late to bring this up, but I needed the following fix to userfault21 branch of your git tree to compile on powerpc. Not late, just in time. I increased the number of syscalls in earlier versions, it must have gotten lost during a rejecting rebase, sorry. I applied it to my tree and it can be applied to -mm and linux-next, thanks! The syscall for arm32 are also ready and on their way to the arm tree, the testsuite worked fine there. ppc also should work fine if you could confirm it'd be interesting, just beware that I got a typo in the testcase: The testsuite passes on powerpc. running userfaultfd nr_pages: 2040, nr_pages_per_cpu: 170 bounces: 31, mode: rnd racing ver poll, userfaults: 80 43 23 23 15 16 12 1 2 96 13 128 bounces: 30, mode: racing ver poll, userfaults: 35 54 62 49 47 48 2 8 0 78 1 0 bounces: 29, mode: rnd ver poll, userfaults: 114 153 70 106 78 57 143 92 114 96 1 0 bounces: 28, mode: ver poll, userfaults: 96 81 5 45 83 19 98 28 1 145 23 2 bounces: 27, mode: rnd racing poll, userfaults: 54 65 60 54 45 49 1 2 1 2 71 20 bounces: 26, mode: racing poll, userfaults: 90 83 35 29 37 35 30 42 3 4 49 6 bounces: 25, mode: rnd poll, userfaults: 52 50 178 112 51 41 23 42 18 99 59 0 bounces: 24, mode: poll, userfaults: 136 101 83 260 84 29 16 88 1 6 160 57 bounces: 23, mode: rnd racing ver, userfaults: 141 197 158 183 39 49 3 52 8 3 6 0 bounces: 22, mode: racing ver, userfaults: 242 266 244 180 162 32 87 43 31 40 34 0 bounces: 21, mode: rnd ver, userfaults: 636 158 175 24 253 104 48 8 0 0 0 0 bounces: 20, mode: ver, userfaults: 531 204 225 117 129 107 11 143 76 31 1 0 bounces: 19, mode: rnd racing, userfaults: 303 169 225 145 59 219 37 0 0 0 0 0 bounces: 18, mode: racing, userfaults: 374 372 37 144 126 90 25 12 15 17 0 0 bounces: 17, mode: rnd, userfaults: 313 412 134 108 80 99 7 56 85 0 0 0 bounces: 16, mode:, userfaults: 431 58 87 167 120 113 98 60 14 8 48 0 bounces: 15, mode: rnd racing ver poll, userfaults: 41 40 25 28 37 24 0 0 0 0 180 75 bounces: 14, mode: racing ver poll, userfaults: 43 53 30 28 25 15 19 0 0 0 0 30 bounces: 13, mode: rnd ver poll, userfaults: 136 91 114 91 92 79 114 77 75 68 1 2 bounces: 12, mode: ver poll, userfaults: 92 120 114 76 153 75 132 157 83 81 10 1 bounces: 11, mode: rnd racing poll, userfaults: 50 72 69 52 53 48 46 59 57 51 37 1 bounces: 10, mode: racing poll, userfaults: 33 49 38 68 35 63 57 49 49 47 25 10 bounces: 9, mode: rnd poll, userfaults: 167 150 67 123 39 75 1 2 9 125 1 1 bounces: 8, mode: poll, userfaults: 147 102 20 87 5 27 118 14 104 40 21 28 bounces: 7, mode: rnd racing ver, userfaults: 305 254 208 74 59 96 36 14 11 7 4 5 bounces: 6, mode: racing ver, userfaults: 290 114 191 94 162 114 34 6 6 32 23 2 bounces: 5, mode: rnd ver, userfaults: 370 381 22 273 21 106 17 55 0 0 0 0 bounces: 4, mode: ver, userfaults: 328 279 179 191 74 86 95 15 13 10 0 0 bounces: 3, mode: rnd racing, userfaults: 222 215 164 70 5 20 179 0 34 3 0 0 bounces: 2, mode: racing, userfaults: 316 385 112 160 225 5 30 49 42 2 4 0 bounces: 1, mode: rnd, userfaults: 273 139 253 176 163 71 85 2 0 0 0 0 bounces: 0, mode:, userfaults: 165 212 633 13 24 66 24 27 15 0 10 1 [PASS] Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object
On Sat, Mar 21, 2015 at 8:28 PM, Alexander Graf ag...@suse.de wrote: On 20.03.15 16:51, Bharata B Rao wrote: On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote: On 20.03.15 12:26, Paul Mackerras wrote: On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote: On 20.03.15 10:39, Paul Mackerras wrote: From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org This probably makes some sense, but please make sure that user space has some way to figure out whether hotplug works at all. Bharata is working on the qemu side of all this, so I assume he has that covered. Well, so far the kernel doesn't expose anything he can query, so I suppose he just blindly assumes that older host kernels will randomly break and nobody cares. I'd rather prefer to see a CAP exposed that qemu can check on. I see that you have already taken this into your tree. I have an updated patch to expose a CAP. If the below patch looks ok, then let me know how you would prefer to take this patch in. Regards, Bharata. KVM: PPC: BOOK3S: Allow reuse of vCPU object From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. User space (QEMU) can reuse the vCPU after checking for the availability of KVM_CAP_SPAPR_REUSE_VCPU capability. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_xics.c |9 +++-- arch/powerpc/kvm/powerpc.c | 12 include/uapi/linux/kvm.h |1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c index a4a8d9f..ead3a35 100644 --- a/arch/powerpc/kvm/book3s_xics.c +++ b/arch/powerpc/kvm/book3s_xics.c @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu, return -EPERM; if (xics-kvm != vcpu-kvm) return -EPERM; - if (vcpu-arch.irq_type) - return -EBUSY; + + /* + * If irq_type is already set, don't reinialize but + * return success allowing this vcpu to be reused. + */ + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT) + return 0; r = kvmppc_xics_create_icp(vcpu, xcpu); if (!r) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 27c0fac..5b7007c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_REUSE_VCPU: + /* + * Kernel currently doesn't support closing of vCPU fd from + * user space (QEMU) correctly. Hence the option available + * is to park the vCPU fd in user space whenever a guest + * CPU is hot removed and reuse the same later when another + * guest CPU is hotplugged. This capability determines whether + * it is safe to assume if parking of vCPU fd and reuse from + * user space works for sPAPR guests. I don't see how the code you're changing here has anything to do with parking vcpus. It's all about being able to call connect on an already connected vcpu and not erroring out. Please reflect this in the cap name and description. You also need to update Documentation/virtual/kvm/api.txt. Furthermore, thinking about this a bit more, I might still miss the exact case why you need this. Why is QEMU issuing a connect again? Could it maybe just not do it? Thinking
Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object
On Sat, Mar 21, 2015 at 8:28 PM, Alexander Graf ag...@suse.de wrote: On 20.03.15 16:51, Bharata B Rao wrote: On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote: On 20.03.15 12:26, Paul Mackerras wrote: On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote: On 20.03.15 10:39, Paul Mackerras wrote: From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org This probably makes some sense, but please make sure that user space has some way to figure out whether hotplug works at all. Bharata is working on the qemu side of all this, so I assume he has that covered. Well, so far the kernel doesn't expose anything he can query, so I suppose he just blindly assumes that older host kernels will randomly break and nobody cares. I'd rather prefer to see a CAP exposed that qemu can check on. I see that you have already taken this into your tree. I have an updated patch to expose a CAP. If the below patch looks ok, then let me know how you would prefer to take this patch in. Regards, Bharata. KVM: PPC: BOOK3S: Allow reuse of vCPU object From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. User space (QEMU) can reuse the vCPU after checking for the availability of KVM_CAP_SPAPR_REUSE_VCPU capability. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_xics.c |9 +++-- arch/powerpc/kvm/powerpc.c | 12 include/uapi/linux/kvm.h |1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c index a4a8d9f..ead3a35 100644 --- a/arch/powerpc/kvm/book3s_xics.c +++ b/arch/powerpc/kvm/book3s_xics.c @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu, return -EPERM; if (xics-kvm != vcpu-kvm) return -EPERM; - if (vcpu-arch.irq_type) - return -EBUSY; + + /* + * If irq_type is already set, don't reinialize but + * return success allowing this vcpu to be reused. + */ + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT) + return 0; r = kvmppc_xics_create_icp(vcpu, xcpu); if (!r) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 27c0fac..5b7007c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_REUSE_VCPU: + /* + * Kernel currently doesn't support closing of vCPU fd from + * user space (QEMU) correctly. Hence the option available + * is to park the vCPU fd in user space whenever a guest + * CPU is hot removed and reuse the same later when another + * guest CPU is hotplugged. This capability determines whether + * it is safe to assume if parking of vCPU fd and reuse from + * user space works for sPAPR guests. I don't see how the code you're changing here has anything to do with parking vcpus. It's all about being able to call connect on an already connected vcpu and not erroring out. Please reflect this in the cap name and description. You also need to update Documentation/virtual/kvm/api.txt. Furthermore, thinking about this a bit more, I might still miss the exact case why you need this. Why is QEMU issuing a connect again? Could it maybe just not do it? Thinking
Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object
On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote: On 20.03.15 12:26, Paul Mackerras wrote: On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote: On 20.03.15 10:39, Paul Mackerras wrote: From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org This probably makes some sense, but please make sure that user space has some way to figure out whether hotplug works at all. Bharata is working on the qemu side of all this, so I assume he has that covered. Well, so far the kernel doesn't expose anything he can query, so I suppose he just blindly assumes that older host kernels will randomly break and nobody cares. I'd rather prefer to see a CAP exposed that qemu can check on. I see that you have already taken this into your tree. I have an updated patch to expose a CAP. If the below patch looks ok, then let me know how you would prefer to take this patch in. Regards, Bharata. KVM: PPC: BOOK3S: Allow reuse of vCPU object From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. User space (QEMU) can reuse the vCPU after checking for the availability of KVM_CAP_SPAPR_REUSE_VCPU capability. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_xics.c |9 +++-- arch/powerpc/kvm/powerpc.c | 12 include/uapi/linux/kvm.h |1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c index a4a8d9f..ead3a35 100644 --- a/arch/powerpc/kvm/book3s_xics.c +++ b/arch/powerpc/kvm/book3s_xics.c @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu, return -EPERM; if (xics-kvm != vcpu-kvm) return -EPERM; - if (vcpu-arch.irq_type) - return -EBUSY; + + /* +* If irq_type is already set, don't reinialize but +* return success allowing this vcpu to be reused. +*/ + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT) + return 0; r = kvmppc_xics_create_icp(vcpu, xcpu); if (!r) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 27c0fac..5b7007c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_REUSE_VCPU: + /* +* Kernel currently doesn't support closing of vCPU fd from +* user space (QEMU) correctly. Hence the option available +* is to park the vCPU fd in user space whenever a guest +* CPU is hot removed and reuse the same later when another +* guest CPU is hotplugged. This capability determines whether +* it is safe to assume if parking of vCPU fd and reuse from +* user space works for sPAPR guests. +*/ + r = 1; + break; default: r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 8055706..8464755 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -760,6 +760,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_ENABLE_HCALL 104 #define KVM_CAP_CHECK_EXTENSION_VM 105 #define KVM_CAP_S390_USER_SIGP 106 +#define KVM_CAP_SPAPR_REUSE_VCPU 107 #ifdef KVM_CAP_IRQ_ROUTING -- To unsubscribe from this list: send the line
Re: [PATCH 07/23] KVM: PPC: Book3S: Allow reuse of vCPU object
On Fri, Mar 20, 2015 at 12:34:18PM +0100, Alexander Graf wrote: On 20.03.15 12:26, Paul Mackerras wrote: On Fri, Mar 20, 2015 at 12:01:32PM +0100, Alexander Graf wrote: On 20.03.15 10:39, Paul Mackerras wrote: From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com Signed-off-by: Paul Mackerras pau...@samba.org This probably makes some sense, but please make sure that user space has some way to figure out whether hotplug works at all. Bharata is working on the qemu side of all this, so I assume he has that covered. Well, so far the kernel doesn't expose anything he can query, so I suppose he just blindly assumes that older host kernels will randomly break and nobody cares. I'd rather prefer to see a CAP exposed that qemu can check on. I see that you have already taken this into your tree. I have an updated patch to expose a CAP. If the below patch looks ok, then let me know how you would prefer to take this patch in. Regards, Bharata. KVM: PPC: BOOK3S: Allow reuse of vCPU object From: Bharata B Rao bhar...@linux.vnet.ibm.com Since KVM isn't equipped to handle closure of vcpu fd from userspace(QEMU) correctly, certain work arounds have to be employed to allow reuse of vcpu array slot in KVM during cpu hot plug/unplug from guest. One such proposed workaround is to park the vcpu fd in userspace during cpu unplug and reuse it later during next hotplug. More details can be found here: KVM: https://www.mail-archive.com/kvm@vger.kernel.org/msg102839.html QEMU: http://lists.gnu.org/archive/html/qemu-devel/2014-12/msg00859.html In order to support this workaround with PowerPC KVM, don't create or initialize ICP if the vCPU is found to be already associated with an ICP. User space (QEMU) can reuse the vCPU after checking for the availability of KVM_CAP_SPAPR_REUSE_VCPU capability. Signed-off-by: Bharata B Rao bhar...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_xics.c |9 +++-- arch/powerpc/kvm/powerpc.c | 12 include/uapi/linux/kvm.h |1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c index a4a8d9f..ead3a35 100644 --- a/arch/powerpc/kvm/book3s_xics.c +++ b/arch/powerpc/kvm/book3s_xics.c @@ -1313,8 +1313,13 @@ int kvmppc_xics_connect_vcpu(struct kvm_device *dev, struct kvm_vcpu *vcpu, return -EPERM; if (xics-kvm != vcpu-kvm) return -EPERM; - if (vcpu-arch.irq_type) - return -EBUSY; + + /* +* If irq_type is already set, don't reinialize but +* return success allowing this vcpu to be reused. +*/ + if (vcpu-arch.irq_type != KVMPPC_IRQ_DEFAULT) + return 0; r = kvmppc_xics_create_icp(vcpu, xcpu); if (!r) diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 27c0fac..5b7007c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -564,6 +564,18 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = 1; break; #endif + case KVM_CAP_SPAPR_REUSE_VCPU: + /* +* Kernel currently doesn't support closing of vCPU fd from +* user space (QEMU) correctly. Hence the option available +* is to park the vCPU fd in user space whenever a guest +* CPU is hot removed and reuse the same later when another +* guest CPU is hotplugged. This capability determines whether +* it is safe to assume if parking of vCPU fd and reuse from +* user space works for sPAPR guests. +*/ + r = 1; + break; default: r = 0; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 8055706..8464755 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -760,6 +760,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_ENABLE_HCALL 104 #define KVM_CAP_CHECK_EXTENSION_VM 105 #define KVM_CAP_S390_USER_SIGP 106 +#define KVM_CAP_SPAPR_REUSE_VCPU 107 #ifdef KVM_CAP_IRQ_ROUTING -- To unsubscribe from this list: send the line
Re: [Qemu-devel] KVM call agenda for September 25th
On Tue, Sep 25, 2012 at 04:51:15PM +0200, Kevin Wolf wrote: Am 25.09.2012 14:57, schrieb Anthony Liguori: qemu -device \ isa-serial,index=0,chr=tcp://localhost:1025/?server=onwait=off Your examples kind of prove this: They aren't much shorter than what exists today, but they contain ? and , which are nasty characters on the command line. Right. '' can't even be specified directly on command line since that will result in qemu command being treated as a background job with anything after '' being discarded. I realized that '' needs to be escaped as %26. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Tue, Nov 08, 2011 at 09:33:04AM -0800, Chris Wright wrote: * Alexander Graf (ag...@suse.de) wrote: On 29.10.2011, at 20:45, Bharata B Rao wrote: As guests become NUMA aware, it becomes important for the guests to have correct NUMA policies when they run on NUMA aware hosts. Currently limited support for NUMA binding is available via libvirt where it is possible to apply a NUMA policy to the guest as a whole. However multinode guests would benefit if guest memory belonging to different guest nodes are mapped appropriately to different host NUMA nodes. To achieve this we would need QEMU to expose information about guest RAM ranges (Guest Physical Address - GPA) and their host virtual address mappings (Host Virtual Address - HVA). Using GPA and HVA, any external tool like libvirt would be able to divide the guest RAM as per the guest NUMA node geometry and bind guest memory nodes to corresponding host memory nodes using HVA. This needs both QEMU (and libvirt) changes as well as changes in the kernel. Ok, let's take a step back here. You are basically growing libvirt into a memory resource manager that know how much memory is available on which nodes and how these nodes would possibly fit into the host's memory layout. Shouldn't that be the kernel's job? It seems to me that architecturally the kernel is the place I would want my memory resource controls to be in. I think that both Peter and Andrea are looking at this. Before we commit an API to QEMU that has a different semantic than a possible new kernel interface (that perhaps QEMU could use directly to inform kernel of the binding/relationship between vcpu thread and it's memory at VM startuup) it would be useful to see what these guys are working on... I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. (*) memsched: An abstraction for representing coupling of threads with virtual address ranges. Threads and virtual address ranges of a memsched group are guaranteed (?) to be located on the same node. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
On Mon, Nov 21, 2011 at 04:25:26PM +0100, Peter Zijlstra wrote: On Mon, 2011-11-21 at 20:48 +0530, Bharata B Rao wrote: I looked at Peter's recent work in this area. (https://lkml.org/lkml/2011/11/17/204) It introduces two interfaces: 1. ms_tbind() to bind a thread to a memsched(*) group 2. ms_mbind() to bind a memory region to memsched group I assume the 2nd interface could be used by QEMU to create memsched groups for each of guest NUMA node memory regions. No, you would need both, you'll need to group vcpu threads _and_ some vaddress space together. I understood QEMU currently uses a single big anonymous mmap() to allocate the guest memory, using this you could either use multiple or carve up the big alloc into virtual nodes by assigning different parts to different ms groups. Example: suppose you want to create a 2 node guest with 8 vcpus, create 2 ms groups, each with 4 vcpu threads and assign half the total guest mmap to either. In the past, Anthony has said that NUMA binding should be done from outside of QEMU (http://www.kerneltrap.org/mailarchive/linux-kvm/2010/8/31/6267041) If you want to expose a sense of virtual NUMA to your guest you really have no choice there. The only thing you can do externally is run whole VMs inside one particular node. Though that was in a different context, may be we should re-look at that and see if QEMU still sticks to that. I know its a bit early, but if needed we should ask Peter to consider extending ms_mbind() to take a tid parameter too instead of working on current task by default. Uh, what for? ms_mbind() works on the current process, not task. In the original post of this mail thread, I proposed a way to export guest RAM ranges (Guest Physical Address-GPA) and their corresponding host host virtual mappings (Host Virtual Address-HVA) from QEMU (via QEMU monitor). The idea was to use this GPA to HVA mappings from tools like libvirt to bind specific parts of the guest RAM to different host nodes. This needed an extension to existing mbind() to allow binding memory of a process(QEMU) from a different process(libvirt). This was needed since we wanted to do all this from libvirt. Hence I was coming from that background when I asked for extending ms_mbind() to take a tid parameter. If QEMU community thinks that NUMA binding should all be done from outside of QEMU, it is needed, otherwise what you have should be sufficient. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Sun, Jun 07, 2009 at 09:04:49AM +0300, Avi Kivity wrote: Bharata B Rao wrote: On Fri, Jun 05, 2009 at 09:01:50AM +0300, Avi Kivity wrote: Bharata B Rao wrote: But could there be client models where you are required to strictly adhere to the limit within the bandwidth and not provide more (by advancing the bandwidth period) in the presence of idle cycles ? That's the limit part. I'd like to be able to specify limits and guarantees on the same host and for the same groups; I don't think that works when you advance the bandwidth period. I think we need to treat guarantees as first-class goals, not something derived from limits (in fact I think guarantees are more useful as they can be used to provide SLAs). I agree that guarantees are important, but I am not sure about 1. specifying both limits and guarantees for groups and Why would you allow specifying a lower bound for cpu usage (a guarantee), and upper bound (a limit), but not both? I was saying that we specify only limits and not guarantees since it can be worked out from limits. Initial thinking was that the kernel will be made aware of only limits and users could set the limits appropriately to obtain the desired guarantees. I understand your concerns/objections on this and we will address this in our next version of RFC as Balbir said. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Fri, Jun 05, 2009 at 09:03:37AM +0300, Avi Kivity wrote: Balbir Singh wrote: I think so. Given guarantees G1..Gn (0 = Gi = 1; sum(Gi) = 1), and a cpu hog running in each group, how would the algorithm divide resources? As per the matrix calculation, but as soon as we reach an idle point, we redistribute the b/w and start a new quantum so to speak, where all groups are charged up to their hard limits. For your question, if there is a CPU hog running, it would be as per the matrix calculation, since the system has no idle point during the bandwidth period. So the groups with guarantees get a priority boost. That's not a good side effect. That happens only in the presence of idle cycles when other groups [with or without guarantees] have nothing useful to do. So how would that matter since there is nothing else to run anyway ? Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Fri, Jun 05, 2009 at 09:01:50AM +0300, Avi Kivity wrote: Bharata B Rao wrote: But could there be client models where you are required to strictly adhere to the limit within the bandwidth and not provide more (by advancing the bandwidth period) in the presence of idle cycles ? That's the limit part. I'd like to be able to specify limits and guarantees on the same host and for the same groups; I don't think that works when you advance the bandwidth period. I think we need to treat guarantees as first-class goals, not something derived from limits (in fact I think guarantees are more useful as they can be used to provide SLAs). I agree that guarantees are important, but I am not sure about 1. specifying both limits and guarantees for groups and 2. not deriving guarantees from limits. Guarantees are met by some form of throttling or limiting and hence I think limiting should drive the guarantees. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Fri, Jun 05, 2009 at 01:53:15AM -0700, Paul Menage wrote: On Wed, Jun 3, 2009 at 10:36 PM, Bharata B Raobhar...@linux.vnet.ibm.com wrote: - Hard limits can be used to provide guarantees. This claim (and the subsequent long thread it generated on how limits can provide guarantees) confused me a bit. Why do we need limits to provide guarantees when we can already provide guarantees via shares? shares design is proportional and hence it can't by itself provide guarantees. Suppose 10 cgroups each want 10% of the machine's CPU. We can just give each cgroup an equal share, and they're guaranteed 10% if they try to use it; if they don't use it, other cgroups can get access to the idle cycles. Now if 11th group with same shares comes in, then each group will now get 9% of CPU and that 10% guarantee breaks. Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Thu, Jun 04, 2009 at 03:19:22PM +0300, Avi Kivity wrote: Bharata B Rao wrote: 2. Need for hard limiting CPU resource -- - Pay-per-use: In enterprise systems that cater to multiple clients/customers where a customer demands a certain share of CPU resources and pays only that, CPU hard limits will be useful to hard limit the customer's job to consume only the specified amount of CPU resource. - In container based virtualization environments running multiple containers, hard limits will be useful to ensure a container doesn't exceed its CPU entitlement. - Hard limits can be used to provide guarantees. How can hard limits provide guarantees? Let's take an example where I have 1 group that I wish to guarantee a 20% share of the cpu, and anther 8 groups with no limits or guarantees. One way to achieve the guarantee is to hard limit each of the 8 other groups to 10%; the sum total of the limits is 80%, leaving 20% for the guarantee group. The downside is the arbitrary limit imposed on the other groups. This method sounds very similar to the openvz method: http://wiki.openvz.org/Containers/Guarantees_for_resources Another way is to place the 8 groups in a container group, and limit that to 80%. But that doesn't work if I want to provide guarantees to several groups. Hmm why not ? Reduce the guarantee of the container group and provide the same to additional groups ? Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] CPU hard limits
On Fri, Jun 05, 2009 at 01:27:55PM +0800, Balbir Singh wrote: * Avi Kivity a...@redhat.com [2009-06-05 08:21:43]: Balbir Singh wrote: But then there is no other way to make a *guarantee*, guarantees come at a cost of idling resources, no? Can you show me any other combination that will provide the guarantee and without idling the system for the specified guarantees? OK, I see part of your concern, but I think we could do some optimizations during design. For example if all groups have reached their hard-limit and the system is idle, should we do start a new hard limit interval and restart, so that idleness can be removed. Would that be an acceptable design point? I think so. Given guarantees G1..Gn (0 = Gi = 1; sum(Gi) = 1), and a cpu hog running in each group, how would the algorithm divide resources? As per the matrix calculation, but as soon as we reach an idle point, we redistribute the b/w and start a new quantum so to speak, where all groups are charged up to their hard limits. But could there be client models where you are required to strictly adhere to the limit within the bandwidth and not provide more (by advancing the bandwidth period) in the presence of idle cycles ? Regards, Bharata. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] CPU hard limits
Hi, This is an RFC about the CPU hard limits feature where I have explained the need for the feature, the proposed plan and the issues around it. Before I come up with an implementation for hard limits, I would like to know community's thoughts on this scheduler enhancement and any feedback and suggestions. Regards, Bharata. 1. CPU hard limit 2. Need for hard limiting CPU resource 3. Granularity of enforcing CPU hard limits 4. Existing solutions 5. Specifying hard limits 6. Per task group vs global bandwidth period 7. Configuring 8. Throttling of tasks 9. Group scheduler hierarchy considerations 10. SMP considerations 11. Starvation 12. Hard limit and fairness 1. CPU hard limit - CFS is a proportional share scheduler which tries to divide the CPU time proportionately between tasks or groups of tasks (task group/cgroup) depending on the priority/weight of the task or shares assigned to groups of tasks. In CFS, a task/task group can get more than its share of CPU if there are enough idle CPU cycles available in the system, due to the work conserving nature of the scheduler. However there are scenarios (Sec 2) where giving more than the desired CPU share to a task/task group is not acceptable. In those scenarios, the scheduler needs to put a hard stop on the CPU resource consumption of task/task group if it exceeds a preset limit. This is usually achieved by throttling the task/task group when it fully consumes its allocated CPU time. 2. Need for hard limiting CPU resource -- - Pay-per-use: In enterprise systems that cater to multiple clients/customers where a customer demands a certain share of CPU resources and pays only that, CPU hard limits will be useful to hard limit the customer's job to consume only the specified amount of CPU resource. - In container based virtualization environments running multiple containers, hard limits will be useful to ensure a container doesn't exceed its CPU entitlement. - Hard limits can be used to provide guarantees. 3. Granularity of enforcing CPU hard limits --- Conceptually, hard limits can either be enforced for individual tasks or groups of tasks. However enforcing limits per task would be too fine grained and would be a lot of work on the part of the system administrator in terms of setting limits for every task. Based on the current understanding of the users of this feature, it is felt that hard limiting is more useful at task group level than the individual tasks level. Hence in the subsequent paragraphs, the concept of hard limit as applicable to task group/cgroup is discussed. 4. Existing solutions - - Both Linux-VServer and OpenVZ virtualization solutions support CPU hard limiting. - Per task limit can be enforced using rlimits, but it is not rate based. 5. Specifying hard limits - CPU time consumed by a task group is generally measured over a time period (called bandwidth period) and the task group gets throttled when its CPU time reaches a limit (hard limit) within a bandwidth period. The task group remains throttled until the bandwidth period gets renewed at which time additional CPU time becomes available to the tasks in the system. When a task group's hard limit is specified as a ratio X/Y, it means that the group will get throttled if its CPU time consumption exceeds X seconds in a bandwidth period of Y seconds. Specifying the hard limit as X/Y requires us to specify the bandwidth period also. Is having a uniform/same bandwidth period for all the groups an option ? If so, we could even specify the hard limit as a percentage, like 30% of a uniform bandwidth period. 6. Per task group vs global bandwidth period The bandwidth period can either be per task group or global. With global bandwidth period, the runtimes of all the task groups need to be replenished when the period ends. Though this appears conceptually simple, the implementation might not scale. Instead if every task group maintains its bandwidth period separately, the refresh cycles of each group will happen independent of each other. Moreover different groups might prefer different bandwidth periods. Hence the first implementation will have per task group bandwidth period. Timers can be used to trigger bandwidth refresh cycles. (similar to rt group sched) 7. Configuring -- - User could set the hard limit (X and/or Y) through the cgroup fs. - When the scheduler supports hard limiting, should it be enabled for all tasks groups of the system ? Or should user have an option to enable hard limiting per group ? - When hard limiting is enabled for a group, should the limit be set to a default to start with ? Or should the user set the limit and the bandwidth before enabling the hard limiting ? - What should be a sane default value for the bandwidth period ? 8. Throttling of