Re: [libvirt] Need a better word than allocated or ascertained
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2011-01-07 15:23:54]: CC'ing Balbir.. On Fri, 07 Jan 2011 10:33:08 +0100, Zdenek Styblik sty...@turnovfree.net wrote: On 01/07/2011 10:10 AM, Justin Clift wrote: On 07/01/2011, at 6:12 PM, Nikunj A. Dadhania wrote: snip Guaranteed sounds best to me. Thats not Gauranteed to the best of my knowlegde Balbir suggest enforced, I guessed i dropped it somewhere. https://www.redhat.com/archives/libvir-list/2010-August/msg00712.html Balbir's suggested wording (from the email): limit to enforce on memory contention Does that mean it's the minimum memory limit it would really like to have, but can't guarantee it? (ie it's not guaranteed) I'm getting a bit confused here. enforced really doesn't fit into the context, or does it? What should it say/explain? [soft-limit] Who is target audience? And I think the last question is very important, because your technical mambo-jumbo might be just fine and tip-top to the last bit, but if nobody else understands it, then such help seems to be a bit helpless to me. Meaning: * allocated/guaranteed I can imagine; * ascertained gave me really non-sense translation, although that might be caused by crappy dictionary; * enforced - uh ... how? what? when? Is it when host is running low on memory and/or there are many VMs competing for memory? If so, please explain it somewhere if it isn't already(yeah, I'm trying to figure out the meaning). Or what happens when memory reaches 'soft-limit'? enforced is same as policing or forcing, whether or not the application likes it. A soft limit is enforced when we hit resource contention (that is the operating system finds it has to do work to find free memory for applications), soft limits kick in and try to push down each cgroup to their soft limit. -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Need a better word than allocated or ascertained
* Zdenek Styblik sty...@turnovfree.net [2011-01-10 14:08:43]: On 01/10/2011 09:55 AM, Balbir Singh wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2011-01-07 15:23:54]: [...] Or what happens when memory reaches 'soft-limit'? enforced is same as policing or forcing, whether or not the application likes it. A soft limit is enforced when we hit resource contention (that is the operating system finds it has to do work to find free memory for applications), soft limits kick in and try to push down each cgroup to their soft limit. Such explanation makes more sense to me rather than proposed sentence. However, there are some critical factors like a] my lack of knowledge on many libvirt(or virtualization in general) topics b] I'm not a native English speaker; which may or may not play a role. --- SNIP --- A soft limit is enforced when host is running short on free resources or during resource contention. Guest's resources are then pushed to soft-limit as an attempt to regain free resources. Limit is in kilobytes. Applies to QEMU and LXC only. --- SNIP --- Good, well stated IMHO I don't know. This is like 10th version and wow, what a pile of non-sense I came with :[ Guest memory won't be pushed bellow soft limit, because guest could go ape(OOMK/whatever) about it and we don't want that. Could it be understood as resource allocation/reservation like in eg. VMware ESX? But it might work differently in QEMU/LXC than in VMware. Anyway, this is probably off-topic here. I just would go for longer explanation rather than squeezing everything into 5 words, which seems to be impossible to me, or changing just one word. ~~~ non-relevant part ~~~ Other things I've noticed at the page... I would change the table to: Name | Units | Required | Desc | --hard-limit limit | kB | optional | some description Or Name | Required | Desc | --hard-limit limit | optional | some descrioption limit is in kilobytes Also, I think it should be 'kB' not 'kb' which means 'kilobits'[1]. I don't want to bitch or anything like that. Please, take it very very easy. Although, it's explained in description kb is meant as kilobytes and it might be only me who is used on kb X kB thing. Dunno :\ I'd agree, conventions need to be properly followed. I would put eg. QEMU and LXC only at new line, but this might be unnecessary(= just a format issue). There also could be special column 'Applies to' and what not(at this point, I feel like I must be really bored to come up with such stuff; please apply sftu if necessary w/o hard feelings ;] ). There is also duplication of this info paragraph below in 'Platform or Hypervisor specific notes', thus if something changes it must be changed at two places. Links: --- [1] http://en.wikipedia.org/wiki/KB Have a nice day, Zdenek -- Zdenek Styblik Net/Linux admin OS TurnovFree.net email: sty...@turnovfree.net jabber: sty...@jabber.turnovfree.net -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] Update docs for memory parameters and memtune command
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-10-18 14:03:53]: On Mon, 18 Oct 2010 09:55:46 +0200, Matthias Bolte matthias.bo...@googlemail.com wrote: 2010/10/18 Nikunj A. Dadhania nik...@linux.vnet.ibm.com: From: Nikunj A. Dadhania nik...@linux.vnet.ibm.com docs/formatdomain.html.in: Add memtune element details [...] @@ -211,6 +216,22 @@ codehugepages/code element set within it. This tells the hypervisor that the guest should have its memory allocated using hugepages instead of the normal native page size./dd + dtcodememtune/code/dt + dd The optional codememtune/code element provides details + regarding the memory tuneable parameters for the domain. If this is + omitted, it defaults to the OS provided defaults./dd + dtcodehard_limit/code/dt + dd The optional codehard_limit/code element is the maximum memory + the guest can use. The units for this value are kilobytes (i.e. blocks + of 1024 bytes)/dd Well, the maximum of memory a guest can use is also controlled by the memory and currentMemory element in some way. How does hard_limit relate to those two? memory and currentMemory are related to balloon size, while these are operating system provided limits. + dtcodesoft_limit/code/dt + dd The optional codesoft_limit/code element is the memory limit to + enforce during memory contention. The units for this value are + kilobytes (i.e. blocks of 1024 bytes)/dd Is this an upper or a lower limit? Does it mean in case of contention this guest may only use up to soft_limit kilobytes of memory (upper limit)? Or does it mean in case of contention make sure that this guest can access at least soft_limit kilobytes of memory (lower limit)? Upper limit of memory the guest can use(i.e upto soft_limit) during contention. Balbir, correct me if this isn't correct. Yes, that interpretation is correct. We try to push back the guest to soft limit on contention, this is typically the case when the guest uses more than the assigned soft limit. How does this relate to the memory and currentMemory element? At present no relation, they are implemented by the OS. This feature allows us to set useful limits, on lack of contention no limits are enforced (IOW, this is work conserving so to speak). How does it related to the min_guarantee element? It is not related to min_guarantee. + dtcodeswap_hard_limit/code/dt + dd The optional codeswap_hard_limit/code element is the maximum + swap the guest can use. The units for this value are kilobytes + (i.e. blocks of 1024 bytes)/dd What about the min_guarantee element anyway? It's not implemented in virsh. Missed it, I will add the docs about min_gaurantee and send the updated patch. It is not implemented in virsh. However, I have taken care of parsing them in domain configuration. -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v3 04/13] XML parsing for memory tunables
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-10-08 14:43:34]: On Fri, 8 Oct 2010 14:10:53 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-10-08 12:00:44]: On Thu, 7 Oct 2010 12:49:29 +0100, Daniel P. Berrange berra...@redhat.com wrote: On Mon, Oct 04, 2010 at 12:47:22PM +0530, Nikunj A. Dadhania wrote: On Mon, 4 Oct 2010 12:16:42 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-09-28 15:26:30]: snip +unsigned long hard_limit; +unsigned long soft_limit; +unsigned long min_guarantee; +unsigned long swap_hard_limit; The hard_limit, soft_limit, swap_hard_limit are s64 and the value is in bytes. What is the unit supported in this implementation? Actually if libvirt is built on 32bit these aren't big enough - make them into 'unsigned long long' data types I reckon. I was thinking that as we are having the unit of KB, we would be able to represent 2^42 bytes of memory limit, ie. 4 Terabytes. Won't this suffice in case of 32bit? How would you represent -1 (2^63 -1) as unlimited or max limit we use today? I think I have answered this question in the thread: this is specific to cgroup that -1 means unlimited, this may not be true for other HVs. OK, so how do we handle unlimited values in general? -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v3 04/13] XML parsing for memory tunables
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-10-08 12:00:44]: On Thu, 7 Oct 2010 12:49:29 +0100, Daniel P. Berrange berra...@redhat.com wrote: On Mon, Oct 04, 2010 at 12:47:22PM +0530, Nikunj A. Dadhania wrote: On Mon, 4 Oct 2010 12:16:42 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-09-28 15:26:30]: snip +unsigned long hard_limit; +unsigned long soft_limit; +unsigned long min_guarantee; +unsigned long swap_hard_limit; The hard_limit, soft_limit, swap_hard_limit are s64 and the value is in bytes. What is the unit supported in this implementation? Actually if libvirt is built on 32bit these aren't big enough - make them into 'unsigned long long' data types I reckon. I was thinking that as we are having the unit of KB, we would be able to represent 2^42 bytes of memory limit, ie. 4 Terabytes. Won't this suffice in case of 32bit? How would you represent -1 (2^63 -1) as unlimited or max limit we use today? -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v3 05/13] Implement cgroup memory controller tunables
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-09-28 15:26:35]: From: Nikunj A. Dadhania nik...@linux.vnet.ibm.com Provides interfaces for setting/getting memory tunables like hard_limit, soft_limit and swap_hard_limit Signed-off-by: Nikunj A. Dadhania nik...@linux.vnet.ibm.com The changes look good to me. unsigned long kb should cover all values in bytes as well. Acked-by: Balbir Singh bal...@linux.vnet.ibm.com -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v3 04/13] XML parsing for memory tunables
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-09-28 15:26:30]: From: Nikunj A. Dadhania nik...@linux.vnet.ibm.com Adding parsing code for memory tunables in the domain xml file v2: + Fix typo min_guarantee Signed-off-by: Nikunj A. Dadhania nik...@linux.vnet.ibm.com --- src/conf/domain_conf.c | 50 +--- src/conf/domain_conf.h | 12 --- src/esx/esx_vmx.c | 30 +- src/lxc/lxc_controller.c |2 +- src/lxc/lxc_driver.c | 12 +-- src/openvz/openvz_driver.c |8 --- src/qemu/qemu_conf.c |8 --- src/qemu/qemu_driver.c | 18 src/test/test_driver.c | 12 +-- src/uml/uml_conf.c |2 +- src/uml/uml_driver.c | 14 ++-- 11 files changed, 104 insertions(+), 64 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index e05d5d7..0dd74e4 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -4231,19 +4231,38 @@ static virDomainDefPtr virDomainDefParseXML(virCapsPtr caps, def-description = virXPathString(string(./description[1]), ctxt); /* Extract domain memory */ -if (virXPathULong(string(./memory[1]), ctxt, def-maxmem) 0) { +if (virXPathULong(string(./memory[1]), ctxt, + def-mem.max_balloon) 0) { virDomainReportError(VIR_ERR_INTERNAL_ERROR, %s, _(missing memory element)); goto error; } -if (virXPathULong(string(./currentMemory[1]), ctxt, def-memory) 0) -def-memory = def-maxmem; +if (virXPathULong(string(./currentMemory[1]), ctxt, + def-mem.cur_balloon) 0) +def-mem.cur_balloon = def-mem.max_balloon; node = virXPathNode(./memoryBacking/hugepages, ctxt); if (node) -def-hugepage_backed = 1; - +def-mem.hugepage_backed = 1; + +/* Extract other memory tunables */ +if (virXPathULong(string(./memtune/hard_limit), ctxt, + def-mem.hard_limit) 0) +def-mem.hard_limit = 0; + +if (virXPathULong(string(./memtune/soft_limit[1]), ctxt, + def-mem.soft_limit) 0) +def-mem.soft_limit = 0; + +if (virXPathULong(string(./memtune/min_guarantee[1]), ctxt, + def-mem.min_guarantee) 0) +def-mem.min_guarantee = 0; + +if (virXPathULong(string(./memtune/swap_hard_limit[1]), ctxt, + def-mem.swap_hard_limit) 0) +def-mem.swap_hard_limit = 0; + Quick question, does 0 represent invalid values? I'd presume you'd want to use something like -1. We support unsigned long long for the values to be set (64 bit signed), unlimited translates to 2^63 - 1, is ULong sufficient to represent that value? if (virXPathULong(string(./vcpu[1]), ctxt, def-vcpus) 0) def-vcpus = 1; @@ -6382,10 +6401,25 @@ char *virDomainDefFormat(virDomainDefPtr def, virBufferEscapeString(buf, description%s/description\n, def-description); -virBufferVSprintf(buf, memory%lu/memory\n, def-maxmem); +virBufferVSprintf(buf, memory%lu/memory\n, def-mem.max_balloon); virBufferVSprintf(buf, currentMemory%lu/currentMemory\n, - def-memory); -if (def-hugepage_backed) { + def-mem.cur_balloon); +virBufferVSprintf(buf, memtune\n); +if (def-mem.hard_limit) { +virBufferVSprintf(buf, hard_limit%lu/hard_limit\n, + def-mem.hard_limit); +} +if (def-mem.soft_limit) { +virBufferVSprintf(buf, soft_limit%lu/soft_limit\n, + def-mem.soft_limit); +} +if (def-mem.swap_hard_limit) { +virBufferVSprintf(buf, swap_hard_limit%lu/swap_hard_limit\n, + def-mem.swap_hard_limit); +} +virBufferVSprintf(buf, /memtune\n); + +if (def-mem.hugepage_backed) { virBufferAddLit(buf, memoryBacking\n); virBufferAddLit(buf, hugepages/\n); virBufferAddLit(buf, /memoryBacking\n); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 7195c04..2ecc2af 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -864,9 +864,15 @@ struct _virDomainDef { char *name; char *description; -unsigned long memory; -unsigned long maxmem; -unsigned char hugepage_backed; +struct { +unsigned long max_balloon; +unsigned long cur_balloon; +unsigned long hugepage_backed; +unsigned long hard_limit; +unsigned long soft_limit; +unsigned long min_guarantee; +unsigned long swap_hard_limit; The hard_limit, soft_limit, swap_hard_limit are s64 and the value is in bytes. What is
Re: [libvirt] [RFC] Memory controller exploitation in libvirt
held might not be the right word for soft limit. How about - Memory limit ensured during contention I'd recommend limit to enforce on memory contention Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC] Memory controller exploitation in libvirt
On Mon, Aug 30, 2010 at 11:56 AM, Nikunj A. Dadhania nik...@linux.vnet.ibm.com wrote: On Tue, 24 Aug 2010 11:07:29 +0100, Daniel P. Berrange berra...@redhat.com wrote: On Tue, Aug 24, 2010 at 03:17:44PM +0530, Nikunj A. Dadhania wrote: On Tue, 24 Aug 2010 11:02:49 +0200, Matthias Bolte matthias.bo...@googlemail.com wrote: snip Yes the ESX driver allows to control ballooning through virDomainSetMemory and virDomainSetMaxMemory. ESX itself also allows to set what's called memoryMinGaurantee in the thread, but this is not exposed in libvirt. LXC driver is using virDomainSetMemory to set the memory hard limit while QEmu/ESX use them to change the ballooning. And as you said, ESX does support memoryMinGaurantee, we can get this exported in libvirt using this new API. Here I am trying to group all the memory related parameters into one single public API as we have in virDomainSetSchedulerParameters. Currently, the names are not conveying what they modify in the below layer and are confusing. For historical design record, I think it would be good to write a short description of what memory tunables are available for each hypervisor, covering VMWare, OpenVZ, Xen, KVM and LXC (the latter both cgroups based). I do recall that OpenVZ in particular had a huge number of memory tunables. This is an attempt at covering the memory tunables supported by various hypervisors in libvirt. Let me know if I have missed any memory tunable. Moreover, inputs from the maintainers/key contributes of each HVs on these parameters is appreciable. This would help in getting a complete coverage of the memory tunables that libvirt can support. 1) OpenVZ = vmguarpages: Memory allocation guarantee, in pages. kmemsize: Size of unswappable kernel memory(in bytes), allocated for processes in this container. oomguarpages: The guaranteed amount of memory for the case the memory is “over-booked” (out-of-memory kill guarantee), in pages. privvmpages: Memory allocation limit, in pages. OpenVZ driver does not implement any of these functions: domainSetMemory domainSetMaxMemory domainGetMaxMemory Although, the driver has an internal implementation for setting memory: openvzDomainSetMemoryInternal that is read from the domain xml file. 2) VMWare = ConfiuredSize: Virtual memory the guest can have. Shares: Priority of the VM, in case there is not enough memory or in case when there is more memory. It has symbolic values like Low, Normal, High and Custom Reservation: Gauranteed lower bound on the amount of the physical memory that the host reserves for the VM even in case of the overcommit. The VM is allowed to allocate till this level and after it has hit the reservation, those pages are not reclaimed. In case, if guest is not using till the reservation, the host can use that portion of memory. Limit: This is the upper bound for the amount of physical memory that the host can allocate for the VM. Memory Balloon ESX driver uses following: * domainSetMaxMemory to set the max virtual memory for the VM. * domainSetMemory to inflate/deflate the balloon. * ESX provides lower bound(Reservation), but is not being exploited currently. 3) Xen == maxmem_set: Maximum amount of memory reservation of the domain mem_target_set: Set current memory usage of the domain 4) KVM LXC memory.limit_in_bytes: Memory hard limit memory.soft_limit_in_bytes: Memory limit held during contention held might not be the right word for soft limit. memory.memsw_limit_in_bytes: Memory+Swap hard limit memory.swapiness: Controls the tendency of moving the VM processes to the swap. Value range is 0-100, where 0 means, avoid swapping as long as possible and 100 means aggressively swap processes. Statistics: memory.usage_in_bytes: Current memory usage memory.memsw_usage_in_bytes: Current memory+swap usage memory.max_usage_in_bytes: Maximum memory usage recorded memory.memsw_max_usage_in_bytes: Maximum memory+swap usage We also have memory.stat, memory.hierarchy - The question is do we care about hierarchical control? We also have controls to decide whether to move memory on moving from one cgroup to another, that might not apply to the LXC/QEMU case. There is also memory.failcnt which I am not sure makes sense to export. Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC] Memory controller exploitation in libvirt
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-08-24 11:53:27]: Subject: [RFC] Memory controller exploitation in libvirt Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that. At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area. Design 1: Provide new API and XML changes for resource management = All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface. 1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats() Parameter list supported: MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory Soft limits allows you to set memory limit on contention. MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started) SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The semantics are just too hard to get right at the moment. Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel. I am not an expert here, so just listing what new elements need to be added to the XML schema: define name=resource element memory element memoryHardLimit/ element memorySoftLimit/ element memoryMinGaurantee/ element swapHardLimit/ element swapSoftLimit/ /element /define I'd prefer a syntax that integrates well with what we currently have cgroup path.../path controller name../name soft limit.../ hard limit.../ /controller ... /cgroup But I am not an XML expert or an export in designing XML configurations. Pros: * Support all the tunables exported by the kernel * More tunables can be added as and when required Cons: * Code changes would touch various levels * Might need to redefine(changing the scope) of existing memory API. Currently, domainSetMemory is used to set limit_in_bytes in LXC and memory ballooning in QEmu. While the domainSetMaxMemory is not defined in QEmu and in case of LXC it is setting the internal object's maxmem variable. Future: * Later on, CPU/IO/Network controllers related tunables can be added/enhanced along with the APIs/XML elements: CPUHardLimit CPUSoftLimit CPUShare CPUPercentage IO_BW_Softlimit IO_BW_Hardlimit IO_BW_percentage * libvirt-cim support for resource management Design 2: Reuse the current memory APIs in libvirt == Use memory.limit_in_bytes to tweak memory hard limits Init - Set the memory.limit_in_bytes to maximum mem. Claiming memory from guest: a) Reduce balloon size b) If the guest does not co-operate(How do we know?), reduce memory.limit_in_bytes. This is a policy
Re: [libvirt] [RFC] Memory controller exploitation in libvirt
* Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-08-24 13:35:10]: On Tue, 24 Aug 2010 13:05:26 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-08-24 11:53:27]: Subject: [RFC] Memory controller exploitation in libvirt Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into Yes it helps, when parsing the parameters from the domain xml file, we can call this API and set them at once. BTW, it can also be called with one parameter if desired. virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats() This is at the cgroup level(internal API) and will be implemented in the way that is suggested. The RFC should not be specific to cgroups. libvirt is supported on multiple OS and the above described APIs in the RFC are public API. I thought we were talking of cgroups in the QEMU driver for Linux. IMHO the generalization is too big. ESX for example, already abstracts their WLM/RM needs in their driver. SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The Ok. Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel. I am not an expert here, so just listing what new elements need to be added to the XML schema: define name=resource element memory element memoryHardLimit/ element memorySoftLimit/ element memoryMinGaurantee/ element swapHardLimit/ element swapSoftLimit/ /element /define I'd prefer a syntax that integrates well with what we currently have cgroup path.../path controller name../name soft limit.../ hard limit.../ /controller ... /cgroup Again this is a libvirt domain xml file, IMO, it should not be cgroup specific. See the comment above. -- Three Cheers, Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [RFC] Memory controller exploitation in libvirt
* Daniel P. Berrange berra...@redhat.com [2010-08-24 11:02:44]: On Tue, Aug 24, 2010 at 01:05:26PM +0530, Balbir Singh wrote: * Nikunj A. Dadhania nik...@linux.vnet.ibm.com [2010-08-24 11:53:27]: Subject: [RFC] Memory controller exploitation in libvirt Memory CGroup is a kernel feature that can be exploited effectively in the current libvirt/qemu driver. Here is a shot at that. At present, QEmu uses memory ballooning feature, where the memory can be inflated/deflated as and when needed, co-operatively between the host and the guest. There should be some mechanism where the host can have more control over the guests memory usage. Memory CGroup provides features such as hard-limit and soft-limit for memory, and hard-limit for swap area. Design 1: Provide new API and XML changes for resource management = All the memory controller tunables are not supported with the current abstractions provided by the libvirt API. libvirt works on various OS. This new API will support GNU/Linux initially and as and when other platforms starts supporting memory tunables, the interface could be enabled for them. Adding following two function pointer to the virDriver interface. 1) domainSetMemoryParameters: which would take one or more name-value pairs. This makes the API extensible, and agnostic to the kind of parameters supported by various Hypervisors. 2) domainGetMemoryParameters: For getting current memory parameters Corresponding libvirt public API: int virDomainSetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); int virDomainGetMemoryParamters (virDomainPtr domain, virMemoryParamterPtr params, unsigned int nparams); Does nparams imply setting several parameters together? Does bulk loading help? I would prefer splitting out the API if possible into virCgroupSetMemory() - already present in src/util/cgroup.c virCgroupGetMemory() - already present in src/util/cgroup.c virCgroupSetMemorySoftLimit() virCgroupSetMemoryHardLimit() virCgroupSetMemorySwapHardLimit() virCgroupGetStats() Nope, we don't want cgroups exposed in the public API, since this has to be applicable to the VMWare and OpenVZ drivers too. I am not talking about exposing these as public API, but be a part of src/util/cgroup.c and utilized by the qemu driver. It is good to abstract out the OS independent parts, but my concern was double exposure through API like driver-setMemory() that is currently used and the newer API. Parameter list supported: MemoryHardLimits (memory.limits_in_bytes) - Maximum memory MemorySoftLimits (memory.softlimit_in_bytes) - Desired memory Soft limits allows you to set memory limit on contention. MemoryMinimumGaurantee - Minimum memory required (without this amount of memory, VM should not be started) SwapHardLimits (memory.memsw_limit_in_bytes) - Maximum swap SwapSoftLimits (Currently not supported by kernel) - Desired swap space We *dont* support SwapSoftLimits in the memory cgroup controller with no plans to support it in the future either at this point. The semantics are just too hard to get right at the moment. That's not a huge problem. Since we have many hypervisors to support in libvirt, I expect the set of tunables will expand over time, and not every hypervisor driver in libvirt will support every tunable. They'll just pick the tunables that apply to them. We can leave SwapSoftLimits out of the public API until we find a HV that needs it Tunables memory.limit_in_bytes, memory.softlimit_in_bytes and memory.memsw_limit_in_bytes are provided by the memory controller in the Linux kernel. I am not an expert here, so just listing what new elements need to be added to the XML schema: define name=resource element memory element memoryHardLimit/ element memorySoftLimit/ element memoryMinGaurantee/ element swapHardLimit/ element swapSoftLimit/ /element /define I'd prefer a syntax that integrates well with what we currently have cgroup path.../path controller name../name soft limit.../ hard limit.../ /controller ... /cgroup That is exposing far too much info about the cgroups implementation details. The XML representation needs to be decouple from the implementation
Re: [libvirt] About cgroup mechanism using in libvirt
On Mon, Jun 14, 2010 at 3:10 PM, Daniel P. Berrange berra...@redhat.com wrote: On Sat, Jun 12, 2010 at 07:23:33AM -0400, Alex Jia wrote: Hey Daniel, The cgroup mechanism have been integrated into libvirt for LXC and QEMU driver, and the LXC driver uses all of cgroup controllers except for net_cls and cpuset, while the QEMU driver only uses the cpu and devices controllers at present. From the user point of view, user can use some virsh commands to control some guest resources: 1. Using 'virsh schedinfo' command to get/set CPU scheduler priority for a guest QEMU + LXC use the cpu controller 'cpu_shares' tunable 2. Using 'virsh vcpuin' command to control guest vcpu affinity QEMU pins the process directly, doesn't use cgroups. LXC has't implemented this yet 3. Using 'virsh setmem' command to change memory allocation 4. Using 'virsh setmaxmem' command to change maximum memory limit QEMU uses balloon driver. LXC uses cgroups memory controller Not sure if I understand this, but the balloon driver and memory cgroups are not mutually exclusive. One could use both together and I would certainly like to see additional commands to support cgroups. What happens if a guest (like freebsd) does not support ballooning? Are you suggesting we'll not need cgroups at all with QEMU? 5. Using 'virsh setvcpus' command to change number of virtual CPUs QEMU uses cpu hotplug. LXC hasn't implemented this. I just make sure the above 1 using CPU scheduler controller, maybe 4 using Memory controller? and maybe 5 using CPU set controller? I am not sure. I think we'll some notion of soft limits as well, not sure if they can be encapsulated using the current set. We need memory shares for example to encapsulate them. Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] cgroup: Enable memory.use_hierarchy of cgroup for domain
On Thu, May 6, 2010 at 7:40 PM, Ryota Ozaki ozaki.ry...@gmail.com wrote: Through conversation with Kumar L Srikanth-B22348, I found that the function of getting memory usage (e.g., virsh dominfo) doesn't work for lxc with ns subsystem of cgroup enabled. This is because of features of ns and memory subsystems. Ns creates child cgroup on every process fork and as a result processes in a container are not assigned in a cgroup for domain (e.g., libvirt/lxc/test1/). For example, libvirt_lxc and init (or somewhat specified in XML) are assigned into libvirt/lxc/test1/8839/ and libvirt/lxc/test1/8839/8849/, respectively. On the other hand, memory subsystem accounts memory usage within a group of processes by default, i.e., it does not take any child (and descendent) groups into account. With the two features, virsh dominfo which just checks memory usage of a cgroup for domain always returns zero because the cgroup has no process. Setting memory.use_hierarchy of a group allows to account (and limit) memory usage of every descendent groups of the group. By setting it of a cgroup for domain, we can get proper memory usage of lxc with ns subsystem enabled. (To be exact, the setting is required only when memory and ns subsystems are enabled at the same time, e.g., mount -t cgroup none /cgroup.) --- This does sound like a valid use case and the correct fix. src/util/cgroup.c | 49 + 1 files changed, 45 insertions(+), 4 deletions(-) diff --git a/src/util/cgroup.c b/src/util/cgroup.c index b8b2eb5..93cd6a9 100644 --- a/src/util/cgroup.c +++ b/src/util/cgroup.c @@ -443,7 +443,38 @@ static int virCgroupCpuSetInherit(virCgroupPtr parent, virCgroupPtr group) return rc; } -static int virCgroupMakeGroup(virCgroupPtr parent, virCgroupPtr group, int create) +static int virCgroupSetMemoryUseHierarchy(virCgroupPtr group) +{ + int rc = 0; + unsigned long long value; + const char *filename = memory.use_hierarchy; + + rc = virCgroupGetValueU64(group, + VIR_CGROUP_CONTROLLER_MEMORY, + filename, value); + if (rc != 0) { + VIR_ERROR(Failed to read %s/%s (%d), group-path, filename, rc); + return rc; + } + + /* Setting twice causes error, so if already enabled, skip setting */ + if (value == 1) + return 0; + + VIR_DEBUG(Setting up %s/%s, group-path, filename); + rc = virCgroupSetValueU64(group, + VIR_CGROUP_CONTROLLER_MEMORY, + filename, 1); + + if (rc != 0) { + VIR_ERROR(Failed to set %s/%s (%d), group-path, filename, rc); + } + + return rc; +} + +static int virCgroupMakeGroup(virCgroupPtr parent, virCgroupPtr group, + int create, int memory_hierarchy) { int i; int rc = 0; @@ -477,6 +508,16 @@ static int virCgroupMakeGroup(virCgroupPtr parent, virCgroupPtr group, int creat break; } } Can you please add a comment here stating that memory.use_hierarchy should always be called prior to creating subcgroups and attaching tasks + if (memory_hierarchy + group-controllers[VIR_CGROUP_CONTROLLER_MEMORY].mountPoint != NULL + (i == VIR_CGROUP_CONTROLLER_MEMORY || + STREQ(group-controllers[i].mountPoint, group-controllers[VIR_CGROUP_CONTROLLER_MEMORY].mountPoint))) { + rc = virCgroupSetMemoryUseHierarchy(group); + if (rc != 0) { + VIR_FREE(path); + break; + } + } } VIR_FREE(path); @@ -553,7 +594,7 @@ static int virCgroupAppRoot(int privileged, if (rc != 0) goto cleanup; - rc = virCgroupMakeGroup(rootgrp, *group, create); + rc = virCgroupMakeGroup(rootgrp, *group, create, 0); cleanup: virCgroupFree(rootgrp); @@ -653,7 +694,7 @@ int virCgroupForDriver(const char *name, VIR_FREE(path); if (rc == 0) { - rc = virCgroupMakeGroup(rootgrp, *group, create); + rc = virCgroupMakeGroup(rootgrp, *group, create, 0); if (rc != 0) virCgroupFree(group); } @@ -703,7 +744,7 @@ int virCgroupForDomain(virCgroupPtr driver, VIR_FREE(path); if (rc == 0) { - rc = virCgroupMakeGroup(driver, *group, create); + rc = virCgroupMakeGroup(driver, *group, create, 1); if (rc != 0) virCgroupFree(group); } A comment on why Domains get hierarchy support and Drivers don't will help unless it is very obvious to developers. Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH] dont't crash in virsh dominfo domain
On Thu, Mar 18, 2010 at 7:18 PM, Daniel Veillard veill...@redhat.com wrote: On Wed, Mar 17, 2010 at 09:11:07PM +0100, Guido Günther wrote: Hi, virsh dominfo domain crashes with: #0 strlen () at ../sysdeps/i386/i486/strlen.S:69 #1 0x080891c9 in qemudNodeGetSecurityModel (conn=0x8133940, secmodel=0xb5676ede) at qemu/qemu_driver.c:4911 #2 0xb7eb5623 in virNodeGetSecurityModel (conn=0x8133940, secmodel=0x0) at libvirt.c:5118 #3 0x0806767a in remoteDispatchNodeGetSecurityModel (server=0x811, client=0x8134080, conn=0x8133940, hdr=0x81a8388, rerr=0xb56771d8, args=0xb56771a0, ret=0xb5677144) at remote.c:1306 #4 0x08068acc in remoteDispatchClientCall (server=0x811, client=0x8134080, msg=0x8168378) at dispatch.c:506 #5 0x08068ee3 in remoteDispatchClientRequest (server=0x811, client=0x8134080, msg=0x8168378) at dispatch.c:388 #6 0x0805baba in qemudWorker (data=0x811de2c) at libvirtd.c:1528 #7 0xb7bb8585 in start_thread (arg=0xb5677b70) at pthread_create.c:300 #8 0xb7b3a29e in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130 if there's no primary security driver set since we only intialize the secmodel.model and secmodel.doi if we have one. Attached patch checks for primarySecurityDriver instead of securityDriver since the later is always set in qemudSecurityInit(). Cheers, -- Guido From 1d26ec760739b0ea17d1b29730dbdb5632d3565c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Guido=20G=C3=BCnther?= a...@sigxcpu.org Date: Wed, 17 Mar 2010 21:04:11 +0100 Subject: [PATCH] Don't crash without a security driver virsh dominfo vm crashes if there's no primary security driver set since we only intialize the secmodel.model and secmodel.doi if we have one. Attached patch checks for securityPrimaryDriver instead of securityDriver since the later is always set in qemudSecurityInit(). Closes: http://bugs.debian.org/574359 --- src/qemu/qemu_driver.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 67d9ade..e26c591 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4956,7 +4956,7 @@ static int qemudNodeGetSecurityModel(virConnectPtr conn, int ret = 0; qemuDriverLock(driver); - if (!driver-securityDriver) { + if (!driver-securityPrimaryDriver) { memset(secmodel, 0, sizeof (*secmodel)); goto cleanup; } -- I've seen this issue too... I can confirm that this patch fixes the issue. Balbir -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] kernel summit topic - 'containers end-game'
* Serge E. Hallyn se...@us.ibm.com [2009-06-30 15:06:13]: Quoting Balbir Singh (bal...@linux.vnet.ibm.com): On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallynse...@us.ibm.com wrote: A topic on ksummit agenda is 'containers end-game and how do we get there'. So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable? More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include: 1. resource control We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term) Thanks, Balbir. By the last sentence, are you talking about having cgroup in its own libcgroup, or do you mean something else? On the topic of cgroups, does anyone not agree that we should try to get rid of the ns cgroup, at least once user namespaces can prevent root in a container from escaping their cgroup? I would have no objections to trying to obsolete ns cgroup once user namespaces can do what you suggest. -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] kernel summit topic - 'containers end-game'
On Tue, Jun 23, 2009 at 8:26 PM, Serge E. Hallynse...@us.ibm.com wrote: A topic on ksummit agenda is 'containers end-game and how do we get there'. So for starters, looking just at application (and system) containers, what do the libvirt and liblxc projects want to see in kernel support that is currently missing? Are there specific things that should be done soon to make containers more useful and usable? More generally, the topic raises the question... what 'end-games' are there? A few I can think of off-hand include: 1. resource control We intend to hold a io-controller minisummit before KS, we should have updates on that front. We also need to discuss CPU hard limits and Memory soft limits. We need control for memory large page, mlock, OOM notification support, shared page accounting, etc. Eventually on the libvirt front, we want to isolate cgroup and lxc support into individual components (long term) 2. lightweight virtual servers 3. (or 2.5) unprivileged containers/jail-on-steroids (lightweight virtual servers in which you might, just maybe, almost, be able to give away a root account, at least as much as you could do so with a kvm/qemu/xen partition) 4. checkpoint, restart, and migration For each end-game, what kernel pieces do we think are missing? For instance, people seem agreed that resource control needs io control :) Containers imo need a user namespace. I think there are quite a few network namespace exploiters who require sysfs directory tagging (or some equivalent) to allow us to migrate physical devices into network namespaces. And checkpoint/restart needs... checkpoint/restart. Balbir Singh -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 1 of 2] Add internal cgroup manipulation functions
Dan Smith wrote: This patch adds src/cgroup.{c,h} with support for creating and manipulating cgroups. It's quite naive at the moment, but should provide something to work with to move forward with resource controls. All groups created with the internal API are forced under $mount/libvirt/ to keep everything together. The first time a group is created, the libvirt directory is also created, and the settings from the root are inherited. The code scans the mount table to look for the first mount of type cgroup, and assumes that all controllers are mounted there. I think this could/should be updated to prefer a mount with just the controller(s) we want, if there are multiple ones. If you have the cpuset controller enabled, and cpuset.cpus_exclusive is 1, then all cgroups to be created will fail. Since we probably shouldn't blindly set the root to be non-exclusive, we may also want to consider this condition to be no cgroup support. diff -r 444e2614d0a2 -r 8e948eb88328 src/Makefile.am --- a/src/Makefile.am Wed Sep 17 16:07:03 2008 + +++ b/src/Makefile.am Mon Sep 29 09:37:42 2008 -0700 @@ -96,7 +96,8 @@ lxc_conf.c lxc_conf.h \ lxc_container.c lxc_container.h \ lxc_controller.c\ - veth.c veth.h + veth.c veth.h \ + cgroup.c cgroup.h OPENVZ_DRIVER_SOURCES = \ openvz_conf.c openvz_conf.h \ diff -r 444e2614d0a2 -r 8e948eb88328 src/cgroup.c --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/src/cgroup.cMon Sep 29 09:37:42 2008 -0700 @@ -0,0 +1,526 @@ +/* + * cgroup.c: Tools for managing cgroups + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * Dan Smith [EMAIL PROTECTED] + */ +#include config.h + +#include stdio.h +#include stdint.h +#include inttypes.h +#include mntent.h +#include fcntl.h +#include string.h +#include errno.h +#include stdlib.h +#include stdbool.h +#include sys/stat.h +#include sys/types.h +#include libgen.h + +#include internal.h +#include util.h +#include cgroup.h + +#define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) +#define DEBUG0(msg) VIR_DEBUG(__FILE__, %s, msg) + +struct virCgroup { +char *path; +}; + There is no support for permissions, is everything run as root? +void virCgroupFree(virCgroupPtr *group) +{ +if (*group != NULL) { +free((*group)-path); +free(*group); +*group = NULL; +} +} + +static virCgroupPtr cgroup_get_mount(void) +{ +FILE *mounts; +struct mntent entry; +char buf[512]; Is 512 arbitrary? How do we know it is going to be sufficient? +virCgroupPtr root = NULL; + +root = calloc(1, sizeof(*root)); +if (root == NULL) +return NULL; + +mounts = fopen(/proc/mounts, r); +if (mounts == NULL) { +DEBUG0(Unable to open /proc/mounts: %m); +goto err; +} + +while (getmntent_r(mounts, entry, buf, sizeof(buf)) != NULL) { +if (STREQ(entry.mnt_type, cgroup)) { +root-path = strdup(entry.mnt_dir); +break; +} +} + +if (root-path == NULL) { +DEBUG0(Did not find cgroup mount); Or strdup failed due to ENOMEM +goto err; +} + +fclose(mounts); + +return root; +err: +virCgroupFree(root); + +return NULL; +} + +int virCgroupHaveSupport(void) +{ +virCgroupPtr root; + +root = cgroup_get_mount(); +if (root == NULL) +return -1; + +virCgroupFree(root); + This is quite a horrible way of wasting computation. +return 0; +} + +static int cgroup_path_of(const char *grppath, + const char *key, + char **path) +{ +virCgroupPtr root; +int rc = 0; + +root = cgroup_get_mount(); So every routine calls cgroup_path_of(), reads the mounts entry and find a entry for cgroup and returns it, why not do it just once and use it. +if (root == NULL) { +rc = -ENOTDIR; +goto out; +} + +if (asprintf(path, %s/%s/%s, root-path, grppath, key) == -1) +rc = -ENOMEM; +out: +virCgroupFree(root); + +return rc; +} + +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value) +{ +int fd = -1; +int rc = 0; +char *keypath = NULL; + +rc = cgroup_path_of(group-path, key, keypath); +if (rc != 0) +return rc; + +fd = open(keypath, O_WRONLY); I see a mix of open and fopen calls.I would prefer to stick to just one, helps with readability. +if (fd 0) { +DEBUG(Unable to open %s: %m, keypath); +rc
[libvirt] [discuss] The new cgroup patches for libvirt
Hi, Everyone, I've seen a new set of patches from Dan Smith, which implement cgroup support for libvirt. While the patches seem simple, there are some issues that have been pointed out in the posting itself. I hope that libvirt will switch over (may be after your concerns are addressed and definitely in the longer run) to using libcgroups rather than having an internal implementation of cgroups. The advantages of switching over would be using the functionality that libcgroup already provides libcgroups (libcg.sf.net) provides 1. Ability to configure and mount cgroups and controllers via initscripts and a configuration file 2. An API to control and read cgroups information 3. Thread safety around API calls 4. Daemons to automatically classify a task based on a certain set of rules 5. API to extract current cgroup classification (where is the task currently in the cgroup hierarchy) While re-implementing might sound like a cool thing to do, here are the drawbacks 1. It leads to code duplication and reduces code reuse 2. It leads to confused users I understand that in the past there has been a perception that libcgroups might not yet be ready, because we did not have ABI stability built into the library and the header file had old comments about things changing. I would urge the group to look at the current implementation of libcgroups (look at v0.32) and help us 1. Fix any issues you see or point them to us 2. Add new API or request for new API that can help us integrate better with libvirt -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Re: [discuss] The new cgroup patches for libvirt
On Fri, Oct 3, 2008 at 11:43 PM, Daniel P. Berrange [EMAIL PROTECTED] wrote: On Fri, Oct 03, 2008 at 09:31:52PM +0530, Balbir Singh wrote: I understand that in the past there has been a perception that libcgroups might not yet be ready, because we did not have ABI stability built into the library and the header file had old comments about things changing. I would urge the group to look at the current implementation of libcgroups (look at v0.32) and help us 1. Fix any issues you see or point them to us 2. Add new API or request for new API that can help us integrate better with libvirt To expand on what I said in my other mail about providing value-add over the representation exposed by the kernel, here's some thoughts on the API exposed. Consider the following high level use case of libvirt - A set of groups, in a 3 level hierarchy APPNAME/DRIVER/DOMAIN - Control the ACL for block/char devices - Control memory limits This translates into an underling implementation, that I need to create 3 levels of cgroups in the filesystem, attach my PIDs at the 3rd level use the memory and device controllers and attach PIDs at the 3rd, and set values for attributes exposed by the controllers. Notice I'm not actually setting any config parms at the 1st 2nd levels, but they do need to still exist to ensure namespace uniqueness amongst different applications using cgroups. The current cgroups API provides APIs that directly map to individual actions wrt the kernel filesystem exposed. So as an application developer I have to explicitly create the 3 levels of hierarchy, tell it I want to use memory device controllers, format config values into the syntax required for each attribute, and remeber the attribute names. // Create the hierachy APPNAME/DRIVER/DOMAIN c1 = cgroup_new_cgroup(libvirt) c2 = cgroup_new_cgroup_parent(c1, lxc) c3 = cgroup_new_cgroup_parent(c2, domain.name) // Setup the controllers I want to use cgroup_add_controler(c3, devices) cgroup_add_controller(c3, memory) // Add my domain's PID to the cgroup cgroup_attach_task(c3, domain.pid) // Set the device ACL limits cgroup_set_value_string(c2, devices.deny, a); char buf[1024]; sprintf(buf, %c %d:%d, 'c', 1, 3); cgroup_set_value_stirng(c2, devices.allow, buf); // Set memory limit cgroup_set_value_uint64(c2, memory.limit_in_bytes, domain.memory * 1024); This really isn't providing any semantically useful abstraction over the direct filesytem manipulation. Just a bunch of wrappers for mkdir(), mount() and read()/write() calls. My application still has to know far too much information about the details of cgroups as exposed by the kernel. True, it definitely does and the way I look at APIs is that they are layers. We've built the first layer that abstracts permissions, paths and strings into a set of useful API. The second layer does things that you say, the question then is why don't we have it yet? Let me try and answer that question 1. We've been trying to build configuration, classification and the low level plumbing 2. We've been planning to build the exact same thing that you say, we call that the pluggable architecture, where controller plug in their logic and provide the abstractions you need, but not gotten there yet. When you announced cgroup support in libvirt, it was definitely going to be a user and we hoped that you would come to us with your exact requirements that you've mentioned now (believe me, your feedback is very useful). The question then to ask is, is it cheaper for you to build these abstractions into libvirt or either helped us or asked us to do so, we would have gladly obliged. You might say that the onus is on the maintainers to do the right thing without feedback, but I would beg to differ. What you've asked for, I consider as a layer on top of the API we have now and should be easy to build. I do not care that there is a concept of 'controllers' at all, I just want to set device ACLs and memory limits. I do not care what the attributes in the filesystem are called, again I just want to set device ACLs and memory limits. I do not care what the data format for them must be for device/memory settings. Memory settings could be stored in base-2, base-10 or base-16 I should not have to know this information. With this style of API, the library provide no real value-add or compelling reason to use it. What might a more useful API look like? At least from my point of view, I'd like to be able to say: // Tell it I want $PID placed in APPNAME/DRIVER/DOMAIN char *path[] = { libvirt, lxc, domain.name}; cg = cgroup_new_path(path, domain.pid) // I want to deny all devices cgroup_deny_all_devices(cg); // Allow /dev/null - either by node/major/minor cgroup_allow_device_node(cg, 'c', 1, 3); // Or more conviently just give it a node to copy info
Re: [libvirt] Re: [discuss] The new cgroup patches for libvirt
On Sat, Oct 4, 2008 at 1:17 AM, Daniel P. Berrange [EMAIL PROTECTED] wrote: On Sat, Oct 04, 2008 at 12:13:38AM +0530, Balbir Singh wrote: On Fri, Oct 3, 2008 at 11:43 PM, Daniel P. Berrange [EMAIL PROTECTED] wrote: True, it definitely does and the way I look at APIs is that they are layers. We've built the first layer that abstracts permissions, paths and strings into a set of useful API. The second layer does things that you say, the question then is why don't we have it yet? Let me try and answer that question 1. We've been trying to build configuration, classification and the low level plumbing 2. We've been planning to build the exact same thing that you say, we call that the pluggable architecture, where controller plug in their logic and provide the abstractions you need, but not gotten there yet. When you announced cgroup support in libvirt, it was definitely going to be a user and we hoped that you would come to us with your exact requirements that you've mentioned now (believe me, your feedback is very useful). The question then to ask is, is it cheaper for you to build these abstractions into libvirt or either helped us or asked us to do so, we would have gladly obliged. You might say that the onus is on the maintainers to do the right thing without feedback, but I would beg to differ. The thing I didn't mention, is that until Dan posted his current patches actually implementing the cgroups stuff in LXC driver, I didn't have a good picture of what the ideal higher level interface would look like. If you try and imagine high level APIs, without having an app actually using them, its all too easy to design something that turns out to not be useful. So while I know the low level cgroups API isn't what we need, it needs the current proof of concept in the libvirt LXC driver to discover what is an effective approach for libcgroups. I suspect our code will evolve further as we learn from what we've got now. By doing this entirely within libvirt we can experiment with effective implementation strategies without having to lockdown a formally supported API immediately. Once things settle down, it'll easier for libcgroups to see exactly what is important for a high level API and thus make one that's useful to more apps in the long term. Please remember my words if you ever find that you have a code base that looks like what we have in libcgroups, please remember to switch over to libcgroup. I fear that you will reach that stage, the code that is going in right now has too many things hard-coded and will need a lot of changes going forward, things like adding support for new controllers is not going to be straight forward, your assumption that only root can create a container might be broken and we'll build support for hierarchies, which will require further changes, etc. I am not scaring you, just trying to make sure we don't solve the same problems twice. Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] Re: [discuss] The new cgroup patches for libvirt
The thing I didn't mention, is that until Dan posted his current patches actually implementing the cgroups stuff in LXC driver, I didn't have a good picture of what the ideal higher level interface would look like. If you try and imagine high level APIs, without having an app actually using them, its all too easy to design something that turns out to not be useful. So while I know the low level cgroups API isn't what we need, it needs the current proof of concept in the libvirt LXC driver to discover what is an effective approach for libcgroups. I suspect our code will evolve further as we learn from what we've got now. By doing this entirely within libvirt we can experiment with effective implementation strategies without having to lockdown a formally supported API immediately. Once things settle down, it'll easier for libcgroups to see exactly what is important for a high level API and thus make one that's useful to more apps in the long term. Agreed, the libvirt changes for cgroups have shown us a useful layer to build. We'll keep on top of it and try and build something that everyone can use. Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 0 of 2] [RFC] Add cgroup manipulation and LXC driver support
Daniel P. Berrange wrote: On Wed, Oct 01, 2008 at 08:41:19AM +0530, Balbir Singh wrote: Dan Smith wrote: DB At the same time having the controllers mounted is mandatory for DB libvirt to work and asking the admin to set things up manually DB also sucks. So perhaps we'll need to mount them automatically, but DB make this behaviuour configurable in some way, so admin can DB override it Perhaps we can: - Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value I am not sure how libvirt provides thread safety, but I did not see any explicit coding for that? The thread safety model for libvirt has two levels - A single virConnectPtr object must only be used by one thread. If you have multiple threads, you must provide each with its own conenct object - Within a stateless driver (Xen, OpenVZ, Test), there is no shared state between virConnectPtr objects, so there are no thread issues in this respect - With a stateful driver, the libvirtd daemon ensures that only a single thread is active at once, so against there are no thread issues there either. Now, in a short while I will be making the daemon fully-multithreaded. When this happens, the stateful drivers will be required to maintain mutexes for locking. The locking model wil have 2 levels, one lock over the driver as a whole. This is held only while acquiring a lock against the object being modified (eg the virtual domain object). Each virtual domain, lives in one cgroup, so there is a single virCGroup object associated with each domain. the virCGroup object state is seflf contained, so independant virCGroup objects can be accessed concurrently from multiple threads, without any threads safety issues. Thanks, that was quite insightful. -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
[libvirt] Re: [discuss] The new cgroup patches for libvirt
Daniel P. Berrange wrote: On Fri, Oct 03, 2008 at 09:31:52PM +0530, Balbir Singh wrote: Hi, Everyone, I've seen a new set of patches from Dan Smith, which implement cgroup support for libvirt. While the patches seem simple, there are some issues that have been pointed out in the posting itself. I hope that libvirt will switch over (may be after your concerns are addressed and definitely in the longer run) to using libcgroups rather than having an internal implementation of cgroups. The advantages of switching over would be using the functionality that libcgroup already provides libcgroups (libcg.sf.net) provides 1. Ability to configure and mount cgroups and controllers via initscripts and a configuration file 2. An API to control and read cgroups information 3. Thread safety around API calls 4. Daemons to automatically classify a task based on a certain set of rules 5. API to extract current cgroup classification (where is the task currently in the cgroup hierarchy) So from a functional point of view you are addressing essentially three use cases 1. System configuration for controllers 2. Automatic task classification 3. Application development API for creating groups If each piece is correctly designed, the choice of implementation for each of these can be, and in some cases must be, totally independant. Since the kernel restricts that a single controller can only be attached to one cgroupsfs mount point, and one attach cannot be changed, the choice of how / where to mount controllers must remain outside the scope of applications. If any application using cgroups were to specify mount points, it would be inflicting its own requirements on every user of cgroups. This implies that applications must be designed to work with whatever controller mount configuration the admin has configured, and not configure stuff themselves. So impl for point 1 (configuration) must, by neccessity, be completely independant of impl for point 3 (application API). Considering automatic task classification. The task classification engine must be able to cope with the fact that applications have some functional requirements on cgroups setup. Taking libvirt as an example, we have a specific need to apply some controllers over a group of processes forming a container. A task classification engine must not re-clasify individual tasks within a container because that would conflict with the semantics required by libvirt. It is, however, free to re-classify the libvirtd daemon itself - whatever cgroup libvirtd is placed in, it will create the LXC cgroups below this point. So if libvirt is designed correctly, it will work with whatever cgroup task classification engine that might be running. Similarly if the task classification engine has been designed to co-operate with applications there is no problem running it alonside libvirt. Thus the implementation of points 2 (task classification) and point 3 (application API) have no need to be formally tied together. Furthermore tieing them together does not magically solve the problem that both applications the cgroups task classification engine need to be intelligently designed to co-operate. Agreed! While re-implementing might sound like a cool thing to do, here are the drawbacks 1. It leads to code duplication and reduces code reuse This is important if the library code is providing significant value add to the application using it. As it stands, libcgroup is merely a direct interface to the cgroups filesystem providing weakly typed setters getters - with the exception of looking at the mount table to find where a controller lives, this is not hard / complex code, so the benefits of re-use are not particularly high. Please see my earlier email on layering of API. In such a scenario reducing code duplication is not in itself a benefit, since there are costs associated with using external libraries. It is more complicated integrate 2 independant style sof API, particularly with different views on error reporting, memory management and varying expectations for the semantic models exposed. I disagree, I see a lot of code that does the same thing, look through /proc/mounts, read and parse values to write and read. I see two API's you've built on top of what libcgroup has (one for setting memory limit and the other for devices). Please compare the patch sizes as well and you'll see what I mean. There are a number of 'hard' questions wrt to cgroups usage by applications, two of which are outlined above. Simply having all applications use a single API cannot magically solve any of these problems - no matter what API is used application developers need to take care to design their usage of cgroups such that it 'plays nicely' with other applications. Playing nicely is a definite requirement, but not using existing code or contributing to it if something is broken and re
Re: [libvirt] [PATCH 0 of 2] [RFC] Add cgroup manipulation and LXC driver support
Daniel P. Berrange wrote: On Tue, Sep 30, 2008 at 11:11:57AM -0700, Dan Smith wrote: BS For all practical purposes, it is not possible to mount all BS controllers at the same place. Consider a simple case of ns, if BS the ns controller is mounted, you need root permissions to create BS new groups, which defeats the whole purpose of the cgroup BS filesystem and assigning permissions, so that an application can BS create groups on it own. I don't think I'd go so far as saying that it defeats the whole purpose, but I understand your point. After just a small amount of playing around, it seems like it might be reasonable to just mount the controllers we care about somewhere just for libvirt. - What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior BS These need to be fixed as well. ...that's why I pointed them out :) I'm thinking that mounting the controllers we care about at daemon startup (as mentioned above) would solve both of these issues as well. Does anyone have an opinion on taking that approach? The trouble is then libvirt would be dictating policy to the host admin, because once you mount a particular controller, you can't change the wayu its mounted. So if libvirt mounted each controller separately, then the admin couldn't have a mount with multiple controllers active, and vica-verca. The kernel cgroups interface really sucks in this regard :-( At the same time having the controllers mounted is mandatory for libvirt to work and asking the admin to set things up manually also sucks. So perhaps we'll need to mount them automatically, but make this behaviuour configurable in some way, so admin can override it As I mentioned in my previous email, one could use the cgconfigparser to automatically mount the controllers at initscripts time and then also use a policy to automatically classify tasks. -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 0 of 2] [RFC] Add cgroup manipulation and LXC driver support
Dan Smith wrote: DB The trouble is then libvirt would be dictating policy to the host DB admin, because once you mount a particular controller, you can't DB change the wayu its mounted. So if libvirt mounted each controller DB separately, then the admin couldn't have a mount with multiple DB controllers active, and vica-verca. Oh, I see. I had left that out of my quick test. I had assumed that it would behave as you would expect. DB The kernel cgroups interface really sucks in this regard :-( I was going to go with surprisingly unideal ...but yeah. The interface, when it was designed was designed to allow flexibility of separating controllers. One might need different resources for tasks, they should not be forced to share the same set of controllers. Cgroups has the notion of busy (as in no new groups are created underneath), so it needs to be not busy for changing the way it is mounted. This has made our life while working on libcgroup very hard. The other thing that gets hard is controller interplay and rules. CPUsets for example has rules about not allowing tasks to attach without adding cpus and mems and other rules about exclusivity and having certain files just in the root. DB At the same time having the controllers mounted is mandatory for DB libvirt to work and asking the admin to set things up manually DB also sucks. So perhaps we'll need to mount them automatically, but DB make this behaviuour configurable in some way, so admin can DB override it Perhaps we can: - Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value I am not sure how libvirt provides thread safety, but I did not see any explicit coding for that? It will muck things up a bit, but I think it might be doable. I would really recommend looking at libcgroup in the long run and using it. -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH 0 of 2] [RFC] Add cgroup manipulation and LXC driver support
Dan Smith wrote: This patch set adds basic cgroup support to the LXC driver. It consists of a small internal cgroup manipulation API, as well as changes to the driver itself to utilize the support. Currently, we just set a memory limit and the allowed devices list. The cgroup.{c,h} interface can be easily redirected to libcgroup in the future if and when the decision to move in that direction is made. Some discussion on the following points is probably warranted, to help determine how deep we want to go with this internal implementation, in terms' of supporting complex system configurations, etc. - What to do if controllers are mounted in multiple places For all practical purposes, it is not possible to mount all controllers at the same place. Consider a simple case of ns, if the ns controller is mounted, you need root permissions to create new groups, which defeats the whole purpose of the cgroup filesystem and assigning permissions, so that an application can create groups on it own. - What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior These need to be fixed as well. -- Balbir -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Re: [Libvir] [RFC] Add Container support to libvirt
* Daniel P. Berrange [EMAIL PROTECTED] [2008-01-15 15:52:13]: On Tue, Jan 15, 2008 at 12:26:43AM -0800, Dave Leskovec wrote: Greetings, Following up on the XML format for the Linux Container support I proposed... I've made the following recommended changes: * Changed mount tags * Changed nameserver tag to be consistent with gateway * Moved cpushare and memory tags outside container tag This is the updated format: domain type='linuxcontainer' nameContainer123/name uuid8dfd44b31e76d8d335150a2d98211ea0/uuid container filesystem mount source dir=/home/user/lxc_files/etc// target dir=/etc// /mount mount source dir=/home/user/lxc_files/var// target dir=/var// /mount /filesystem Comparing this to the Linux-VServer XML that Daniel posted, you're both pretty much representing the same concepts so we need to make a decision about which format to use for filesystem mounts. OpenVZ also provides a /domain/container/filesystem tag, though it uses a concept of filesystem templates auto-cloned per container rather than explicit mounts. I think I'd like to see filesystem type=mount source dir=/home/user/lxc_files/etc// target dir=/etc// /filesystem For the existing OpenVZ XML, we can augment their filesystem tag with an attribute type=template. application/usr/sbin/container_init/application network hostname='browndog' ip address=192.168.1.110 netmask=255.255.255.0/ gateway address=192.168.1.1/ nameserver address=192.168.1.1/nameserver /ip /network Again this is pretty similar to needs of VServer / OpenVZ. In the existing OpenVZ XML, the gateway and nameserver tags are immediately within the network tag, rather than nested inside the ip tag. Aside from that it looks to be a consistent set of information. /container cpushare40/cpushare As Daniel points out, we've thus far explicitly excluded tuning info from the XML. Not that I have any suggestion on where else to put it at this time. This is a minor thing though, easily implemented once we come to a decision. At some point, we'll need resource management extensions to libvirt. vserver and openVZ both use them and it will also be useful for containers and kvm/qemu as well. I think we'll need a resource management feature extension to the XML format. Currently resource management is provided through control groups (I can send out links if desired). Ideally once configured the control groups should be persistent (visible across reboots, so we need to save state). Thoughts? memory65536/memory devices console tty='/dev/pts/4'/ /devices /domain Does this look ok now? All comments and questions are welcome. Pretty close. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=| -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL -- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list