On Mon, Aug 31, 2015 at 5:42 PM, 'Klaus Aehlig' via ganeti-devel < [email protected]> wrote:
> While Ganeti traditionally is very careful about making > sure no instance every runs out of memory. However, there > are situations where it can be desirable to accept parts of > the memory to be swapped out, e.g., if, from the use case, > it is known that always some machines will have memory not > actively accessed. Add a design for this situation that by > default preserves the current state of affairs. > > Signed-off-by: Klaus Aehlig <[email protected]> > --- > Makefile.am | 1 + > doc/design-draft.rst | 1 + > doc/design-memory-over-commitment.rst | 176 > ++++++++++++++++++++++++++++++++++ > 3 files changed, 178 insertions(+) > create mode 100644 doc/design-memory-over-commitment.rst > > diff --git a/Makefile.am b/Makefile.am > index acb80c3..e0595ba 100644 > --- a/Makefile.am > +++ b/Makefile.am > @@ -694,6 +694,7 @@ docinput = \ > doc/design-location.rst \ > doc/design-linuxha.rst \ > doc/design-lu-generated-jobs.rst \ > + doc/design-memory-over-commitment.rst \ > doc/design-migration-speed-hbal.rst \ > doc/design-monitoring-agent.rst \ > doc/design-move-instance-improvements.rst \ > diff --git a/doc/design-draft.rst b/doc/design-draft.rst > index b02937a..353f0cd 100644 > --- a/doc/design-draft.rst > +++ b/doc/design-draft.rst > @@ -26,6 +26,7 @@ Design document drafts > design-multi-storage-htools.rst > design-repaird.rst > design-migration-speed-hbal.rst > + design-memory-over-commitment.rst > > .. vim: set textwidth=72 : > .. Local Variables: > diff --git a/doc/design-memory-over-commitment.rst > b/doc/design-memory-over-commitment.rst > new file mode 100644 > index 0000000..2a84085 > --- /dev/null > +++ b/doc/design-memory-over-commitment.rst > @@ -0,0 +1,176 @@ > +====================== > +Memory Over Commitment > +====================== > + > +.. contents:: :depth: 4 > + > +This document describes the proposed changes to support memory > +overcommitment in Ganeti. > + > +Background > +========== > + > +Memory is a non-preemptable resource, thus cannot be shared, e.g., > s/thus/and thus/ > +in a round-robin fashion. Therefore, Ganeti is very careful to make > +sure, there is always enough physical memory for the memory promised > s/,// > +to the instances. In fact, even in an N+1 redundant way: should one > +node fail, its instances can be relocated to other nodes while still > +having enough physical memory for the memory promised to all instances. > + > +Overview over the current memory model > +-------------------------------------- > + > +To make decissions, ``htools`` query the following parameter from Ganeti. > s/ss/s/, s/parameter/parameters/ > + > +- The amount of memory used by each instance. This is the state-of-record > + backend parameter ``maxmem`` for that instance (maybe inherited from > + group-level or cluster-level backend paramters). It tells the hypervisor > + the maximal amount of memory that instance may use. > + > +- The state-of-world paramters for the node memory. They are collected > + live and are hypervisor specific. The following paramters are collected. > s/paramters/parameters/, x2 > + > + - memory_total: the total memory size on the node > + > + - memory_free: the available memory on the node for instances > + > + - memory_dom0: the memory used by the node itself, if available > + > + For Xen, the amount of total and free memory are obtained by parsing > + the outout of Xen ``info`` command (e.g., ``xm info``). The dom0 > s/outout/output/ > + memory is obtained by looking in the output of the ``list`` command > + for ``Domain-0``. > + > + For the ``kvm`` hypervisor, all these paramters are obtained by > + reading ``/proc/memstate``, where the entries ``MemTotal`` and > + ``Active`` are considered the values for ``memory_total`` and > + ``memory_dom0``, respectively. The value for ``memory_free`` is > + taken as the sum of the entries ``MemFree``, ``Buffers``, and > ``Cached``. > + > + > +Current state and shortcomings > +============================== > + > +While the current model of never over provisioning memory serves well > I believe over-provisioning means supplying a surplus of something, and over-committing should be used here. > +to provide reliability guarantees to instances, it does not suit well > +situations were the actual use of memory in the instances is spiky. > Consider > +a scenario where instances only touch a small portion of their memory most > +of the time, but occasionally use a large amount memory. Then, at any > moment, > s/memory/of memory/ > +a large fraction of the memory used for the instances sits there without > s/there/around/ > +being actively used. By swapping out the not actively used memory, > resources > +can be used more efficiently. > + > +Proposed changes > +================ > + > +We propose to support over commitment of memory if desired by the > +administrator. Memory will change from beeing a hard constraint to > s/beeing/being/ > +being a question of policy. The default will be not to over commit > +memory. > + > +Extension of the policy by a new parameter > +------------------------------------------ > + > +The instance policy is extenden by a new real-number field > ``memory-ratio``. > s/extenden/extended/ > +Policies on groups inherit this paramter from the cluster wide policy in > the > +same way as all other parameters of the instance policy. > + > +When a cluster is upgraded from an earlier version not containing > +``memory-ratio``, the value ``1.0`` is inserted for this new field in > +the cluster-level ``ipolicy``; in this way, the status quo of not over +committing memory is preserved via upgrades. The ``gnt-cluster > +modify`` and ``gnt-group modify`` commands are extended to allow > +setting of the ``memory-ratio``. + > +The ``htools`` text format is extended to also contain this new > +ipolicy parameter. It is added as an optional entry at the end of the > +parameter list of an ipolicy line, to remain backwards compatible. > +If the paramter is missing, the value ``1.0`` is assumed. > s/paramter/parameter/g > + > +Changes to the memory reporting on ``kvm`` > +------------------------------------------ > + > +For ``kvm``, ``dom0`` corresponds to the amount of memory used by Ganeti > +itself and all other non-``kvm`` processes running on this node. The > amount > +of memory currently reported for ``dom0``, however, includes the amount of > +active memory of the ``kvm`` processes. This is in conflict with the > underlying > +assumption ``dom0`` memory is not available for instance. > + > +Therefore, for ``kvm`` we will report as ``dom0`` the state-of-record > +backend paramter ``memory_dom0`` for the ``kvm`` hypervisor. As a > hypervisor > +backend paramter, it is run-time tunable and inheritable at group level. > +If this paramter is not present, a default value of ``1024M`` will be > used, > +which is conservative estimate of the amount of memory used by Ganeti on a > a conservative estimate > +medium-sized cluster. The reason for using a state-of-record value is to > +have a stable amount of reserved memory, irrsespectively of the current > activity > irrespectively > +of Ganeti. > + > + > +Changes to the memory policy > +---------------------------- > + > +The memory policy will be changed in that we assume that one byte > +of physical node memory can hold ``memory-ratio`` bytes of instance > +memory, but still only one byte of Ganeti memory. Of course, in practise > +this has to be backed by swap sapce; it is the administrator > responsibility > s/sapce/space/ > +to ensure that each node has swap of at > +least ``(memory-ratio - 1.0) * (memory_total - memory_dom0)``. > It would be possible and desirable to have Ganeti check the available swap space as well during cluster verify. This might not be backwards compatible though, and would have to be activated by an option. > + > +The new memory policy will be as follows. > + > +- The difference between the total memory of a node and its dom0 > + memory will be considered the amount of *available memory*. > + > +- The amount of *used memory* will be (as is now) the sum of > + the memory of all instance and the reserved memory. > + > +- The *relative memory usage* is the fraction of used and available > + memory. Note that the relative usage can be bigger than ``1.0``. > + > +- The memory-related constraint for instance placement is that > + afterwards the relative memory usage be at most the > + memory-ratio. Again, if the ratio of the memory of the real > + instances on the node to available memory is bigger than the > + memory-ratio this is considered a hard violation, otherwise > + it is considered a soft violation. > + > +- The definition of N+1 redunandcy (including > s/redunandcy/redundancy/ > + :doc:`design-shared-storage-redundancy`) is kept literally as is. > + Note, however, that the meaning does change, the definition depends > as the definition > + on the notion of allowed moves, which is changed by this proposal. > + > + > +Changes to cluster verify > +------------------------- > + > +The only place where the Ganeti core handles memory is > +when ``gnt-cluster verify`` verifies N+1 redundancy. This code will be > changed > +to follow the new memory model. > + > +Changes to ``htools`` > +--------------------- > + > +The underlying model of the cluster will be changed in accordance with > +the suggested change of the memory policy. As all higher-level ``htools`` > +operations go through only the primitives of adding/moving an instance > +if possible, and inspecting the cluster metrics, changing the base > +model will make all ``htools`` compliant with the new memory model. > + > +Balancing > +--------- > + > +The cluster metric components will not be changed. Note the standard > +deviation of relative memory usage is already one of the components. > +For dynamic (load-based) balancing, the amount of not immediately > +discardable memory will serve as an indication of memory activity; > +as usual, the measure will be the standard deviation of the relative > +value (i.e., the ratio of non-discardable memory to available > +memory). The weighting for this metric component will have to be > +determined by experimentation; as starting point we will use the > +status quo value of ``1.0``. > One problem I see with this is that this metric will work well for oversubscribed clusters, but may negatively influence clusters which are not oversubscribed. It should ideally kick in only after a cluster becomes oversubscribed. We could scale it by an additional factor calculated as: max(per node) [ max(relative-memory-usage - 1, 0) / (memory_ratio - 1) ] * * I really miss LaTeX here. > + > + > + > + > + > + > -- > 2.5.0.457.gab17608 > > Hrvoje Ribicic Ganeti Engineering Google Germany GmbH Dienerstr. 12, 80331, München Geschäftsführer: Graham Law, Christine Elizabeth Flores Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg
