https://virtuozzo.atlassian.net/browse/VSTOR-121489
Signed-off-by: Pavel Tikhomirov <[email protected]>
Feature: ve: ve generic structures
---
.../ve-cgroup-and-namespace.rst | 152 ++++++++++++++++++
1 file changed, 152 insertions(+)
create mode 100644
Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
diff --git
a/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
b/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
new file mode 100644
index 000000000000..39da58feb080
--- /dev/null
+++ b/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
@@ -0,0 +1,152 @@
+======================
+VE Cgroup and Namespace
+======================
+
+Overview
+========
+
+Virtuozzo containers (VEs or Virtual Environments) are created through the
+cooperation of two kernel mechanisms: **VE cgroup** and **VE namespace**. These
+two components work together to provide better container isolation and resource
+management.
+
+Background
+==========
+
+In Virtuozzo 9 kernel, containers were implemented using only **VE cgroup** in
+cgroup-v1 mode. The VE cgroup subsystem served dual purposes:
+
+1. **Resource management**: Managing container resources through the cgroup
+ hierarchy
+2. **Container identification**: Tasks were associated with containers through
+ their cgroup membership
+
+However, as the kernel ecosystem moved toward cgroup-v2, and all modern
+distributions switch to it, we also needed to adapt our container
+implementation.
+
+Cgroup-v2 has unified hierarchy and we ought to meld VE cgroup into that
+together with other subsystems. This leads to complications because now we
+can't just put all container processes into a single VE cgroup, but from the
+container virtualization perspective we still want them to appear as one VE.
+
+In Virtuozzo 10 kernel, **VE namespace** was introduced to separate container
+identification from resource management.
+
+VE Cgroup
+=========
+
+VE cgroup is a cgroup subsystem (``ve_cgrp_subsys``) that provides:
+
+**Resource Management**
+ - Container lifecycle management (start, stop, state tracking)
+ - Resource limits and accounting
+ - Container configuration (features, time offsets, network limits, etc.)
+ - Release agent support (cgroup-v1 only)
+
+**Structure**
+ - Each VE container is represented by a ``struct ve_struct``
+ - The VE struct embeds a ``cgroup_subsys_state`` (css), making it part of the
+ cgroup hierarchy
+ - VE cgroup can operate in both cgroup-v1 and cgroup-v2 modes
+
+**Cgroup Operations**
+ - ``ve_create()``: Creates a new VE cgroup when a cgroup directory is
+ created. Initializes VE structure, sets initial state to STARTING,
+ allocates per-VE resources (vdso, devtmpfs, logging).
+ - ``ve_destroy()``: Destroys VE cgroup when cgroup directory is removed AND
+ all references to VE are put. Frees all VE resources (vdso, devtmpfs
+ mounts, logging, per-CPU stats), sets state to DEAD.
+ - ``ve_start_container()``: Starts the container (triggered by writing
+ "START" to ``ve.state``). Validates prerequisites (VE ID set, correct
+ state, process is pidns init), grabs container context (credentials,
+ nsproxy), marks VE root(s) in cgroup hierarchy, starts kthreadd and UMH
+ workers, initializes hooks and release agent, transitions state to RUNNING.
+ - ``ve_stop_ns()``: Stops the container when PID namespace stops. Transitions
+ state to STOPPING, stops workqueues and UMH helpers, stops kthreadd.
+ - ``ve_exit_ns()``: Final cleanup when container PID namespace exits.
+ Unmarks VE roots, finalizes hooks, removes VE from global list,
+ drops container context, transitions state to STOPPED.
+ - ``ve_state_show()`` / ``ve_state_write()``: Read/write container state
+ through ``ve.state`` cgroup file. States: STARTING, RUNNING, STOPPING,
+ STOPPED, DEAD.
+
+VE Namespace
+============
+
+VE namespace is a Linux namespace type (``CLONE_NEWVE``) that provides:
+
+**Container Identification**
+ - Each task has a ``task->ve_ns`` pointer to its VE namespace
+ - Each task has a ``task->task_ve`` pointer directly to the VE struct
+ - The ``get_exec_env()`` macro returns ``current->task_ve`` for quick VE
+ lookup
+
+**1:1 Relationship with VE Cgroup**
+ - Each VE namespace has an exclusive link to exactly one VE cgroup
+ - The link is established when the namespace is created
+ - This relationship is maintained through ``ve_namespace->ve`` and
+ ``ve_struct->ve_ns`` pointers
+ - VE namespace holds the reference to VE
+ - Task holds the reference to VE namespace and implicitly to VE through it
+
+**Namespace Operations**
+ - ``clone_ve_ns()``: Creates a new VE namespace and links it to the VE cgroup
+ - ``copy_ve_ns()``: Creates a new VE namespace during clone3()
+ - ``unshare_ve_namespace()``: Unshares VE namespace for the calling task
+ - ``switch_ve_namespace()``: Changes a task's VE namespace and task_ve
pointer
+ - ``exit_ve_namespace()``: Release task's VE namespace
+
+How They Work Together
+=======================
+
+**Container Creation Flow (cgroup-v2)**
+
+1. A Container cgroup is created through cgroupfs (with vz.slice child as on
+ cgroup-v2 we need a leaf cgroup to be able to enter it later when container
+ is populated with tasks, kernel also puts auxiliary kthreads there):
+ .. code-block:: bash
+ mkdir /sys/fs/cgroup/machine.slice/$VEID{,/vz.slice}
+
+2. The VE controller is enabled in ancestor cgroups (together with other
+ cgroup controllers):
+
+.. code-block:: bash
+ echo "+ve" > /sys/fs/cgroup{,/machine.slice}/cgroup.subtree_control
+
+3. Unhide VE files (hidden by default) for the Container cgroup:
+
+.. code-block:: bash
+ echo "-ve" > /sys/fs/cgroup/machine.slice/$VEID/cgroup.controllers_hidden
+
+4. Cgroup is initialized with VE ID:
+
+.. code-block:: bash
+ echo $VEID > /sys/fs/cgroup/machine.slice/$VEID/ve.veid
+
+5. (Optional) Some more configuration can be done (eBPF device control,
+ ve.mount_opts, ve.features, ve.sysfs_permissions, ve.pseudosuper is enabled,
+ memory limits, time offsets, network limits, etc.)
+
+6. Task (Container manager) is attached to the Container root cgroup:
+
+.. code-block:: bash
+ echo $$ > /sys/fs/cgroup/machine.slice/$VEID/cgroup.procs
+
+7. When the task does clone3() for container init with ``CLONE_NEWVE`` flag
+ (together with other namespaces), a new VE namespace is created and
+ automatically linked to the VE cgroup that the task belongs to.
+
+8. Container (VE) is started (where it checks that it has all the necessary
+ cgroups and namespaces configured correctly):
+
+.. code-block:: bash
+ echo "START" > /sys/fs/cgroup/machine.slice/$VEID/ve.state
+
+9. Mount namespace and rootfs of the Container is set up.
+
+10. Pseudosuper is disabled. (Pseudosuper is desigined to temporarily provide
+ privileges needed for container setup and is disabled before container init
+ starts for security reasons.)
+
+11. Container init is executed (e.g. systemd), and now container is fully
running.
--
2.52.0
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel