https://virtuozzo.atlassian.net/browse/VSTOR-121489
Signed-off-by: Pavel Tikhomirov <[email protected]>

Feature: ve: ve generic structures
---
 .../ve-cgroup-and-namespace.rst               | 152 ++++++++++++++++++
 1 file changed, 152 insertions(+)
 create mode 100644 
Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst

diff --git 
a/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst 
b/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
new file mode 100644
index 000000000000..39da58feb080
--- /dev/null
+++ b/Documentation/Virtuozzo/FeatureDescriptions/ve-cgroup-and-namespace.rst
@@ -0,0 +1,152 @@
+======================
+VE Cgroup and Namespace
+======================
+
+Overview
+========
+
+Virtuozzo containers (VEs or Virtual Environments) are created through the
+cooperation of two kernel mechanisms: **VE cgroup** and **VE namespace**. These
+two components work together to provide better container isolation and resource
+management.
+
+Background
+==========
+
+In Virtuozzo 9 kernel, containers were implemented using only **VE cgroup** in
+cgroup-v1 mode. The VE cgroup subsystem served dual purposes:
+
+1. **Resource management**: Managing container resources through the cgroup
+   hierarchy
+2. **Container identification**: Tasks were associated with containers through
+   their cgroup membership
+
+However, as the kernel ecosystem moved toward cgroup-v2, and all modern
+distributions switch to it, we also needed to adapt our container
+implementation.
+
+Cgroup-v2 has unified hierarchy and we ought to meld VE cgroup into that
+together with other subsystems. This leads to complications because now we
+can't just put all container processes into a single VE cgroup, but from the
+container virtualization perspective we still want them to appear as one VE.
+
+In Virtuozzo 10 kernel, **VE namespace** was introduced to separate container
+identification from resource management.
+
+VE Cgroup
+=========
+
+VE cgroup is a cgroup subsystem (``ve_cgrp_subsys``) that provides:
+
+**Resource Management**
+  - Container lifecycle management (start, stop, state tracking)
+  - Resource limits and accounting
+  - Container configuration (features, time offsets, network limits, etc.)
+  - Release agent support (cgroup-v1 only)
+
+**Structure**
+  - Each VE container is represented by a ``struct ve_struct``
+  - The VE struct embeds a ``cgroup_subsys_state`` (css), making it part of the
+    cgroup hierarchy
+  - VE cgroup can operate in both cgroup-v1 and cgroup-v2 modes
+
+**Cgroup Operations**
+  - ``ve_create()``: Creates a new VE cgroup when a cgroup directory is
+    created. Initializes VE structure, sets initial state to STARTING,
+    allocates per-VE resources (vdso, devtmpfs, logging).
+  - ``ve_destroy()``: Destroys VE cgroup when cgroup directory is removed AND
+    all references to VE are put. Frees all VE resources (vdso, devtmpfs
+    mounts, logging, per-CPU stats), sets state to DEAD.
+  - ``ve_start_container()``: Starts the container (triggered by writing
+    "START" to ``ve.state``). Validates prerequisites (VE ID set, correct
+    state, process is pidns init), grabs container context (credentials,
+    nsproxy), marks VE root(s) in cgroup hierarchy, starts kthreadd and UMH
+    workers, initializes hooks and release agent, transitions state to RUNNING.
+  - ``ve_stop_ns()``: Stops the container when PID namespace stops. Transitions
+    state to STOPPING, stops workqueues and UMH helpers, stops kthreadd.
+  - ``ve_exit_ns()``: Final cleanup when container PID namespace exits.
+    Unmarks VE roots, finalizes hooks, removes VE from global list,
+    drops container context, transitions state to STOPPED.
+  - ``ve_state_show()`` / ``ve_state_write()``: Read/write container state
+    through ``ve.state`` cgroup file. States: STARTING, RUNNING, STOPPING,
+    STOPPED, DEAD.
+
+VE Namespace
+============
+
+VE namespace is a Linux namespace type (``CLONE_NEWVE``) that provides:
+
+**Container Identification**
+  - Each task has a ``task->ve_ns`` pointer to its VE namespace
+  - Each task has a ``task->task_ve`` pointer directly to the VE struct
+  - The ``get_exec_env()`` macro returns ``current->task_ve`` for quick VE
+    lookup
+
+**1:1 Relationship with VE Cgroup**
+  - Each VE namespace has an exclusive link to exactly one VE cgroup
+  - The link is established when the namespace is created
+  - This relationship is maintained through ``ve_namespace->ve`` and
+    ``ve_struct->ve_ns`` pointers
+  - VE namespace holds the reference to VE
+  - Task holds the reference to VE namespace and implicitly to VE through it
+
+**Namespace Operations**
+  - ``clone_ve_ns()``: Creates a new VE namespace and links it to the VE cgroup
+  - ``copy_ve_ns()``: Creates a new VE namespace during clone3()
+  - ``unshare_ve_namespace()``: Unshares VE namespace for the calling task
+  - ``switch_ve_namespace()``: Changes a task's VE namespace and task_ve 
pointer
+  - ``exit_ve_namespace()``: Release task's VE namespace
+
+How They Work Together
+=======================
+
+**Container Creation Flow (cgroup-v2)**
+
+1. A Container cgroup is created through cgroupfs (with vz.slice child as on
+   cgroup-v2 we need a leaf cgroup to be able to enter it later when container
+   is populated with tasks, kernel also puts auxiliary kthreads there):
+   .. code-block:: bash
+   mkdir /sys/fs/cgroup/machine.slice/$VEID{,/vz.slice}
+
+2. The VE controller is enabled in ancestor cgroups (together with other
+   cgroup controllers):
+
+.. code-block:: bash
+   echo "+ve" > /sys/fs/cgroup{,/machine.slice}/cgroup.subtree_control
+
+3. Unhide VE files (hidden by default) for the Container cgroup:
+
+.. code-block:: bash
+   echo "-ve" > /sys/fs/cgroup/machine.slice/$VEID/cgroup.controllers_hidden
+
+4. Cgroup is initialized with VE ID:
+
+.. code-block:: bash
+   echo $VEID > /sys/fs/cgroup/machine.slice/$VEID/ve.veid
+
+5. (Optional) Some more configuration can be done (eBPF device control,
+   ve.mount_opts, ve.features, ve.sysfs_permissions, ve.pseudosuper is enabled,
+   memory limits, time offsets, network limits, etc.)
+
+6. Task (Container manager) is attached to the Container root cgroup:
+
+.. code-block:: bash
+   echo $$ > /sys/fs/cgroup/machine.slice/$VEID/cgroup.procs
+
+7. When the task does clone3() for container init with ``CLONE_NEWVE`` flag
+   (together with other namespaces), a new VE namespace is created and
+   automatically linked to the VE cgroup that the task belongs to.
+
+8. Container (VE) is started (where it checks that it has all the necessary
+   cgroups and namespaces configured correctly):
+
+.. code-block:: bash
+   echo "START" > /sys/fs/cgroup/machine.slice/$VEID/ve.state
+
+9. Mount namespace and rootfs of the Container is set up.
+
+10. Pseudosuper is disabled. (Pseudosuper is desigined to temporarily provide
+    privileges needed for container setup and is disabled before container init
+    starts for security reasons.)
+
+11. Container init is executed (e.g. systemd), and now container is fully 
running.
-- 
2.52.0

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to