Signed-off-by: Andy Lutomirski <l...@amacapital.net>
---
 Documentation/security/capabilities.txt | 161 ++++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 Documentation/security/capabilities.txt

diff --git a/Documentation/security/capabilities.txt 
b/Documentation/security/capabilities.txt
new file mode 100644
index 0000000..dc7bc34
--- /dev/null
+++ b/Documentation/security/capabilities.txt
@@ -0,0 +1,161 @@
+                         Linux capabilities
+
+
+==== What are capabilities ====
+
+Various system calls check for appropriate privileges.  For example, a program
+may bypass normal file permission checking if it has the CAP_DAC_OVERRIDE
+capability.  There are a lot of capabilities; the complete list is in
+include/uapi/linux/capability.h.
+
+When reading this description, do not assume anything about the word
+"inheritable".  It probably does not do what you expect.
+
+Every task has the following pieces of capability-related state.
+
+ * Four capability bit masks:
+   * The effective set (pE).  Privileged operations check this set.
+   * The permitted set (pP).  Tasks may set these bits in pE.
+   * The inheritable set (pI).  This set is complicated.
+   * The bounding set (pB).  This partially limits new permitted capabilities.
+
+ * Secure bits.  Each bit has a corresponding "lock" bit.
+   * SECURE_NONROOT: Makes uid==0 and euid==0 less special at exec time.
+   * SECURE_KEEP_CAPS: Prevents setresuid() from removing permitted caps.
+   * SECURE_NO_SETUID_FIXUP: Makes setresuid() entirely nonmagical.
+
+ * no_new_privs: See Documentation/prctl/no_new_privs.txt
+
+There is one invariant: pE ⊆ pP.
+
+In addition, files can have capabilities.  If a file has capabilities, it
+specifies two masks and one bit:
+ * fP: The permitted or forced set.
+ * fI: The inheritable set.
+ * fE (a single bit): Supposedly true for "legacy" programs.
+
+libcap's setcap tool pretends that fE is a bitmask.  It's not.
+
+At the most basic level, only pE matters.  All of the complexity is in how
+pE and the other masks can change.  (This is a slight lie -- user namespaces
+change this.)
+
+==== System calls ====
+
+Capabilities and related state are affected by these syscalls:
+ * capset: Change capabilities directly.
+ * set[res]uid: Sometimes changes capabilities for legacy compatibility.
+ * prctl(PR_SET_KEEPCAPS): Used to twiddle SECURE_KEEP_CAPS.
+ * prctl(PR_SET_SECUREBITS): Used to twiddle securebits in general.
+ * prctl(PR_SET_NO_NEW_PRIVS): Used to set no_new_privs.
+ * prctl(PR_CAPBSET_DROP): Used to remove bits from pB.
+ * execve: Does all kinds of magic.
+
+==== capset ====
+
+capset changes pI, pP, and pE as requested, subject to:
+
+ - (CAP_SETPCAP ∈ pE or euid is namespace owner) or pI' ⊆ pI | pP
+ - pI' ⊆ pI | pB
+ - pP' ⊆ pP
+ - pE' ⊆ pE
+
+In the event that pI ⊆ pB, the first two conditions simplify to pI' ⊆ pI | pP.
+
+==== set*uid ====
+
+After set[res]uid, if !SECURE_NO_SETUID_FIXUP, a fixup happens.  This fixup
+does two things:
+
+ - If !SECURE_KEEP_CAPS and some old uid was 0 and no new uid is 0, then
+   pP and pE are cleared.
+ - If euid becomes zero, the pE = pP.  Conversely, if euid becomes nonzero,
+   then pE' = 0.  (Note that this is independent of SECURE_KEEP_CAPS.)
+
+setfsuid has similar logic to tweak the fs-related pE bits.
+
+====  prctl ====
+
+---- PR_SET_KEEPCAPS ----
+
+This changes SECURE_KEEP_CAPS as long as !SECURE_KEEP_CAPS_LOCKED.
+CAP_SETPCAP is not required.
+
+---- PR_SET_SECUREBITS ----
+
+This changes securebits, subject to:
+ - The caller must have CAP_SETPCAP.
+ - The *_LOCKED bits can be set but not cleared.
+ - A locked bit cannot be changed.
+
+Note that an unprivileged process can change SECURE_KEEP_CAPS via
+PR_SET_KEEPCAPS but not via PR_SET_SECUREBITS.
+
+---- PR_SET_NO_NEW_PRIVS ----
+
+Sets the no_new_privs bit.  No privilege is required.  It is impossible
+to clear the no_new_privs bit.
+
+---- PR_CAPBSET_DROP ----
+
+Clears a single bit of pB.  Doing this requires CAP_SETPCAP.  There is no
+way to set a cleared bit of pB.
+
+==== execve ====
+
+execve's behavior is rather complicated.  It does this:
+
+Step 1: Load fI, fP, and fE.  If the file has no capabilities (the xattr
+is malformed or absent), then set fI = 0, fP = 0, and fE = false.  (In theory,
+fE is set on "legacy" binaries that don't know how to check their own
+capability sets.)
+
+Step 2: Apply the basic pP update rule:
+
+   pP' = (pB & fP) | (pI & fI)
+
+Step 3: If fE and pP ⊈ fP, then abort.  (This prevents legacy binaries from
+malfunctioning dangerously if pB is missing important bits.)
+
+Step 4: Apply a fixup for root if !SECURE_NOROOT.  The fixup is:
+
+ - If vfs caps were present, uid != 0, and euid == 0, then warn once per boot.
+ - Otherwise:
+   - If euid == 0 or uid == 0, then pP' = pB | pI.
+   - If euid == 0, then set fE = true.  (This does not affect the check
+      in step 2.)
+
+Step 5: Apply no_new_privs
+
+If no_new_privs is set (or if new euid != old uid or new egit != old gid and
+an unprivileged ptracer is attached), then set euid = uid, egid = gid,
+and set pP' = pP' & pP.  (Note: If CAP_SETUID is effective (in the old context)
+and no_new_privs is not set, then the euid and egid changes are skipped.)
+
+Step 6: Compute pE
+
+If fE, then pE' = pP'.  Else pE' = 0.
+
+Step 7: Clear SECURE_KEEP_CAPS.
+
+This happens regardless of the setting of SECURE_KEEP_CAPS_LOCKED.  Setting
+SECURE_KEEP_CAPS_LOCKED is therefore probably a mistake unless
+SECURE_NO_SETUID_FIXUP is set.
+
+
+In the absence of something like no_new_privs, then either
+
+pP' = (pB & fP) | (pI & fI)  (the normal case)
+
+or
+
+pP' = pB | pI (if euid or uid == 0)
+
+The latter condition means that, if euid or uid is zero, then execve acts
+(in part) as though fP = fI = <all bits set>.
+
+
+The upshot: pI bits can result in actual (pP or pE) privilege if you exec a
+program that has that fI bit set *or* you have !issecure(SECURE_NOROOT) and
+(euid == 0 || uid == 0).  (That latter case is possibly better understood
+as promoting pB bits to pP.)
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to