Re: [systemd-devel] avoid unmounts in unprivileged containers

2021-03-01 Thread Rodny Molina
Right, in systemd's case there's no access to the external namespaces, but
being the first process in a container allows you to take a snapshot of
/proc/1/mountinfo during initialization (the container runtime would have
all the initial mountpoints ready by then), and store all these mountpoints
and their parameters in a hash-map. That way, when an unmount arrives, you
just need to compare it with your mountpoint DB to determine if they are
"foreign" or "local".

In our particular case (Sysbox container runtime) the above simplistic
approach was not enough coz, on top of the above, we also needed to prevent
certain remount operations (read-only -> read-write) from succeeding ...

Hope it makes sense now.

cheers,

/Rodny



On Mon, Mar 1, 2021 at 1:29 PM Lennart Poettering 
wrote:

> On Sa, 27.02.21 11:28, Rodny Molina (rodnymol...@gmail.com) wrote:
>
> > Thanks for your detailed answer / explanation Lennart, it's fully
> > consistent with my code-browsing findings.
> >
> > I've been struggling myself with the problem that you alluded above to
> > identify "foreign" mountpoints. After banging my head against the wall
> for
> > a while i ended up implementing an heuristic based on the
> > major:minor-number field of the /proc/pid/mountinfo file: if the
> container
> > mountpoint being considered has a major:minor-id that matches those
> > major:minor-ids present in the host mount namespace, then this one is
> > likely a "foreign" mountpoint, and shouldn't be unmounted.
>
> Not sure I follow. We'd need this from inside the container, so that
> we don't even try to unmount the file system. But from "inside" we
> have no outside to the host mount namespace...
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


-- 
/Rodny
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] avoid unmounts in unprivileged containers

2021-02-27 Thread Rodny Molina
Thanks for your detailed answer / explanation Lennart, it's fully
consistent with my code-browsing findings.

I've been struggling myself with the problem that you alluded above to
identify "foreign" mountpoints. After banging my head against the wall for
a while i ended up implementing an heuristic based on the
major:minor-number field of the /proc/pid/mountinfo file: if the container
mountpoint being considered has a major:minor-id that matches those
major:minor-ids present in the host mount namespace, then this one is
likely a "foreign" mountpoint, and shouldn't be unmounted.

Obviously, this would force you to extend the current systemd mountInfo
parser. And there is a caveat as not all file-systems make use of a unique
/ differentiated ID for every new mountpoint (e.g. "/dev/null" fs always
use the same major:minor id across different mount namespaces), so there
could be false-positives, but that doesn't represent a problem in our case.
Here is the specific code if you want to check it out:
https://github.com/nestybox/sysbox-fs/blob/master/mount/infoParser.go#L828

Please let me know if you ever find a better approach.

cheers,

/Rodny

On Wed, Feb 24, 2021 at 9:19 AM Lennart Poettering 
wrote:

> On Fr, 19.02.21 19:17, Rodny Molina (rodnymol...@gmail.com) wrote:
>
> > Hi,
> >
> > As part of a prototype I'm working on to run systemd within an
> unprivileged
> > docker container, I would like to prevent mountpoints created at runtime
> > from being unmounted during the container shutdown process. I understand
> > that systemd creates ".mount" units dynamically for
> > these mountpoints as they show up in /proc/pid/mountinfo, but after
> reading
> > the docs + code, I don't see a way to avoid these unmounts during the
> > shutdown.target execution.
>
> Yeah, it would be great if we could automatically determine "foreign
> owned" mounts, and then step away from them. But there's really no way
> for us to figure that out, at lesat to my knowledge. Ideally
> /proc/self/mountinfo would tell us about this in some field, but it
> really doesn't afaik.
>
> > Interestingly, I see that there's code
> > <
> https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398>
> > that
> > skips the unmounting cycle attending to the ConditionVirtualization /
> > containeinarized settings, which is what I need, but I'm not able to see
> > that code being called during the container shutdown -- probably i'm not
> > understanding systemd's fsm unwinding logic well enough ...
>
> There are two phases of shutdown: the regular phase where we follow
> mount unit deps, and stuff is umounted via /sbin/umount. i.e. where
> the shutdown is handled by the usual unit logic.
>
> And then there's the second phase which shutdown.c implements: it's a
> separate binary that PID 1 invokes via execve() (so that it becomes
> new PID 1) and then pretty robustly just tries to
> umount/detach/disassembles/… without understanding of dependencies
> what might be left over.
>
> The first phase hence is the "clean" shutdown logic and the second
> phase is the "dirty" fallback logic that tries really hard to sync/put
> file systems into a clean state if the first phase fails (maybe
> because some misplaced deps).
>
> The second phase is skipped in containers, the first one is not. The
> second phase is unnecessary in containers since the container manager
> and namespace cleanup take care of this anyway, and even if it didn't,
> the host's shutdown logic can take responsibility of all this.
>
> Now, if the kernel would provide us with the info we'd generate the
> deps for .mount units synthesized from /proc/self/mountinfo in a way
> that "foreign owned" mounts won't get unmounted in phase 1, but we
> simply can't do that automatically since we can't distinguish
> them. :-(
>
> You could manually define .mount units for all units you know are
> owned by the outside container manager, but that is nasty and
> fragile. The mount units would have to carefully have the right deps
> (or better: should miss the right deps) to ensure things are clean
> when shutting down.
>
> So yeah, I#d love to fix this properly, generically, but this requires
> some kernel work first, and that's not just a technical difficulty but
> given the maintainer of said interfaces also a political one.
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


-- 
/Rodny
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] avoid unmounts in unprivileged containers

2021-02-23 Thread Rodny Molina
Partially answering my own questions ...

The code that I was referring to (systemd-shutdown) that takes into account
containerized environments (ConditionVirtualization) and avoids doing the
unmounts, is invoked at a later stage
<https://github.com/systemd/systemd/blob/main/src/core/main.c#L1558> in the
shutdown cycle. By the time that this code executes, all the
mountpoints that I care about (those extracted from /proc/pid/mountinfo at
runtime) are already unmounted.

So I have no answer for my original question: is there any config knob to
avoid doing unmounts during the container-shutdown process?

Thanks!

On Fri, Feb 19, 2021 at 7:17 PM Rodny Molina  wrote:

> Hi,
>
> As part of a prototype I'm working on to run systemd within an
> unprivileged docker container, I would like to prevent mountpoints created
> at runtime from being unmounted during the container shutdown process. I
> understand that systemd creates ".mount" units dynamically for
> these mountpoints as they show up in /proc/pid/mountinfo, but after reading
> the docs + code, I don't see a way to avoid these unmounts during the
> shutdown.target execution.
>
> Interestingly, I see that there's code
> <https://github.com/systemd/systemd/blob/main/src/shutdown/shutdown.c#L398> 
> that
> skips the unmounting cycle attending to the ConditionVirtualization /
> containeinarized settings, which is what I need, but I'm not able to see
> that code being called during the container shutdown -- probably i'm not
> understanding systemd's fsm unwinding logic well enough ...
>
> Any suggestions?
>
> Thanks!
>
> PS: Last few logs obtained during my container shutdown process ...
>
> ---
> Feb 20 03:00:23 08363a0a79ee umount[1273]: umount: /var/lib/kubelet: must
> be superuser to unmount.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Received SIGCHLD from PID 1273
> (umount).
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Child 1273 (umount) died
> (code=exited, status=32/n/a)
> Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Child 1273
> belongs to var-lib-kubelet.mount.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Mount
> process exited, code=exited, status=32/n/a
> Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Changed
> unmounting -> mounted
> Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Job 180
> var-lib-kubelet.mount/stop finished, result=failed
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Failed unmounting
> /var/lib/kubelet.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-journald.service:
> Received EPOLLHUP on stored fd 47 (stored), closing.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: local-fs-pre.target changed
> active -> dead
> Feb 20 03:00:23 08363a0a79ee systemd[1]: local-fs-pre.target: Job 156
> local-fs-pre.target/stop finished, result=done
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped target Local File Systems
> (Pre).
> Feb 20 03:00:23 08363a0a79ee systemd[1]: umount.target changed dead ->
> active
> Feb 20 03:00:23 08363a0a79ee systemd[1]: umount.target: Job 168
> umount.target/start finished, result=done
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Reached target Unmount All
> Filesystems.
> Feb 20 03:00:23 08363a0a79ee systemd[1]:
> systemd-tmpfiles-setup-dev.service: Succeeded.
> Feb 20 03:00:23 08363a0a79ee systemd[1]:
> systemd-tmpfiles-setup-dev.service: Service restart not allowed.
> Feb 20 03:00:23 08363a0a79ee systemd[1]:
> systemd-tmpfiles-setup-dev.service: Changed exited -> dead
> Feb 20 03:00:23 08363a0a79ee systemd[1]:
> systemd-tmpfiles-setup-dev.service: Job 105
> systemd-tmpfiles-setup-dev.service/stop finished, result=done
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped Create Static Device
> Nodes in /dev.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service:
> Succeeded.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Service
> restart not allowed.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Changed
> exited -> dead
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Job 164
> systemd-sysusers.service/stop finished, result=done
> Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped Create System Users.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
> Succeeded.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
> Service restart not allowed.
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
> Changed exited -> dead
> Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service: Job
> 117 systemd-remount-fs.service/stop finished, result=done
> Feb 20 03:00:23 08363a0a79ee systemd[1]:

[systemd-devel] avoid unmounts in unprivileged containers

2021-02-19 Thread Rodny Molina
Hi,

As part of a prototype I'm working on to run systemd within an unprivileged
docker container, I would like to prevent mountpoints created at runtime
from being unmounted during the container shutdown process. I understand
that systemd creates ".mount" units dynamically for
these mountpoints as they show up in /proc/pid/mountinfo, but after reading
the docs + code, I don't see a way to avoid these unmounts during the
shutdown.target execution.

Interestingly, I see that there's code

that
skips the unmounting cycle attending to the ConditionVirtualization /
containeinarized settings, which is what I need, but I'm not able to see
that code being called during the container shutdown -- probably i'm not
understanding systemd's fsm unwinding logic well enough ...

Any suggestions?

Thanks!

PS: Last few logs obtained during my container shutdown process ...

---
Feb 20 03:00:23 08363a0a79ee umount[1273]: umount: /var/lib/kubelet: must
be superuser to unmount.
Feb 20 03:00:23 08363a0a79ee systemd[1]: Received SIGCHLD from PID 1273
(umount).
Feb 20 03:00:23 08363a0a79ee systemd[1]: Child 1273 (umount) died
(code=exited, status=32/n/a)
Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Child 1273
belongs to var-lib-kubelet.mount.
Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Mount
process exited, code=exited, status=32/n/a
Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Changed
unmounting -> mounted
Feb 20 03:00:23 08363a0a79ee systemd[1]: var-lib-kubelet.mount: Job 180
var-lib-kubelet.mount/stop finished, result=failed
Feb 20 03:00:23 08363a0a79ee systemd[1]: Failed unmounting /var/lib/kubelet.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-journald.service: Received
EPOLLHUP on stored fd 47 (stored), closing.
Feb 20 03:00:23 08363a0a79ee systemd[1]: local-fs-pre.target changed active
-> dead
Feb 20 03:00:23 08363a0a79ee systemd[1]: local-fs-pre.target: Job 156
local-fs-pre.target/stop finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped target Local File Systems
(Pre).
Feb 20 03:00:23 08363a0a79ee systemd[1]: umount.target changed dead ->
active
Feb 20 03:00:23 08363a0a79ee systemd[1]: umount.target: Job 168
umount.target/start finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Reached target Unmount All
Filesystems.
Feb 20 03:00:23 08363a0a79ee systemd[1]:
systemd-tmpfiles-setup-dev.service: Succeeded.
Feb 20 03:00:23 08363a0a79ee systemd[1]:
systemd-tmpfiles-setup-dev.service: Service restart not allowed.
Feb 20 03:00:23 08363a0a79ee systemd[1]:
systemd-tmpfiles-setup-dev.service: Changed exited -> dead
Feb 20 03:00:23 08363a0a79ee systemd[1]:
systemd-tmpfiles-setup-dev.service: Job 105
systemd-tmpfiles-setup-dev.service/stop finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped Create Static Device Nodes
in /dev.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service:
Succeeded.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Service
restart not allowed.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Changed
exited -> dead
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-sysusers.service: Job 164
systemd-sysusers.service/stop finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped Create System Users.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
Succeeded.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
Service restart not allowed.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service:
Changed exited -> dead
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-remount-fs.service: Job
117 systemd-remount-fs.service/stop finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Stopped Remount Root and Kernel
File Systems.
Feb 20 03:00:23 08363a0a79ee systemd[1]: shutdown.target changed dead ->
active
Feb 20 03:00:23 08363a0a79ee systemd[1]: shutdown.target: Job 89
shutdown.target/start finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Reached target Shutdown.
Feb 20 03:00:23 08363a0a79ee systemd[1]: final.target changed dead -> active
Feb 20 03:00:23 08363a0a79ee systemd[1]: final.target: Job 167
final.target/start finished, result=done
Feb 20 03:00:23 08363a0a79ee systemd[1]: Reached target Final Step.
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-halt.service: Failed to
reset devices.allow/devices.deny: Operation not permitted
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-halt.service: Failed to
set invocation ID on control group /system.slice/systemd-halt.service,
ignoring: Operation not permitted
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-halt.service: Failed to
remove delegate flag on control group /system.slice/systemd-halt.service,
ignoring: Operation not permitted
Feb 20 03:00:23 08363a0a79ee systemd[1]: systemd-halt.service: Passing 0
fds to service
Feb 20 03:00:23 08363a0a79ee sys