Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, 07.11.11 11:09, Williams, Dan J (dan.j.willi...@intel.com) wrote: What exactly is kill_all_processes()? is it SIGTERM or SIGKILL or both with a gap or ??? SIGTERM followed by SIGKILL after 5s if the programs do not react to that in time. But note that this logic only applies to processes which for some reason managed to escape systemd's usual cgroup-based killing logic. Normal services are hence already killed at that time, and only processes which moved themselves out of any cgroup or for which the service files disabled killing might survive to this point. So I think mdmon should always try to escape itself from cgroup based killing. It follows the lifespan of the array, and if the array is not stopped by the cgroup exit (or the array lifespan is not controlled in a service file), then mdmon must keep running. Well, I think when it gets killed by the cgroup-based killer then it should try to tear down its MD device. In the mdmon service file use SendSIGKILL=no to disable sending of SIGKILL after the initial SIGTERM. With KillSignal= you chan choose the signal you first want to be killed with, if you don't want it to be SIGTERM. With KillMode= you can choose whether only the main process of the service, all processes of the service, or no processes of the service shall be killed. With TimeoutSec= you can set the timeout between the SIGTERM and the SIGKILL. See systemd.service(5) for more information. You have relatively flexible control of the first step in this code. The second step is then the hammer that tries to fix up what this step didn't accomplish. My suggestion to check argv[0][0] was to avoid the hammer. I notice that if the rootfs is on a dm or md device systemd/shutdown will always fall through to ultimate_send_signal() which will not discriminate against processes flagged with '@'. Since we aren't stopping the root md device I wonder if ultimate_send_signal() should also ignore flagged processes, or whether the failure to stop the root device is to be expected and let shutdown skip ultimate_send_signal() if the only remaining work is shutting down the rootfs-blockdev. I'm leaning towards the latter. The idea was to skip processes flgged with '@' in both the ultimate_send_signal() and send_signal() calls. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On 11-11-08 01:11, Michal Soltys wrote: I've peeked into systemd, and from what I can see, it /only/ jumps back to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown binary is present on initramfs. And whenever mdmon is still running or not, is not in any way determinent for pivot_root(2) call to succeed (or ... ?). If /run/initramfs/shutdown is not present, then systemd just do the things the old way as far as I can see - it doesn't even attempt to pivot. And if it doesn't, the it can't umount the root (being itself tied to it) ? So essentially, if systemd execs /shutdown (after pivoting to /run/initramfs) - then it's dracut's modules.d/99shutdown, which itself sources hooks from other modules to do the rest of cleaning job. And that should take care of all the remaining stuff (including terminating mdmon in graceful way, and then umounting /oldroot). Either way - pretty simple to add the necessary functionality to dracut. So wouldn't simply a systemd's cgroup named say - immortals - with mdmon (by default) in it suffice ? Pivot back as usual, leave mdmon alive, let the dracut (or anything else used for initramfs) do the rest of the job (properly). I did some testings today, and it's all working nicely as expected. Actually I modified classic rc scripts I'm using under sysinit to perform full umount/detach (using similar methods to systemd), with mdmon happily living through everything. The only things needed after pivot_root were: mdmon --takeover --all telinit U (so obviously my dracut image had mdmon, telinit and init, and slightly adjusted shutdown script). Then everything from oldroot could be nicely and cleanly umounted. Even more elegant would be if e.g. mdmon had added option such as: --reroot newroot to chroot() and reopen its files under newroot, and then systemd would call mdmon --reroot /run/initramfs --all --takeover after - prepare_new_root() and before - pivot_to_new_root() Then even existing intiramfs image could (probably) be mdmon-agnostic. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On 11-11-08 17:46, Michal Soltys wrote: Then even existing intiramfs image could (probably) be mdmon-agnostic. Actually: chroot /run/initramfs mdmon --takeover --all worked just fine (after preparing new root - so after all mount --binds, and before pivot_root(8)). So in context of systemd instead of sysv scripts - a fork / chroot / exec mdmon / wait - instead of killing it would do the thing, followed by pivot_to_new_root(). Actually anything that could benefit from immortality in one or the other way (perhaps udevd, so e.g. lvm doesn't need --noudevsync ? - when taken over inside dracut's shutdown or anything similar after going back to initramfs) that can be pre-chrooted into /run/initramfs and exec'ed, should work just fine. For the record, udevd worked properly with pivot survival. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Nov 8, 2011 at 6:43 AM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 07.11.11 11:09, Williams, Dan J (dan.j.willi...@intel.com) wrote: So I think mdmon should always try to escape itself from cgroup based killing. It follows the lifespan of the array, and if the array is not stopped by the cgroup exit (or the array lifespan is not controlled in a service file), then mdmon must keep running. Well, I think when it gets killed by the cgroup-based killer then it should try to tear down its MD device. We can easily fall off the complexity cliff trying to tear down the MD device because it can be pinned by a mounted filesystem or being claimed anywhere in an arbitrary stack of DM or MD devices. I did not think cgroup exit would umount() filesystems? [..] I notice that if the rootfs is on a dm or md device systemd/shutdown will always fall through to ultimate_send_signal() which will not discriminate against processes flagged with '@'. Since we aren't stopping the root md device I wonder if ultimate_send_signal() should also ignore flagged processes, or whether the failure to stop the root device is to be expected and let shutdown skip ultimate_send_signal() if the only remaining work is shutting down the rootfs-blockdev. I'm leaning towards the latter. The idea was to skip processes flgged with '@' in both the ultimate_send_signal() and send_signal() calls. Ok, that makes it easier. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, 07.11.11 13:52, NeilBrown (ne...@suse.de) wrote: Why doesn't the kernel do that on its own? Because the kernel doesn't know about the format of the metadata that describes the array. Yupp, my suggestion would be to change that. What we do right now is this: kill_all_processes(); do { umount_all_file_systems_we_can(); read_only_mount_all_remaining_file_systems(); } while (we_had_some_success_with_that()); jump_into_initrd(); As long as mdmon references a file from the root disk we cannot umount it, so the loop wouldn't be effective. What exactly is kill_all_processes()? is it SIGTERM or SIGKILL or both with a gap or ??? SIGTERM followed by SIGKILL after 5s if the programs do not react to that in time. But note that this logic only applies to processes which for some reason managed to escape systemd's usual cgroup-based killing logic. Normal services are hence already killed at that time, and only processes which moved themselves out of any cgroup or for which the service files disabled killing might survive to this point. I assume a SIGKILL. I don't mind a SIGTERM and it could be useful to expedite mdmon cleaning up. However there is an important piece missing. When you remount,ro a filesystem, the block device doesn't get told so it thinks it is still open read/write. So md cannot tell mdmon that the array is now read-only It would make a lot of sense for mdmon to exit after receiving a SIGTERM as soon as the device is marked read-only. But it just doesn't know. As mentioned by Kay, you can get notifications for this by poll()ing on /proc/self/mountinfo. Note again however, that we kill first, and only then try to unmount/remount. We can probably fix that, but that doesn't really help for now. I think I would like: - add to the above loop stop any virtual devices that we can. Exactly how to do that if /proc and /sys are already unmounted is unclear. Is one or both of these kept around somewhere? /proc and /sys are not unmounted in this loop. Being virtual API fs we exclude them from this logic and leave them around until the initrd unmounts them if it wants to. Actually, in the loop above there are three more steps: in each iteration we also try to detach all swap devices, all loopback devices and all DM devices. We probably could add a similar operation for MD devices here too. But note that this loop is more of a last-resort thing, and normally shouldn't do much. - allow processes to be marked some way so they get SIGTERM but not SIGKILL. I'm happy adding magic char to argv[0]. Note that you can configure how you are killed relatively flexibly in the service files and that the loop pointed out above is only this last resort thing which is applied to all processes/mount points/... which stick around after this normal shutdown. Here's another attempt in explaining how this works: snip terminate_all_mount_and_service_units(); kill_all_remaining_processes(); do { umount_all_remaining_file_systems_we_can(); read_only_mount_all_remaining_file_systems(); detach_all_remaining_loop_devices(); detach_all_remaining_swap_devices(); detach_all_remaining_dm_devices(); } while (we_had_some_success_with_that()); jump_into_initrd(); /snip You have relatively flexible control of the first step in this code. The second step is then the hammer that tries to fix up what this step didn't accomplish. My suggestion to check argv[0][0] was to avoid the hammer. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, Nov 7, 2011 at 4:00 AM, Lennart Poettering lenn...@poettering.net wrote: On Mon, 07.11.11 13:52, NeilBrown (ne...@suse.de) wrote: Why doesn't the kernel do that on its own? Because the kernel doesn't know about the format of the metadata that describes the array. Yupp, my suggestion would be to change that. It's quite a bit of idiosyncratic code that needs to be duplicated in kernel space and userspace (since userspace always needs to know how to parse the metadata for array assembly). All to record a dirty bit that flips at most every 5 seconds, or a disk failure event which is even less frequent. Throw in policy constraints like restricting which block devices can become part of the raid set. Rinse and repeat for every possible metadata format. [..] What exactly is kill_all_processes()? is it SIGTERM or SIGKILL or both with a gap or ??? SIGTERM followed by SIGKILL after 5s if the programs do not react to that in time. But note that this logic only applies to processes which for some reason managed to escape systemd's usual cgroup-based killing logic. Normal services are hence already killed at that time, and only processes which moved themselves out of any cgroup or for which the service files disabled killing might survive to this point. So I think mdmon should always try to escape itself from cgroup based killing. It follows the lifespan of the array, and if the array is not stopped by the cgroup exit (or the array lifespan is not controlled in a service file), then mdmon must keep running. [..] Here's another attempt in explaining how this works: snip terminate_all_mount_and_service_units(); kill_all_remaining_processes(); do { umount_all_remaining_file_systems_we_can(); read_only_mount_all_remaining_file_systems(); detach_all_remaining_loop_devices(); detach_all_remaining_swap_devices(); detach_all_remaining_dm_devices(); So I've started putting together a md_detach_all() routine that will attempt to stop all md devices (via sysfs). Where all mdmon instances have missed the initial killall with the argv '@' flagging. Like the dm implementation it will address all but the root md device. } while (we_had_some_success_with_that()); jump_into_initrd(); The final act of the initramfs is then mdadm --wait-clean --scan to communicate with the rootfs-blockdev-mdmon to be sure the metadata has been marked clean. All other mdmon instances should have exited naturally when their md devices stopped, but the --wait-clean --scan will have ensured shutdown can progress safely. You have relatively flexible control of the first step in this code. The second step is then the hammer that tries to fix up what this step didn't accomplish. My suggestion to check argv[0][0] was to avoid the hammer. I notice that if the rootfs is on a dm or md device systemd/shutdown will always fall through to ultimate_send_signal() which will not discriminate against processes flagged with '@'. Since we aren't stopping the root md device I wonder if ultimate_send_signal() should also ignore flagged processes, or whether the failure to stop the root device is to be expected and let shutdown skip ultimate_send_signal() if the only remaining work is shutting down the rootfs-blockdev. I'm leaning towards the latter. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On 11-11-02 14:32, Lennart Poettering wrote: What we do right now is this: kill_all_processes(); do { umount_all_file_systems_we_can(); read_only_mount_all_remaining_file_systems(); } while (we_had_some_success_with_that()); jump_into_initrd(); As long as mdmon references a file from the root disk we cannot umount it, so the loop wouldn't be effective. I've peeked into systemd, and from what I can see, it /only/ jumps back to initramfs (prepare_new_root() and pivot_to_new_root()) if shutdown binary is present on initramfs. And whenever mdmon is still running or not, is not in any way determinent for pivot_root(2) call to succeed (or ... ?). If /run/initramfs/shutdown is not present, then systemd just do the things the old way as far as I can see - it doesn't even attempt to pivot. And if it doesn't, the it can't umount the root (being itself tied to it) ? So essentially, if systemd execs /shutdown (after pivoting to /run/initramfs) - then it's dracut's modules.d/99shutdown, which itself sources hooks from other modules to do the rest of cleaning job. And that should take care of all the remaining stuff (including terminating mdmon in graceful way, and then umounting /oldroot). Either way - pretty simple to add the necessary functionality to dracut. So wouldn't simply a systemd's cgroup named say - immortals - with mdmon (by default) in it suffice ? Pivot back as usual, leave mdmon alive, let the dracut (or anything else used for initramfs) do the rest of the job (properly). p.s. Sorry if I missed something obvious, it was a quick and late peek over systemd's shutdown.c. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, 2 Nov 2011 14:32:25 +0100 Lennart Poettering lenn...@poettering.net wrote: On Wed, 02.11.11 13:03, NeilBrown (ne...@suse.de) wrote: Each instance of mdmon manages a set of arrays and must remain running until all of those arrays are readonly (or shut down). This allows it to record that all writes have completed and mark the array as 'clean' so a resync isn't needed at next boot. Why doesn't the kernel do that on its own? Because the kernel doesn't know about the format of the metadata that describes the array. You couldn't just do the equivalent of fuser -k /some/filesystem umount /some/filesystem iterating over filesystems with '/' last? Then anything that only uses the /run filesystem will survive. What we do right now is this: kill_all_processes(); do { umount_all_file_systems_we_can(); read_only_mount_all_remaining_file_systems(); } while (we_had_some_success_with_that()); jump_into_initrd(); As long as mdmon references a file from the root disk we cannot umount it, so the loop wouldn't be effective. What exactly is kill_all_processes()? is it SIGTERM or SIGKILL or both with a gap or ??? I assume a SIGKILL. I don't mind a SIGTERM and it could be useful to expedite mdmon cleaning up. However there is an important piece missing. When you remount,ro a filesystem, the block device doesn't get told so it thinks it is still open read/write. So md cannot tell mdmon that the array is now read-only It would make a lot of sense for mdmon to exit after receiving a SIGTERM as soon as the device is marked read-only. But it just doesn't know. We can probably fix that, but that doesn't really help for now. I think I would like: - add to the above loop stop any virtual devices that we can. Exactly how to do that if /proc and /sys are already unmounted is unclear. Is one or both of these kept around somewhere? - allow processes to be marked some way so they get SIGTERM but not SIGKILL. I'm happy adding magic char to argv[0]. We should be able to make it work with those changes - if they are possible. Thanks, NeilBrown signature.asc Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, Nov 7, 2011 at 03:52, NeilBrown ne...@suse.de wrote: However there is an important piece missing. When you remount,ro a filesystem, the block device doesn't get told so it thinks it is still open read/write. So md cannot tell mdmon that the array is now read-only That ro/rw flag is visible in /proc/self/mountinfo, shouldn't it be possible for mdmon to poll() that file and let the kernel wake stuff up when the ro/rw flag changes, like we do for the usual mount changes already? Kay ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, 7 Nov 2011 04:42:54 +0100 Kay Sievers kay.siev...@vrfy.org wrote: On Mon, Nov 7, 2011 at 03:52, NeilBrown ne...@suse.de wrote: However there is an important piece missing. When you remount,ro a filesystem, the block device doesn't get told so it thinks it is still open read/write. So md cannot tell mdmon that the array is now read-only That ro/rw flag is visible in /proc/self/mountinfo, shouldn't it be possible for mdmon to poll() that file and let the kernel wake stuff up when the ro/rw flag changes, like we do for the usual mount changes already? Kay The ro/rw flag for file systems is in /proc/self/mountinfo. However I want the ro/rw flag for the block device. A block device can be partitioned so it might have multiple filesystems on it. and it might have swap too. or a dm table or another md device or an open file descriptor or Yes, I could maybe parse various different files and try to work out what is going on. But the kernel can easily *know* what is going on. Making this work perfectly would require md dropping its write-access to member devices when the last write-access to the top level device goes. And the same for dm and loop and . But just filesystems would go a long way to catching the common cases correctly. Thanks, NeilBrown signature.asc Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers kay.siev...@vrfy.org wrote: People who like to put their rootfs on a userspace managed raid device just get what they asked for. :) Proper care and feeding of mdmon and userspace managed block devices / filesystems is a solvable problem. To me the :) runs the risk of implying we don't think we can get this right. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, Nov 2, 2011 at 20:31, Williams, Dan J dan.j.willi...@intel.com wrote: On Wed, Nov 2, 2011 at 11:49 AM, Kay Sievers kay.siev...@vrfy.org wrote: On Wed, Nov 2, 2011 at 19:16, Williams, Dan J dan.j.willi...@intel.com wrote: On Wed, Nov 2, 2011 at 7:33 AM, Kay Sievers kay.siev...@vrfy.org wrote: People who like to put their rootfs on a userspace managed raid device just get what they asked for. :) Proper care and feeding of mdmon and userspace managed block devices / filesystems is a solvable problem. To me the :) runs the risk of implying we don't think we can get this right. It implied that I think it is totally insane what you guys try to accomplish. Managing the rootfs blockdev with tools contained in the rootfs itself is just crazy. No smiley this time. Yes, much clearer. Which is why the never let mdmon run from an fs it is managing is better than the current dance that was implemented to address the need to drop initramfs memory and get around a lack of having a filesystem (like /run) that persisted from early boot. But we now run back into the problem of pinning initramfs memory. Does systemd already expect that the full initramfs sticks around to handle shutdown? If so then we have come full circle and don't really need the mdmon --takeover functionality versus just letting the initramfs-mdmon handle their entire lifetime of the rootfs blockdev. It all depends on the initramfs implementation. Systemd is not involved here and has no knowledge about what was left behind, it just checks if there is binary in /run provided by initramfs, and then it calls this binary instead of just bringing down the box itself. So far only dracut implements this shutdown logic, which is just a go-back-to initramfs and disassemble/shut down everything that was assembled before the initramfs started the real init. I wouldn't be surprised if we see more of these use cases from subsystems which put their rootfs on something that needs to be managed from userspace. The pinned memory for the tools in initramfs that stay around in tmpfs is probably the price to pay for these kinds of setups of the rootfs, when subsystems want to avoid adding the needed logic to the kernel to safely shut down the rootfs. Kay ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 02.11.11 16:21, Kay Sievers (kay.siev...@vrfy.org) wrote: On Wed, Nov 2, 2011 at 16:17, Lennart Poettering lenn...@poettering.net wrote: Kernel threads we detect by checking whether /proc/$PID/cmdline is empty, hence I'd suggest we use the first char of argv[0][0] here, to detect whether something is a process to avoid killing. Question is which char to choose for that. I am tempted to use '@'. Maybe introduce a 'initramfs' cgroup and move the pids there? Well, in which hierarchy? I am a bit concerned about having other subsystems muck with the systemd cgroup hierarchy, before systemd has set it up. I can see some elegance in having all code from the initrd that remains running during boot in a cgroup of its own, but that's probably orthogonal to finding a way to recognize processes not to kill at shutdown. Why? Because there's stuff like Plymouth which also stays around from the initramfs, but actually is something we *do* want to kill on shutdown. So how about rather than binaries self modifying themselves as please don't kill me with argv[][], shutdown can just avoid process where /proc/$PID/cmdline starts with /run/initramfs? Then it's up to where the initramfs runs the binary to determine which instances it wants provenance over versus leaving to the init system. For manually started arrays maybe we should arrange for an initramfs-started-mdmon to spawn new instances for user started containers, rather than using the local /sbin/mdmon. Then the mdadm -Ss initiated by /run/initramfs/shutdown can reliably stop any md device regardless of how it was started. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, 02.11.11 10:21, Williams, Dan J (dan.j.willi...@intel.com) wrote: That means we'd: a) patch systemd to check whether argv[0][0] of a process is '@' and owned by root and exclude it from killing on shutdown. b) patch mdmon to set argv[0][0] of itself to '@' iff it is running from the initrd. If it is run from the main system it should not set that and just be shut down like any other service. Well, there are two cases to consider: 1/ user starts the array manually and stops it with mdadm -Ss (mdmon automatically shuts down). No need for a service mdmon just follows the lifespan of the array. 2/ user starts the array but then expects it to be around until system shutdown In the latter case let the initramfs-mdmon takeover all arrays with mdmon --takeover --all. But if all arrays may eventually be re-parented to an mdmon instance from /run, why not always start mdmon from there? Well I am not sure how mdmon works, but let's say you booted up with an initrd lacking mdmon. Then, while the machine is up you set up a some md device, and start mdmon for that. At this point it will be independent of the initrd. But that should be OK since at shutdown time it can be detached cleanly without any special magic, too, since mdmon is not stored on that md device. So if you boot from md you need mdmon in the initrd. If you just use md outside of the root disk, then running mdmon as a normal service (i.e. one that is shut down like any other) should be perfectly fine. This why I suggested that only mdmon run from the initrd should set argv[0][0] = '@', because only that one needs the special handling that it cannot be terminated properly on shut down. The one running from the normal system can be treated like a standard systemd service. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, 02.11.11 15:18, Williams, Dan J (dan.j.willi...@intel.com) wrote: On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 02.11.11 16:21, Kay Sievers (kay.siev...@vrfy.org) wrote: On Wed, Nov 2, 2011 at 16:17, Lennart Poettering lenn...@poettering.net wrote: Kernel threads we detect by checking whether /proc/$PID/cmdline is empty, hence I'd suggest we use the first char of argv[0][0] here, to detect whether something is a process to avoid killing. Question is which char to choose for that. I am tempted to use '@'. Maybe introduce a 'initramfs' cgroup and move the pids there? Well, in which hierarchy? I am a bit concerned about having other subsystems muck with the systemd cgroup hierarchy, before systemd has set it up. I can see some elegance in having all code from the initrd that remains running during boot in a cgroup of its own, but that's probably orthogonal to finding a way to recognize processes not to kill at shutdown. Why? Because there's stuff like Plymouth which also stays around from the initramfs, but actually is something we *do* want to kill on shutdown. So how about rather than binaries self modifying themselves as please don't kill me with argv[][], shutdown can just avoid process where /proc/$PID/cmdline starts with /run/initramfs? Then it's up to where the initramfs runs the binary to determine which instances it wants provenance over versus leaving to the init system. Nope, whether something should be excluded of killing during shutdown is orthogonal to being part of the initramfs. For example, Plymouth (i.e. the graphical boot splash thingy) is started form initrd too, but we definitely want to kill it on shut down. I am a bit concerned about checks against paths since initrd might play some namespacing games and the paths might not appear to the main system they way you'd expect. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Wed, Nov 2, 2011 at 4:39 PM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 02.11.11 15:18, Williams, Dan J (dan.j.willi...@intel.com) wrote: On Wed, Nov 2, 2011 at 8:29 AM, Lennart Poettering lenn...@poettering.net wrote: On Wed, 02.11.11 16:21, Kay Sievers (kay.siev...@vrfy.org) wrote: On Wed, Nov 2, 2011 at 16:17, Lennart Poettering lenn...@poettering.net wrote: Kernel threads we detect by checking whether /proc/$PID/cmdline is empty, hence I'd suggest we use the first char of argv[0][0] here, to detect whether something is a process to avoid killing. Question is which char to choose for that. I am tempted to use '@'. Maybe introduce a 'initramfs' cgroup and move the pids there? Well, in which hierarchy? I am a bit concerned about having other subsystems muck with the systemd cgroup hierarchy, before systemd has set it up. I can see some elegance in having all code from the initrd that remains running during boot in a cgroup of its own, but that's probably orthogonal to finding a way to recognize processes not to kill at shutdown. Why? Because there's stuff like Plymouth which also stays around from the initramfs, but actually is something we *do* want to kill on shutdown. So how about rather than binaries self modifying themselves as please don't kill me with argv[][], shutdown can just avoid process where /proc/$PID/cmdline starts with /run/initramfs? Then it's up to where the initramfs runs the binary to determine which instances it wants provenance over versus leaving to the init system. Nope, whether something should be excluded of killing during shutdown is orthogonal to being part of the initramfs. For example, Plymouth (i.e. the graphical boot splash thingy) is started form initrd too, but we definitely want to kill it on shut down. In the plymouth case the path would be /bin/plymouth, the initramfs would need to take special care to run mdmon from /run/initramfs to identify it as needing the initramfs environment to carry out its shutdown. I am a bit concerned about checks against paths since initrd might play some namespacing games and the paths might not appear to the main system they way you'd expect. The initramfs needs to be modified to either tell mdmon it is running from the initramfs or arrange for /proc/$MDMON_PID/cwd to appear to be from /run/initramfs. I only like the latter because it works with existing mdmon binaries, but we may need shutdown to always leave mdmon alone... For user started md arrays the shutdown sequence still goes: killall -- umount ...and we would need to express:: killall (but mdmon) -- umount -- mdadm -Ss (stops all not in use arrays) So maybe we do the argv @ tagging in all cases and systemd never kills mdmon but arranges for all (stoppable) md devices to be stopped, then rely on /run/initramfs/shutdown to handle the rootfs blockdev. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, 31 Oct 2011 12:06:13 +0100 Lennart Poettering lenn...@poettering.net wrote: On Sun, 23.10.11 01:00, Dan Williams (dan.j.willi...@intel.com) wrote: Well, it would be nice if the md utils would offer something doing this without spawning multiple processes and killing them again. /me wonders why his raid5 resyncs every boot on Fedora 15 and has found this old thread. I'm tempted to: 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora) This will not help you. We nowadays jump back into the initrd when we shut down, so that the initrd disassembles everything it assembled at boot time. This for the first time enables us to ensure that all layers of our stack are in a sane state (i.e. fully offline) when we shut down, regardless in which way you stack it. This sounds particularly elegant. Is there some part of the filesystem, that survives through the whole process - from before / is mounted until after it is unmounted? Presumably this would be /run if anything. mdmon must be running from the time that / becomes writable until after it becomes readonly. If we can have it from before it is mounted until after it is unmounted, that might be even better. (It is possible to start a new one which replaces the old one but if that was only used for version upgrades, that would be best). So if mdmon has a 'cwd' and all open files in /run (and the executable elsewhere in the same filesystem), could it easily survive the 'kill all processes before unmounting /' thing? However, just excluding mdmom from being killed will not make this work, simply because jumping into initrd only works sensibly if we can drop all references to all previous mounts which requires us to have only one process running at that time, and one process only. It always boils down to the same thing: mdmon must be something we can shutdown cleanly like every other process. Excluding it from that will just move the problem around, but not fix it. My ideal would be that you just ignore mdmon. After unmounting '/', you shutdown md arrays with mdadm -Ss and then mdmon will spontaneously disappear. 2/ arrange for mdadm --wait-clean --scan to be called after all filesytems have been mounted read only Won't help you really either, since we have to kill all processes before we jump into the initrd to fully disassemble mounts and storage. There'll always be this chicken and egg problem: we cannot disassmble all storage until all processes are gone and we are back in the initrd. But mdmon wants to stay running after we ...but a few things strike me. This does not seem to be what was being proposed above. Systemd does not treat dm devices like a service and takes care to shut them down explicitly (but in that case there is an api that it can call). Is it time for a libmd.so, so systemd can invoke the --wait-clean --scan process itself? Probably simpler to just SIGTERM mdmon and wait for it. We actually try to disassemble md already, i.e. we call the DM_DEV_REMOVE ioctl for all left-over devices. I am not really interested to link against libdm itself. :-) I get used to this .. people confusing md and dm, people confusing nfs-client with nfs-server, people confusing me with some other Mr Brown :-) NeilBrown signature.asc Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sun, 23.10.11 01:00, Dan Williams (dan.j.willi...@intel.com) wrote: Well, it would be nice if the md utils would offer something doing this without spawning multiple processes and killing them again. /me wonders why his raid5 resyncs every boot on Fedora 15 and has found this old thread. I'm tempted to: 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora) This will not help you. We nowadays jump back into the initrd when we shut down, so that the initrd disassembles everything it assembled at boot time. This for the first time enables us to ensure that all layers of our stack are in a sane state (i.e. fully offline) when we shut down, regardless in which way you stack it. However, just excluding mdmom from being killed will not make this work, simply because jumping into initrd only works sensibly if we can drop all references to all previous mounts which requires us to have only one process running at that time, and one process only. It always boils down to the same thing: mdmon must be something we can shutdown cleanly like every other process. Excluding it from that will just move the problem around, but not fix it. 2/ arrange for mdadm --wait-clean --scan to be called after all filesytems have been mounted read only Won't help you really either, since we have to kill all processes before we jump into the initrd to fully disassemble mounts and storage. There'll always be this chicken and egg problem: we cannot disassmble all storage until all processes are gone and we are back in the initrd. But mdmon wants to stay running after we ...but a few things strike me. This does not seem to be what was being proposed above. Systemd does not treat dm devices like a service and takes care to shut them down explicitly (but in that case there is an api that it can call). Is it time for a libmd.so, so systemd can invoke the --wait-clean --scan process itself? Probably simpler to just SIGTERM mdmon and wait for it. We actually try to disassemble md already, i.e. we call the DM_DEV_REMOVE ioctl for all left-over devices. I am not really interested to link against libdm itself. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Mon, 31.10.11 12:06, Lennart Poettering (lenn...@poettering.net) wrote: We actually try to disassemble md already, i.e. we call the DM_DEV_REMOVE ioctl for all left-over devices. I am not really interested to link against libdm itself. Sorry, wasn't fully woken up yet and mixed up dm and md here. Ignore this sentence... Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sunday, 23. October 2011 10:00:36 Dan Williams wrote: Is it time for a libmd.so, so systemd can invoke the --wait-clean --scan process itself? Probably simpler to just SIGTERM mdmon and wait for it. The mdadm code makes good use of non-reentrant functions like ctime(), readdir() and others. Luckily systemd is single threaded. If we provide a public interface, that would need fixing though. Cheers, Thomas ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sun, 23 Oct 2011 01:00:36 -0700 Dan Williams dan.j.willi...@intel.com wrote: On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidj...@mail.ru) wrote: a) mdmon is perfectly capable of restarting, it is already used to take over mdmon launched in initrd. The problem is to know when to restart - i.e. when respective libraries are changed. This is a job for package management in distribution. It is already employed for glibc, systemd and some others and can just as well be employed for mdmon. And this is totally unrelated to systemd :) Really, you are sying there is a synchronous way to make mdmon reexec itself? How does that work? I am not sure whether it qualifies as synchronous, but mdmon --takeover will kill any existing mdmon for this and start monitoring itself. I wonder if this is really fully synchronous, i.e. that a) there is no point in time where mdmon is not running during this restart and b) the mdmom --takeover command returns when the new daemon is fully up, and not right-away. Well, the root file systems cannot be unmounted, only remounted. So, is there a way to invoke mdmon so that it flushes all metadata changes to disk and immediately terminates then this should be all we need for a clean solution. We'd then shutdown the normal instances of mdmon down like any other daemon and simply invoke this metadata flushing command as part of late shutdown. Hmm ... it looks like you just need to start mdmon do mdadm --wait-clean After this you can kill mdmon again (assuming decide is no more in use). Well, it would be nice if the md utils would offer something doing this without spawning multiple processes and killing them again. /me wonders why his raid5 resyncs every boot on Fedora 15 and has found this old thread. I'm tempted to: 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora) 2/ arrange for mdadm --wait-clean --scan to be called after all filesytems have been mounted read only ...but a few things strike me. This does not seem to be what was being proposed above. Systemd does not treat dm devices like a service and takes care to shut them down explicitly (but in that case there is an api that it can call). Is it time for a libmd.so, so systemd can invoke the --wait-clean --scan process itself? Probably simpler to just SIGTERM mdmon and wait for it. -- Dan Hi Dan, could you please explain in a bit more detail exactly what you think it is that is going wrong for you? I don't think it is anything like the original problem, as I don't think you are starting array manually. I think your problem is that 'mdmon' is being killed too early at shutdown. Clear we need to get whatever-kills-user-processes to skip mdmon - maybe by writing the pid to some magic file that 'ignore_proc' already knows about? Ultimately we probably want to get udev to start mdmon for us and have mdadm notice and not start it itself. We also need to get udev to notice arrays that are being reshaped and to start the mdadm which montiors the reshape so that mdadm doesn't have to fork it itself. That should fix the original problem, but I don't think it addresses your problem at all. I don't have a Fedora install so I cannot hunt around to see what is happening. I don't like the idea for a 'libmd.so' at all - certainly not until the problem is properly understood and other solutions (like running scripts) prove ineffective. NeilBrown signature.asc Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Feb 8, 2011 at 9:28 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidj...@mail.ru) wrote: a) mdmon is perfectly capable of restarting, it is already used to take over mdmon launched in initrd. The problem is to know when to restart - i.e. when respective libraries are changed. This is a job for package management in distribution. It is already employed for glibc, systemd and some others and can just as well be employed for mdmon. And this is totally unrelated to systemd :) Really, you are sying there is a synchronous way to make mdmon reexec itself? How does that work? I am not sure whether it qualifies as synchronous, but mdmon --takeover will kill any existing mdmon for this and start monitoring itself. I wonder if this is really fully synchronous, i.e. that a) there is no point in time where mdmon is not running during this restart and b) the mdmom --takeover command returns when the new daemon is fully up, and not right-away. Well, the root file systems cannot be unmounted, only remounted. So, is there a way to invoke mdmon so that it flushes all metadata changes to disk and immediately terminates then this should be all we need for a clean solution. We'd then shutdown the normal instances of mdmon down like any other daemon and simply invoke this metadata flushing command as part of late shutdown. Hmm ... it looks like you just need to start mdmon do mdadm --wait-clean After this you can kill mdmon again (assuming decide is no more in use). Well, it would be nice if the md utils would offer something doing this without spawning multiple processes and killing them again. /me wonders why his raid5 resyncs every boot on Fedora 15 and has found this old thread. I'm tempted to: 1/ teach ignore_proc() to scan for pid files in /dev/md/ (MDMON_DIR on Fedora) 2/ arrange for mdadm --wait-clean --scan to be called after all filesytems have been mounted read only ...but a few things strike me. This does not seem to be what was being proposed above. Systemd does not treat dm devices like a service and takes care to shut them down explicitly (but in that case there is an api that it can call). Is it time for a libmd.so, so systemd can invoke the --wait-clean --scan process itself? Probably simpler to just SIGTERM mdmon and wait for it. -- Dan ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, 08.02.11 12:07, Lennart Poettering (lenn...@poettering.net) wrote: At this point we know it is container, know that it has external metadata and know that we need external metadata handler (mdmon). But it is too late for systemd. Kay, do you know why this change event is used here? Any chance we can get rid of it? So, it seems that the change event does make some sense here. I have now added a new property to systemd: if you set SYSTEMD_READY=0 on a udev device then systemd will consider it unplugged even if it shows up in the udev tree. If this property is not set for a device, or is set to 1 we will conisder the device plugged. To make this md stuff compatible with systemd we hence just need to set SYSTEMD_READY=0 during the new event and drop it when the device is fully set up. Andrey, since you are playing around with this, do you happen to know which attribute we should check to set SYSTEMD_READY=0 properly? It would be cool if we could come up with a default rule for inclusion in our systemd rules file that will ensure the device only shows up when it is ready. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Fri, 04.02.11 22:55, Andrey Borzenkov (arvidj...@mail.ru) wrote: That's right, but the names are not known in advance and can change between reboots. This means such units have to be generated dynamically, exist until reboot (ramfs?) and be removed when array is destroyed. Not sure it is really manageable. Hmm? It should be sufficient to just write the service template properly (mdmon@.service) and then instantiate it when needed with systemctl start mdmon@xyz.service or something equivalent. itMs a matter of issuing a single dbus call. And which instance should generate them? mdadm? i think it is much nicer to spawn the necessary mdadm service instance from a udev rule, Yes, this can be done relatively easily; as proof of concept: SUBSYSTEM!=block, GOTO=systemd_md_end ACTION!=change, GOTO=systemd_md_end KERNEL!=md*, GOTO=systemd_md_end ATTR{md/metadata_version}==external:[A-Za-z]*, RUN+=/bin/systemctl start mdmon@%k.service LABEL=systemd_md_end Nah, it's much better to simply use the SYSTEMD_WANTS var on the device. Something like this: , ENV{SYSTEMD_WANTS}=mdmon@%k.service That way the device unit will simply have a wants dep on the service unit, and this is prefectly discoverable. Setting SYSTEMD_WANTS would be more elegant solution, but it does not work with current systemd implementation. It is capable of starting requested units only on add event (effectively the very first time device becomes plugged), while mdmon must be started on change event, as only then we know whether mdmon is required at all. Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think about this. Why does md employ the change event? Is this really necessary, smells a bit foul. Running mdmon via systemd in this way opens up interesting possibility. E.g. service could be declared immortal and be exempt from usual shutdown sequence ... or is it possible to do already? A service needs to conflict with shutdown.target to be shut down when we go down normally. If your service does not conflict with shutdown.target then it will stay around and be killed only after systemd is gone and PID1 is systemd-shutdown which then kills all processes remaining (independent of any idea of service) and the unmounts all file systems. Normally all services conflict with shutdown.target implicitly, which you can turn off by setting DefaultDependencies=. Actually it can be implemented even without mdadm patches; apparently it is possible to suppress normal starting of mdmon by setting MDADM_NO_MDMON=1 A this point mdmon is simply broken: if glibc or mdmon itself (or any lib it is using) is upgraded, then mdmon will keep referencing the old .so or binary as long as it is running. This means that the fs these files are on cannot be remounted r/o. However mdmon insists on being shutdown only after all fs got remounted ro. So you have a cyclic ordering loop here: mdmon wants to be shut down after the remount, but we need to shut it down before the remount. This is unfixable unless a) mdmon learns reexecution of itself without losing state (like most init systems so), or b) mdmon would stop insisting on being shutdown only after the remount. In my eyes b) is very much preferebale: It should be possible to shut down mdmon like any other service. And if then some md related code still needs to be run on late shutdown this should be done from a new process. I would be willing to add some hooks for this, so that we can execute arbitrary drop-in processes as part of the final shutdown loop. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Feb 8, 2011 at 12:48 PM, Lennart Poettering lenn...@poettering.net wrote: On Fri, 04.02.11 22:55, Andrey Borzenkov (arvidj...@mail.ru) wrote: That's right, but the names are not known in advance and can change between reboots. This means such units have to be generated dynamically, exist until reboot (ramfs?) and be removed when array is destroyed. Not sure it is really manageable. Hmm? It should be sufficient to just write the service template properly (mdmon@.service) and then instantiate it when needed with systemctl start mdmon@xyz.service or something equivalent. itMs a matter of issuing a single dbus call. And which instance should generate them? mdadm? i think it is much nicer to spawn the necessary mdadm service instance from a udev rule, Yes, this can be done relatively easily; as proof of concept: SUBSYSTEM!=block, GOTO=systemd_md_end ACTION!=change, GOTO=systemd_md_end KERNEL!=md*, GOTO=systemd_md_end ATTR{md/metadata_version}==external:[A-Za-z]*, RUN+=/bin/systemctl start mdmon@%k.service LABEL=systemd_md_end Nah, it's much better to simply use the SYSTEMD_WANTS var on the device. Something like this: , ENV{SYSTEMD_WANTS}=mdmon@%k.service That way the device unit will simply have a wants dep on the service unit, and this is prefectly discoverable. Setting SYSTEMD_WANTS would be more elegant solution, but it does not work with current systemd implementation. It is capable of starting requested units only on add event (effectively the very first time device becomes plugged), while mdmon must be started on change event, as only then we know whether mdmon is required at all. Oha, so you are actually aware of SYSTEMD_WANTS. Hmm. I need to think about this. Why does md employ the change event? Is this really necessary, smells a bit foul. I am probably the wrong one to ask, but here is what happens when array is started (from udev perspective) UDEV [1297507039.109828] add /devices/virtual/block/md127 (block) UDEV_LOG=3 ACTION=add DEVPATH=/devices/virtual/block/md127 SUBSYSTEM=block DEVNAME=/dev/md127 DEVTYPE=disk SEQNUM=1742 UDISKS_PRESENTATION_NOPOLICY=1 MAJOR=9 MINOR=127 TAGS=:systemd: After this event device goes plugged and SYSTEMD_WANTS (if any) are triggered. But at this point we have zero information about array to decide anything. UDEV [1297507039.211940] change /devices/virtual/block/md127 (block) UDEV_LOG=3 ACTION=change DEVPATH=/devices/virtual/block/md127 SUBSYSTEM=block DEVNAME=/dev/md127 DEVTYPE=disk SEQNUM=1743 MD_LEVEL=container MD_DEVICES=2 MD_METADATA=ddf MD_UUID=f8362f39:0436b20f:cf338104:afec436e MD_DEVNAME=ddf0 UDISKS_PRESENTATION_NOPOLICY=1 MAJOR=9 MINOR=127 DEVLINKS=/dev/disk/by-id/md-uuid-f8362f39:0436b20f:cf338104:afec436e /dev/md/ddf0 TAGS=:systemd: At this point we know it is container, know that it has external metadata and know that we need external metadata handler (mdmon). But it is too late for systemd. Actually it can be implemented even without mdadm patches; apparently it is possible to suppress normal starting of mdmon by setting MDADM_NO_MDMON=1 A this point mdmon is simply broken: if glibc or mdmon itself (or any lib it is using) is upgraded, then mdmon will keep referencing the old .so or binary as long as it is running. This means that the fs these files are on cannot be remounted r/o. However mdmon insists on being shutdown only after all fs got remounted ro. So you have a cyclic ordering loop here: mdmon wants to be shut down after the remount, but we need to shut it down before the remount. Ehh ... a) mdmon is perfectly capable of restarting, it is already used to take over mdmon launched in initrd. The problem is to know when to restart - i.e. when respective libraries are changed. This is a job for package management in distribution. It is already employed for glibc, systemd and some others and can just as well be employed for mdmon. And this is totally unrelated to systemd :) b) having binary launched off some fs should not prevent this fs to be remountd ro - binaries are not opened rw This is unfixable unless a) mdmon learns reexecution of itself without losing state (like most init systems so), or b) mdmon would stop insisting on being shutdown only after the remount. As far as I can tell, both is true today; but remounting is not enough, unfortunately. In my eyes b) is very much preferebale: It should be possible to shut down mdmon like any other service. And if then some md related code still needs to be run on late shutdown this should be done from a new process. I would be willing to add some hooks for this, so that we can execute arbitrary drop-in processes as part of the final shutdown loop. mdmon is needed to ensure metadata were correctly updated. So it needs to exist as long as metadata *may* be updated. For practical purposes it means - until file system is unmounted and flushed to disks. I am not sure that remounting ro stops
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Feb 8, 2011 at 2:07 PM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 08.02.11 13:52, Andrey Borzenkov (arvidj...@mail.ru) wrote: I am probably the wrong one to ask, but here is what happens when array is started (from udev perspective) [...] After this event device goes plugged and SYSTEMD_WANTS (if any) are triggered. But at this point we have zero information about array to decide anything. [...] At this point we know it is container, know that it has external metadata and know that we need external metadata handler (mdmon). But it is too late for systemd. Kay, do you know why this change event is used here? Any chance we can get rid of it? Actually it can be implemented even without mdadm patches; apparently it is possible to suppress normal starting of mdmon by setting MDADM_NO_MDMON=1 A this point mdmon is simply broken: if glibc or mdmon itself (or any lib it is using) is upgraded, then mdmon will keep referencing the old .so or binary as long as it is running. This means that the fs these files are on cannot be remounted r/o. However mdmon insists on being shutdown only after all fs got remounted ro. So you have a cyclic ordering loop here: mdmon wants to be shut down after the remount, but we need to shut it down before the remount. Ehh ... a) mdmon is perfectly capable of restarting, it is already used to take over mdmon launched in initrd. The problem is to know when to restart - i.e. when respective libraries are changed. This is a job for package management in distribution. It is already employed for glibc, systemd and some others and can just as well be employed for mdmon. And this is totally unrelated to systemd :) Really, you are sying there is a synchronous way to make mdmon reexec itself? How does that work? I am not sure whether it qualifies as synchronous, but mdmon --takeover will kill any existing mdmon for this and start monitoring itself. b) having binary launched off some fs should not prevent this fs to be remountd ro - binaries are not opened rw If you run a binary and then the package manager replaces it then the running instance will still refer to the old copy and this will have the effect that the file isn't actually deleted until the proces exits/execs. And because that is the way it is the kernel will refuse unmounting of the fs until you terminated/reexeced your process. This is unfixable unless a) mdmon learns reexecution of itself without losing state (like most init systems so), or b) mdmon would stop insisting on being shutdown only after the remount. As far as I can tell, both is true today; but remounting is not enough, unfortunately. So, you are saying we can shut down mdmon without ill effects early? At least that's what I see. You can shutdown mdmon and continue to work with file system, even if it is mounted rw. Under some conditions mount will hang; i.e. start array kill mdmon try to mount mount will hang. If you start mdmon, it is mounted. But if you now umount kill mdmon mount it is mounted just fine. In my eyes b) is very much preferebale: It should be possible to shut down mdmon like any other service. And if then some md related code still needs to be run on late shutdown this should be done from a new process. I would be willing to add some hooks for this, so that we can execute arbitrary drop-in processes as part of the final shutdown loop. mdmon is needed to ensure metadata were correctly updated. So it needs to exist as long as metadata *may* be updated. For practical purposes it means - until file system is unmounted and flushed to disks. I am not sure that remounting ro stops all activity (at least, mounting ro definitely *writes* to device using some filesystems). Well, the root file systems cannot be unmounted, only remounted. So, is there a way to invoke mdmon so that it flushes all metadata changes to disk and immediately terminates then this should be all we need for a clean solution. We'd then shutdown the normal instances of mdmon down like any other daemon and simply invoke this metadata flushing command as part of late shutdown. Hmm ... it looks like you just need to start mdmon do mdadm --wait-clean After this you can kill mdmon again (assuming decide is no more in use). ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, 08.02.11 16:54, Andrey Borzenkov (arvidj...@mail.ru) wrote: a) mdmon is perfectly capable of restarting, it is already used to take over mdmon launched in initrd. The problem is to know when to restart - i.e. when respective libraries are changed. This is a job for package management in distribution. It is already employed for glibc, systemd and some others and can just as well be employed for mdmon. And this is totally unrelated to systemd :) Really, you are sying there is a synchronous way to make mdmon reexec itself? How does that work? I am not sure whether it qualifies as synchronous, but mdmon --takeover will kill any existing mdmon for this and start monitoring itself. I wonder if this is really fully synchronous, i.e. that a) there is no point in time where mdmon is not running during this restart and b) the mdmom --takeover command returns when the new daemon is fully up, and not right-away. Well, the root file systems cannot be unmounted, only remounted. So, is there a way to invoke mdmon so that it flushes all metadata changes to disk and immediately terminates then this should be all we need for a clean solution. We'd then shutdown the normal instances of mdmon down like any other daemon and simply invoke this metadata flushing command as part of late shutdown. Hmm ... it looks like you just need to start mdmon do mdadm --wait-clean After this you can kill mdmon again (assuming decide is no more in use). Well, it would be nice if the md utils would offer something doing this without spawning multiple processes and killing them again. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Jan 25, 2011 at 7:28 AM, Lennart Poettering lenn...@poettering.net wrote: On Tue, 25.01.11 06:58, Andrey Borzenkov (arvidj...@mail.ru) wrote: systemd supports instantiated services, for example to deal with the gettys (e.g. getty@tty5.service). It should be trivial to use the same for mdmon (e.g. mdmon@md3.service). That's right, but the names are not known in advance and can change between reboots. This means such units have to be generated dynamically, exist until reboot (ramfs?) and be removed when array is destroyed. Not sure it is really manageable. Hmm? It should be sufficient to just write the service template properly (mdmon@.service) and then instantiate it when needed with systemctl start mdmon@xyz.service or something equivalent. itMs a matter of issuing a single dbus call. And which instance should generate them? mdadm? i think it is much nicer to spawn the necessary mdadm service instance from a udev rule, Yes, this can be done relatively easily; as proof of concept: SUBSYSTEM!=block, GOTO=systemd_md_end ACTION!=change, GOTO=systemd_md_end KERNEL!=md*, GOTO=systemd_md_end ATTR{md/metadata_version}==external:[A-Za-z]*, RUN+=/bin/systemctl start mdmon@%k.service LABEL=systemd_md_end where mdon@.service is [Unit] Description=mdmon service BindTo=dev-%i.device After=dev-%i.device [Service] Type=forking PIDFile=/dev/.mdadm/%i.pid ExecStart=/sbin/mdmon --takeover %i With the result [root@localhost ~]# systemctl status mdmon@md127.service mdmon@md127.service - mdmon service Loaded: loaded (/etc/systemd/system/mdmon@.service) Active: active (running) since Tue, 08 Feb 2011 09:43:30 -0500; 5min ago Process: 1467 ExecStart=/sbin/mdmon --takeover %i (code=exited, status=0/SUCCESS) Main PID: 1468 (mdmon) CGroup: name=systemd:/system/mdmon@.service/md127 └ 1468 /sbin/mdmon --takeover md127 Setting SYSTEMD_WANTS would be more elegant solution, but it does not work with current systemd implementation. It is capable of starting requested units only on add event (effectively the very first time device becomes plugged), while mdmon must be started on change event, as only then we know whether mdmon is required at all. Running mdmon via systemd in this way opens up interesting possibility. E.g. service could be declared immortal and be exempt from usual shutdown sequence ... or is it possible to do already? Actually it can be implemented even without mdadm patches; apparently it is possible to suppress normal starting of mdmon by setting MDADM_NO_MDMON=1 but you could even run it from mdadm by invoking one dbus call from it. It may turn out to be necessary still. If container needs mdmon, arrays it contains won't become read-write until mdmon is started. If mdmon is started asynchronously by udev, there is window where someone may try to use array before it is rw. As trivial example, mount unit which depends on md device unit. I do not think mdadm maintainer will be happy to add D-Bus dependency to something that is likely to be included in initrd though :) But may be we could simply try execl(/bin/systemctl, start, ...) before current execl(/sbin/mdmon,... )? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sat, 22.01.11 20:55, Andrey Borzenkov (arvidj...@mail.ru) wrote: mdmon does not belong to user. User is not even aware that it is started. And it is likely not the last case. So systemd does need some framework which can move such processes out of user session. It probably needs some sd_daemon API to notify systemd that it is system level task even if it was started as result of user interaction. Well, it is started by user, so it belongs to user. And systemd has an API to start system-level task as a result of user interaction: it is called systemctl start mdmon.service. mdmon is not a singleton - it is started for every array that needs it (not each array needs it). Can you pass extra parameters that identify object mdmon should monitor via systemctl? systemd supports instantiated services, for example to deal with the gettys (e.g. getty@tty5.service). It should be trivial to use the same for mdmon (e.g. mdmon@md3.service). Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Tue, Jan 25, 2011 at 6:44 AM, Lennart Poettering lenn...@poettering.net wrote: On Sat, 22.01.11 20:55, Andrey Borzenkov (arvidj...@mail.ru) wrote: mdmon does not belong to user. User is not even aware that it is started. And it is likely not the last case. So systemd does need some framework which can move such processes out of user session. It probably needs some sd_daemon API to notify systemd that it is system level task even if it was started as result of user interaction. Well, it is started by user, so it belongs to user. And systemd has an API to start system-level task as a result of user interaction: it is called systemctl start mdmon.service. mdmon is not a singleton - it is started for every array that needs it (not each array needs it). Can you pass extra parameters that identify object mdmon should monitor via systemctl? systemd supports instantiated services, for example to deal with the gettys (e.g. getty@tty5.service). It should be trivial to use the same for mdmon (e.g. mdmon@md3.service). That's right, but the names are not known in advance and can change between reboots. This means such units have to be generated dynamically, exist until reboot (ramfs?) and be removed when array is destroyed. Not sure it is really manageable. And which instance should generate them? mdadm? ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
2010/12/4 Tomasz Torcz to...@pipebreaker.pl: On Sat, Dec 04, 2010 at 03:08:05PM +0300, Andrey Borzenkov wrote: (/etc/pam.d/system-auth), which automatically creates cgroups by login session, which in turn gets killed when the user has completely logged out. That is why your mdadm gets terminated, too. Sure. You can avoid that by adding create-session=0 to it, like: # grep pam_systemd /etc/pam.d/systemd-auth session optional pam_systemd.so create-session=0 But I do want user session to be created; and systemd was specifically extended to properly terminate user sessions on shutdown. It is just that mdmon does not belong to user session at all. Man page talks about kill-session= and kill-user= parameters, which may be useful to you. Which is the recommented way if you want processes (created by the user) to live on even when this user has fully logged out. mdmon does not belong to user. User is not even aware that it is started. And it is likely not the last case. So systemd does need some framework which can move such processes out of user session. It probably needs some sd_daemon API to notify systemd that it is system level task even if it was started as result of user interaction. Well, it is started by user, so it belongs to user. And systemd has an API to start system-level task as a result of user interaction: it is called systemctl start mdmon.service. mdmon is not a singleton - it is started for every array that needs it (not each array needs it). Can you pass extra parameters that identify object mdmon should monitor via systemctl? Using udev to listen to new array event and starting mdmon from there looks promising. I do not know whether it is possible to identify such arrays at this point though nor do I have hardware to test. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidj...@gmail.com) wrote: If user starts array manually (mdadm -A -s as example) from within user session and array needs mdmon, mdmon becomes part of user session control group: Are you suggesting that mdadm forks off mdmon from within the user session? This is horribly ugly and broken and they shouldn't do that. ├ user │ └ root │ └ 1 │ ├ 1916 login -- root │ ├ 1930 -bash │ ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup... │ └ 2062 mdmon md127 It is then killed by systemd during shutdown as part of user session. Well, only if you enable that the user session is completely killed on logout, which we currently don't do by default. I wonder if it would make sense to add an option which kills user sessions on log out only for uid != 0. This might help here, but only half-way, since sudo would still break. But anyway, I'll add this to the todo list. It results in dirty array on next boot. Hmm, that shouldn't happen. Is there any magic that allows daemon to be exempted from killing? Well, I have been discussing this with Kay and we'll most likely add something like DontKillOnShutdown=yes or so, which if added to a unit file will exempt it from killing during the normal service shutdown phase, and the first killing spree (but not the second, post-umount killing spree). But that of course would require mdmon to be started like any other daemon, and not forked off mdadm. That should mostly fix the problem, but then again I do believe that the whole idea of mdmon is just borked, since it will necessarily pin page from the root fs into memory which will create all kinds of problems, for example after upgrades (i.e. mdmon maps libc into memory, libc gets updated, the old libc deleted, which cannot be written to disk as long as mdmon stays running pinning it, which will disallow the ultimate unmounting/remounting of the fs). Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sat, 04.12.10 15:08, Andrey Borzenkov (arvidj...@gmail.com) wrote: It is then killed by systemd during shutdown as part of user session. It results in dirty array on next boot. Is there any magic that allows daemon to be exempted from killing? While your raid should absolutely not be corrupted on next reboot when mdmon receives a SIGTERM, This won't be corrupted but it will initiate rebuilt. I have reports that such rebuild may take hours, costing performance and loss of redundancy. Well, eventually we need to be able to kill mdmon. Otherwise we might not be able to remount the root dir r/o. How exactly is mdmon supposed to behave on shutdown? Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Fri, 7 Jan 2011 01:38:27 +0100 Lennart Poettering lenn...@poettering.net wrote: On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidj...@gmail.com) wrote: If user starts array manually (mdadm -A -s as example) from within user session and array needs mdmon, mdmon becomes part of user session control group: Are you suggesting that mdadm forks off mdmon from within the user session? This is horribly ugly and broken and they shouldn't do that. What alternative would you suggest? A daemon needs to be running while certain md arrays are running and writable. NeilBrown ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Fri, 7 Jan 2011 02:09:32 +0100 Michael Biebl mbi...@gmail.com wrote: 2011/1/7 Lennart Poettering lenn...@poettering.net: Well, I have been discussing this with Kay and we'll most likely add something like DontKillOnShutdown=yes or so, which if added to a unit Make that KillOnShutdown=no, please. Agreed :) That reminds me of hal-disable-polling --enable-polling ( http://ur1.ca/2rmis ) -- With respect, Roman signature.asc Description: PGP signature ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Fri, 07.01.11 12:16, NeilBrown (ne...@suse.de) wrote: On Fri, 7 Jan 2011 01:38:27 +0100 Lennart Poettering lenn...@poettering.net wrote: On Sat, 04.12.10 11:41, Andrey Borzenkov (arvidj...@gmail.com) wrote: If user starts array manually (mdadm -A -s as example) from within user session and array needs mdmon, mdmon becomes part of user session control group: Are you suggesting that mdadm forks off mdmon from within the user session? This is horribly ugly and broken and they shouldn't do that. What alternative would you suggest? Start it as a normal service like any other. But if you fork off the daemon from the user session then the daemon will run in a very broken context: the resource limits of the user apply, the audit trail will point to the user (i.e. /proc/self/loginuid), the cgroup will be of the user, the daemon cannot be supervised as every other daemon. Also, the daemon will inherit all the other process properties from the user, which is almost definitely wrong. i.e. the env block and so on, the sig mask. gazillions of small little properties. Of course, a big bunch of them you can reset in your code, but that's a race you cannot win: the kernel adds new process properties all the time, and you'd have to reset them manually. It's is really essential that daemons are started from a clean process environment, and are detached from the user session. SysV kinda provides that, for everything started on boot and in a limited way for stuff started via /sbin/service. systemd provides that too and much more correct. But just forking off things just like that is not a good solution. A thinkable, relatively simple solution in a systemd world is to pull in the mdmon service from the udev device. The udev device would do all the necessary matching to figure out whether mdmon is needed or not. If you care about non-systemd environments something like this of course becomes a lot more complex. A daemon needs to be running while certain md arrays are running and writable. Well, but auto-spawning it from the user session is not really a usable solution. Lennart -- Lennart Poettering - Red Hat, Inc. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Saturday, December 04, 2010 09:41:26 am Andrey Borzenkov wrote: If user starts array manually (mdadm -A -s as example) from within user session and array needs mdmon, mdmon becomes part of user session control group: ├ user │ └ root │ └ 1 │ ├ 1916 login -- root │ ├ 1930 -bash │ ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup... │ └ 2062 mdmon md127 It is then killed by systemd during shutdown as part of user session. It results in dirty array on next boot. Is there any magic that allows daemon to be exempted from killing? While your raid should absolutely not be corrupted on next reboot when mdmon receives a SIGTERM, I suspect you're using pam_systemd.so (/etc/pam.d/system-auth), which automatically creates cgroups by login session, which in turn gets killed when the user has completely logged out. That is why your mdadm gets terminated, too. You can avoid that by adding create-session=0 to it, like: # grep pam_systemd /etc/pam.d/systemd-auth session optionalpam_systemd.so create-session=0 Which is the recommented way if you want processes (created by the user) to live on even when this user has fully logged out. Regards, Christian Parpart. p.s.: see pam_systemd(8) ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel
Re: [systemd-devel] systemd kills mdmon if it was started manually by user
On Sat, Dec 4, 2010 at 12:12 PM, Christian Parpart tra...@gentoo.org wrote: On Saturday, December 04, 2010 09:41:26 am Andrey Borzenkov wrote: If user starts array manually (mdadm -A -s as example) from within user session and array needs mdmon, mdmon becomes part of user session control group: ├ user │ └ root │ └ 1 │ ├ 1916 login -- root │ ├ 1930 -bash │ ├ 1964 gpg-agent --keep-display --daemon --write-env-file /root/.gnup... │ └ 2062 mdmon md127 It is then killed by systemd during shutdown as part of user session. It results in dirty array on next boot. Is there any magic that allows daemon to be exempted from killing? While your raid should absolutely not be corrupted on next reboot when mdmon receives a SIGTERM, This won't be corrupted but it will initiate rebuilt. I have reports that such rebuild may take hours, costing performance and loss of redundancy. I suspect you're using pam_systemd.so Yes (/etc/pam.d/system-auth), which automatically creates cgroups by login session, which in turn gets killed when the user has completely logged out. That is why your mdadm gets terminated, too. Sure. You can avoid that by adding create-session=0 to it, like: # grep pam_systemd /etc/pam.d/systemd-auth session optional pam_systemd.so create-session=0 But I do want user session to be created; and systemd was specifically extended to properly terminate user sessions on shutdown. It is just that mdmon does not belong to user session at all. Which is the recommented way if you want processes (created by the user) to live on even when this user has fully logged out. mdmon does not belong to user. User is not even aware that it is started. And it is likely not the last case. So systemd does need some framework which can move such processes out of user session. It probably needs some sd_daemon API to notify systemd that it is system level task even if it was started as result of user interaction. ___ systemd-devel mailing list systemd-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/systemd-devel