For very large IBM Power mainframe systems with hundreds of CPUs and TBs of RAM booting can take a very long time.
Initial reports showed that booting a configuration of several hundred CPUs and 64TB of RAM would take more than 30 minutes and require kernel parameters of udev.children-max=1024 systemd.default_timeout_start_sec=3600 to prevent dropping into emergency mode. Gathering information about what's happening during the boot is a bit challenging but two main issues appeared to be: a large number of path lookups for non-existent files, and very high lock contention in the VFS during path walks particularly in the dentry allocation code path. The underlying cause of this was thought to be the sheer number of sysfs memory objects, 100,000+ for a 64TB memory configuration as the hardware divides the memory into 256MB logical blocks. This is believed to be due to either IBM Power hardware design or a requirement of the mainframe software used to create logical partitions (LPARs, that are used to install an operating system to provide services), since these can be made up of a wide range of resources, CPU, Memory, disks, etc. It's unclear yet whether the creation of syfs nodes for these memory devices can be postponed or spread out over a larger amount of time. That's because the high overhead looks to be due to notifications received by udev which invokes a systemd program for them and attempts by systemd folks to improve this have not focused on changing the handling of these notifications, possibly because of difficulties with doing so. This remains an avenue of investigation. Kernel traces show there are many path walks with a fairly large portion of those for non-existent paths. However, looking at the systemd code invoked by the udev action it appears there's only one additional lookup for each invocation so the large number of negative lookups is most likely due to the large number of notifications rather than a fault with the systemd program. The series here tries to reduce the locking needed during path walks based on the assumption that there are many path walks with a fairly large portion of those for non-existent paths, as described above. That was done by adding kernfs negative dentry caching (non-existent paths) to avoid continual alloc/free cycle of dentries and a read/write semaphore introduced to increase kernfs concurrency during path walks. With these changes we still need kernel parameters of udev.children-max=2048 and systemd.default_timeout_start_sec=300 for the fastest boot times of under 5 minutes. There may be opportunities for further improvements but the series here has seen a fair amount of testing and thinking about what else these could be. Discussing it with Rick Lindsay, I suspect improvements will get more difficult to implement for somewhat less improvement so I think what we have here is a good start for now. Changes since v1: - fix locking in .permission() and .getattr() by re-factoring the attribute handling code. --- Ian Kent (6): kernfs: switch kernfs to use an rwsem kernfs: move revalidate to be near lookup kernfs: improve kernfs path resolution kernfs: use revision to identify directory node changes kernfs: refactor attr locking kernfs: make attr_mutex a local kernfs node lock fs/kernfs/dir.c | 284 ++++++++++++++++++++++++++++--------------- fs/kernfs/file.c | 4 - fs/kernfs/inode.c | 58 +++++---- fs/kernfs/kernfs-internal.h | 29 ++++ fs/kernfs/mount.c | 12 +- fs/kernfs/symlink.c | 4 - include/linux/kernfs.h | 7 + 7 files changed, 259 insertions(+), 139 deletions(-) -- Ian