On Fri, 2020-06-19 at 11:38 -0400, Tejun Heo wrote: > Hello, Ian. > > On Wed, Jun 17, 2020 at 03:37:43PM +0800, Ian Kent wrote: > > The series here tries to reduce the locking needed during path > > walks > > based on the assumption that there are many path walks with a > > fairly > > large portion of those for non-existent paths, as described above. > > > > That was done by adding kernfs negative dentry caching (non- > > existent > > paths) to avoid continual alloc/free cycle of dentries and a > > read/write > > semaphore introduced to increase kernfs concurrency during path > > walks. > > > > With these changes we still need kernel parameters of > > udev.children-max=2048 > > and systemd.default_timeout_start_sec=300 for the fastest boot > > times of > > under 5 minutes. > > I don't have strong objections to the series but the rationales don't > seem > particularly strong. It's solving a suspected problem but only half > way. It > isn't clear whether this can be the long term solution for the > problem > machine and whether it will benefit anyone else in a meaningful way > either. > > I think Greg already asked this but how are the 100,000+ memory > objects > used? Is that justified in the first place?
The problem is real enough, however, whether improvements can be made in other areas flowing on from the arch specific device creation notifications is not clear cut. There's no question that there is very high contention between the VFS and sysfs and that's all the series is trying to improve, nothing more. What both you and Greg have raised are good questions but are unfortunately very difficult to answer. I tried to add some discussion about it, to the extent that I could, in the cover letter. Basically the division of memory into 256M chunks is something that's needed to provide flexibility for arbitrary partition creation (a set of hardware allocated that's used for, essentially, a bare metal OS install). Whether that's many small partitions for load balanced server farms (or whatever) or much larger partitions for for demanding applications, such as Oracle systems, is not something that can be known in advance. So the division into small memory chunks can't change. The question of sysfs node creation, what uses them and when they are used is much harder. I'm not able to find that out and, I doubt even IBM would know, if their customers use applications that need to consult the sysfs file system for this information or when it's needed if it is need at all. So I'm stuck on this question. One thing is for sure though, it would be (at the very least) risky for a vendor to assume they either aren't needed or aren't needed early during system start up. OTOH I've looked at what gets invoked on udev notifications (which is the source of the heavy path walk activity, I admit I need to dig deeper) and that doesn't appear to be doing anything obviously wrong so that far seems ok. For my part, as long as the series proves to be sound, why not, it does substantially reduce contention between the VFS and sysfs is the face of heavy sysfs path walk activity so I think that stands alone as sufficient to consider the change worthwhile. Ian