>> locks and not hashing itself. Note that at the time vnode interlock and >> vm object lock were the same thing and in this workload the lock is >> under a lot of contention. Should the lock owner get preempted, anyone >> doing a lookup to the affected vnode (e.g., libc) will be holding >> the relevant per-cpu lock and will block on a turnstile. Whoever ends >> up running on the affected cpu is likely to do a lookup on their own, >> but the relevant per-cpu lock is taken and go off cpu. The same thing >> happening on more than one cpu at a time could easily reslut in a >> cascading failure, which I strongly suspect is precisely what happened. >> >> That is, the win does not stem from rb trees but finer-grained locking >> which does not block other threads which look up something else on >> the same cpu. > > Not on NetBSD. Kernel preemption is possible and allowed (mainly for real > time applications), but happens infrequently during normal operation. > There > are a number of pieces of code that take advantage of that fact and are > "optimistically per-CPU", and they work very well as preemption is rarely > observed. Further the blocking case on v_interlock in cache_lookup() is > rare. That's no to say it doesn't happen, it does, but I don't think it > enough to explain the performance differences. >
I noted suspected preemption was occurring because of contention on the vm side. It should only take one to start the cascade. >> As mentioned earlier I think rb trees instead of a hash are pessimal >> here. >> >> First, a little step back. The lookup starts with securing vnodes from >> cwdinfo. This represents a de facto global serialisation point (times two >> since the work has to be reverted later). In FreeBSD I implemented an >> equivalent with copy-on-write semantics. Easy take on it is that I take >> an rwlock-equivalent and then grab a reference on the found struct. >> This provides me with an implicit reference on root and current working >> directory vnodes. If the struct is unshared on fork, aforementioned >> serialisation point becomes localized to the process. > > Interesting. Sounds somewhat like both NetBSD and FreeBSD do for process > credentials. > I was thinking about doing precisely that but I found it iffy to have permanently stored references per-thread. With the proposal they get "gained" around the actual lookup, otherwise this is very similar to what mountcheckdirs is dealing with right now. >> In my tests even with lookups which share most path components, the >> last one tends to be different. Using a hash means this typically >> results in grabbing different locks for that case and consequently >> fewer cache-line ping pongs. > > My experience has been different. What I've observed is that shared hash > tables usually generate huge cache pressure unless they're small and rarely > updated. If the hash were small, matching the access pattern (e.g. > per-dir) then I think it would have the opportunity to make maximum use of > the cache. That could be a great win and certainly better than rbtree. > Well in my tests this is all heavily dominated by SMP-effects, which I expect to be exacerbated by just one lock. Side note is that I had a look at your vput. The pre-read + VOP_UNLOCK + actual loop to drop the ref definitely slow things down if only a little bit as this can force a shared cacheline transition from under someone cmpxching. That said, can you generate a flamegraph from a fully patched kernel? Curious where the time is spent now, my bet is spinning on vnode locks. -- Mateusz Guzik <mjguzik gmail.com>