On Mon, Jun 06, 2016 at 01:46:23PM -0700, Linus Torvalds wrote:

> So my gut feel is that we do want to have the same heuristics for
> rwsems and mutexes (well, modulo possible actual semantic differences
> due to the whole shared-vs-exclusive issues).
> 
> And I also suspect that the mutexes have gotten a lot more performance
> tuning done on them, so it's likely the correct thing to try to make
> the rwsem match the mutex code rather than the other way around.
> 
> I think we had Jason and Davidlohr do mutex work last year, let's see
> if they agree on that "yes, the mutex case is the likely more tuned
> case" feeling.
> 
> The fact that your performance improves when you do that obviously
> then also validates the assumption that the mutex spinning is the
> better optimized one.

FWIW, there's another fun issue on ramfs - dcache_readdir() is doing an
obscene amount of grabbing/releasing ->d_lock and once you take the external
serialization out, parallel getdents load hits contention on *that*.
In spades.  And unlike mutex (or rswem exclusive), contention on ->d_lock
chews a lot of cycles.  The root cause is the use of cursors - we not only
move them more than we ought to (we do that on each entry reported, rather
than once before return from dcache_readdir()), we can't traverse the real
list entries (which remain nice and stable; another low-hanging fruit is
pointless grabbing ->d_lock on those) without ->d_lock on parent.

I think I have a kinda-sorta solution, but it has a problem.  What I want
to do is
        * list_move() only once per dcache_readdir()
        * ->d_lock taken for that and only for that.
        * list_move() itself surrounded with write_seqcount_{begin,end} on
some seqcount
        * traversal to the next real entry done under rcu_read_lock in a
seqretry loop.

The only problem is where to put that seqcount (unsigned int, really).
->i_dir_seq is an obvious candidate, but that'll need careful profiling
on getdents/lookup mixes...

Reply via email to