On Mon, Aug 26, 2024 at 01:09:15PM -0400, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 09:01:54AM GMT, Paul E. McKenney wrote:
> > But get_state_synchronize_srcu() is still a function call.  But the
> > offer of a function that checks multiple grace-period-state cookies in
> > one call still stands.
> 
> It shouldn't be though, it's just reading a sequence number - I'd much
> prefer if it could be at least a static inline.
> 
> Which you won't want to do, because it would mean exposing RCU private
> data structures; hence my approach of exposing an RCU-only api for
> getting a pointer to the sequence number.

True enough.

In addition, I have not yet been able to come up with any safe use of
this sequence number that does not require memory ordering.  And -that-
I most emphatically won't want to export to the caller.

Hence my offer of a function taking multiple grace-period-state cookies
in one go, allowing the memory-ordering overhead to be amortized over
the set of cookies.

> > > In my current version of the code, we call __get_state_synchronize_rcu()
> > > after we've disabled interrupts; we know the rcu gp seq isn't going to
> > > tick while we're in the critical path. But this doesn't apply if it's
> > > for SRCU, and I don't want to add if (src) srcu_read_lock() branches to
> > > it.
> > 
> > Actually, disabling interrupts does *not* prevent RCU's grace-period 
> > sequence number from changing.  For example, the following really can
> > happen:
> > 
> > o   RCU notices that the current grace period can end.
> > 
> > o   A CPU disables interrupts here.
> > 
> > o   A call to get_state_synchronize_rcu() returns a cookie
> >     corresponding to the end of the grace period following the
> >     current one.
> > 
> > o   RCU ends the current grace period, therefore updating the
> >     grace-period sequence number.
> > 
> > o   RCU starts a new grace period, therefore updating the
> >     grace-period sequence number once again.
> > 
> > o   RCU cannot complete this new grace period until that CPU
> >     re-enables interrupts, but it has already updated its grace-period
> >     sequence number not once, but twice.
> > 
> > So if your code knows that RCU's grace-period sequence number cannot
> > change while a given CPU has interrupts disabled, that code has a bug.
> > A low-probability bug, perhaps, but if your code is running on enough
> > systems, it will make its presence known.
> 
> Ok, good to know
> 
> > > Not at all essential, the races that result from this are harmless, but
> > > if we e.g. decide it's worthwhile to only kick off a gp if it hasn't
> > > ticked (i.e. only kick rcu if we're still on seq of the object being
> > > enqueued) it'd be nice.
> > 
> > Why not call get_state_synchronize_rcu(), and ask for a new grace period
> > if the value returned is different than that from the previous call?
> 
> We don't want to falsely think the object expires later than it actually
> does, and have more accumulated work than we need to.

Yes, that can happen, but grace periods are long and that race window
will normally be very short.  It will not matter in actual practice.

> > > Funny, I had the same thoughts trying to read your code... :)
> > 
> > Amazing how much easier it is to generate new code than to understand
> > existing code, isn't it?  ;-)
> 
> I would have much preferred if your existing code worked with SRCU. You
> think I'm doing this for fun?

To be fair, the "for fun" reason seemed to me to be one of the more
positive explanations for your providing an unsolicited rip-and-replace
patch for kfree_rcu() without prior consultation. ;-)

But, to your point, why haven't we already implemented kfree_srcu()?

Because in v6.10, there are six use cases for it, and they appear to have
infrequent updates.  Thus, kfree_srcu() is not yet not worth its weight.

> > > > Plus, unlike kfree_rcu() post-grace-period handling, call_rcu() callback
> > > > functions usually access the memory block passed to them, which means
> > > > that they are incurring that per-element cache miss in any case.
> > > 
> > > True. But this would allow us to prefetch those objects (several
> > > iterations in advance).
> > 
> > I need to see a CPU on which this actually make a significant difference
> > before adding this sort of complexity.
> 
> We would of course want benchmarks to show that this was worthwhile
> before switching call_rcu(), since absent a performance improvement we'd
> want to stick with the approach that doesn't allocate memory.

Agreed.

Furthermore, any approach that does allocate memory needs a non-allocating
fallback.  Out-of-memory deadlocks are not fun.

> > > Just processing a few items? hmm, would we want to though, when
> > > otherwise we'd be calling kfree_bulk()? I think icache locality means
> > > we'd generally prefer not to.
> > 
> > You might not want to yet, but you eventually would want this.
> 
> Because?

TL;DR:  Lessons learned from bitter experience.

For more detail, please read on.

Because certain inconvenient laws of physics (and I am looking at *you*,
finite speed of light and atomic nature of matter!) mean that global
agreement across a computer system is expensive.  This is especially
a problem if synchronous global agreement is required, which is one
reason why reader-writer locking is so slow, one way or another.
(You can either make readers slow or you can make updaters *really* slow.)

RCU does not completely avoid the cost of global agreement.  After all,
at the end of a grace period, there must be global agreement that all
pre-existing readers have completed.  But the computation of that global
agreement can be (and is) spread over time.  This spreading has two
good effects: First, it permits CPUs and tasks to use lighter-weight
operations for high-frequency operations such as rcu_read_lock(), and
second, the grace-period requests that arrived during one grace-period
computation can all be satisfied by the next grace-period computation,
in turn permitting the overhead of that next computation to be amortized
over all those requests.  Which is one reason why grace periods are not
unconditionally expedited.

But it is all high-amortization fun and games only until there is risk
of exhausting memory.  At that point, RCU takes steps to expedite
the current grace period, albeit in a light-weight fashion compared
to synchronize_rcu_expedited().  It has proven unwise to wait for mm
to complain to RCU (for example, via a shrinker), so RCU does this
lightweight expediting using heuristics based on the number and rate of
callbacks on each CPU.  And yes, RCU also uses shrinkers, just not as
the first line of defense.

So why don't we have such heuristics in SRCU?

Because SRCU has not yet been subjected to workloads that flood SRCU with
callbacks, so it has not yet proven necessary.  Obviously, should SRCU
start getting flooded, SRCU will start taking whatever forms of evasive
action are indicated by the situation at hand, quite likely including
expediting the current SRCU grace period.

Until that time, we keep it simple.  Or simpler, anyway.

                                                        Thanx, Paul

Reply via email to