On Fri, Oct 06, 2017 at 12:12:41PM +0200, Thomas Gleixner wrote:
> On Fri, 6 Oct 2017, Chris Wilson wrote:
> > Quoting Daniel Vetter (2017-10-06 10:06:37)
> > > stop_machine is not really a locking primitive we should use, except
> > > when the hw folks tell us the hw is broken and that's the only way to
> > > work around it.
> > > 
> > > This patch tries to address the locking abuse of stop_machine() from
> > > 
> > > commit 20e4933c478a1ca694b38fa4ac44d99e659941f5
> > > Author: Chris Wilson <ch...@chris-wilson.co.uk>
> > > Date:   Tue Nov 22 14:41:21 2016 +0000
> > > 
> > >     drm/i915: Stop the machine as we install the wedged submit_request 
> > > handler
> > > 
> > > Chris said parts of the reasons for going with stop_machine() was that
> > > it's no overhead for the fast-path. But these callbacks use irqsave
> > > spinlocks and do a bunch of MMIO, and rcu_read_lock is _real_ fast.
> > 
> > I still want a discussion of the reason why keeping the normal path clean
> > and why an alternative is sought, here. That design leads into vv
> 
> stop_machine() is the least resort when serialization problems cannot be
> solved otherwise. We try to avoid it where ever we can. While on the call
> site it looks simple, it's invasive in terms of locking as shown by the
> lockdep splat and it's imposing latencies and other side effects on all
> CPUs in the system. So if you don't have a compelling technical reason to
> use it, then it _is_ the wrong tool.
> 
> As Daniel has shown it's not required, so there is no technical reason why
> stomp_machine() has to be used here.

Well I'm not sure yet whether my fix is actually correct :-)

But imo there's a bunch more reason why stop_machine is uncool, beyond
just the "it's a huge shotgun which doesn't play well with anything else"
aspect:

- What we actually seem to want is to make sure that all the
  engine->submit_request have completed, which happen to all run in
  hardirq context. It's an artifact of stop_machine that it completes all
  hardirq handlers, but afaiui stop_machine is really just aimed at
  getting all cpus to execute a specific well know loop (so that your
  callback can start patching .text and other evil stuff). If we move our
  callback into a thread that gets preempted, we have a problem.

- As a consequence, no lockdep annotations for the locking we actually
  want. And since this is for gpu hang recovery (something relatively rare
  that just _has_ to work) we really need all the support from all the
  debug tools we can get to catch possible issues.

- Another consequence is that the read side critical sections aren't
  annotated in the code. That makes it ever so more likely that a redesign
  moves them out of hardirq context and breaks it all.

- Not relevant here (I think), but stop_machine doesn't remove the need
  for read-side (compiler) barriers. Not relevant here I think, but in
  other cases we might still need to sprinkle READ_ONCE all over to make
  sure gcc doesn't realod and create races that way.

rcu has all these bits covered, is maintained by very smart people, and
the overhead is somewhere between 0 and a cacheline access that we touch
anyway (preempt_count is also wrangled by our spinlocks in all the
callbacks). No way this will ever show up against all the mmio writes the
callback does anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Reply via email to