On Mon, Sep 05, 2016 at 11:29:26AM -0400, Alan Stern wrote:
> On Mon, 5 Sep 2016, Peter Zijlstra wrote:
> 
> > > Actually, seeing it written out like this, one realizes that it really 
> > > ought to be:
> > > 
> > >   CPU 0                           CPU 1
> > >         -----                             -----
> > >         Start DMA                 Handle DMA-complete irq
> > >         Sleep until bh->state             smp_mb()
> > >                                   set bh->state
> > >                                   Wake up CPU 0
> > >   smp_mb()
> > >   Compute rc based on contents of the DMA buffer
> > > 
> > > (Bear in mind also that on some platforms, the I/O operation is carried 
> > > out by PIO rather than DMA.)

> > Also, smp_mb() doesn't necessarily interact with MMIO/DMA at all IIRC.
> > Its only defined to do CPU/CPU interactions.
> 
> Suppose the DMA master finishes filling up an input buffer and issues a
> completion irq to CPU1.  Presumably the data would then be visible to
> CPU1 if the interrupt handler looked at it.  So if CPU1 executes smp_mb
> before setting bh->state and waking up CPU0, and if CPU0 executes
> smp_mb after testing bh->state and before reading the data buffer,
> wouldn't CPU0 then see the correct data in the buffer?  Even if CPU0 
> never did go to sleep?

Couple of notes here; I would expect the DMA master to make its stores
_globally_ visible on 'completion'. Because I'm not sure our smp_mb()
would help make it globally visible, since its only defined on CPU/CPU
interactions, not external.

Note that for example ARM has the distinction where smp_mb() uses
DMB-ISH barrier, dma_[rw]mb() uses DMB-OSH{LD,ST} and mb() uses DSB-SY.

The ISH domain is the Inner-SHarable (IIRC) and only includes CPUs. The
OSH is Outer-SHarable and adds external agents like DMA (also includes
CPUs). The DSB-SY thing is even heavier and syncs world or something; I
always forget these details.

> Or would something more be needed?

The thing is, this is x86 (TSO). Most everything is globally
ordered/visible and full barriers.

The only reorder allowed by TSO is a later read can happen before a
prior store. That is, only:

        X = 1;
        smp_mb();
        r = Y;

actually needs a full barrier. All the other variants are no-ops.

Note that x86's dma_[rw]mb() are no-ops (like the regular smp_[rw]mb()).
Which seems to imply DMA is coherent and globally visible.

> > I would very much expect the device IO stuff to order things for us in
> > this case. "Start DMA" should very much include sufficient fences to
> > ensure the data under DMA is visible to the DMA engine, this would very
> > much include things like flushing store buffers and maybe even writeback
> > caches, depending on platform needs.
> > 
> > At the same time, I would expect "Handle DMA-complete irq", even if it
> > were done with a PIO polling loop, to guarantee ordering against later
> > operations such that 'complete' really means that.
> 
> That's what I would expect too.
> 
> Back in the original email thread where the problem was first reported, 
> Felipe says that the problem appears to be something else.  Here's what 
> it looks like now, in schematic form:
> 
>       CPU0
>       ----
>       get_next_command():
>         while (bh->state != BUF_STATE_EMPTY)
>               sleep_thread();
>         start an input request for bh
>         while (bh->state != BUF_STATE_FULL)
>               sleep_thread();
> 
> As mentioned above, the input involves DMA and is terminated by an irq.
> The request's completion handler is bulk_out_complete():
> 
>       CPU1
>       ----
>       bulk_out_complete():
>         bh->state = BUF_STATE_FULL;
>         wakeup_thread();
> 
> According to Felipe, when CPU0 wakes up and checks bh->state, it sees a 
> value different from BUF_STATE_FULL.  So it goes back to sleep again 
> and doesn't make any forward progress.
> 
> It's possible that something else is changing bh->state when it 
> shouldn't.  But if this were the explanation, why would Felipe see that 
> the problem goes away when he changes the memory barriers in 
> sleep_thread() and wakeup_thread() to smp_mb()?

Good question, let me stare at the code more.

Could you confirm that bulk_{in,out}_complete() work on different
usb_request structures, and they can not, at any time, get called on the
_same_ request?


--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to