Hi, On Fri, Dec 11, 2020 at 5:32 PM Stephen Boyd <swb...@chromium.org> wrote: > > Quoting Doug Anderson (2020-12-10 17:51:53) > > Hi, > > > > On Thu, Dec 10, 2020 at 5:39 PM Stephen Boyd <swb...@chromium.org> wrote: > > > > > > Quoting Doug Anderson (2020-12-10 17:30:17) > > > > On Thu, Dec 10, 2020 at 5:21 PM Stephen Boyd <swb...@chromium.org> > > > > wrote: > > > > > > > > > > Yeah and so if it comes way later because it timed out then what's the > > > > > point of calling synchronize_irq() again? To make the completion > > > > > variable set when it won't be tested again until it is reinitialized? > > > > > > > > Presumably the idea is to try to recover to a somewhat usable state > > > > again? We're not rebooting the machine so, even though this transfer > > > > failed, we will undoubtedly do another transfer later. If that > > > > "abort" interrupt comes way later while we're setting up the next > > > > transfer we'll really confuse ourselves. > > > > > > The interrupt handler just sets a completion variable. What does that > > > confuse? > > > > The interrupt handler sees a "DONE" interrupt. If we've made it far > > enough into setting up the next transfer that "cur_xfer" has been set > > then it might do more, no? > > I thought it saw a cancel/abort EN bit? > > if (m_irq & M_CMD_CANCEL_EN) > complete(&mas->cancel_done); > if (m_irq & M_CMD_ABORT_EN) > complete(&mas->abort_done) > > and only a DONE bit if a transfer happened.
Ah, true. The crazy thing is that since we do abort / cancel with commands we get them together with "done". That "done" could potentially confuse the next transfer... In theory we could ignore DONE if we see ABORT / CANCEL, but I've now spent a bunch of time on this and I think the best thing is to just make sure we won't start the next transfer if any IRQs are pending. I'll post patches... > > > > I guess you could go the route of adding a synchronize_irq() at the > > > > start of the next transfer, but I'd rather add the overhead in the > > > > exceptional case (the timeout) than the normal case. In the normal > > > > case we don't need to worry about random IRQs from the past transfer > > > > suddenly showing up. > > > > > > > > > > How does adding synchronize_irq() at the end guarantee that the abort is > > > cleared out of the hardware though? It seems to assume that the abort is > > > pending at the GIC when it could still be running through the hardware > > > and not executed yet. It seems like a synchronize_irq() for that is > > > wishful thinking that the irq is merely pending even though it timed > > > out and possibly never ran. Maybe it's stuck in a write buffer in the > > > CPU? > > > > I guess I'm asserting that if a full second passed (because we timed > > out) and after that full second no interrupts are pending then the > > interrupt will never come. That seems a reasonable assumption to me. > > It seems hard to believe it'd be stuck in a write buffer for a full > > second? > > > > Ok, so if we don't expect an irq to come in why are we calling > synchronize_irq()? I'm lost. It turns out that synchronize_irq() doesn't do what I thought it did, actually. :( Despite __synchronize_hardirq() talking about waiting for "pending" interrupts, it actually passes in "IRQCHIP_STATE_ACTIVE" and not "IRQCHIP_STATE_PENDING". So much for that. ...but, if it did, I guess my point (which no longer matters) was: a) If you wait a second but don't wait for pending interrupts to be done, interrupts might still come later if the CPU servicing interrupts was blocked. b) If you don't wait a second but wait for pending interrupts to be done, interrupts might still come later because maybe the transaction wasn't finished first. c) If you wait a second (enough for the transaction to finish) and then wait for pending interrupts (to handle ISR being blocked) then you're good. --- So I got tired of all this conjecture and decided to write some code. I reproduced the problem with some test code that let me call local_irq_disable() for a set amount of time based on sysfs. In terminal 1: while true; do ectool version > /dev/null; done In terminal 2, disable interrupts on cpu0 for 2000 ms: taskset -c 0 echo 2000 > /sys/module/spi_geni_qcom/parameters/doug_test Of course, I got the timeout and the NULL dereference. Then I could poke at all the corner cases. Posting patches for what I think is the best solution... -Doug