On Fri, Oct 26, 2012 at 2:48 AM, Jan Kiszka <jan.kis...@siemens.com> wrote: > On 2012-10-25 19:02, Avi Kivity wrote: >> On 10/25/2012 06:39 PM, Jan Kiszka wrote: >>>> >>>> That doesn't work cross-thread. >>>> >>>> vcpu A: write to device X, dma-ing to device Y >>>> vcpu B: write to device Y, dma-ing to device X >>> >>> We will deny DMA-ing from device X on behalf of a VCPU, ie. in dispatch >>> context, to Y. >>> >>> What we do not deny, though, is DMA-ing from an I/O thread that >>> processes an event for device X. >> >> I would really like to avoid depending on the context. In real hardware, >> there is no such thing. > > The point is how we deal with any kind of access to a device that > requires taking that device's lock while holding another lock, provided > that scenario can also take place in reverse order at the same time. > Known scenarios are: > > - vcpu 1 -> access device A -> access device B > - vcpu 2 -> access device B -> access device A > > - event 1 -> device A event processing -> access device B > - event 2 -> device B event processing -> access device A > > and combinations of those pairs. > >> >>> If the invoked callback of device X >>> holds the device lock across some DMA request to Y, then we risk to run >>> into the same ABBA issue. Hmm... >> >> Yup. >> >>> >>>> >>>> My suggestion was to drop the locks around DMA, then re-acquire the lock >>>> and re-validate data. >>> >>> Maybe possible, but hairy depending on the device model. >> >> It's unpleasant, yes. >> >> Note depending on the device, we may not need to re-validate data, it may be >> sufficient to load it into local variables to we know it is consistent at >> some point. But all those solutions suffer from requiring device model >> authors to understand all those issues, rather than just add a simple lock >> around access to their data structures. > > Right. And therefor it is a suboptimal way to start (patching). > >> >>>>>>> I see that we have a all-or-nothing problem here: to address this >>>>>>> properly, we need to convert the IRQ path to lock-less (or at least >>>>>>> compatible with holding per-device locks) as well. >>>>>> >>>>>> There is a transitional path where writing to a register that can cause >>>>>> IRQ changes takes both the big lock and the local lock. >>>>>> >>>>>> Eventually, though, of course all inner subsystems must be threaded for >>>>>> this work to have value. >>>>>> >>>>> >>>>> But that transitional path must not introduce regressions. Opening a >>>>> race window between IRQ cause update and event injection is such a >>>>> thing, just like dropping concurrent requests on the floor. >>>> >>>> Can you explain the race? >>> >>> Context A Context B >>> >>> device.lock >>> ... >>> device.set interrupt_cause = 0 >>> lower_irq = true >>> ... >>> device.unlock >>> device.lock >>> ... >>> device.interrupt_cause = 42 >>> raise_irq = true >>> ... >>> device.unlock >>> if (raise_irq) >>> bql.lock >>> set_irq(device.irqno) >>> bql.unlock >>> if (lower_irq) >>> bql.lock >>> clear_irq(device.irqno) >>> bql.unlock >>> >>> >>> And there it goes, our interrupt event. >> >> Obviously you'll need to reacquire the device lock after taking bql and >> revalidate stuff. But that is not what I am suggesting. Instead, any path >> that can lead to an irq update (or timer update etc) will take both the bql >> and the device lock. This will leave after the first pass only side effect >> free register reads and writes, which is silly if we keep it that way, but >> we will follow with a threaded timer and irq subsystem and we'll peel away >> those big locks. >> >> device_mmio_write: >> if register is involved in irq or timers or block layer or really >> anything that matters: >> bql.acquire >> device.lock.acquire >> do stuff >> device.lock.release >> if that big condition from above was true: >> bql.release > > Looks simpler than it is as you cannot wrap complete handlers with that > pattern. An example where it would fail (until we solved the locking > issues above): > > mmio_write: > bql.conditional_lock > device.lock > device.check_state > issue_dma > device.update_state > update_irq, play_with_timers, etc. > device.unlock > bql.conditional_unlock > > If that DMA request hits an unconverted MMIO region or one that takes > BQL conditionally as above, we will lock up (or bail out as our mutexes > detect the error). E1000's start_xmit looks like this so far, and that's > a pretty import service. > > Moreover, I prefer having a representative cut-through over enjoying to > merge a first step that excludes some 80% of the problems. For that > reason I would be even be inclined to start with addressing the IRQ > injection topic first (patch-wise), then the other necessary backend > services for the e1000 or whatever and convert some device(s) last. > > IOW: cut out anything from this series that touches e1000 until the > building blocks for converting it reasonably are finished. Carrying > experimental, partially broken conversion on top is fine, try to merge > pieces of that not, IMHO. > Agreed. Just want to take this opportunity to discuss what is the next and what is nut.
Regards, pingfan > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux