On Wed, 2012-06-20 at 16:25 -0500, Anthony Liguori wrote: > So this cancellation stuff is hopelessly broken > > It's simply not possible to fully cancel pending DMA in a synchronous > callback.
Well, at least for PAPR H_PUT_TCE, cancellation must be synchronous, ie the hypercall must not return until the cancellation is complete. > Indeed, bdrv_aio_cancel ends up having a nasty little loop in it: > > if (active) { > /* fail safe: if the aio could not be canceled, we wait for > it */ > while (qemu_paio_error(acb) == EINPROGRESS) > ; > } > > That spins w/100% CPU. > > Can you explain when DMA cancellation really happens and what the effect > would > be if we simply ignored it? It will almost never happen in practice. It will actually never happen with the current patch series. Where it -will- eventually happen in the long run is if the guest removes a translation that is "in use" by a dma_map() mapping established by a device. It's always a guest programming error though and it's not an attack vector since the guest can only shoot itself in the foot, but it might make things like kdump less reliable inside the guest. We need a way to signal the device that the translation is going away and we need -a- way to synchronize though it could be a two step process (see below). > Can we do something more clever like use an asynchronous callback to handle > flushing active DMA mappings? > > There's just no way a callback like this is going to work. Ok so first let's see what happens in real HW: One of the DMA accesses gets a target abort return from the host bridge. The device interrupts it's operations and signals an error. Now, I agree that requiring a cancel callback to act synchronously might be a bit fishy, so what about we define the following semantics: - First this assumes our iommu backend decides to implement that level of correctness, as I said above, none do in that patch series (yet) - The basic idea is that for most iommu, there's an MMIO to start a TLB flush and an MMIO the guest uses to spin on to get the status as to whether the TLB flush has completed, so we can do things asynchronously that way. However we -still- need to do things synchronously for the hypercall used by PAPR, but as we discussed earlier that can be done without spinning, by delaying the completion of the hypercall. - So step 1, no callback at all. When an iommu TLB flush operation is started, we tag all pending maps (see below). We signal completion when all those maps have been unmapped. - The above tagging can be done using some kind of generation count along with an ordered list of maps, we keep track of the "oldest" map still active, that sort of thing. Not too hard. - step 2, because some maps can be long lived and we don't want to delay invalidations for ever, we add a cancel callback which device can optionally installed along with a map. This callback is only meant to -initiate- a cancellation in order to speed up when the unmap will occur. What do you think ? Would that work ? As I explained in my email exchange with Gerd, there's quite a few issues in actually implementing cancellation properly anyway, for example, today on PAPR, H_PUT_TCE is implemented by the kernel KVM in real mode for performance reasons. So we would need to change the KVM API to be able to keep the kernel informed that there are maps covering portions of the iommu space (a bitmap ?) to force exists to qemu when an invalidation collides with a map for example. Additionally, to be totally -correct-, we would need synchronization with qemu to a much larger extent. IE. Any invalidation must also make sure that anything that used a previous translation has completed, ie, even the simple dma_ld/st ops must be synchronized in theory. My conclusion is that the complexity of solving the problem is huge, while the actual problem scope is close to non-existent. So I think we can safely merge the series and ignore the issue for the time being. Cheers, Ben.