Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-26 Thread Toke Høiland-Jørgensen via iommu
Halil Pasic  writes:

> On Fri, 25 Mar 2022 11:27:41 +
> Robin Murphy  wrote:
>
>> What muddies the waters a bit is that the opposite combination 
>> sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for 
>> one have already made the case for eliding that in code elsewhere, but 
>> it doesn't necessarily hold for the inverse here, hence why I'm not sure 
>> there even is a robust common solution for peeking at a live 
>> DMA_FROM_DEVICE buffer.
>
> In https://lkml.org/lkml/2022/3/24/739 I also argued, that a robust
> common solution for a peeking at a live DMA_FROM_DEVICE buffer is
> probably not possible, at least not with the current programming model
> as described by Documentation/core-api/dma-api.rst.
>
> Namely AFAIU the programming model is based on exclusive ownership: the
> buffer is either owned by the device, which means CPU(s) are not allowed
> to *access* it, or it is owned by the CPU(s), and the device is not
> allowed to *access* it. Do we agree on this?
>
> Considering what Linus said here https://lkml.org/lkml/2022/3/24/775
> I understand that: if the idea that dma_sync_*_for_{cpu,device} always
> transfers ownership to the cpu and device respectively is abandoned, 
> and we re-define ownership in a sense that only the owner may write,
> but non-owner is allowed to read, then it may be possible to make the
> scenario under discussion work. 
>
> The scenario in pseudo code:
>
> /* when invoked device might be doing DMA into buf */
> rx_buf_complete(buf)
> {
>   prepare_peek(buf, DMA_FROM_DEVICE);
> if (!is_ready(buf)) {
> /*let device gain the buffer again*/
> peek_done_not_ready(buf, DMA_FROM_DEVICE);
> return false;
> }
>   peek_done_ready(buf, DMA_FROM_DEVICE);
>   process_buff(buf, DMA_FROM_DEVICE); is
> }
>
> IMHO it is pretty obvious, that prepare_peek() has to update the
> cpu copy of the data *without* transferring ownership to the CPU. Since
> the owner is still the device, it is legit for the device to keep
> modifying the buffer via DMA. In case of the swiotlb, we would copy the
> content of the bounce buffer to the orig buffer possibly after
> invalidating
> caches, and for non-swiotlb we would do invalidate caches. So
> prepare_peek() could be actually something like,
> dma_sync_single_for_cpu(buf, DMA_FROM_DEVICE,
> DMA_ATTR_NO_OWNERSHIP_TRANSFER)
> which would most end up being functionally the same, as without the
> flag, since my guess is that the ownership is only tracked in our
> heads.

Well we also need to ensure that the CPU caches are properly invalidated
either in prepare_peek() or peek_done_not_ready(), so that the data is
not cached between subsequent peeks. This could translate to either
turning prepare_peek() into dma_sync_single_for_cpu(buf,
DMA_FROM_DEVICE, DMA_ATTR_NO_OWNERSHIP_TRANSFER_BUT_INVALIDATE_CACHES),
or it could turn peek_done_not_ready() into something that just
invalidates the cache.

I was also toying with the idea of having a copy-based peek helper like:

u32 data = dma_peek_word(buf, offset)

which leaves the ownership as-is, but copies out a single word from the
buffer at the given offset (from the bounce buffer or real buffer as
appropriate) without messing with the ownership notion. The trouble with
this idea is that ath9k reads two different words that are 44 bytes from
each other, so it would have to do two such calls, which would be racy :(

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-25 Thread Toke Høiland-Jørgensen via iommu
Christoph Hellwig  writes:

> On Thu, Mar 24, 2022 at 07:02:16PM +0100, Halil Pasic wrote:
>> > If 
>> > ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a 
>> > partial revert of just that one hunk.
>> >
>> 
>> I'm not against being pragmatic and doing the partial revert. But as
>> explained above, I do believe for correctness of swiotlb we ultimately
>> do need that change. So if the revert is the short term solution,
>> what should be our mid-term road-map?
>
> Unless I'm misunderstanding this thread we found the bug in ath9k
> and have a fix for that now?

According to Maxim's comment on the other subthread, that ath9k patch
wouldn't work on all platforms (and constitutes a bit of a violation of
the DMA API ownership abstraction). So not quite, I think?

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-25 Thread Toke Høiland-Jørgensen via iommu
Robin Murphy  writes:

> On 2022-03-25 16:25, Toke Høiland-Jørgensen wrote:
>> Maxime Bizon  writes:
>> 
>>> On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>>>

 It's actually very natural in that situation to flush the caches from
 the CPU side again. And so dma_sync_single_for_device() is a fairly
 reasonable thing to do in that situation.

>>>
>>> In the non-cache-coherent scenario, and assuming dma_map() did an
>>> initial cache invalidation, you can write this:
>>>
>>> rx_buffer_complete_1(buf)
>>> {
>>> invalidate_cache(buf, size)
>>> if (!is_ready(buf))
>>> return;
>>> 
>>> }
>>>
>>> or
>>>
>>> rx_buffer_complete_2(buf)
>>> {
>>> if (!is_ready(buf)) {
>>> invalidate_cache(buf, size)
>>> return;
>>> }
>>> 
>>> }
>>>
>>> The latter is preferred for performance because dma_map() did the
>>> initial invalidate.
>>>
>>> Of course you could write:
>>>
>>> rx_buffer_complete_3(buf)
>>> {
>>> invalidate_cache(buf, size)
>>> if
>>> (!is_ready(buf)) {
>>> invalidate_cache(buf, size)
>>> return;
>>> }
>>> 
>>> 
>>> }
>>>
>>>
>>> but it's a waste of CPU cycles
>>>
>>> So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
>>> are both doing invalidation in existing implementation of arch DMA ops,
>>> implementers may have taken some liberty around DMA-API to avoid
>>> unnecessary cache operation (not to blame them).
>> 
>> I sense an implicit "and the driver can't (or shouldn't) influence
>> this" here, right?
>
> Right, drivers don't get a choice of how a given DMA API implementation 
> works.
>
>>> For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
>>>
>>> sync_single_for_device()
>>>=> __dma_page_cpu_to_dev()
>>>  => dma_cache_maint_page(op=dmac_map_area)
>>>=> cpu_cache.dma_map_area()
>>>
>>> sync_single_for_cpu()
>>>=> __dma_page_dev_to_cpu()
>>>  =>
>>> __dma_page_cpu_to_dev(op=dmac_unmap_area)
>>>=>
>>> cpu_cache.dma_unmap_area()
>>>
>>> dma_map_area() always does cache invalidate.
>>>
>>> But for a couple of CPU variant, dma_unmap_area() is a noop, so
>>> sync_for_cpu() does nothing.
>>>
>>> Toke's patch will break ath9k on those platforms (mostly silent
>>> breakage, rx corruption leading to bad performance)
>> 
>> Okay, so that would be bad obviously. So if I'm reading you correctly
>> (cf my question above), we can't fix this properly from the driver side,
>> and we should go with the partial SWIOTLB revert instead?
>
> Do you have any other way of telling if DMA is idle, or temporarily
> pausing it before the sync_for_cpu, such that you could honour the
> notion of ownership transfer properly?

I'll go check with someone who has a better grasp of how the hardware
works, but I don't think so...

> As mentioned elsewhere I suspect the only "real" fix if you really do
> need to allow concurrent access is to use the coherent DMA API for
> buffers rather than streaming mappings, but that's obviously some far
> more significant surgery.

That would imply copying the packet data out of that (persistent)
coherent mapping each time we do a recv operation, though, right? That
would be quite a performance hit...

If all we need is a way to make dma_sync_single_for_cpu() guarantee a
cache invalidation, why can't we just add a separate version that does
that (dma_sync_single_for_cpu_peek() or something)? Using that with the
patch I posted earlier should be enough to resolve the issue, AFAICT?

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-25 Thread Toke Høiland-Jørgensen via iommu
Maxime Bizon  writes:

> On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
>
>> 
>> It's actually very natural in that situation to flush the caches from
>> the CPU side again. And so dma_sync_single_for_device() is a fairly
>> reasonable thing to do in that situation.
>> 
>
> In the non-cache-coherent scenario, and assuming dma_map() did an
> initial cache invalidation, you can write this:
>
> rx_buffer_complete_1(buf)
> {
>   invalidate_cache(buf, size)
>   if (!is_ready(buf))
>   return;
>   
> }
>
> or 
>
> rx_buffer_complete_2(buf)
> {
>   if (!is_ready(buf)) {
>   invalidate_cache(buf, size)
>   return;
>   }
>   
> }
>
> The latter is preferred for performance because dma_map() did the
> initial invalidate.
>
> Of course you could write:
>
> rx_buffer_complete_3(buf)
> {
>   invalidate_cache(buf, size)
>   if
> (!is_ready(buf)) {
>   invalidate_cache(buf, size)
>   return;
>   }
>   
> 
> }
>
>
> but it's a waste of CPU cycles
>
> So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
> are both doing invalidation in existing implementation of arch DMA ops,
> implementers may have taken some liberty around DMA-API to avoid
> unnecessary cache operation (not to blame them).

I sense an implicit "and the driver can't (or shouldn't) influence
this" here, right?

> For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
>
> sync_single_for_device()
>   => __dma_page_cpu_to_dev()
> => dma_cache_maint_page(op=dmac_map_area)
>   => cpu_cache.dma_map_area()
>
> sync_single_for_cpu()
>   => __dma_page_dev_to_cpu()
> =>
> __dma_page_cpu_to_dev(op=dmac_unmap_area)
>   =>
> cpu_cache.dma_unmap_area()
>
> dma_map_area() always does cache invalidate.
>
> But for a couple of CPU variant, dma_unmap_area() is a noop, so
> sync_for_cpu() does nothing.
>
> Toke's patch will break ath9k on those platforms (mostly silent
> breakage, rx corruption leading to bad performance)

Okay, so that would be bad obviously. So if I'm reading you correctly
(cf my question above), we can't fix this properly from the driver side,
and we should go with the partial SWIOTLB revert instead?

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Linus Torvalds  writes:

> On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen  wrote:
>>
>> Right, but is that sync_for_device call really needed?
>
> Well, imagine that you have a non-cache-coherent DMA (not bounce
> buffers - just bad hardware)...
>
> So the driver first does that dma_sync_single_for_cpu() for the CPU
> see the current state (for the non-cache-coherent case it would just
> invalidate caches).
>
> The driver then examines the command buffer state, sees that it's
> still in progress, and does that return -EINPROGRESS.
>
> It's actually very natural in that situation to flush the caches from
> the CPU side again. And so dma_sync_single_for_device() is a fairly
> reasonable thing to do in that situation.
>
> But it doesn't seem *required*, no. The CPU caches only have a copy of
> the data in them, no writeback needed (and writeback wouldn't work
> since DMA from the device may be in progress).
>
> So I don't think the dma_sync_single_for_device() is *wrong* per se,
> because the CPU didn't actually do any modifications.
>
> But yes, I think it's unnecessary - because any later CPU accesses
> would need that dma_sync_single_for_cpu() anyway, which should
> invalidate any stale caches.

OK, the above was basically how I understood it. Thank you for
confirming!

> And it clearly doesn't work in a bounce-buffer situation, but honestly
> I don't think a "CPU modified buffers concurrently with DMA" can
> *ever* work in that situation, so I think it's wrong for a bounce
> buffer model to ever do anything in the dma_sync_single_for_device()
> situation.

Right.

> Does removing that dma_sync_single_for_device() actually fix the
> problem for the ath driver?

I am hoping Oleksandr can help answer that since my own ath9k hardware
is currently on strike :(

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Robin Murphy  writes:

> On 2022-03-24 16:31, Christoph Hellwig wrote:
>> On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
 I'm looking into this; but in the interest of a speedy resolution of
 the regression I would be in favour of merging that partial revert
 and reinstating it if/when we identify (and fix) any bugs in ath9k :)
>>>
>>> This looks fishy:
>>>
>>> ath9k/recv.c
>>>
>>>  /* We will now give hardware our shiny new allocated skb */
>>>  new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
>>>common->rx_bufsize, 
>>> dma_type);
>>>  if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
>>>  dev_kfree_skb_any(requeue_skb);
>>>  goto requeue_drop_frag;
>>>  }
>>>
>>>  /* Unmap the frame */
>>>  dma_unmap_single(sc->dev, bf->bf_buf_addr,
>>>   common->rx_bufsize, dma_type);
>>>
>>>  bf->bf_mpdu = requeue_skb;
>>>  bf->bf_buf_addr = new_buf_addr;
>> 
>> Creating a new mapping for the same buffer before unmapping the
>> previous one does looks rather bogus.  But it does not fit the
>> pattern where revering the sync_single changes make the driver
>> work again.
>
> OK, you made me look :)
>
> Now that it's obvious what to look for, I can only conclude that during 
> the stanza in ath_edma_get_buffers(), the device is still writing to the 
> buffer while ownership has been transferred to the CPU, and whatever got 
> written while ath9k_hw_process_rxdesc_edma() was running then gets wiped 
> out by the subsequent sync_for_device, which currently resets the 
> SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of 
> the DMA API that's not allowed, but on the other hand I'm not sure if we 
> even have a good idiom for "I can't tell if the device has finished with 
> this buffer or not unless I look at it" :/

Right, but is that sync_for_device call really needed? AFAICT, that
ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of
the data when it returns EINPROGRESS, so could we just skip it? Like
the patch below? Or am I misunderstanding the semantics here?

-Toke


diff --git a/drivers/net/wireless/ath/ath9k/recv.c 
b/drivers/net/wireless/ath/ath9k/recv.c
index 0c0624a3b40d..19244d4c0ada 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
common->rx_bufsize, DMA_FROM_DEVICE);
 
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
-   if (ret == -EINPROGRESS) {
-   /*let device gain the buffer again*/
-   dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
-   common->rx_bufsize, DMA_FROM_DEVICE);
+   if (ret == -EINPROGRESS)
return false;
-   }
 
__skb_unlink(skb, _edma->rx_fifo);
if (ret == -EINVAL) {
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Robin Murphy  writes:

> On 2022-03-24 10:25, Oleksandr Natalenko wrote:
>> Hello.
>> 
>> On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
>>> On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote:
 I'll admit I still never quite grasped the reason for also adding the
 override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think
 by that point we were increasingly tired and confused and starting to
 second-guess ourselves (well, I was, at least). I don't think it's wrong
 per se, but as I said I do think it can bite anyone who's been doing
 dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3
 alone turns out to work OK then I'd be inclined to try a partial revert of
 just that one hunk.
>>>
>>> Agreed.  Let's try that first.
>>>
>>> Oleksandr, can you try the patch below:
>>>
>>>
>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>>> index 6db1c475ec827..6c350555e5a1c 100644
>>> --- a/kernel/dma/swiotlb.c
>>> +++ b/kernel/dma/swiotlb.c
>>> @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, 
>>> phys_addr_t tlb_addr,
>>>   void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t 
>>> tlb_addr,
>>> size_t size, enum dma_data_direction dir)
>>>   {
>>> -   /*
>>> -* Unconditional bounce is necessary to avoid corruption on
>>> -* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
>>> -* the whole lengt of the bounce buffer.
>>> -*/
>>> -   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> -   BUG_ON(!valid_dma_direction(dir));
>>> +   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> +   else
>>> +   BUG_ON(dir != DMA_FROM_DEVICE);
>>>   }
>>>   
>>>   void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
>>>
>> 
>> With this patch the AP works for me.
>
> Cool, thanks for confirming. So I think ath9k probably is doing 
> something dodgy with dma_sync_*(), but if Linus prefers to make the 
> above change rather than wait for that to get figured out, I believe 
> that should be fine.

I'm looking into this; but in the interest of a speedy resolution of the
regression I would be in favour of merging that partial revert and
reinstating it if/when we identify (and fix) any bugs in ath9k :)

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu