Re: [PATCH 2/2] ath10k: do not use coherent memory for tx buffers
Felix Fietkauwrites: >>> However, on the device that I'm testing on >>> (IPQ806x based), this patch makes the difference between working and >>> non-working wifi, fixing the regression introduced by your pre-allocated >>> coherent memory patch. >> >> Thank you for the catch up and fix. >> Btw, the regression can be fixed by using GFP_KERNEL, instead of >> GFP_DMA, right? > > I just did some timing measurements, and it seems that the DMA coherent > variant is roughly 200 nanoseconds faster. Maybe the extra latency is > caused by the CPU filling the cacheline from RAM first. > > Kalle, please only merge the first one and drop this patch. > I will send a replacement for it. Ok, patch 2 dropped. -- Kalle Valo ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH 2/2] ath10k: do not use coherent memory for tx buffers
On 2015-11-23 19:50, Peter Oh wrote: > > On 11/23/2015 10:18 AM, Felix Fietkau wrote: >> On 2015-11-23 18:25, Peter Oh wrote: >>> Hi, >>> >>> Have you measured the peak throughput? >>> The pre-allocated coherent memory concept was introduced as once of peak >>> throughput improvement. >> It's all still pre-allocated and pre-mapped. > Right. I mis-guessed with the title. >> >>> IIRC, dma_map_single takes about 4 us on Cortex A7 and dma_unmap_single >>> also takes time to invalid cache. >> That's why I didn't put a map/unmap in the hot path. There is only a >> cache sync there. With coherent memory, every single word access blocks >> until the transaction is complete. With cached/mapped memory, the CPU >> can fill the cachelines first, then flush it in one go. This usually >> ends up being faster than working with coherent memory directly. >> >>> Please share your tput number before and after, so I don't need to worry >>> about performance degrade. >> I don't have an ideal setup for tput tests at the moment, so I can't >> give you any numbers. > Could you share any rough number? >> However, on the device that I'm testing on >> (IPQ806x based), this patch makes the difference between working and >> non-working wifi, fixing the regression introduced by your pre-allocated >> coherent memory patch. > Thank you for the catch up and fix. > Btw, the regression can be fixed by using GFP_KERNEL, instead of > GFP_DMA, right? I just did some timing measurements, and it seems that the DMA coherent variant is roughly 200 nanoseconds faster. Maybe the extra latency is caused by the CPU filling the cacheline from RAM first. Kalle, please only merge the first one and drop this patch. I will send a replacement for it. - Felix ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH 2/2] ath10k: do not use coherent memory for tx buffers
Am 23.11.2015 um 18:25 schrieb Peter Oh: Hi, Have you measured the peak throughput? The pre-allocated coherent memory concept was introduced as once of peak throughput improvement. IIRC, dma_map_single takes about 4 us on Cortex A7 and dma_unmap_single also takes time to invalid cache. Please share your tput number before and after, so I don't need to worry about performance degrade. yes. and this concept fucks up the qualcom ipq806x platform (which has by default 2 QCA99XX cards). it does not work, since the preallocated concept allocates too much memory which is not available in dma space on that platform. thanks Sebastian Thanks, Peter On 11/23/2015 05:18 AM, Felix Fietkau wrote: Coherent memory is expensive to access, since all memory accesses bypass the cache. It is also completely unnecessary for this case. Convert to mapped memory instead and use the DMA API to flush the cache where necessary. Fixes allocation failures on embedded devices. Signed-off-by: Felix Fietkau--- drivers/net/wireless/ath/ath10k/htt_tx.c | 77 +--- 1 file changed, 51 insertions(+), 26 deletions(-) diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c index 8f76b9d..99d9793 100644 --- a/drivers/net/wireless/ath/ath10k/htt_tx.c +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c @@ -100,7 +100,7 @@ void ath10k_htt_tx_free_msdu_id(struct ath10k_htt *htt, u16 msdu_id) int ath10k_htt_tx_alloc(struct ath10k_htt *htt) { struct ath10k *ar = htt->ar; -int ret, size; +int size; ath10k_dbg(ar, ATH10K_DBG_BOOT, "htt tx max num pending tx %d\n", htt->max_num_pending_tx); @@ -109,39 +109,41 @@ int ath10k_htt_tx_alloc(struct ath10k_htt *htt) idr_init(>pending_tx); size = htt->max_num_pending_tx * sizeof(struct ath10k_htt_txbuf); -htt->txbuf.vaddr = dma_alloc_coherent(ar->dev, size, - >txbuf.paddr, - GFP_DMA); -if (!htt->txbuf.vaddr) { -ath10k_err(ar, "failed to alloc tx buffer\n"); -ret = -ENOMEM; +htt->txbuf.vaddr = kzalloc(size, GFP_KERNEL); +if (!htt->txbuf.vaddr) goto free_idr_pending_tx; -} + +htt->txbuf.paddr = dma_map_single(ar->dev, htt->txbuf.vaddr, size, + DMA_TO_DEVICE); +if (dma_mapping_error(ar->dev, htt->txbuf.paddr)) +goto free_txbuf_vaddr; if (!ar->hw_params.continuous_frag_desc) -goto skip_frag_desc_alloc; +return 0; size = htt->max_num_pending_tx * sizeof(struct htt_msdu_ext_desc); -htt->frag_desc.vaddr = dma_alloc_coherent(ar->dev, size, - >frag_desc.paddr, - GFP_DMA); -if (!htt->frag_desc.vaddr) { -ath10k_warn(ar, "failed to alloc fragment desc memory\n"); -ret = -ENOMEM; +htt->frag_desc.vaddr = kzalloc(size, GFP_KERNEL); +if (!htt->frag_desc.vaddr) goto free_txbuf; -} -skip_frag_desc_alloc: +htt->frag_desc.paddr = dma_map_single(ar->dev, htt->frag_desc.vaddr, + size, DMA_TO_DEVICE); +if (dma_mapping_error(ar->dev, htt->txbuf.paddr)) +goto free_frag_desc; + return 0; +free_frag_desc: +kfree(htt->frag_desc.vaddr); free_txbuf: size = htt->max_num_pending_tx * sizeof(struct ath10k_htt_txbuf); -dma_free_coherent(htt->ar->dev, size, htt->txbuf.vaddr, - htt->txbuf.paddr); +dma_unmap_single(htt->ar->dev, htt->txbuf.paddr, size, DMA_TO_DEVICE); +free_txbuf_vaddr: +kfree(htt->txbuf.vaddr); free_idr_pending_tx: idr_destroy(>pending_tx); -return ret; +return -ENOMEM; } static int ath10k_htt_tx_clean_up_pending(int msdu_id, void *skb, void *ctx) @@ -170,15 +172,17 @@ void ath10k_htt_tx_free(struct ath10k_htt *htt) if (htt->txbuf.vaddr) { size = htt->max_num_pending_tx * sizeof(struct ath10k_htt_txbuf); -dma_free_coherent(htt->ar->dev, size, htt->txbuf.vaddr, - htt->txbuf.paddr); +dma_unmap_single(htt->ar->dev, htt->txbuf.paddr, size, + DMA_TO_DEVICE); +kfree(htt->txbuf.vaddr); } if (htt->frag_desc.vaddr) { size = htt->max_num_pending_tx * sizeof(struct htt_msdu_ext_desc); -dma_free_coherent(htt->ar->dev, size, htt->frag_desc.vaddr, - htt->frag_desc.paddr); +dma_unmap_single(htt->ar->dev, htt->frag_desc.paddr, size, + DMA_TO_DEVICE); +kfree(htt->frag_desc.vaddr); } } @@ -550,6 +554,7 @@ int ath10k_htt_tx(struct ath10k_htt *htt, struct sk_buff *msdu) struct htt_msdu_ext_desc *ext_desc = NULL; bool limit_mgmt_desc = false; bool is_probe_resp = false; +int txbuf_offset, frag_offset, frag_size; if (unlikely(ieee80211_is_mgmt(hdr->frame_control)) &&
Re: [PATCH 2/2] ath10k: do not use coherent memory for tx buffers
On 2015-11-23 18:25, Peter Oh wrote: > Hi, > > Have you measured the peak throughput? > The pre-allocated coherent memory concept was introduced as once of peak > throughput improvement. It's all still pre-allocated and pre-mapped. > IIRC, dma_map_single takes about 4 us on Cortex A7 and dma_unmap_single > also takes time to invalid cache. That's why I didn't put a map/unmap in the hot path. There is only a cache sync there. With coherent memory, every single word access blocks until the transaction is complete. With cached/mapped memory, the CPU can fill the cachelines first, then flush it in one go. This usually ends up being faster than working with coherent memory directly. > Please share your tput number before and after, so I don't need to worry > about performance degrade. I don't have an ideal setup for tput tests at the moment, so I can't give you any numbers. However, on the device that I'm testing on (IPQ806x based), this patch makes the difference between working and non-working wifi, fixing the regression introduced by your pre-allocated coherent memory patch. - Felix ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k
Re: [PATCH 2/2] ath10k: do not use coherent memory for tx buffers
On 11/23/2015 10:18 AM, Felix Fietkau wrote: On 2015-11-23 18:25, Peter Oh wrote: Hi, Have you measured the peak throughput? The pre-allocated coherent memory concept was introduced as once of peak throughput improvement. It's all still pre-allocated and pre-mapped. Right. I mis-guessed with the title. IIRC, dma_map_single takes about 4 us on Cortex A7 and dma_unmap_single also takes time to invalid cache. That's why I didn't put a map/unmap in the hot path. There is only a cache sync there. With coherent memory, every single word access blocks until the transaction is complete. With cached/mapped memory, the CPU can fill the cachelines first, then flush it in one go. This usually ends up being faster than working with coherent memory directly. Please share your tput number before and after, so I don't need to worry about performance degrade. I don't have an ideal setup for tput tests at the moment, so I can't give you any numbers. Could you share any rough number? However, on the device that I'm testing on (IPQ806x based), this patch makes the difference between working and non-working wifi, fixing the regression introduced by your pre-allocated coherent memory patch. Thank you for the catch up and fix. Btw, the regression can be fixed by using GFP_KERNEL, instead of GFP_DMA, right? - Felix Thanks, Peter ___ ath10k mailing list ath10k@lists.infradead.org http://lists.infradead.org/mailman/listinfo/ath10k