Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Thu, Apr 20, 2023 at 06:42:17PM +0200, Alexander Lobakin wrote: > When there's no recycling of pages, then yes. And since recycling is > done asynchronously, sometimes new allocations happen either way. > Anyways, that was roughly a couple years ago right when you introduced > dma_alloc_noncoherent(). Things might've been changed since then. > I could try again while next is closed (i.e. starting this Sunday), the > only thing I'd like to mention: Page Pool allocates pages via > alloc_pages_bulk_array_node(). Bulking helps a lot (and PP uses bulks of > 16 IIRC), explicit node setting helps when Rx queues are distributed > between several nodes. We can then have one struct device for several nodes. > As I can see, there's now no function to allocate in bulks and no > explicit node setting option (e.g. mlx5 works around this using > set_dev_node() + allocate + set_dev_node(orig_node)). Could such options > be added in near future? That would help a lot switching to the > functions intended for use when DMA mappings can stay for a long time. > >From what I see from the code, that shouldn't be a problem (except for > non-direct DMA cases, where we'd need to introduce new callbacks or > extend the existing ones). So the node hint is something we can triviall pass through, and something the mlx5 maintainers should have done from the beginning instead of this nasty hack. Patches gladly accepted. A alloc_pages_bulk_array_node-like allocator also seems doable, we just need to make sure it has a decent fallback as I don't think we can wire it up to all the crazy legacy iommu drivers. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Tue, Apr 25, 2023 at 04:12:05AM -0400, Michael S. Tsirkin wrote: > In theory, absolutely. In practice modern virtio devices are ok, > the reason we are stuck supporting old legacy ones is because legacy > devices are needed to run old guests. And then people turn > around and run a new guest on the same device, > for example because they switch back and forth e.g. > for data recovery? Or because whoever is selling the > host wants to opt for maximum compatibility. > > Teaching all of linux to sometimes use dma and sometimes not > is a lot of work, and for limited benefit of these legacy systems. It's not like virtio is the only case where blindly assuming your can use DMA operations in a higher layer is the problem.. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Thu, Apr 20, 2023 at 09:18:21AM -0700, Christoph Hellwig wrote: > On Thu, Apr 20, 2023 at 05:11:48PM +0800, Xuan Zhuo wrote: > > I know that the current design of DMA API only supports some physical > > hardware, > > but can it be modified or expanded? > > I think the important point is that for some cases there is no need > to dma map at all, and upper layers should be fine by that by just > doing the dma mapping in helpers called by the driver. > > The virtio drivers then check if platform_access is set, then call the > generic dma mapping helper, or if not just allocate memory using > alloc_pages and also skip all the sync calls. In theory, absolutely. In practice modern virtio devices are ok, the reason we are stuck supporting old legacy ones is because legacy devices are needed to run old guests. And then people turn around and run a new guest on the same device, for example because they switch back and forth e.g. for data recovery? Or because whoever is selling the host wants to opt for maximum compatibility. Teaching all of linux to sometimes use dma and sometimes not is a lot of work, and for limited benefit of these legacy systems. We do it in a limited number of cases but generally making DMA itself DTRT sounds more attractive. So special DMA ops for these makes some sense: yes the firmware described DMA is wrong on these boxes but buggy firmware is not so unusual, is it? Given virtio devices actually are on a virtual bus (the virtio bus) sticking the fake DMA ops on this bus seems to make sense as a way to express this quirk. No? -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, 24 Apr 2023 17:28:01 +0200, Alexander Lobakin wrote: > From: Jakub Kicinski > Date: Fri, 21 Apr 2023 06:50:59 -0700 > > > On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote: > >> I am not particularly familiar with dma-bufs. I want to know if this > >> mechanism > >> can solve the problem of virtio-net. > >> > >> I saw this framework, allowing the driver do something inside the ops of > >> dma-bufs. > >> > >> If so, is it possible to propose a new patch based on dma-bufs? > > > > I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework > > Oh no, not me. I suck at dma-bufs, tried to understand them several > times with no progress :D My knowledge is limited to "ok, if it's > DMA + userspace, then it's likely dma-buf" :smile_with_tear: > > > uAPI of XSK to allow user to pass in a dma-buf region rather than just > > a user VA. So it may be a larger effort but architecturally it may be > > the right solution. > > > > I'm curious whether this could be done without tons of work. Switching > Page Pool to dma_alloc_noncoherent() is simpler :D But, as I wrote > above, we need to extend DMA API first to provide bulk allocations and > NUMA-aware allocations. > Can't we provide a shim for back-compat, i.e. if a program passes just a > user VA, create a dma-buf in the kernel already? Yes I think so too. If this is the case, will the workload be much smaller? Let me try it. Thanks. > > Thanks, > Olek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Fri, 21 Apr 2023 06:50:59 -0700, Jakub Kicinski wrote: > On Fri, 21 Apr 2023 15:31:04 +0800 Xuan Zhuo wrote: > > I am not particularly familiar with dma-bufs. I want to know if this > > mechanism > > can solve the problem of virtio-net. > > > > I saw this framework, allowing the driver do something inside the ops of > > dma-bufs. > > > > If so, is it possible to propose a new patch based on dma-bufs? > > I haven't looked in detail, maybe Olek has? AFAIU you'd need to rework > uAPI of XSK to allow user to pass in a dma-buf region rather than just > a user VA. This seems to be a big job. Can we first receive this patch. Thanks. > So it may be a larger effort but architecturally it may be > the right solution. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Thu, 20 Apr 2023 07:13:49 -0700, Jakub Kicinski wrote: > On Wed, 19 Apr 2023 23:19:22 -0700 Christoph Hellwig wrote: > > > In this case yes, pinned user memory, it gets sliced up into MTU sized > > > chunks, fed into an Rx queue of a device, and user can see packets > > > without any copies. > > > > How long is the life time of these mappings? Because dma_map_* > > assumes a temporary mapping and not one that is pinned bascically > > forever. > > Yeah, this one is "for ever". > > > > Quite similar use case #2 is upcoming io_uring / "direct placement" > > > patches (former from Meta, latter for Google) which will try to receive > > > just the TCP data into pinned user memory. > > > > I don't think we can just long term pin user memory here. E.g. for > > confidential computing cases we can't even ever do DMA straight to > > userspace. I had that conversation with Meta's block folks who > > want to do something similar with io_uring and the only option is an > > an allocator for memory that is known DMAable, e.g. through dma-bufs. > > > > You guys really all need to get together and come up with a scheme > > that actually works instead of piling these hacks over hacks. > > Okay, that simplifies various aspects. We'll just used dma-bufs from > the start in the new APIs. I am not particularly familiar with dma-bufs. I want to know if this mechanism can solve the problem of virtio-net. I saw this framework, allowing the driver do something inside the ops of dma-bufs. If so, is it possible to propose a new patch based on dma-bufs? Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Thu, Apr 20, 2023 at 05:11:48PM +0800, Xuan Zhuo wrote: > I know that the current design of DMA API only supports some physical > hardware, > but can it be modified or expanded? I think the important point is that for some cases there is no need to dma map at all, and upper layers should be fine by that by just doing the dma mapping in helpers called by the driver. The virtio drivers then check if platform_access is set, then call the generic dma mapping helper, or if not just allocate memory using alloc_pages and also skip all the sync calls. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Thu, Apr 20, 2023 at 03:59:39PM +0200, Alexander Lobakin wrote: > Hmm, currently almost all Ethernet drivers map Rx pages once and then > just recycle them, keeping the original DMA mapping. Which means pages > can have the same first mapping for very long time, often even for the > lifetime of the struct device. Same for XDP sockets, the lifetime of DMA > mappings equals the lifetime of sockets. > Does it mean we'd better review that approach and try switching to > dma_alloc_*() family (non-coherent/caching in our case)? Yes, exactly. dma_alloc_noncoherent can be used exactly as alloc_pages + dma_map_* by the driver (including the dma_sync_* calls on reuse), but has a huge number of advantages. > Also, I remember I tried to do that for one my driver, but the thing > that all those functions zero the whole page(s) before returning them to > the driver ruins the performance -- we don't need to zero buffers for > receiving packets and spend a ton of cycles on it (esp. in cases when 4k > gets zeroed each time, but your main body of traffic is 64-byte frames). Hmm, the single zeroing when doing the initial allocation shows up in these profiles? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, 19 Apr 2023 23:19:22 -0700, Christoph Hellwig wrote: > On Wed, Apr 19, 2023 at 09:45:06AM -0700, Jakub Kicinski wrote: > > > Can you explain what the actual use case is? > > > > > > From the original patchset I suspect it is dma mapping something very > > > long term and then maybe doing syncs on it as needed? > > > > In this case yes, pinned user memory, it gets sliced up into MTU sized > > chunks, fed into an Rx queue of a device, and user can see packets > > without any copies. > > How long is the life time of these mappings? Because dma_map_* > assumes a temporary mapping and not one that is pinned bascically > forever. > > > Quite similar use case #2 is upcoming io_uring / "direct placement" > > patches (former from Meta, latter for Google) which will try to receive > > just the TCP data into pinned user memory. > > I don't think we can just long term pin user memory here. E.g. for > confidential computing cases we can't even ever do DMA straight to > userspace. I had that conversation with Meta's block folks who > want to do something similar with io_uring and the only option is an > an allocator for memory that is known DMAable, e.g. through dma-bufs. > > You guys really all need to get together and come up with a scheme > that actually works instead of piling these hacks over hacks. I think that cases Jakub mentioned are new requirements. From the perspective of implementation, compared to this patch I submitted, if the DMA API can be expanded, compatible with some special hardware, I think it is a good solution. I know that the current design of DMA API only supports some physical hardware, but can it be modified or expanded? Still the previous idea, can we add a new ops (not dma_ops) in device? After the driver configuration, so that the DMA API can be compatible with some special hardware? Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, Apr 19, 2023 at 09:45:06AM -0700, Jakub Kicinski wrote: > > Can you explain what the actual use case is? > > > > From the original patchset I suspect it is dma mapping something very > > long term and then maybe doing syncs on it as needed? > > In this case yes, pinned user memory, it gets sliced up into MTU sized > chunks, fed into an Rx queue of a device, and user can see packets > without any copies. How long is the life time of these mappings? Because dma_map_* assumes a temporary mapping and not one that is pinned bascically forever. > Quite similar use case #2 is upcoming io_uring / "direct placement" > patches (former from Meta, latter for Google) which will try to receive > just the TCP data into pinned user memory. I don't think we can just long term pin user memory here. E.g. for confidential computing cases we can't even ever do DMA straight to userspace. I had that conversation with Meta's block folks who want to do something similar with io_uring and the only option is an an allocator for memory that is known DMAable, e.g. through dma-bufs. You guys really all need to get together and come up with a scheme that actually works instead of piling these hacks over hacks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, Apr 19, 2023 at 03:14:48PM +0200, Alexander Lobakin wrote: > >>> dma addresses and thus dma mappings are completely driver specific. > >>> Upper layers have no business looking at them. > > Here it's not an "upper layer". XSk core doesn't look at them or pass > them between several drivers. Same for upper layers :) The just do abstract operations that can sit on top of a variety of drivers. > It maps DMA solely via the struct device > passed from the driver and then just gets-sets addresses for this driver > only. Just like Page Pool does for regular Rx buffers. This got moved to > the XSk core to not repeat the same code pattern in each driver. Which assumes that: a) a DMA mapping needs to be done at all b) it can be done using a struct device exposed to it c) that DMA mapping is actually at the same granularity that it operates on all of which might not be true. > >>From the original patchset I suspect it is dma mapping something very > > long term and then maybe doing syncs on it as needed? > > As I mentioned, XSk provides some handy wrappers to map DMA for drivers. > Previously, XSk was supported by real hardware drivers only, but here > the developer tries to add support to virtio-net. I suspect he needs to > use DMA mapping functions different from which the regular driver use. Yes, For actual hardware virtio and some more complex virtualized setups it works just like real hardware. For legacy virtio there is no DMA maping involved at all. Because of that all DMA mapping needs to be done inside of virtio. > So this is far from dma_map_ops, the author picked wrong name :D > And correct, for XSk we map one big piece of memory only once and then > reuse it for buffers, no inflight map/unmap on hotpath (only syncs when > needed). So this mapping is longterm and is stored in XSk core structure > assigned to the driver which this mapping was done for. > I think Jakub thinks of something similar, but for the "regular" Rx/Tx, > not only XDP sockets :) FYI, dma_map_* is not intended for long term mappings, can lead to starvation issues. You need to use dma_alloc_* instead. And "you" in that case is as I said the driver, not an upper layer. If it's just helper called by drivers and never from core code, that's of course fine. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, Apr 19, 2023 at 03:22:39PM +0200, Alexander Lobakin wrote: > If DMA syncs are not needed on your x86_64 DMA-coherent system, it > doesn't mean we all don't need it. If the DMA isn't actually a DMA (as in the virtio case, or other cases that instead have to do their own dma mapping at much lower layers) syncs generally don't make sense. > Instead of filling pointers with > "default" callbacks, you could instead avoid indirect calls at all when > no custom DMA ops are specified. Pls see how for example Christoph did > that for direct DMA. It would cost only one if-else for case without > custom DMA ops here instead of an indirect call each time. So yes, I think the abstraction here should not be another layer of DMA ops, but the option to DMA map or not at all. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, 19 Apr 2023 15:22:39 +0200, Alexander Lobakin wrote: > From: Xuan Zhuo > Date: Mon, 17 Apr 2023 11:27:50 +0800 > > > The purpose of this patch is to allow driver pass the own dma_ops to > > xsk. > > > > This is to cope with the scene of virtio-net. If virtio does not have > > VIRTIO_F_ACCESS_PLATFORM, then virtio cannot use DMA API. In this case, > > XSK cannot use DMA API directly to achieve DMA address. Based on this > > scene, we must let XSK support driver to use the driver's dma_ops. > > > > On the other hand, the implementation of XSK as a highlevel code > > should put the underlying operation of DMA to the driver layer. > > The driver layer determines the implementation of the final DMA. XSK > > should not make such assumptions. Everything will be simplified if DMA > > is done at the driver level. > > > > More is here: > > > > https://lore.kernel.org/virtualization/1681265026.6082013-1-xuanz...@linux.alibaba.com/ > > > > Signed-off-by: Xuan Zhuo > > [...] > > > struct xsk_buff_pool { > > /* Members only used in the control path first. */ > > struct device *dev; > > @@ -85,6 +102,7 @@ struct xsk_buff_pool { > > * sockets share a single cq when the same netdev and queue id is > > shared. > > */ > > spinlock_t cq_lock; > > + struct xsk_dma_ops dma_ops; > > Why full struct, not a const pointer? You'll have indirect calls either > way, copying the full struct won't reclaim you much performance. > > > struct xdp_buff_xsk *free_heads[]; > > }; > > > > [...] > > > @@ -424,18 +426,29 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct > > device *dev, > > return 0; > > } > > > > + if (!dma_ops) { > > + pool->dma_ops.map_page = dma_map_page_attrs; > > + pool->dma_ops.mapping_error = dma_mapping_error; > > + pool->dma_ops.need_sync = dma_need_sync; > > + pool->dma_ops.sync_single_range_for_device = > > dma_sync_single_range_for_device; > > + pool->dma_ops.sync_single_range_for_cpu = > > dma_sync_single_range_for_cpu; > > + dma_ops = &pool->dma_ops; > > + } else { > > + pool->dma_ops = *dma_ops; > > + } > > If DMA syncs are not needed on your x86_64 DMA-coherent system, it > doesn't mean we all don't need it. Instead of filling pointers with > "default" callbacks, you could instead avoid indirect calls at all when > no custom DMA ops are specified. Pls see how for example Christoph did > that for direct DMA. It would cost only one if-else for case without > custom DMA ops here instead of an indirect call each time. > > (I *could* suggest using INDIRECT_CALL_WRAPPER(), but I won't, since > it's more expensive than direct checking and I feel like it's more > appropriate to check directly here) OK, I will fix it in next version. Thanks. > > > + > > dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem); > > if (!dma_map) > > return -ENOMEM; > [...] > > Thanks, > Olek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Wed, 19 Apr 2023 15:14:48 +0200, Alexander Lobakin wrote: > From: Christoph Hellwig > Date: Tue, 18 Apr 2023 22:16:53 -0700 > > > On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote: > >>> You can't just do dma mapping outside the driver, because there are > >>> drivers that do not require DMA mapping at all. virtio is an example, > >>> but all the classic s390 drivers and some other odd virtualization > >>> ones are others. > >> > >> What bus are the classic s390 on (in terms of the device model)? > > > > I think most of them are based on struct ccw_device, but I'll let the > > s390 maintainers fill in. > > > > Another interesting case that isn't really relevant for your networking > > guys, but that caused as problems is RDMA. For hardware RDMA devices > > it wants the ULPs to DMA map, but it turns out we have various software > > drivers that map to network drivers that do their own DMA mapping > > at a much lower layer and after potentially splitting packets or > > even mangling them. > > > >> > I don't think it's reasonable to be bubbling up custom per-subsystem > DMA ops into all of them for the sake of virtio. > >>> > >>> dma addresses and thus dma mappings are completely driver specific. > >>> Upper layers have no business looking at them. > > Here it's not an "upper layer". XSk core doesn't look at them or pass > them between several drivers. It maps DMA solely via the struct device > passed from the driver and then just gets-sets addresses for this driver > only. Just like Page Pool does for regular Rx buffers. This got moved to > the XSk core to not repeat the same code pattern in each driver. > > >> > >> Damn, that's unfortunate. Thinking aloud -- that means that if we want > >> to continue to pull memory management out of networking drivers to > >> improve it for all, cross-optimize with the rest of the stack and > >> allow various upcoming forms of zero copy -- then we need to add an > >> equivalent of dma_ops and DMA API locally in networking? > > Managing DMA addresses is totally fine as long as you don't try to pass > mapped addresses between different drivers :D Page Pool already does > that and I don't see a problem in that in general. > > > > > Can you explain what the actual use case is? > > > >>From the original patchset I suspect it is dma mapping something very > > long term and then maybe doing syncs on it as needed? > > As I mentioned, XSk provides some handy wrappers to map DMA for drivers. > Previously, XSk was supported by real hardware drivers only, but here > the developer tries to add support to virtio-net. I suspect he needs to > use DMA mapping functions different from which the regular driver use. > So this is far from dma_map_ops, the author picked wrong name :D Yes, dma_ops caused some misunderstandings, which is indeed a wrong name. Thanks. > And correct, for XSk we map one big piece of memory only once and then > reuse it for buffers, no inflight map/unmap on hotpath (only syncs when > needed). So this mapping is longterm and is stored in XSk core structure > assigned to the driver which this mapping was done for. > I think Jakub thinks of something similar, but for the "regular" Rx/Tx, > not only XDP sockets :) > > Thanks, > Olek ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, Apr 17, 2023 at 11:19:47PM -0700, Jakub Kicinski wrote: > > You can't just do dma mapping outside the driver, because there are > > drivers that do not require DMA mapping at all. virtio is an example, > > but all the classic s390 drivers and some other odd virtualization > > ones are others. > > What bus are the classic s390 on (in terms of the device model)? I think most of them are based on struct ccw_device, but I'll let the s390 maintainers fill in. Another interesting case that isn't really relevant for your networking guys, but that caused as problems is RDMA. For hardware RDMA devices it wants the ULPs to DMA map, but it turns out we have various software drivers that map to network drivers that do their own DMA mapping at a much lower layer and after potentially splitting packets or even mangling them. > > > > I don't think it's reasonable to be bubbling up custom per-subsystem > > > DMA ops into all of them for the sake of virtio. > > > > dma addresses and thus dma mappings are completely driver specific. > > Upper layers have no business looking at them. > > Damn, that's unfortunate. Thinking aloud -- that means that if we want > to continue to pull memory management out of networking drivers to > improve it for all, cross-optimize with the rest of the stack and > allow various upcoming forms of zero copy -- then we need to add an > equivalent of dma_ops and DMA API locally in networking? Can you explain what the actual use case is? >From the original patchset I suspect it is dma mapping something very long term and then maybe doing syncs on it as needed? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, Apr 17, 2023 at 07:54:00PM -0700, Jakub Kicinski wrote: > AF_XDP, io_uring, and increasing number of pinned memory / zero copy > implementations need to do DMA mapping outside the drivers. You can't just do dma mapping outside the driver, because there are drivers that do not require DMA mapping at all. virtio is an example, but all the classic s390 drivers and some other odd virtualization ones are others. > I don't think it's reasonable to be bubbling up custom per-subsystem > DMA ops into all of them for the sake of virtio. dma addresses and thus dma mappings are completely driver specific. Upper layers have no business looking at them. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, 17 Apr 2023 18:19:50 -0700, Jakub Kicinski wrote: > On Tue, 18 Apr 2023 09:07:30 +0800 Jason Wang wrote: > > > > Would you mind explaining this a bit more to folks like me who are not > > > > familiar with VirtIO? DMA API is supposed to hide the DMA mapping > > > > details from the stack, why is it not sufficient here. > > > > The reason is that legacy virtio device don't use DMA(vring_use_dma_api()). > > > > The AF_XDP assumes DMA for netdev doesn't work in this case. We need a > > way to make it work. > > Can we not push this down to be bus level? virtio has its own bus it > can plug in whatever magic it wants into dma ops. It is actually not possible. [1] https://lore.kernel.org/virtualization/zducdeylqawqv...@infradead.org/ > > Doesn't have to be super fast for af_xdp's sake - for af_xdp dma mapping > is on the control path. You can keep using the if (vring_use_dma_api()) > elsewhere for now if there is a perf concern. Sorry, I don't particularly understand this passage. Now, the question is if vring_use_dma_api() is false, then we cannot use DMA API in AF_XDP. The good news is that except for some of sync's operations, they are in the control path. I think it is very small effect on performance. Because in most case the sync is unnecessary. > > Otherwise it really seems like we're bubbling up a virtio hack into > generic code :( Can we understand the purpose of this matter to back the DMA operation to the driver? Although I don't know if there are other drivers with similar requirements. Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, 17 Apr 2023 11:57:53 -0700, Jakub Kicinski wrote: > On Mon, 17 Apr 2023 11:56:10 -0700 Jakub Kicinski wrote: > > > May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API. > > > > > > I mean the callbacks for xsk to do dma. > > > > > > Maybe, I should rename it in the next version. > > > > Would you mind explaining this a bit more to folks like me who are not > > familiar with VirtIO? DMA API is supposed to hide the DMA mapping > > details from the stack, why is it not sufficient here. > > Umm.. also it'd help to post the user of the API in the same series. > I only see the XSK changes, maybe if the virtio changes were in > the same series I could answer my own question. This [1] is the similar code. This is the early version. But the idea is similar to this patch. [1] https://lore.kernel.org/virtualization/20230202110058.130695-1-xuanz...@linux.alibaba.com/ Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Tue, Apr 18, 2023 at 2:58 AM Jakub Kicinski wrote: > > On Mon, 17 Apr 2023 11:56:10 -0700 Jakub Kicinski wrote: > > > May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API. > > > > > > I mean the callbacks for xsk to do dma. > > > > > > Maybe, I should rename it in the next version. > > > > Would you mind explaining this a bit more to folks like me who are not > > familiar with VirtIO? DMA API is supposed to hide the DMA mapping > > details from the stack, why is it not sufficient here. The reason is that legacy virtio device don't use DMA(vring_use_dma_api()). The AF_XDP assumes DMA for netdev doesn't work in this case. We need a way to make it work. Thanks > > Umm.. also it'd help to post the user of the API in the same series. > I only see the XSK changes, maybe if the virtio changes were in > the same series I could answer my own question. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, 17 Apr 2023 02:43:32 -0400, "Michael S. Tsirkin" wrote: > On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote: > > @@ -532,9 +545,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool) > > xskb->xdp.data_meta = xskb->xdp.data; > > > > if (pool->dma_need_sync) { > > - dma_sync_single_range_for_device(pool->dev, xskb->dma, 0, > > -pool->frame_len, > > -DMA_BIDIRECTIONAL); > > + pool->dma_ops.sync_single_range_for_device(pool->dev, > > xskb->dma, 0, > > + pool->frame_len, > > + DMA_BIDIRECTIONAL); > > } > > return &xskb->xdp; > > } > > @@ -670,15 +683,15 @@ EXPORT_SYMBOL(xp_raw_get_dma); > > > > void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb) > > { > > - dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0, > > - xskb->pool->frame_len, DMA_BIDIRECTIONAL); > > + xskb->pool->dma_ops.sync_single_range_for_cpu(xskb->pool->dev, > > xskb->dma, 0, > > + xskb->pool->frame_len, > > DMA_BIDIRECTIONAL); > > } > > EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow); > > > > void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t > > dma, > > size_t size) > > { > > - dma_sync_single_range_for_device(pool->dev, dma, 0, > > -size, DMA_BIDIRECTIONAL); > > + pool->dma_ops.sync_single_range_for_device(pool->dev, dma, 0, > > + size, DMA_BIDIRECTIONAL); > > } > > EXPORT_SYMBOL(xp_dma_sync_for_device_slow); > > So you add an indirect function call on data path? Won't this be costly? Yes, this may introduce some cost. The good news is that in some case, sync is not necessary. dma_need_sync should be false. Thanks. > > > -- > > 2.32.0.3.g01195cf9f > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
Hi Xuan, kernel test robot noticed the following build errors: [auto build test ERROR on net-next/main] url: https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903 patch link: https://lore.kernel.org/r/20230417032750.7086-1-xuanzhuo%40linux.alibaba.com patch subject: [PATCH net-next] xsk: introduce xsk_dma_ops config: mips-randconfig-r021-20230416 (https://download.01.org/0day-ci/archive/20230417/202304171441.ezrwcnsh-...@intel.com/config) compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project 9638da200e00bd069e6dd63604e14cbafede9324) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install mips cross compiling tool for clang build # apt-get install binutils-mipsel-linux-gnu # https://github.com/intel-lab-lkp/linux/commit/28e766603a33761d7bd1fdd3e107595408319f7d git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903 git checkout 28e766603a33761d7bd1fdd3e107595408319f7d # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=mips olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash net/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot | Link: https://lore.kernel.org/oe-kbuild-all/202304171441.ezrwcnsh-...@intel.com/ All errors (new ones prefixed by >>): >> net/xdp/xsk_buff_pool.c:430:26: error: incompatible function pointer types >> assigning to 'dma_addr_t (*)(struct device *, struct page *, unsigned long, >> size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int >> (*)(struct device *, struct page *, unsigned long, unsigned int, enum >> dma_data_direction, unsigned long)') from 'dma_addr_t (struct device *, >> struct page *, size_t, size_t, enum dma_data_direction, unsigned long)' (aka >> 'unsigned int (struct device *, struct page *, unsigned int, unsigned int, >> enum dma_data_direction, unsigned long)') >> [-Wincompatible-function-pointer-types] pool->dma_ops.map_page = dma_map_page_attrs; ^ ~~ 1 error generated. vim +430 net/xdp/xsk_buff_pool.c 409 410 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev, 411 struct xsk_dma_ops *dma_ops, 412 unsigned long attrs, struct page **pages, u32 nr_pages) 413 { 414 struct xsk_dma_map *dma_map; 415 dma_addr_t dma; 416 int err; 417 u32 i; 418 419 dma_map = xp_find_dma_map(pool); 420 if (dma_map) { 421 err = xp_init_dma_info(pool, dma_map); 422 if (err) 423 return err; 424 425 refcount_inc(&dma_map->users); 426 return 0; 427 } 428 429 if (!dma_ops) { > 430 pool->dma_ops.map_page = dma_map_page_attrs; 431 pool->dma_ops.mapping_error = dma_mapping_error; 432 pool->dma_ops.need_sync = dma_need_sync; 433 pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device; 434 pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu; 435 dma_ops = &pool->dma_ops; 436 } else { 437 pool->dma_ops = *dma_ops; 438 } 439 440 dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem); 441 if (!dma_map) 442 return -ENOMEM; 443 444 for (i = 0; i < dma_map->dma_pages_cnt; i++) { 445 dma = dma_ops->map_page(dev, pages[i], 0, PAGE_SIZE, 446 DMA_BIDIRECTIONAL, attrs); 447 if (dma_ops->mapping_error(dev, dma)) { 448 __xp_dma_unmap(dma_map, dma_ops, attrs); 449 return -ENOMEM; 450 } 451 if (dma_ops->need_sync(dev, dma)) 452 dma_map->dma_need_sync = true; 453 dma_map->dma_pages[i] = dma; 454 } 455 456 if (pool->unaligned) 457 xp_check_dma_contiguity(dma_map); 458 459 err = xp_init_dma_info(pool, dma_map);
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote: > @@ -532,9 +545,9 @@ struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool) > xskb->xdp.data_meta = xskb->xdp.data; > > if (pool->dma_need_sync) { > - dma_sync_single_range_for_device(pool->dev, xskb->dma, 0, > - pool->frame_len, > - DMA_BIDIRECTIONAL); > + pool->dma_ops.sync_single_range_for_device(pool->dev, > xskb->dma, 0, > +pool->frame_len, > +DMA_BIDIRECTIONAL); > } > return &xskb->xdp; > } > @@ -670,15 +683,15 @@ EXPORT_SYMBOL(xp_raw_get_dma); > > void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb) > { > - dma_sync_single_range_for_cpu(xskb->pool->dev, xskb->dma, 0, > - xskb->pool->frame_len, DMA_BIDIRECTIONAL); > + xskb->pool->dma_ops.sync_single_range_for_cpu(xskb->pool->dev, > xskb->dma, 0, > + xskb->pool->frame_len, > DMA_BIDIRECTIONAL); > } > EXPORT_SYMBOL(xp_dma_sync_for_cpu_slow); > > void xp_dma_sync_for_device_slow(struct xsk_buff_pool *pool, dma_addr_t dma, >size_t size) > { > - dma_sync_single_range_for_device(pool->dev, dma, 0, > - size, DMA_BIDIRECTIONAL); > + pool->dma_ops.sync_single_range_for_device(pool->dev, dma, 0, > +size, DMA_BIDIRECTIONAL); > } > EXPORT_SYMBOL(xp_dma_sync_for_device_slow); So you add an indirect function call on data path? Won't this be costly? > -- > 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
Hi Xuan, kernel test robot noticed the following build errors: [auto build test ERROR on net-next/main] url: https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903 patch link: https://lore.kernel.org/r/20230417032750.7086-1-xuanzhuo%40linux.alibaba.com patch subject: [PATCH net-next] xsk: introduce xsk_dma_ops config: i386-randconfig-a011-20230417 (https://download.01.org/0day-ci/archive/20230417/202304171427.uaryn9jl-...@intel.com/config) compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/intel-lab-lkp/linux/commit/28e766603a33761d7bd1fdd3e107595408319f7d git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Xuan-Zhuo/xsk-introduce-xsk_dma_ops/20230417-112903 git checkout 28e766603a33761d7bd1fdd3e107595408319f7d # save the config file mkdir build_dir && cp config build_dir/.config COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash net/ If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot | Link: https://lore.kernel.org/oe-kbuild-all/202304171427.uaryn9jl-...@intel.com/ All errors (new ones prefixed by >>): >> net/xdp/xsk_buff_pool.c:430:26: error: incompatible function pointer types >> assigning to 'dma_addr_t (*)(struct device *, struct page *, unsigned long, >> size_t, enum dma_data_direction, unsigned long)' (aka 'unsigned int >> (*)(struct device *, struct page *, unsigned long, unsigned int, enum >> dma_data_direction, unsigned long)') from 'dma_addr_t (struct device *, >> struct page *, size_t, size_t, enum dma_data_direction, unsigned long)' (aka >> 'unsigned int (struct device *, struct page *, unsigned int, unsigned int, >> enum dma_data_direction, unsigned long)') >> [-Werror,-Wincompatible-function-pointer-types] pool->dma_ops.map_page = dma_map_page_attrs; ^ ~~ 1 error generated. vim +430 net/xdp/xsk_buff_pool.c 409 410 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev, 411 struct xsk_dma_ops *dma_ops, 412 unsigned long attrs, struct page **pages, u32 nr_pages) 413 { 414 struct xsk_dma_map *dma_map; 415 dma_addr_t dma; 416 int err; 417 u32 i; 418 419 dma_map = xp_find_dma_map(pool); 420 if (dma_map) { 421 err = xp_init_dma_info(pool, dma_map); 422 if (err) 423 return err; 424 425 refcount_inc(&dma_map->users); 426 return 0; 427 } 428 429 if (!dma_ops) { > 430 pool->dma_ops.map_page = dma_map_page_attrs; 431 pool->dma_ops.mapping_error = dma_mapping_error; 432 pool->dma_ops.need_sync = dma_need_sync; 433 pool->dma_ops.sync_single_range_for_device = dma_sync_single_range_for_device; 434 pool->dma_ops.sync_single_range_for_cpu = dma_sync_single_range_for_cpu; 435 dma_ops = &pool->dma_ops; 436 } else { 437 pool->dma_ops = *dma_ops; 438 } 439 440 dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem); 441 if (!dma_map) 442 return -ENOMEM; 443 444 for (i = 0; i < dma_map->dma_pages_cnt; i++) { 445 dma = dma_ops->map_page(dev, pages[i], 0, PAGE_SIZE, 446 DMA_BIDIRECTIONAL, attrs); 447 if (dma_ops->mapping_error(dev, dma)) { 448 __xp_dma_unmap(dma_map, dma_ops, attrs); 449 return -ENOMEM; 450 } 451 if (dma_ops->need_sync(dev, dma)) 452 dma_map->dma_need_sync = true; 453 dma_map->dma_pages[i] = dma; 454 } 455 456 if (pool->unaligned) 457 xp_check_dma_contiguity(dma_map); 458 459 err = xp_init_dma_info(pool, dma_map); 460 if (err) { 461 __xp_dma_unmap(dma_map, dma_ops, attrs); 462
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Sun, 16 Apr 2023 21:24:32 -0700, Christoph Hellwig wrote: > On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote: > > The purpose of this patch is to allow driver pass the own dma_ops to > > xsk. > > Drivers have no business passing around dma_ops, or even knowing about > them. May misunderstand, here the "dma_ops" is not the "dma_ops" of DMA API. I mean the callbacks for xsk to do dma. Maybe, I should rename it in the next version. Thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next] xsk: introduce xsk_dma_ops
On Mon, Apr 17, 2023 at 11:27:50AM +0800, Xuan Zhuo wrote: > The purpose of this patch is to allow driver pass the own dma_ops to > xsk. Drivers have no business passing around dma_ops, or even knowing about them. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next] xsk: introduce xsk_dma_ops
The purpose of this patch is to allow driver pass the own dma_ops to xsk. This is to cope with the scene of virtio-net. If virtio does not have VIRTIO_F_ACCESS_PLATFORM, then virtio cannot use DMA API. In this case, XSK cannot use DMA API directly to achieve DMA address. Based on this scene, we must let XSK support driver to use the driver's dma_ops. On the other hand, the implementation of XSK as a highlevel code should put the underlying operation of DMA to the driver layer. The driver layer determines the implementation of the final DMA. XSK should not make such assumptions. Everything will be simplified if DMA is done at the driver level. More is here: https://lore.kernel.org/virtualization/1681265026.6082013-1-xuanz...@linux.alibaba.com/ Signed-off-by: Xuan Zhuo --- include/net/xdp_sock_drv.h | 20 +++- include/net/xsk_buff_pool.h | 19 +++ net/xdp/xsk_buff_pool.c | 47 +++-- 3 files changed, 68 insertions(+), 18 deletions(-) diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index 9c0d860609ba..181583ff6a26 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -67,7 +67,17 @@ static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, { struct xdp_umem *umem = pool->umem; - return xp_dma_map(pool, dev, attrs, umem->pgs, umem->npgs); + return xp_dma_map(pool, dev, NULL, attrs, umem->pgs, umem->npgs); +} + +static inline int xsk_pool_dma_map_with_ops(struct xsk_buff_pool *pool, + struct device *dev, + struct xsk_dma_ops *dma_ops, + unsigned long attrs) +{ + struct xdp_umem *umem = pool->umem; + + return xp_dma_map(pool, dev, dma_ops, attrs, umem->pgs, umem->npgs); } static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp) @@ -226,6 +236,14 @@ static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, return 0; } +static inline int xsk_pool_dma_map_with_ops(struct xsk_buff_pool *pool, + struct device *dev, + struct xsk_dma_ops *dma_ops, + unsigned long attrs) +{ + return 0; +} + static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp) { return 0; diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 3e952e569418..1299b9d12484 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -43,6 +43,23 @@ struct xsk_dma_map { bool dma_need_sync; }; +struct xsk_dma_ops { + dma_addr_t (*map_page)(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, unsigned long attrs); + void (*unmap_page)(struct device *dev, dma_addr_t dma_handle, + size_t size, enum dma_data_direction dir, + unsigned long attrs); + int (*mapping_error)(struct device *dev, dma_addr_t dma_addr); + bool (*need_sync)(struct device *dev, dma_addr_t dma); + void (*sync_single_range_for_cpu)(struct device *dev, dma_addr_t addr, + unsigned long offset, size_t size, + enum dma_data_direction dir); + void (*sync_single_range_for_device)(struct device *dev, dma_addr_t addr, +unsigned long offset, size_t size, +enum dma_data_direction dir); +}; + struct xsk_buff_pool { /* Members only used in the control path first. */ struct device *dev; @@ -85,6 +102,7 @@ struct xsk_buff_pool { * sockets share a single cq when the same netdev and queue id is shared. */ spinlock_t cq_lock; + struct xsk_dma_ops dma_ops; struct xdp_buff_xsk *free_heads[]; }; @@ -131,6 +149,7 @@ static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p /* AF_XDP ZC drivers, via xdp_sock_buff.h */ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq); int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev, + struct xsk_dma_ops *dma_ops, unsigned long attrs, struct page **pages, u32 nr_pages); void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs); struct xdp_buff *xp_alloc(struct xsk_buff_pool *pool); diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index b2df1e0f8153..646090cae8ec 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -328,7 +328,8 @@ static void xp_destroy_dma_map(struct xsk_dma_map *dma_map) kfree(dma_map); } -static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs) +static void __xp_dma_un