Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, Oct 17, 2023 at 2:17 PM Xuan Zhuo wrote: > > On Tue, 17 Oct 2023 13:27:47 +0800, Jason Wang wrote: > > On Tue, Oct 17, 2023 at 11:28 AM Jason Wang wrote: > > > > > > On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo > > > wrote: > > > > > > > > On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang > > > > wrote: > > > > > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo > > > > > wrote: > > > > > > > > > > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang > > > > > > wrote: > > > > > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > > > > > > wrote: > > > > > > > > > > > > > > > > ## AF_XDP > > > > > > > > > > > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network > > > > > > > > framework. The zero > > > > > > > > copy feature of xsk (XDP socket) needs to be supported by the > > > > > > > > driver. The > > > > > > > > performance of zero copy is very good. mlx5 and intel ixgbe > > > > > > > > already support > > > > > > > > this feature, This patch set allows virtio-net to support xsk's > > > > > > > > zerocopy xmit > > > > > > > > feature. > > > > > > > > > > > > > > > > At present, we have completed some preparation: > > > > > > > > > > > > > > > > 1. vq-reset (virtio spec and kernel code) > > > > > > > > 2. virtio-core premapped dma > > > > > > > > 3. virtio-net xdp refactor > > > > > > > > > > > > > > > > So it is time for Virtio-Net to complete the support for the > > > > > > > > XDP Socket > > > > > > > > Zerocopy. > > > > > > > > > > > > > > > > Virtio-net can not increase the queue num at will, so xsk > > > > > > > > shares the queue with > > > > > > > > kernel. > > > > > > > > > > > > > > > > On the other hand, Virtio-Net does not support generate > > > > > > > > interrupt from driver > > > > > > > > manually, so when we wakeup tx xmit, we used some tips. If the > > > > > > > > CPU run by TX > > > > > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the > > > > > > > > remote CPU. If it > > > > > > > > is also the local CPU, then we wake up napi directly. > > > > > > > > > > > > > > > > This patch set includes some refactor to the virtio-net to let > > > > > > > > that to support > > > > > > > > AF_XDP. > > > > > > > > > > > > > > > > ## performance > > > > > > > > > > > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > > > > > > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > > > > > > I use this tool to send udp packet by kernel syscall. > > > > > > > > > > > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > > > > > > > > > > > I write a tool that sends udp packets or recvs udp packets by > > > > > > > > AF_XDP. > > > > > > > > > > > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > > > > > > --|---|--| > > > > > > > > xmit by syscall | 100%| | 676,915 > > > > > > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > > > > > > recv by syscall | 60% | 100% | 932,288 > > > > > > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > > > > > > > > > > > Any chance we can get a testpmd result (which I guess should be > > > > > > > better > > > > > > > than PPS above)? > > > > > > > > > > > > Do you mean testpmd + DPDK + AF_XDP? > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > Yes. This is probably better because my tool does more work. That > > > > > > is not a > > > > > > complete testing tool used by our business. > > > > > > > > > > Probably, but it would be appealing for others. Especially considering > > > > > DPDK supports AF_XDP PMD now. > > > > > > > > OK. > > > > > > > > Let me try. > > > > > > > > But could you start to review firstly? > > > > > > Yes, it's in my todo list. > > > > Speaking too fast, I think if it doesn't take too long time, I would > > wait for the result first as netdim series. One reason is that I > > remember claims to be only 10% to 20% loss comparing to wire speed, so > > I'd expect it should be much faster. I vaguely remember, even a vhost > > can gives us more than 3M PPS if we disable SMAP, so the numbers here > > are not as impressive as expected. > > > What is SMAP? Cloud you give me more info? Supervisor Mode Access Prevention Vhost suffers from this. > > So if we think the 3M as the wire speed, you expect the result > can reach 2.8M pps/core, right? It's AF_XDP that claims to be 80% if my memory is correct. So a correct AF_XDP implementation should not sit behind this too much. > Now the recv result is 2.5M(2463646) pps/core. > Do you think there is a huge gap? You never describe your testing environment in details. For example, is this a virtual environment? What's the CPU model and frequency etc. Because I never see a NIC whose wire speed is 3M. > > My tool makes udp packet and lookup route, so it take more much cpu. That's why I suggest you to test raw PPS. Thanks >
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, 17 Oct 2023 13:27:47 +0800, Jason Wang wrote: > On Tue, Oct 17, 2023 at 11:28 AM Jason Wang wrote: > > > > On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo > > wrote: > > > > > > On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang > > > wrote: > > > > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo > > > > wrote: > > > > > > > > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang > > > > > wrote: > > > > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > > > > > wrote: > > > > > > > > > > > > > > ## AF_XDP > > > > > > > > > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network > > > > > > > framework. The zero > > > > > > > copy feature of xsk (XDP socket) needs to be supported by the > > > > > > > driver. The > > > > > > > performance of zero copy is very good. mlx5 and intel ixgbe > > > > > > > already support > > > > > > > this feature, This patch set allows virtio-net to support xsk's > > > > > > > zerocopy xmit > > > > > > > feature. > > > > > > > > > > > > > > At present, we have completed some preparation: > > > > > > > > > > > > > > 1. vq-reset (virtio spec and kernel code) > > > > > > > 2. virtio-core premapped dma > > > > > > > 3. virtio-net xdp refactor > > > > > > > > > > > > > > So it is time for Virtio-Net to complete the support for the XDP > > > > > > > Socket > > > > > > > Zerocopy. > > > > > > > > > > > > > > Virtio-net can not increase the queue num at will, so xsk shares > > > > > > > the queue with > > > > > > > kernel. > > > > > > > > > > > > > > On the other hand, Virtio-Net does not support generate interrupt > > > > > > > from driver > > > > > > > manually, so when we wakeup tx xmit, we used some tips. If the > > > > > > > CPU run by TX > > > > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the > > > > > > > remote CPU. If it > > > > > > > is also the local CPU, then we wake up napi directly. > > > > > > > > > > > > > > This patch set includes some refactor to the virtio-net to let > > > > > > > that to support > > > > > > > AF_XDP. > > > > > > > > > > > > > > ## performance > > > > > > > > > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > > > > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > > > > > I use this tool to send udp packet by kernel syscall. > > > > > > > > > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > > > > > > > > > I write a tool that sends udp packets or recvs udp packets by > > > > > > > AF_XDP. > > > > > > > > > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > > > > > --|---|--| > > > > > > > xmit by syscall | 100%| | 676,915 > > > > > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > > > > > recv by syscall | 60% | 100% | 932,288 > > > > > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > > > > > > > > > Any chance we can get a testpmd result (which I guess should be > > > > > > better > > > > > > than PPS above)? > > > > > > > > > > Do you mean testpmd + DPDK + AF_XDP? > > > > > > > > Yes. > > > > > > > > > > > > > > Yes. This is probably better because my tool does more work. That is > > > > > not a > > > > > complete testing tool used by our business. > > > > > > > > Probably, but it would be appealing for others. Especially considering > > > > DPDK supports AF_XDP PMD now. > > > > > > OK. > > > > > > Let me try. > > > > > > But could you start to review firstly? > > > > Yes, it's in my todo list. > > Speaking too fast, I think if it doesn't take too long time, I would > wait for the result first as netdim series. One reason is that I > remember claims to be only 10% to 20% loss comparing to wire speed, so > I'd expect it should be much faster. I vaguely remember, even a vhost > can gives us more than 3M PPS if we disable SMAP, so the numbers here > are not as impressive as expected. What is SMAP? Cloud you give me more info? So if we think the 3M as the wire speed, you expect the result can reach 2.8M pps/core, right? Now the recv result is 2.5M(2463646) pps/core. Do you think there is a huge gap? My tool makes udp packet and lookup route, so it take more much cpu. I am confused. What is SMAP? Could you give me more information? So if we use 3M as the wire speed, you would expect the result to be 2.8M pps/core, right? Now the recv result is 2.5M (2463646 = 3,343,168/1.357) pps/core. Do you think the difference is big? My tool makes udp packets and looks up routes, so it requires more CPU. I'm confused. Is there something I misunderstood? Thanks. > > Thanks > > > > > > > > > > > > > > > > > > > > > > > What I noticed is that the hotspot is the driver writing virtio desc. > > > > > Because > > > > > the device is in busy mode. So there is race between driver and > > > > > device. > > > > > So I modified the virtio core and lazily updated avail idx. Then pps
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, Oct 17, 2023 at 11:28 AM Jason Wang wrote: > > On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo wrote: > > > > On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang wrote: > > > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo > > > wrote: > > > > > > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang > > > > wrote: > > > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > > > > wrote: > > > > > > > > > > > > ## AF_XDP > > > > > > > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. > > > > > > The zero > > > > > > copy feature of xsk (XDP socket) needs to be supported by the > > > > > > driver. The > > > > > > performance of zero copy is very good. mlx5 and intel ixgbe already > > > > > > support > > > > > > this feature, This patch set allows virtio-net to support xsk's > > > > > > zerocopy xmit > > > > > > feature. > > > > > > > > > > > > At present, we have completed some preparation: > > > > > > > > > > > > 1. vq-reset (virtio spec and kernel code) > > > > > > 2. virtio-core premapped dma > > > > > > 3. virtio-net xdp refactor > > > > > > > > > > > > So it is time for Virtio-Net to complete the support for the XDP > > > > > > Socket > > > > > > Zerocopy. > > > > > > > > > > > > Virtio-net can not increase the queue num at will, so xsk shares > > > > > > the queue with > > > > > > kernel. > > > > > > > > > > > > On the other hand, Virtio-Net does not support generate interrupt > > > > > > from driver > > > > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU > > > > > > run by TX > > > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote > > > > > > CPU. If it > > > > > > is also the local CPU, then we wake up napi directly. > > > > > > > > > > > > This patch set includes some refactor to the virtio-net to let that > > > > > > to support > > > > > > AF_XDP. > > > > > > > > > > > > ## performance > > > > > > > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > > > > I use this tool to send udp packet by kernel syscall. > > > > > > > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > > > > > > > I write a tool that sends udp packets or recvs udp packets by > > > > > > AF_XDP. > > > > > > > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > > > > --|---|--| > > > > > > xmit by syscall | 100%| | 676,915 > > > > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > > > > recv by syscall | 60% | 100% | 932,288 > > > > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > > > > > > > Any chance we can get a testpmd result (which I guess should be better > > > > > than PPS above)? > > > > > > > > Do you mean testpmd + DPDK + AF_XDP? > > > > > > Yes. > > > > > > > > > > > Yes. This is probably better because my tool does more work. That is > > > > not a > > > > complete testing tool used by our business. > > > > > > Probably, but it would be appealing for others. Especially considering > > > DPDK supports AF_XDP PMD now. > > > > OK. > > > > Let me try. > > > > But could you start to review firstly? > > Yes, it's in my todo list. Speaking too fast, I think if it doesn't take too long time, I would wait for the result first as netdim series. One reason is that I remember claims to be only 10% to 20% loss comparing to wire speed, so I'd expect it should be much faster. I vaguely remember, even a vhost can gives us more than 3M PPS if we disable SMAP, so the numbers here are not as impressive as expected. Thanks > > > > > > > > > > > > > > > > What I noticed is that the hotspot is the driver writing virtio desc. > > > > Because > > > > the device is in busy mode. So there is race between driver and device. > > > > So I modified the virtio core and lazily updated avail idx. Then pps > > > > can reach > > > > 10,000,000. > > > > > > Care to post a draft for this? > > > > YES, I is thinking for this. > > But maybe that is just work for split. The packed mode has some troubles. > > Ok. > > Thanks > > > > > Thanks. > > > > > > > > Thanks > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > ## maintain > > > > > > > > > > > > I am currently a reviewer for virtio-net. I commit to maintain > > > > > > AF_XDP support in > > > > > > virtio-net. > > > > > > > > > > > > Please review. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > v1: > > > > > > 1. remove two virtio commits. Push this patchset to net-next > > > > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to > > > > > > xsk: support tx > > > > > > 3. fix some warnings > > > > > > > > > > > > Xuan Zhuo (19): > > > > > > virtio_net: rename free_old_xmit_skbs to free_old_xmit > > > > > > virtio_net: unify the code for recycling the x
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo wrote: > > On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang wrote: > > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo > > wrote: > > > > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang > > > wrote: > > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > > > wrote: > > > > > > > > > > ## AF_XDP > > > > > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. > > > > > The zero > > > > > copy feature of xsk (XDP socket) needs to be supported by the driver. > > > > > The > > > > > performance of zero copy is very good. mlx5 and intel ixgbe already > > > > > support > > > > > this feature, This patch set allows virtio-net to support xsk's > > > > > zerocopy xmit > > > > > feature. > > > > > > > > > > At present, we have completed some preparation: > > > > > > > > > > 1. vq-reset (virtio spec and kernel code) > > > > > 2. virtio-core premapped dma > > > > > 3. virtio-net xdp refactor > > > > > > > > > > So it is time for Virtio-Net to complete the support for the XDP > > > > > Socket > > > > > Zerocopy. > > > > > > > > > > Virtio-net can not increase the queue num at will, so xsk shares the > > > > > queue with > > > > > kernel. > > > > > > > > > > On the other hand, Virtio-Net does not support generate interrupt > > > > > from driver > > > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU > > > > > run by TX > > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote > > > > > CPU. If it > > > > > is also the local CPU, then we wake up napi directly. > > > > > > > > > > This patch set includes some refactor to the virtio-net to let that > > > > > to support > > > > > AF_XDP. > > > > > > > > > > ## performance > > > > > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > > > I use this tool to send udp packet by kernel syscall. > > > > > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > > > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP. > > > > > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > > > --|---|--| > > > > > xmit by syscall | 100%| | 676,915 > > > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > > > recv by syscall | 60% | 100% | 932,288 > > > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > > > > > Any chance we can get a testpmd result (which I guess should be better > > > > than PPS above)? > > > > > > Do you mean testpmd + DPDK + AF_XDP? > > > > Yes. > > > > > > > > Yes. This is probably better because my tool does more work. That is not a > > > complete testing tool used by our business. > > > > Probably, but it would be appealing for others. Especially considering > > DPDK supports AF_XDP PMD now. > > OK. > > Let me try. > > But could you start to review firstly? Yes, it's in my todo list. > > > > > > > > > > What I noticed is that the hotspot is the driver writing virtio desc. > > > Because > > > the device is in busy mode. So there is race between driver and device. > > > So I modified the virtio core and lazily updated avail idx. Then pps can > > > reach > > > 10,000,000. > > > > Care to post a draft for this? > > YES, I is thinking for this. > But maybe that is just work for split. The packed mode has some troubles. Ok. Thanks > > Thanks. > > > > > Thanks > > > > > > > > Thanks. > > > > > > > > > > > Thanks > > > > > > > > > > > > > > ## maintain > > > > > > > > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP > > > > > support in > > > > > virtio-net. > > > > > > > > > > Please review. > > > > > > > > > > Thanks. > > > > > > > > > > v1: > > > > > 1. remove two virtio commits. Push this patchset to net-next > > > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to > > > > > xsk: support tx > > > > > 3. fix some warnings > > > > > > > > > > Xuan Zhuo (19): > > > > > virtio_net: rename free_old_xmit_skbs to free_old_xmit > > > > > virtio_net: unify the code for recycling the xmit ptr > > > > > virtio_net: independent directory > > > > > virtio_net: move to virtio_net.h > > > > > virtio_net: add prefix virtnet to all struct/api inside virtio_net.h > > > > > virtio_net: separate virtnet_rx_resize() > > > > > virtio_net: separate virtnet_tx_resize() > > > > > virtio_net: sq support premapped mode > > > > > virtio_net: xsk: bind/unbind xsk > > > > > virtio_net: xsk: prevent disable tx napi > > > > > virtio_net: xsk: tx: support tx > > > > > virtio_net: xsk: tx: support wakeup > > > > > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk > > > > > buffer > > > > > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer > > > > > virtio_net: xsk: rx: introduce ad
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang wrote: > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo wrote: > > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang wrote: > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > > wrote: > > > > > > > > ## AF_XDP > > > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The > > > > zero > > > > copy feature of xsk (XDP socket) needs to be supported by the driver. > > > > The > > > > performance of zero copy is very good. mlx5 and intel ixgbe already > > > > support > > > > this feature, This patch set allows virtio-net to support xsk's > > > > zerocopy xmit > > > > feature. > > > > > > > > At present, we have completed some preparation: > > > > > > > > 1. vq-reset (virtio spec and kernel code) > > > > 2. virtio-core premapped dma > > > > 3. virtio-net xdp refactor > > > > > > > > So it is time for Virtio-Net to complete the support for the XDP Socket > > > > Zerocopy. > > > > > > > > Virtio-net can not increase the queue num at will, so xsk shares the > > > > queue with > > > > kernel. > > > > > > > > On the other hand, Virtio-Net does not support generate interrupt from > > > > driver > > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU run > > > > by TX > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote > > > > CPU. If it > > > > is also the local CPU, then we wake up napi directly. > > > > > > > > This patch set includes some refactor to the virtio-net to let that to > > > > support > > > > AF_XDP. > > > > > > > > ## performance > > > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > > I use this tool to send udp packet by kernel syscall. > > > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP. > > > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > > --|---|--| > > > > xmit by syscall | 100%| | 676,915 > > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > > recv by syscall | 60% | 100% | 932,288 > > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > > > Any chance we can get a testpmd result (which I guess should be better > > > than PPS above)? > > > > Do you mean testpmd + DPDK + AF_XDP? > > Yes. > > > > > Yes. This is probably better because my tool does more work. That is not a > > complete testing tool used by our business. > > Probably, but it would be appealing for others. Especially considering > DPDK supports AF_XDP PMD now. OK. Let me try. But could you start to review firstly? > > > > > What I noticed is that the hotspot is the driver writing virtio desc. > > Because > > the device is in busy mode. So there is race between driver and device. > > So I modified the virtio core and lazily updated avail idx. Then pps can > > reach > > 10,000,000. > > Care to post a draft for this? YES, I is thinking for this. But maybe that is just work for split. The packed mode has some troubles. Thanks. > > Thanks > > > > > Thanks. > > > > > > > > Thanks > > > > > > > > > > > ## maintain > > > > > > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP > > > > support in > > > > virtio-net. > > > > > > > > Please review. > > > > > > > > Thanks. > > > > > > > > v1: > > > > 1. remove two virtio commits. Push this patchset to net-next > > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: > > > > support tx > > > > 3. fix some warnings > > > > > > > > Xuan Zhuo (19): > > > > virtio_net: rename free_old_xmit_skbs to free_old_xmit > > > > virtio_net: unify the code for recycling the xmit ptr > > > > virtio_net: independent directory > > > > virtio_net: move to virtio_net.h > > > > virtio_net: add prefix virtnet to all struct/api inside virtio_net.h > > > > virtio_net: separate virtnet_rx_resize() > > > > virtio_net: separate virtnet_tx_resize() > > > > virtio_net: sq support premapped mode > > > > virtio_net: xsk: bind/unbind xsk > > > > virtio_net: xsk: prevent disable tx napi > > > > virtio_net: xsk: tx: support tx > > > > virtio_net: xsk: tx: support wakeup > > > > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer > > > > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer > > > > virtio_net: xsk: rx: introduce add_recvbuf_xsk() > > > > virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer > > > > virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer > > > > virtio_net: update tx timeout record > > > > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY > > > > > > > > MAINTAINERS | 2 +- > > > > drivers/net/Kconfig
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo wrote: > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang wrote: > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo > > wrote: > > > > > > ## AF_XDP > > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The > > > zero > > > copy feature of xsk (XDP socket) needs to be supported by the driver. The > > > performance of zero copy is very good. mlx5 and intel ixgbe already > > > support > > > this feature, This patch set allows virtio-net to support xsk's zerocopy > > > xmit > > > feature. > > > > > > At present, we have completed some preparation: > > > > > > 1. vq-reset (virtio spec and kernel code) > > > 2. virtio-core premapped dma > > > 3. virtio-net xdp refactor > > > > > > So it is time for Virtio-Net to complete the support for the XDP Socket > > > Zerocopy. > > > > > > Virtio-net can not increase the queue num at will, so xsk shares the > > > queue with > > > kernel. > > > > > > On the other hand, Virtio-Net does not support generate interrupt from > > > driver > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by > > > TX > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. > > > If it > > > is also the local CPU, then we wake up napi directly. > > > > > > This patch set includes some refactor to the virtio-net to let that to > > > support > > > AF_XDP. > > > > > > ## performance > > > > > > ENV: Qemu with vhost-user(polling mode). > > > > > > Sockperf: https://github.com/Mellanox/sockperf > > > I use this tool to send udp packet by kernel syscall. > > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP. > > > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > > --|---|--| > > > xmit by syscall | 100%| | 676,915 > > > xmit by xsk | 59.1% | 100% | 5,447,168 > > > recv by syscall | 60% | 100% | 932,288 > > > recv by xsk | 35.7% | 100% | 3,343,168 > > > > Any chance we can get a testpmd result (which I guess should be better > > than PPS above)? > > Do you mean testpmd + DPDK + AF_XDP? Yes. > > Yes. This is probably better because my tool does more work. That is not a > complete testing tool used by our business. Probably, but it would be appealing for others. Especially considering DPDK supports AF_XDP PMD now. > > What I noticed is that the hotspot is the driver writing virtio desc. Because > the device is in busy mode. So there is race between driver and device. > So I modified the virtio core and lazily updated avail idx. Then pps can reach > 10,000,000. Care to post a draft for this? Thanks > > Thanks. > > > > > Thanks > > > > > > > > ## maintain > > > > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP > > > support in > > > virtio-net. > > > > > > Please review. > > > > > > Thanks. > > > > > > v1: > > > 1. remove two virtio commits. Push this patchset to net-next > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: > > > support tx > > > 3. fix some warnings > > > > > > Xuan Zhuo (19): > > > virtio_net: rename free_old_xmit_skbs to free_old_xmit > > > virtio_net: unify the code for recycling the xmit ptr > > > virtio_net: independent directory > > > virtio_net: move to virtio_net.h > > > virtio_net: add prefix virtnet to all struct/api inside virtio_net.h > > > virtio_net: separate virtnet_rx_resize() > > > virtio_net: separate virtnet_tx_resize() > > > virtio_net: sq support premapped mode > > > virtio_net: xsk: bind/unbind xsk > > > virtio_net: xsk: prevent disable tx napi > > > virtio_net: xsk: tx: support tx > > > virtio_net: xsk: tx: support wakeup > > > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer > > > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer > > > virtio_net: xsk: rx: introduce add_recvbuf_xsk() > > > virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer > > > virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer > > > virtio_net: update tx timeout record > > > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY > > > > > > MAINTAINERS | 2 +- > > > drivers/net/Kconfig | 8 +- > > > drivers/net/Makefile| 2 +- > > > drivers/net/virtio/Kconfig | 13 + > > > drivers/net/virtio/Makefile | 8 + > > > drivers/net/{virtio_net.c => virtio/main.c} | 652 +--- > > > drivers/net/virtio/virtio_net.h | 359 +++ > > > drivers/net/virtio/xsk.c| 545 > > > drivers/net/virtio/xsk.h| 32 + > > > 9 files changed, 1247 inser
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang wrote: > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo wrote: > > > > ## AF_XDP > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero > > copy feature of xsk (XDP socket) needs to be supported by the driver. The > > performance of zero copy is very good. mlx5 and intel ixgbe already support > > this feature, This patch set allows virtio-net to support xsk's zerocopy > > xmit > > feature. > > > > At present, we have completed some preparation: > > > > 1. vq-reset (virtio spec and kernel code) > > 2. virtio-core premapped dma > > 3. virtio-net xdp refactor > > > > So it is time for Virtio-Net to complete the support for the XDP Socket > > Zerocopy. > > > > Virtio-net can not increase the queue num at will, so xsk shares the queue > > with > > kernel. > > > > On the other hand, Virtio-Net does not support generate interrupt from > > driver > > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If > > it > > is also the local CPU, then we wake up napi directly. > > > > This patch set includes some refactor to the virtio-net to let that to > > support > > AF_XDP. > > > > ## performance > > > > ENV: Qemu with vhost-user(polling mode). > > > > Sockperf: https://github.com/Mellanox/sockperf > > I use this tool to send udp packet by kernel syscall. > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP. > > > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > > --|---|--| > > xmit by syscall | 100%| | 676,915 > > xmit by xsk | 59.1% | 100% | 5,447,168 > > recv by syscall | 60% | 100% | 932,288 > > recv by xsk | 35.7% | 100% | 3,343,168 > > Any chance we can get a testpmd result (which I guess should be better > than PPS above)? Do you mean testpmd + DPDK + AF_XDP? Yes. This is probably better because my tool does more work. That is not a complete testing tool used by our business. What I noticed is that the hotspot is the driver writing virtio desc. Because the device is in busy mode. So there is race between driver and device. So I modified the virtio core and lazily updated avail idx. Then pps can reach 10,000,000. Thanks. > > Thanks > > > > > ## maintain > > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP > > support in > > virtio-net. > > > > Please review. > > > > Thanks. > > > > v1: > > 1. remove two virtio commits. Push this patchset to net-next > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: > > support tx > > 3. fix some warnings > > > > Xuan Zhuo (19): > > virtio_net: rename free_old_xmit_skbs to free_old_xmit > > virtio_net: unify the code for recycling the xmit ptr > > virtio_net: independent directory > > virtio_net: move to virtio_net.h > > virtio_net: add prefix virtnet to all struct/api inside virtio_net.h > > virtio_net: separate virtnet_rx_resize() > > virtio_net: separate virtnet_tx_resize() > > virtio_net: sq support premapped mode > > virtio_net: xsk: bind/unbind xsk > > virtio_net: xsk: prevent disable tx napi > > virtio_net: xsk: tx: support tx > > virtio_net: xsk: tx: support wakeup > > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer > > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer > > virtio_net: xsk: rx: introduce add_recvbuf_xsk() > > virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer > > virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer > > virtio_net: update tx timeout record > > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY > > > > MAINTAINERS | 2 +- > > drivers/net/Kconfig | 8 +- > > drivers/net/Makefile| 2 +- > > drivers/net/virtio/Kconfig | 13 + > > drivers/net/virtio/Makefile | 8 + > > drivers/net/{virtio_net.c => virtio/main.c} | 652 +--- > > drivers/net/virtio/virtio_net.h | 359 +++ > > drivers/net/virtio/xsk.c| 545 > > drivers/net/virtio/xsk.h| 32 + > > 9 files changed, 1247 insertions(+), 374 deletions(-) > > create mode 100644 drivers/net/virtio/Kconfig > > create mode 100644 drivers/net/virtio/Makefile > > rename drivers/net/{virtio_net.c => virtio/main.c} (91%) > > create mode 100644 drivers/net/virtio/virtio_net.h > > create mode 100644 drivers/net/virtio/xsk.c > > create mode 100644 drivers/net/virtio/xsk.h > > > > -- > > 2.32.0.3.g01195cf9f > > > ___ Virtualization mailing list Vi
Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo wrote: > > ## AF_XDP > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero > copy feature of xsk (XDP socket) needs to be supported by the driver. The > performance of zero copy is very good. mlx5 and intel ixgbe already support > this feature, This patch set allows virtio-net to support xsk's zerocopy xmit > feature. > > At present, we have completed some preparation: > > 1. vq-reset (virtio spec and kernel code) > 2. virtio-core premapped dma > 3. virtio-net xdp refactor > > So it is time for Virtio-Net to complete the support for the XDP Socket > Zerocopy. > > Virtio-net can not increase the queue num at will, so xsk shares the queue > with > kernel. > > On the other hand, Virtio-Net does not support generate interrupt from driver > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it > is also the local CPU, then we wake up napi directly. > > This patch set includes some refactor to the virtio-net to let that to support > AF_XDP. > > ## performance > > ENV: Qemu with vhost-user(polling mode). > > Sockperf: https://github.com/Mellanox/sockperf > I use this tool to send udp packet by kernel syscall. > > xmit command: sockperf tp -i 10.0.3.1 -t 1000 > > I write a tool that sends udp packets or recvs udp packets by AF_XDP. > > | Guest APP CPU |Guest Softirq CPU | UDP PPS > --|---|--| > xmit by syscall | 100%| | 676,915 > xmit by xsk | 59.1% | 100% | 5,447,168 > recv by syscall | 60% | 100% | 932,288 > recv by xsk | 35.7% | 100% | 3,343,168 Any chance we can get a testpmd result (which I guess should be better than PPS above)? Thanks > > ## maintain > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support > in > virtio-net. > > Please review. > > Thanks. > > v1: > 1. remove two virtio commits. Push this patchset to net-next > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: > support tx > 3. fix some warnings > > Xuan Zhuo (19): > virtio_net: rename free_old_xmit_skbs to free_old_xmit > virtio_net: unify the code for recycling the xmit ptr > virtio_net: independent directory > virtio_net: move to virtio_net.h > virtio_net: add prefix virtnet to all struct/api inside virtio_net.h > virtio_net: separate virtnet_rx_resize() > virtio_net: separate virtnet_tx_resize() > virtio_net: sq support premapped mode > virtio_net: xsk: bind/unbind xsk > virtio_net: xsk: prevent disable tx napi > virtio_net: xsk: tx: support tx > virtio_net: xsk: tx: support wakeup > virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer > virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer > virtio_net: xsk: rx: introduce add_recvbuf_xsk() > virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer > virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer > virtio_net: update tx timeout record > virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY > > MAINTAINERS | 2 +- > drivers/net/Kconfig | 8 +- > drivers/net/Makefile| 2 +- > drivers/net/virtio/Kconfig | 13 + > drivers/net/virtio/Makefile | 8 + > drivers/net/{virtio_net.c => virtio/main.c} | 652 +--- > drivers/net/virtio/virtio_net.h | 359 +++ > drivers/net/virtio/xsk.c| 545 > drivers/net/virtio/xsk.h| 32 + > 9 files changed, 1247 insertions(+), 374 deletions(-) > create mode 100644 drivers/net/virtio/Kconfig > create mode 100644 drivers/net/virtio/Makefile > rename drivers/net/{virtio_net.c => virtio/main.c} (91%) > create mode 100644 drivers/net/virtio/virtio_net.h > create mode 100644 drivers/net/virtio/xsk.c > create mode 100644 drivers/net/virtio/xsk.h > > -- > 2.32.0.3.g01195cf9f > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH
On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov wrote: > > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki > wrote: > > > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki > > > wrote: > > >> > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > >> index 0448700890f7..298634556fab 100644 > > >> --- a/include/uapi/linux/bpf.h > > >> +++ b/include/uapi/linux/bpf.h > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > > >> BPF_PROG_TYPE_SK_LOOKUP, > > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > > >> BPF_PROG_TYPE_NETFILTER, > > >> + BPF_PROG_TYPE_VNET_HASH, > > > > > > Sorry, we do not add new stable program types anymore. > > > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > > >> __u8 tstamp_type; > > >> __u32 :24; /* Padding, future use. */ > > >> __u64 hwtstamp; > > >> + > > >> + __u32 vnet_hash_value; > > >> + __u16 vnet_hash_report; > > >> + __u16 vnet_rss_queue; > > >> }; > > > > > > we also do not add anything to uapi __sk_buff. > > > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > > >> + .get_func_proto = sk_filter_func_proto, > > >> + .is_valid_access= sk_filter_is_valid_access, > > >> + .convert_ctx_access = bpf_convert_ctx_access, > > >> + .gen_ld_abs = bpf_gen_ld_abs, > > >> +}; > > > > > > and we don't do ctx rewrites like this either. > > > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > > in _unstable_ way. > > > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > > and I'm worried if it may mean the interface stability. > > > > Let me describe the context. QEMU bundles an eBPF program that is used > > for the "eBPF steering program" feature of tun. Now I'm proposing to > > extend the feature to allow to return some values to the userspace and > > vhost_net. As such, the extension needs to be done in a way that ensures > > interface stability. > > bpf is not an option then. > we do not add stable bpf program types or hooks any more. Does this mean eBPF could not be used for any new use cases other than the existing ones? > If a kernel subsystem wants to use bpf it needs to accept the fact > that such bpf extensibility will be unstable and subsystem maintainers > can decide to remove such bpf support in the future. I don't see how it is different from the existing ones. Thanks > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu wrote: > > > > On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: > > On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: > >> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: > >>> > >>> > >>> On 10/12/2023 8:01 PM, Jason Wang wrote: > On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: > > Devices with on-chip IOMMU or vendor specific IOTLB implementation > > may need to restore iotlb mapping to the initial or default state > > using the .reset_map op, as it's desirable for some parent devices > > to solely manipulate mappings by its own, independent of virtio device > > state. For instance, device reset does not cause mapping go away on > > such IOTLB model in need of persistent mapping. Before vhost-vdpa > > is going away, give them a chance to reset iotlb back to the initial > > state in vhost_vdpa_cleanup(). > > > > Signed-off-by: Si-Wei Liu > > --- > >drivers/vhost/vdpa.c | 16 > >1 file changed, 16 insertions(+) > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c > > index 851535f..a3f8160 100644 > > --- a/drivers/vhost/vdpa.c > > +++ b/drivers/vhost/vdpa.c > > @@ -131,6 +131,15 @@ static struct vhost_vdpa_as > > *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, > > return vhost_vdpa_alloc_as(v, asid); > >} > > > > +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) > > +{ > > + struct vdpa_device *vdpa = v->vdpa; > > + const struct vdpa_config_ops *ops = vdpa->config; > > + > > + if (ops->reset_map) > > + ops->reset_map(vdpa, asid); > > +} > > + > >static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) > >{ > > struct vhost_vdpa_as *as = asid_to_as(v, asid); > > @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa > > *v, u32 asid) > > > > hlist_del(&as->hash_link); > > vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); > > + /* > > +* Devices with vendor specific IOMMU may need to restore > > +* iotlb to the initial or default state which is not done > > +* through device reset, as the IOTLB mapping manipulation > > +* could be decoupled from the virtio device life cycle. > > +*/ > Should we do this according to whether IOTLB_PRESIST is set? > >>> Well, in theory this seems like so but it's unnecessary code change > >>> actually, as that is the way how vDPA parent behind platform IOMMU works > >>> today, and userspace doesn't break as of today. :) > >> Well, this is one question I've ever asked before. You have explained > >> that one of the reason that we don't break userspace is that they may > >> couple IOTLB reset with vDPA reset as well. One example is the Qemu. > >> > >>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set > >>> it doesn't necessarily mean the iotlb will definitely be destroyed > >>> across reset (think about the platform IOMMU case), so userspace today > >>> is already tolerating enough with either good or bad IOMMU. I'm confused, how to define tolerating here? For example, if it has tolerance, why bother? > >>This code of > >>> not checking IOTLB_PERSIST being set is intentional, there's no point to > >>> emulate bad IOMMU behavior even for older userspace (with improper > >>> emulation to be done it would result in even worse performance). I can easily imagine a case: The old Qemu that works only with a setup like mlx5_vdpa. If we do this without a negotiation, IOTLB will not be clear but the Qemu will try to re-program the IOTLB after reset. Which will break? 1) stick the exact old behaviour with just one line of check 2) audit all the possible cases to avoid a one line of code 1) seems much easier than 2) > >> For two reasons: > >> > >> 1) backend features need acked by userspace this is by design > >> 2) keep the odd behaviour seems to be more safe as we can't audit > >> every userspace program > >> > > The old behavior (without flag ack) cannot be trusted already, as: Possibly but the point is to unbreak userspace no matter how weird the behaviour we've ever had. > > * Devices using platform IOMMU (in other words, not implementing > > neither .set_map nor .dma_map) does not unmap memory at virtio reset. > > * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do > > reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb > > called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset, > > now all backends work the same as far as I know., which was (and is) > > the way devices using the platform IOMMU works. > > > > The difference in behavior did not matter as QEMU unmaps all the > > memory unregistering the memory listener at vhost_vdpa_dev_start(..., > > start
Re: [PATCH net-next v1 13/19] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
On Mon, 16 Oct 2023 16:44:34 -0700, Jakub Kicinski wrote: > On Mon, 16 Oct 2023 20:00:27 +0800 Xuan Zhuo wrote: > > @@ -305,9 +311,15 @@ static inline void virtnet_free_old_xmit(struct > > virtnet_sq *sq, bool in_napi, > > > > stats->bytes += xdp_get_frame_len(frame); > > xdp_return_frame(frame); > > + } else { > > + stats->bytes += virtnet_ptr_to_xsk(ptr); > > + ++xsknum; > > } > > stats->packets++; > > } > > + > > + if (xsknum) > > + xsk_tx_completed(sq->xsk.pool, xsknum); > > } > > sparse complains: > > drivers/net/virtio/virtio_net.h:322:41: warning: incorrect type in argument 1 > (different address spaces) > drivers/net/virtio/virtio_net.h:322:41:expected struct xsk_buff_pool *pool > drivers/net/virtio/virtio_net.h:322:41:got struct xsk_buff_pool > [noderef] __rcu *pool > > please build test with W=1 C=1 OK. I will add C=1 to may script. Thanks. > -- > pw-bot: cr ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 4/4] vdpa/mlx5: implement .reset_map driver op
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device creation time. This 1:1 DMA MR will be implicitly destroyed while the first .set_map call is invoked, in which case callers like vhost-vdpa will start to set up custom mappings. When the .reset callback is invoked, the custom mappings will be cleared and the 1:1 DMA MR will be re-created. In order to reduce excessive memory mapping cost in live migration, it is desirable to decouple the vhost-vdpa IOTLB abstraction from the virtio device life cycle, i.e. mappings can be kept around intact across virtio device reset. Leverage the .reset_map callback, which is meant to destroy the regular MR on the given ASID and recreate the initial DMA mapping. That way, the device .reset op can run free from having to maintain and clean up memory mappings by itself. The cvq mapping also needs to be cleared if is in the given ASID. Co-developed-by: Dragos Tatulea Signed-off-by: Dragos Tatulea Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 + drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 +- 3 files changed, 31 insertions(+), 5 deletions(-) diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h b/drivers/vdpa/mlx5/core/mlx5_vdpa.h index db988ced5a5d..84547d998bcf 100644 --- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h +++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h @@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev, struct vhost_iotlb *iotlb, unsigned int asid); int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev); +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid); #define mlx5_vdpa_warn(__dev, format, ...) \ dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, __func__, __LINE__, \ diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c index 66530e28f327..2197c46e563a 100644 --- a/drivers/vdpa/mlx5/core/mr.c +++ b/drivers/vdpa/mlx5/core/mr.c @@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev) return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0); } + +int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid) +{ + if (asid >= MLX5_VDPA_NUM_AS) + return -EINVAL; + + mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]); + + if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { + if (mlx5_vdpa_create_dma_mr(mvdev)) + mlx5_vdpa_warn(mvdev, "create DMA MR failed\n"); + } else { + mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid); + } + + return 0; +} diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c b/drivers/vdpa/mlx5/net/mlx5_vnet.c index 6abe02310f2b..928e71bc5571 100644 --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c @@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) unregister_link_notifier(ndev); teardown_driver(ndev); clear_vqs_ready(ndev); - mlx5_vdpa_destroy_mr_resources(&ndev->mvdev); ndev->mvdev.status = 0; ndev->mvdev.suspended = false; ndev->cur_num_vqs = 0; @@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev) init_group_to_asid_map(mvdev); ++mvdev->generation; - if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) { - if (mlx5_vdpa_create_dma_mr(mvdev)) - mlx5_vdpa_warn(mvdev, "create MR failed\n"); - } up_write(&ndev->reslock); return 0; @@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, unsigned int asid, return err; } +static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid) +{ + struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); + struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev); + int err; + + down_write(&ndev->reslock); + err = mlx5_vdpa_reset_mr(mvdev, asid); + up_write(&ndev->reslock); + return err; +} + static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx) { struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev); @@ -3199,6 +3206,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = { .set_config = mlx5_vdpa_set_config, .get_generation = mlx5_vdpa_get_generation, .set_map = mlx5_vdpa_set_map, + .reset_map = mlx5_vdpa_reset_map, .set_group_asid = mlx5_set_group_asid, .get_vq_dma_dev = mlx5_get_vq_dma_dev, .free = mlx5_vdpa_free, -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 0/4] vdpa: decouple reset of iotlb mapping from device reset
In order to reduce needlessly high setup and teardown cost of iotlb mapping during live migration, it's crucial to decouple the vhost-vdpa iotlb abstraction from the virtio device life cycle, i.e. iotlb mappings should be left intact across virtio device reset [1]. For it to work, the on-chip IOMMU parent device could implement a separate .reset_map() operation callback to restore 1:1 DMA mapping without having to resort to the .reset() callback, the latter of which is mainly used to reset virtio device state. This new .reset_map() callback will be invoked only before the vhost-vdpa driver is to be removed and detached from the vdpa bus, such that other vdpa bus drivers, e.g. virtio-vdpa, can start with 1:1 DMA mapping when they are attached. For the context, those on-chip IOMMU parent devices, create the 1:1 DMA mapping at vdpa device creation, and they would implicitly destroy the 1:1 mapping when the first .set_map or .dma_map callback is invoked. This patchset is based off of the descriptor group v3 series from Dragos. [2] [1] Reducing vdpa migration downtime because of memory pin / maps https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html [2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/ --- v2: - improved commit message to clarify the intended csope of .reset_map API - improved commit messages to clarify no breakage on older userspace v1: - rewrote commit messages to include more detailed description and background - reword to vendor specific IOMMU implementation from on-chip IOMMU - include parent device backend feautres to persistent iotlb precondition - reimplement mlx5_vdpa patch on top of descriptor group series RFC v3: - fix missing return due to merge error in patch #4 RFC v2: - rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table group" series: https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/ --- Si-Wei Liu (4): vdpa: introduce .reset_map operation callback vhost-vdpa: reset vendor specific mapping to initial state in .release vhost-vdpa: introduce IOTLB_PERSIST backend feature bit vdpa/mlx5: implement .reset_map driver op drivers/vdpa/mlx5/core/mlx5_vdpa.h | 1 + drivers/vdpa/mlx5/core/mr.c| 17 drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 - drivers/vhost/vdpa.c | 31 ++ include/linux/vdpa.h | 10 ++ include/uapi/linux/vhost_types.h | 2 ++ 6 files changed, 74 insertions(+), 5 deletions(-) -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 1/4] vdpa: introduce .reset_map operation callback
Device specific IOMMU parent driver who wishes to see mapping to be decoupled from virtio or vdpa device life cycle (device reset) can use it to restore memory mapping in the device IOMMU to the initial or default state. The reset of mapping is done per address space basis. The reason why a separate .reset_map op is introduced is because this allows a simple on-chip IOMMU model without exposing too much device implementation details to the upper vdpa layer. The .dma_map/unmap or .set_map driver API is meant to be used to manipulate the IOTLB mappings, and has been abstracted in a way similar to how a real IOMMU device maps or unmaps pages for certain memory ranges. However, apart from this there also exists other mapping needs, in which case 1:1 passthrough mapping has to be used by other users (read virtio-vdpa). To ease parent/vendor driver implementation and to avoid abusing DMA ops in an unexpacted way, these on-chip IOMMU devices can start with 1:1 passthrough mapping mode initially at the he time of creation. Then the .reset_map op can be used to switch iotlb back to this initial state without having to expose a complex two-dimensional IOMMU device model. The .reset_map is not a MUST for every parent that implements the .dma_map or .set_map API, because there could be software vDPA devices (which has use_va=true) that don't have to pin kernel memory so they don't care much about high mapping cost during device reset. And those software devices may have also implemented their own DMA ops, so don't have to use .reset_map to achieve a simple IOMMU device model for 1:1 passthrough mapping, either. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez Acked-by: Jason Wang --- include/linux/vdpa.h | 10 ++ 1 file changed, 10 insertions(+) diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h index d376309b99cf..26ae6ae1eac3 100644 --- a/include/linux/vdpa.h +++ b/include/linux/vdpa.h @@ -327,6 +327,15 @@ struct vdpa_map_file { * @iova: iova to be unmapped * @size: size of the area * Returns integer: success (0) or error (< 0) + * @reset_map: Reset device memory mapping to the default + * state (optional) + * Needed for devices that are using device + * specific DMA translation and prefer mapping + * to be decoupled from the virtio life cycle, + * i.e. device .reset op does not reset mapping + * @vdev: vdpa device + * @asid: address space identifier + * Returns integer: success (0) or error (< 0) * @get_vq_dma_dev:Get the dma device for a specific * virtqueue (optional) * @vdev: vdpa device @@ -405,6 +414,7 @@ struct vdpa_config_ops { u64 iova, u64 size, u64 pa, u32 perm, void *opaque); int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid, u64 iova, u64 size); + int (*reset_map)(struct vdpa_device *vdev, unsigned int asid); int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group, unsigned int asid); struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx); -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f57b95..a3f8160c9807 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(&as->hash_link); vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ + vhost_vdpa_reset_map(v, asid); kfree(as); return 0; -- 2.39.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
Userspace needs this feature flag to distinguish if vhost-vdpa iotlb in the kernel can be trusted to persist IOTLB mapping across vDPA reset. Without it, userspace has no way to tell apart if it's running on an older kernel, which could silently drop all iotlb mapping across vDPA reset, especially with broken parent driver implementation for the .reset driver op. The broken driver may incorrectly drop all mappings of its own as part of .reset, which inadvertently ends up with corrupted mapping state between vhost-vdpa userspace and the kernel. As a workaround, to make the mapping behaviour predictable across reset, userspace has to pro-actively remove all mappings before vDPA reset, and then restore all the mappings afterwards. This workaround is done unconditionally on top of all parent drivers today, due to the parent driver implementation issue and no means to differentiate. This workaround had been utilized in QEMU since day one when the corresponding vhost-vdpa userspace backend came to the world. There are 3 cases that backend may claim this feature bit on for: - parent device that has to work with platform IOMMU - parent device with on-chip IOMMU that has the expected .reset_map support in driver - parent device with vendor specific IOMMU implementation with persistent IOTLB mapping already that has to specifically declare this backend feature The reason why .reset_map is being one of the pre-condition for persistent iotlb is because without it, vhost-vdpa can't switch back iotlb to the initial state later on, especially for the on-chip IOMMU case which starts with identity mapping at device creation. virtio-vdpa requires on-chip IOMMU to perform 1:1 passthrough translation from PA to IOVA as-is to begin with, and .reset_map is the only means to turn back iotlb to the identity mapping mode after vhost-vdpa is gone. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start( started = false), but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the driver bug has been solved. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Signed-off-by: Si-Wei Liu Acked-by: Eugenio Pérez --- drivers/vhost/vdpa.c | 15 +++ include/uapi/linux/vhost_types.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index a3f8160c9807..9202986a7d81 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct vhost_vdpa *v) return ops->get_backend_features(vdpa); } +static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + return (!ops->set_map && !ops->dma_map) || ops->reset_map || + vhost_vdpa_get_backend_features(v) & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); +} + static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep) { struct vdpa_device *vdpa = v->vdpa; @@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, return -EFAULT; if (features & ~(VHOST_VDPA_BACKEND_FEATURES | BIT_ULL(VHOST_BACKEND_F_DESC_ASID) | +BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) | BIT_ULL(VHOST_BACKEND_F_SUSPEND) | BIT_ULL(VHOST_BACKEND_F_RESUME) | BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK))) @@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) && !vhost_vdpa_has_desc_group(v)) return -EOPNOTSUPP; + if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) && +!vhost_vdpa_has_persistent_map(v)) + return -EOPNOTSUPP; vhost_set_backend_features(&v->vdev, features); return 0; } @@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep, features |= BIT_ULL(VHOST_BACKEND_F_RESUME); if (vhost_vdpa_has_desc_group(v)) features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID); + if (vhost_vdpa_has_persistent_map(v)) + features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST); features |= vhost_vdpa_get_backend_features(v); if (copy_to_user(featurep, &features, sizeof(features))) r = -EFAULT; diff --git a/include/uapi/linu
Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH
On Mon, Oct 16, 2023 at 7:53 PM Alexei Starovoitov wrote: > > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki > wrote: > > > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki > > > wrote: > > >> > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > >> index 0448700890f7..298634556fab 100644 > > >> --- a/include/uapi/linux/bpf.h > > >> +++ b/include/uapi/linux/bpf.h > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > > >> BPF_PROG_TYPE_SK_LOOKUP, > > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > > >> BPF_PROG_TYPE_NETFILTER, > > >> + BPF_PROG_TYPE_VNET_HASH, > > > > > > Sorry, we do not add new stable program types anymore. > > > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > > >> __u8 tstamp_type; > > >> __u32 :24; /* Padding, future use. */ > > >> __u64 hwtstamp; > > >> + > > >> + __u32 vnet_hash_value; > > >> + __u16 vnet_hash_report; > > >> + __u16 vnet_rss_queue; > > >> }; > > > > > > we also do not add anything to uapi __sk_buff. > > > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > > >> + .get_func_proto = sk_filter_func_proto, > > >> + .is_valid_access= sk_filter_is_valid_access, > > >> + .convert_ctx_access = bpf_convert_ctx_access, > > >> + .gen_ld_abs = bpf_gen_ld_abs, > > >> +}; > > > > > > and we don't do ctx rewrites like this either. > > > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > > in _unstable_ way. > > > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > > and I'm worried if it may mean the interface stability. > > > > Let me describe the context. QEMU bundles an eBPF program that is used > > for the "eBPF steering program" feature of tun. Now I'm proposing to > > extend the feature to allow to return some values to the userspace and > > vhost_net. As such, the extension needs to be done in a way that ensures > > interface stability. > > bpf is not an option then. > we do not add stable bpf program types or hooks any more. > If a kernel subsystem wants to use bpf it needs to accept the fact > that such bpf extensibility will be unstable and subsystem maintainers > can decide to remove such bpf support in the future. Based on hooks for tracepoints and kfuncs, correct? Perhaps the existing stable flow dissector type is extensible to optionally compute the hash and report hash and hash type. Else we probably should revisit the previous version of this series. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH
On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki wrote: > > On 2023/10/16 1:07, Alexei Starovoitov wrote: > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki > > wrote: > >> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > >> index 0448700890f7..298634556fab 100644 > >> --- a/include/uapi/linux/bpf.h > >> +++ b/include/uapi/linux/bpf.h > >> @@ -988,6 +988,7 @@ enum bpf_prog_type { > >> BPF_PROG_TYPE_SK_LOOKUP, > >> BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */ > >> BPF_PROG_TYPE_NETFILTER, > >> + BPF_PROG_TYPE_VNET_HASH, > > > > Sorry, we do not add new stable program types anymore. > > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff { > >> __u8 tstamp_type; > >> __u32 :24; /* Padding, future use. */ > >> __u64 hwtstamp; > >> + > >> + __u32 vnet_hash_value; > >> + __u16 vnet_hash_report; > >> + __u16 vnet_rss_queue; > >> }; > > > > we also do not add anything to uapi __sk_buff. > > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = { > >> + .get_func_proto = sk_filter_func_proto, > >> + .is_valid_access= sk_filter_is_valid_access, > >> + .convert_ctx_access = bpf_convert_ctx_access, > >> + .gen_ld_abs = bpf_gen_ld_abs, > >> +}; > > > > and we don't do ctx rewrites like this either. > > > > Please see how hid-bpf and cgroup rstat are hooking up bpf > > in _unstable_ way. > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF > and I'm worried if it may mean the interface stability. > > Let me describe the context. QEMU bundles an eBPF program that is used > for the "eBPF steering program" feature of tun. Now I'm proposing to > extend the feature to allow to return some values to the userspace and > vhost_net. As such, the extension needs to be done in a way that ensures > interface stability. bpf is not an option then. we do not add stable bpf program types or hooks any more. If a kernel subsystem wants to use bpf it needs to accept the fact that such bpf extensibility will be unstable and subsystem maintainers can decide to remove such bpf support in the future. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote: On Mon, Oct 16, 2023 at 8:33 AM Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(&as->hash_link); vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). For two reasons: 1) backend features need acked by userspace this is by design 2) keep the odd behaviour seems to be more safe as we can't audit every userspace program The old behavior (without flag ack) cannot be trusted already, as: * Devices using platform IOMMU (in other words, not implementing neither .set_map nor .dma_map) does not unmap memory at virtio reset. * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset, now all backends work the same as far as I know., which was (and is) the way devices using the platform IOMMU works. The difference in behavior did not matter as QEMU unmaps all the memory unregistering the memory listener at vhost_vdpa_dev_start(..., started = false), Exactly. It's not just QEMU, but any (older) userspace manipulates mappings through the vhost-vdpa iotlb interface has to unmap all mappings to workaround the vdpa parent driver bug. If they don't do explicit unmap, it would cause state inconsistency between vhost-vdpa and parent driver, then old mappings can't be restored, and new mapping can be added to iotlb after vDPA reset. There's no point to preserve this broken and inconsistent behavior between vhost-vdpa and parent driver, as userspace doesn't care at all! but the backend acknowledging this feature flag allows QEMU to make sure it is safe to skip this unmap & map in the case of vhost stop & start cycle. In that sense, this feature flag is actually a signal for userspace to know that the bug has been solved. Right, I couldn't say it better than you do, thanks! The feature flag is more of an unusual means to indicating kernel bug having been fixed, rather than introduce a new feature or new kernel behavior ending up in change of userspace's expectation. Not offering it indicates that userspace cannot trust the kernel will retain the maps. Si-Wei or Dragos, please correct m
Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release
On 10/15/2023 11:32 PM, Jason Wang wrote: On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu wrote: On 10/12/2023 8:01 PM, Jason Wang wrote: On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu wrote: Devices with on-chip IOMMU or vendor specific IOTLB implementation may need to restore iotlb mapping to the initial or default state using the .reset_map op, as it's desirable for some parent devices to solely manipulate mappings by its own, independent of virtio device state. For instance, device reset does not cause mapping go away on such IOTLB model in need of persistent mapping. Before vhost-vdpa is going away, give them a chance to reset iotlb back to the initial state in vhost_vdpa_cleanup(). Signed-off-by: Si-Wei Liu --- drivers/vhost/vdpa.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 851535f..a3f8160 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -131,6 +131,15 @@ static struct vhost_vdpa_as *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v, return vhost_vdpa_alloc_as(v, asid); } +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid) +{ + struct vdpa_device *vdpa = v->vdpa; + const struct vdpa_config_ops *ops = vdpa->config; + + if (ops->reset_map) + ops->reset_map(vdpa, asid); +} + static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) { struct vhost_vdpa_as *as = asid_to_as(v, asid); @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid) hlist_del(&as->hash_link); vhost_vdpa_iotlb_unmap(v, &as->iotlb, 0ULL, 0ULL - 1, asid); + /* +* Devices with vendor specific IOMMU may need to restore +* iotlb to the initial or default state which is not done +* through device reset, as the IOTLB mapping manipulation +* could be decoupled from the virtio device life cycle. +*/ Should we do this according to whether IOTLB_PRESIST is set? Well, in theory this seems like so but it's unnecessary code change actually, as that is the way how vDPA parent behind platform IOMMU works today, and userspace doesn't break as of today. :) Well, this is one question I've ever asked before. You have explained that one of the reason that we don't break userspace is that they may couple IOTLB reset with vDPA reset as well. One example is the Qemu. Nope, it was the opposite. Maybe it was not clear enough, let me try once more - userspace CANNOT decouple IOTLB reset from vDPA reset today. This is because of bug/discrepancy in mlx5_vdap and vdpa_sim already breaking userspace's expectation, rendering the brokenness/inconsistency on vhost-vdpa mapping interface from behaving what it promised and should have done. Only with the IOTLB_PERSIST flag seen userspace can trust vhost-vdpa kernel interface *reliably* to decouple IOTLB reset from vDPA reset. Without seeing this flag, no matter how the code in QEMU was written, today's older userspace was never like to assume the mappings will *definitely* be cleared by vDPA reset. If any userspace implementation wants to get consistent behavior for all vDPA parent devices, it still has to *explicitly* clear all existing mappings by its own by sending bunch of unmap (iotlb invalidate) requests to vhost-vdpa kernel before resetting the vDPA backend. In brief, userspace is already broken by kernel implementation today, and new userspace needs some device flag to know for sure if kernel bug has already been fixed; older userspace doesn't care about preserving the broken kernel behavior at all, regardless whether or not it wants to decouple IOTLB from vDPA reset. As explained in previous threads [1][2], when IOTLB_PERSIST is not set it doesn't necessarily mean the iotlb will definitely be destroyed across reset (think about the platform IOMMU case), so userspace today is already tolerating enough with either good or bad IOMMU. This code of not checking IOTLB_PERSIST being set is intentional, there's no point to emulate bad IOMMU behavior even for older userspace (with improper emulation to be done it would result in even worse performance). For two reasons: 1) backend features need acked by userspace this is by design There's no breakage on this part. Backend feature IOTLB_PERSIST won't be set if userspace doesn't ack. 2) keep the odd behaviour seems to be more safe as we can't audit every userspace program Definitely don't have to audit every userspace program, but I cannot think of a case where a sane userspace program can be broken. Can you elaborate one or two potential userspace usage that may break because of this? As said, platform IOMMU already did it this way. Regards, -Siwei Thanks I think the purpose of the IOTLB_PERSIST flag is just to give userspace 100% certainty of persistent iotlb mapping not getting lost across vdpa reset. Thanks, -Siwei [1] https://lore.kernel.org/virtualizat
Re: [RFC PATCH] ALSA: virtio: use copy and fill_silence callbacks
On Thu, Oct 12, 2023 at 05:10:50PM +0200, Matias Ezequiel Vara Larsen wrote: > This commit replaces the mmap mechanism with the copy() and > fill_silence() callbacks for both capturing and playback for the > virtio-sound driver. This change is required to prevent the updating of > the content of a buffer that is already in the available ring. > > The current mechanism splits a dma buffer into descriptors that are > exposed to the device. This dma buffer is shared with the user > application. When the device consumes a buffer, the driver moves the > request from the used ring to available ring. > > The driver exposes the buffer to the device without knowing if the > content has been updated from the user. The section 2.8.21.1 of the > virtio spec states that: "The device MAY access the descriptor chains > the driver created and the memory they refer to immediately". If the > device picks up buffers from the available ring just after it is > notified, it happens that the content may be old. > > By providing the copy() callback, the driver first updates the content > of the buffer, and then, exposes the buffer to the device by enqueuing > it in the available ring. Thus, device always picks up a buffer that is > updated. > > For capturing, the driver starts by exposing all the available buffers > to device. After device updates the content of a buffer, it enqueues it > in the used ring. It is only after the copy() for capturing is issued > that the driver re-enqueues the buffer in the available ring. > > Note that the copy() function assumes that user is always writing a > period. Testing shows that this is true but I may be wrong. This RFC > aims at clarifying this. This sounds like an ALSA driver API question. Jaroslav and Takashi: Any thoughts? Stefan > > Signed-off-by: Matias Ezequiel Vara Larsen > --- > sound/virtio/virtio_pcm.c | 11 ++-- > sound/virtio/virtio_pcm.h | 9 +++- > sound/virtio/virtio_pcm_msg.c | 50 --- > sound/virtio/virtio_pcm_ops.c | 94 +++ > 4 files changed, 137 insertions(+), 27 deletions(-) > > diff --git a/sound/virtio/virtio_pcm.c b/sound/virtio/virtio_pcm.c > index c10d91fff2fb..bfe982952303 100644 > --- a/sound/virtio/virtio_pcm.c > +++ b/sound/virtio/virtio_pcm.c > @@ -104,8 +104,6 @@ static int virtsnd_pcm_build_hw(struct > virtio_pcm_substream *vss, >* only message-based transport. >*/ > vss->hw.info = > - SNDRV_PCM_INFO_MMAP | > - SNDRV_PCM_INFO_MMAP_VALID | > SNDRV_PCM_INFO_BATCH | > SNDRV_PCM_INFO_BLOCK_TRANSFER | > SNDRV_PCM_INFO_INTERLEAVED | > @@ -471,12 +469,11 @@ int virtsnd_pcm_build_devs(struct virtio_snd *snd) > for (kss = ks->substream; kss; kss = kss->next) > vs->substreams[kss->number]->substream = kss; > > - snd_pcm_set_ops(vpcm->pcm, i, &virtsnd_pcm_ops); > + if (i == SNDRV_PCM_STREAM_CAPTURE) > + snd_pcm_set_ops(vpcm->pcm, i, > &virtsnd_pcm_capture_ops); > + else > + snd_pcm_set_ops(vpcm->pcm, i, > &virtsnd_pcm_playback_ops); > } > - > - snd_pcm_set_managed_buffer_all(vpcm->pcm, > -SNDRV_DMA_TYPE_VMALLOC, NULL, > -0, 0); > } > > return 0; > diff --git a/sound/virtio/virtio_pcm.h b/sound/virtio/virtio_pcm.h > index 062eb8e8f2cf..1c1106ec971f 100644 > --- a/sound/virtio/virtio_pcm.h > +++ b/sound/virtio/virtio_pcm.h > @@ -50,6 +50,8 @@ struct virtio_pcm_substream { > struct work_struct elapsed_period; > spinlock_t lock; > size_t buffer_bytes; > + u8 *buffer; > + size_t buffer_sz; > size_t hw_ptr; > bool xfer_enabled; > bool xfer_xrun; > @@ -90,7 +92,8 @@ struct virtio_pcm { > struct virtio_pcm_stream streams[SNDRV_PCM_STREAM_LAST + 1]; > }; > > -extern const struct snd_pcm_ops virtsnd_pcm_ops; > +extern const struct snd_pcm_ops virtsnd_pcm_playback_ops; > +extern const struct snd_pcm_ops virtsnd_pcm_capture_ops; > > int virtsnd_pcm_validate(struct virtio_device *vdev); > > @@ -117,7 +120,9 @@ int virtsnd_pcm_msg_alloc(struct virtio_pcm_substream > *vss, > > void virtsnd_pcm_msg_free(struct virtio_pcm_substream *vss); > > -int virtsnd_pcm_msg_send(struct virtio_pcm_substream *vss); > +int virtsnd_pcm_msg_send(struct virtio_pcm_substream *vss, bool single); > + > +int virtsnd_pcm_msg_send_locked(struct virtio_pcm_substream *vss, bool > single); > > unsigned int virtsnd_pcm_msg_pending_num(struct virtio_pcm_substream *vss); > > diff --git a/sound/virtio/virtio_pcm_msg.c b/sound/virtio/virtio_pcm_msg.c > index aca2dc1989ba..9a5f9814cb62 100644 > --- a/sound/virtio/virtio_pcm_msg.c > +++ b/sound/virtio/virtio_pcm_msg.c > @@ -132,7 +132,6 @@ static
Re: [PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2
On 16.10.23 16:29, Peter Zijlstra wrote: On Mon, Oct 16, 2023 at 02:39:32PM +0200, Juergen Gross wrote: Instead of stacking alternative and paravirt patching, use the new ALT_FLAG_CALL flag to switch those mixed calls to pure alternative handling. This eliminates the need to be careful regarding the sequence of alternative and paravirt patching. For call depth tracking callthunks_setup() needs to be adapted to patch calls at alternative patching sites instead of paravirt calls. Signed-off-by: Juergen Gross I cannot help but feel this would've been better as two patches, one introducing ALT_NOT_XEN and then a second with the rest. In case I need to respin I'll split it up. Regardless, Acked-by: Peter Zijlstra (Intel) Thanks, Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 4/4] x86/paravirt: remove no longer needed paravirt patching code
On Mon, Oct 16, 2023 at 02:39:33PM +0200, Juergen Gross wrote: > Now that paravirt is using the alternatives patching infrastructure, > remove the paravirt patching code. > > Signed-off-by: Juergen Gross > --- > arch/x86/include/asm/paravirt.h | 18 > arch/x86/include/asm/paravirt_types.h | 40 > arch/x86/include/asm/text-patching.h | 12 - > arch/x86/kernel/alternative.c | 66 +-- > arch/x86/kernel/paravirt.c| 30 > arch/x86/kernel/vmlinux.lds.S | 13 -- > arch/x86/tools/relocs.c | 2 +- > 7 files changed, 3 insertions(+), 178 deletions(-) More - more better! :-) Acked-by: Peter Zijlstra (Intel) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2
On Mon, Oct 16, 2023 at 02:39:32PM +0200, Juergen Gross wrote: > Instead of stacking alternative and paravirt patching, use the new > ALT_FLAG_CALL flag to switch those mixed calls to pure alternative > handling. > > This eliminates the need to be careful regarding the sequence of > alternative and paravirt patching. > > For call depth tracking callthunks_setup() needs to be adapted to patch > calls at alternative patching sites instead of paravirt calls. > > Signed-off-by: Juergen Gross I cannot help but feel this would've been better as two patches, one introducing ALT_NOT_XEN and then a second with the rest. Regardless, Acked-by: Peter Zijlstra (Intel) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 4/4] x86/paravirt: remove no longer needed paravirt patching code
Now that paravirt is using the alternatives patching infrastructure, remove the paravirt patching code. Signed-off-by: Juergen Gross --- arch/x86/include/asm/paravirt.h | 18 arch/x86/include/asm/paravirt_types.h | 40 arch/x86/include/asm/text-patching.h | 12 - arch/x86/kernel/alternative.c | 66 +-- arch/x86/kernel/paravirt.c| 30 arch/x86/kernel/vmlinux.lds.S | 13 -- arch/x86/tools/relocs.c | 2 +- 7 files changed, 3 insertions(+), 178 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 9c6c5cfa9fe2..f09acce9432c 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -725,31 +725,13 @@ void native_pv_lock_init(void) __init; #else /* __ASSEMBLY__ */ -#define _PVSITE(ptype, ops, word, algn)\ -771:; \ - ops;\ -772:; \ - .pushsection .parainstructions,"a"; \ -.align algn; \ -word 771b; \ -.byte ptype; \ -.byte 772b-771b; \ -_ASM_ALIGN;\ - .popsection - - #ifdef CONFIG_X86_64 #ifdef CONFIG_PARAVIRT_XXL #ifdef CONFIG_DEBUG_ENTRY -#define PARA_PATCH(off)((off) / 8) -#define PARA_SITE(ptype, ops) _PVSITE(ptype, ops, .quad, 8) #define PARA_INDIRECT(addr)*addr(%rip) .macro PARA_IRQ_save_fl - PARA_SITE(PARA_PATCH(PV_IRQ_save_fl), - ANNOTATE_RETPOLINE_SAFE; - call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);) ANNOTATE_RETPOLINE_SAFE; call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl); .endm diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 323dca625eea..756cb75d22b5 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -2,15 +2,6 @@ #ifndef _ASM_X86_PARAVIRT_TYPES_H #define _ASM_X86_PARAVIRT_TYPES_H -#ifndef __ASSEMBLY__ -/* These all sit in the .parainstructions section to tell us what to patch. */ -struct paravirt_patch_site { - u8 *instr; /* original instructions */ - u8 type;/* type of this instruction */ - u8 len; /* length of original instruction */ -}; -#endif - #ifdef CONFIG_PARAVIRT #ifndef __ASSEMBLY__ @@ -250,34 +241,6 @@ struct paravirt_patch_template { extern struct pv_info pv_info; extern struct paravirt_patch_template pv_ops; -#define PARAVIRT_PATCH(x) \ - (offsetof(struct paravirt_patch_template, x) / sizeof(void *)) - -#define paravirt_type(op) \ - [paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\ - [paravirt_opptr] "m" (pv_ops.op) -/* - * Generate some code, and mark it as patchable by the - * apply_paravirt() alternate instruction patcher. - */ -#define _paravirt_alt(insn_string, type) \ - "771:\n\t" insn_string "\n" "772:\n"\ - ".pushsection .parainstructions,\"a\"\n"\ - _ASM_ALIGN "\n" \ - _ASM_PTR " 771b\n" \ - " .byte " type "\n"\ - " .byte 772b-771b\n" \ - _ASM_ALIGN "\n" \ - ".popsection\n" - -/* Generate patchable code, with the default asm parameters. */ -#define paravirt_alt(insn_string) \ - _paravirt_alt(insn_string, "%c[paravirt_typenum]") - -/* Simple instruction patching code. */ -#define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t" - -unsigned int paravirt_patch(u8 type, void *insn_buff, unsigned long addr, unsigned int len); #define paravirt_ptr(op) [paravirt_opptr] "m" (pv_ops.op) int paravirt_disable_iospace(void); @@ -545,9 +508,6 @@ unsigned long pv_native_read_cr2(void); #define paravirt_nop ((void *)x86_nop) -extern struct paravirt_patch_site __parainstructions[], - __parainstructions_end[]; - #endif /* __ASSEMBLY__ */ #define ALT_NOT_XENALT_NOT(X86_FEATURE_XENPV) diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 29832c338cdc..0b70653a98c1 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -6,18 +6,6 @@ #include #include -struct paravirt_patch_site; -#ifdef CONFIG_PARAVIRT -void apply_paravirt(struct paravirt_patch_site *start, - struct paravirt_patch_site *end); -#else -static inline void apply_paravirt(struct paravirt_patch_site *start, - struct paravirt_patch_site
[PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2
Instead of stacking alternative and paravirt patching, use the new ALT_FLAG_CALL flag to switch those mixed calls to pure alternative handling. This eliminates the need to be careful regarding the sequence of alternative and paravirt patching. For call depth tracking callthunks_setup() needs to be adapted to patch calls at alternative patching sites instead of paravirt calls. Signed-off-by: Juergen Gross --- arch/x86/include/asm/alternative.h| 5 +-- arch/x86/include/asm/paravirt.h | 49 +-- arch/x86/include/asm/paravirt_types.h | 29 +++- arch/x86/kernel/callthunks.c | 17 +- arch/x86/kernel/module.c | 20 +++ 5 files changed, 51 insertions(+), 69 deletions(-) diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index dd63b96577e9..f6c0ff678e2e 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -89,6 +89,8 @@ struct alt_instr { u8 replacementlen; /* length of new instruction */ } __packed; +extern struct alt_instr __alt_instructions[], __alt_instructions_end[]; + /* * Debug flag that can be tested to see whether alternative * instructions were patched in already: @@ -104,11 +106,10 @@ extern void apply_fineibt(s32 *start_retpoline, s32 *end_retpoine, s32 *start_cfi, s32 *end_cfi); struct module; -struct paravirt_patch_site; struct callthunk_sites { s32 *call_start, *call_end; - struct paravirt_patch_site *pv_start, *pv_end; + struct alt_instr*alt_start, *alt_end; }; #ifdef CONFIG_CALL_THUNKS diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index ed5c7342f2ef..9c6c5cfa9fe2 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -142,8 +142,7 @@ static inline void write_cr0(unsigned long x) static __always_inline unsigned long read_cr2(void) { return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2, - "mov %%cr2, %%rax;", - ALT_NOT(X86_FEATURE_XENPV)); + "mov %%cr2, %%rax;", ALT_NOT_XEN); } static __always_inline void write_cr2(unsigned long x) @@ -154,13 +153,12 @@ static __always_inline void write_cr2(unsigned long x) static inline unsigned long __read_cr3(void) { return PVOP_ALT_CALL0(unsigned long, mmu.read_cr3, - "mov %%cr3, %%rax;", ALT_NOT(X86_FEATURE_XENPV)); + "mov %%cr3, %%rax;", ALT_NOT_XEN); } static inline void write_cr3(unsigned long x) { - PVOP_ALT_VCALL1(mmu.write_cr3, x, - "mov %%rdi, %%cr3", ALT_NOT(X86_FEATURE_XENPV)); + PVOP_ALT_VCALL1(mmu.write_cr3, x, "mov %%rdi, %%cr3", ALT_NOT_XEN); } static inline void __write_cr4(unsigned long x) @@ -182,7 +180,7 @@ extern noinstr void pv_native_wbinvd(void); static __always_inline void wbinvd(void) { - PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV)); + PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT_XEN); } static inline u64 paravirt_read_msr(unsigned msr) @@ -390,27 +388,25 @@ static inline void paravirt_release_p4d(unsigned long pfn) static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, - "mov %%rdi, %%rax", - ALT_NOT(X86_FEATURE_XENPV)) }; + "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pteval_t pte_val(pte_t pte) { return PVOP_ALT_CALLEE1(pteval_t, mmu.pte_val, pte.pte, - "mov %%rdi, %%rax", ALT_NOT(X86_FEATURE_XENPV)); + "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline pgd_t __pgd(pgdval_t val) { return (pgd_t) { PVOP_ALT_CALLEE1(pgdval_t, mmu.make_pgd, val, - "mov %%rdi, %%rax", - ALT_NOT(X86_FEATURE_XENPV)) }; + "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pgdval_t pgd_val(pgd_t pgd) { return PVOP_ALT_CALLEE1(pgdval_t, mmu.pgd_val, pgd.pgd, - "mov %%rdi, %%rax", ALT_NOT(X86_FEATURE_XENPV)); + "mov %%rdi, %%rax", ALT_NOT_XEN); } #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION @@ -444,14 +440,13 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd) static inline pmd_t __pmd(pmdval_t val) { return (pmd_t) { PVOP_ALT_CALLEE1(pmdval_t, mmu.make_pmd, val, - "mov %%rdi, %%rax", - ALT_NOT(X86_FEATURE_XENPV)) }; + "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inl
[PATCH v2 1/4] x86/paravirt: move some functions and defines to alternative
As a preparation for replacing paravirt patching completely by alternative patching, move some backend functions and #defines to alternative code and header. Signed-off-by: Juergen Gross Acked-by: Peter Zijlstra (Intel) --- arch/x86/include/asm/alternative.h| 16 arch/x86/include/asm/paravirt.h | 12 - arch/x86/include/asm/paravirt_types.h | 4 +-- arch/x86/include/asm/qspinlock_paravirt.h | 4 +-- arch/x86/kernel/alternative.c | 10 arch/x86/kernel/kvm.c | 4 +-- arch/x86/kernel/paravirt.c| 30 +++ arch/x86/xen/irq.c| 2 +- 8 files changed, 41 insertions(+), 41 deletions(-) diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h index 9c4da699e11a..67e50bd40bb8 100644 --- a/arch/x86/include/asm/alternative.h +++ b/arch/x86/include/asm/alternative.h @@ -330,6 +330,22 @@ static inline int alternatives_text_reserved(void *start, void *end) */ #define ASM_NO_INPUT_CLOBBER(clbr...) "i" (0) : clbr +/* Macro for creating assembler functions avoiding any C magic. */ +#define DEFINE_ASM_FUNC(func, instr, sec) \ + asm (".pushsection " #sec ", \"ax\"\n" \ +".global " #func "\n\t"\ +".type " #func ", @function\n\t" \ +ASM_FUNC_ALIGN "\n"\ +#func ":\n\t" \ +ASM_ENDBR \ +instr "\n\t" \ +ASM_RET\ +".size " #func ", . - " #func "\n\t" \ +".popsection") + +void x86_BUG(void); +void x86_nop(void); + #else /* __ASSEMBLY__ */ #ifdef CONFIG_SMP diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 6c8ff12140ae..ed5c7342f2ef 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -726,18 +726,6 @@ static __always_inline unsigned long arch_local_irq_save(void) #undef PVOP_VCALL4 #undef PVOP_CALL4 -#define DEFINE_PARAVIRT_ASM(func, instr, sec) \ - asm (".pushsection " #sec ", \"ax\"\n" \ -".global " #func "\n\t"\ -".type " #func ", @function\n\t" \ -ASM_FUNC_ALIGN "\n"\ -#func ":\n\t" \ -ASM_ENDBR \ -instr "\n\t" \ -ASM_RET\ -".size " #func ", . - " #func "\n\t" \ -".popsection") - extern void default_banner(void); void native_pv_lock_init(void) __init; diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 772d03487520..9f84dc074f88 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -542,8 +542,6 @@ int paravirt_disable_iospace(void); __PVOP_VCALL(op, PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),\ PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4)) -void _paravirt_nop(void); -void paravirt_BUG(void); unsigned long paravirt_ret0(void); #ifdef CONFIG_PARAVIRT_XXL u64 _paravirt_ident_64(u64); @@ -553,7 +551,7 @@ void pv_native_irq_enable(void); unsigned long pv_native_read_cr2(void); #endif -#define paravirt_nop ((void *)_paravirt_nop) +#define paravirt_nop ((void *)x86_nop) extern struct paravirt_patch_site __parainstructions[], __parainstructions_end[]; diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h index 85b6e3609cb9..ef9697f20129 100644 --- a/arch/x86/include/asm/qspinlock_paravirt.h +++ b/arch/x86/include/asm/qspinlock_paravirt.h @@ -56,8 +56,8 @@ __PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath, ".spinlock.text"); "pop%rdx\n\t" \ FRAME_END -DEFINE_PARAVIRT_ASM(__raw_callee_save___pv_queued_spin_unlock, - PV_UNLOCK_ASM, .spinlock.text); +DEFINE_ASM_FUNC(__raw_callee_save___pv_queued_spin_unlock, + PV_UNLOCK_ASM, .spinlock.text); #else /* CONFIG_64BIT */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 73be3931e4f0..1c8dd8e05f3f 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -385,6 +385,16 @@ apply_relocation(u8 *buf, size_t len, u8 *dest, u8 *src, size_t src_len) } } +/* Low-level backend functions usable from alternative code replacements. */ +DEFINE_ASM_FUNC(x86_nop, "", .entry.text); +EXPORT_SYMBOL_GPL(x86_nop); + +noinstr void x86_BUG(void) +{ + BUG(); +} +EXPORT_SYMBOL_GPL(x86_BUG); + /* * Replace instructions with better alternatives for this
[PATCH v2 0/4] x86/paravirt: Get rid of paravirt patching
This is a small series getting rid of paravirt patching by switching completely to alternative patching for the same functionality. The basic idea is to add the capability to switch from indirect to direct calls via a special alternative patching option. This removes _some_ of the paravirt macro maze, but most of it needs to stay due to the need of hiding the call instructions from the compiler in order to avoid needless register save/restore. What is going away is the nasty stacking of alternative and paravirt patching and (of course) the special .parainstructions linker section. I have tested the series on bare metal and as Xen PV domain to still work. Note that objtool might need some changes to cope with the new indirect call patching mechanism. Additionally some paravirt handling can probably be removed from it. Changes in V2: - split last patch into 2 - rebase of patch 2 as suggested by Peter - addressed Peter's comments for patch 3 Juergen Gross (4): x86/paravirt: move some functions and defines to alternative x86/alternative: add indirect call patching x86/paravirt: switch mixed paravirt/alternative calls to alternative_2 x86/paravirt: remove no longer needed paravirt patching code arch/x86/include/asm/alternative.h| 26 - arch/x86/include/asm/paravirt.h | 79 +-- arch/x86/include/asm/paravirt_types.h | 73 +++--- arch/x86/include/asm/qspinlock_paravirt.h | 4 +- arch/x86/include/asm/text-patching.h | 12 --- arch/x86/kernel/alternative.c | 116 ++ arch/x86/kernel/callthunks.c | 17 ++-- arch/x86/kernel/kvm.c | 4 +- arch/x86/kernel/module.c | 20 +--- arch/x86/kernel/paravirt.c| 54 ++ arch/x86/kernel/vmlinux.lds.S | 13 --- arch/x86/tools/relocs.c | 2 +- arch/x86/xen/irq.c| 2 +- 13 files changed, 137 insertions(+), 285 deletions(-) -- 2.35.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 19/19] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
Now, we supported AF_XDP(xsk). Add NETDEV_XDP_ACT_XSK_ZEROCOPY to xdp_features. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index ac62d0955c13..c66d8509330e 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -4329,7 +4329,8 @@ static int virtnet_probe(struct virtio_device *vdev) dev->hw_features |= NETIF_F_GRO_HW; dev->vlan_features = dev->features; - dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT; + dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT | + NETDEV_XDP_ACT_XSK_ZEROCOPY; /* MTU range: 68 - 65535 */ dev->min_mtu = MIN_MTU; -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 13/19] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk buffer) by the last two types bits. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/virtio_net.h | 16 ++-- drivers/net/virtio/xsk.h| 5 + 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 7c72a8bb1813..d4e620a084f4 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -243,6 +243,11 @@ struct virtnet_info { #include "xsk.h" +static inline bool virtnet_is_skb_ptr(void *ptr) +{ + return !((unsigned long)ptr & VIRTIO_XMIT_DATA_MASK); +} + static inline bool virtnet_is_xdp_frame(void *ptr) { return (unsigned long)ptr & VIRTIO_XDP_FLAG; @@ -279,11 +284,12 @@ static inline void *virtnet_sq_unmap(struct virtnet_sq *sq, void *data) static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi, struct virtnet_sq_stats *stats) { + unsigned int xsknum = 0; unsigned int len; void *ptr; while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (!virtnet_is_xdp_frame(ptr)) { + if (virtnet_is_skb_ptr(ptr)) { struct sk_buff *skb; if (sq->do_dma) @@ -295,7 +301,7 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi, stats->bytes += skb->len; napi_consume_skb(skb, in_napi); - } else { + } else if (virtnet_is_xdp_frame(ptr)) { struct xdp_frame *frame; if (sq->do_dma) @@ -305,9 +311,15 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi, stats->bytes += xdp_get_frame_len(frame); xdp_return_frame(frame); + } else { + stats->bytes += virtnet_ptr_to_xsk(ptr); + ++xsknum; } stats->packets++; } + + if (xsknum) + xsk_tx_completed(sq->xsk.pool, xsknum); } static inline void virtnet_vq_napi_schedule(struct napi_struct *napi, diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h index 1bd19dcda649..7ebc9bda7aee 100644 --- a/drivers/net/virtio/xsk.h +++ b/drivers/net/virtio/xsk.h @@ -14,6 +14,11 @@ static inline void *virtnet_xsk_to_ptr(u32 len) return (void *)(p | VIRTIO_XSK_FLAG); } +static inline u32 virtnet_ptr_to_xsk(void *ptr) +{ + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET; +} + int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp); bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool, int budget); -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 16/19] virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
Implementing the logic of xsk rx. If this packet is not for XSK determined in XDP, then we need to copy once to generate a SKB. If it is for XSK, it is a zerocopy receive packet process. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 14 ++-- drivers/net/virtio/virtio_net.h | 4 ++ drivers/net/virtio/xsk.c| 120 drivers/net/virtio/xsk.h| 4 ++ 4 files changed, 137 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 0e740447b142..003dd67ab707 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -822,10 +822,10 @@ static void put_xdp_frags(struct xdp_buff *xdp) } } -static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, - struct net_device *dev, - unsigned int *xdp_xmit, - struct virtnet_rq_stats *stats) +int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, + struct net_device *dev, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats) { struct xdp_frame *xdpf; int err; @@ -1589,13 +1589,17 @@ static void receive_buf(struct virtnet_info *vi, struct virtnet_rq *rq, return; } - if (vi->mergeable_rx_bufs) + rcu_read_lock(); + if (rcu_dereference(rq->xsk.pool)) + skb = virtnet_receive_xsk(dev, vi, rq, buf, len, xdp_xmit, stats); + else if (vi->mergeable_rx_bufs) skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit, stats); else if (vi->big_packets) skb = receive_big(dev, vi, rq, buf, len, stats); else skb = receive_small(dev, vi, rq, buf, ctx, len, xdp_xmit, stats); + rcu_read_unlock(); if (unlikely(!skb)) return; diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 6e71622fca45..fd7f34703c9b 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -346,6 +346,10 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int return false; } +int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp, + struct net_device *dev, + unsigned int *xdp_xmit, + struct virtnet_rq_stats *stats); void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq); void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq); void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq); diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 841fb078882a..f1c64414fac9 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -13,6 +13,18 @@ static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) sg->length = len; } +static unsigned int virtnet_receive_buf_num(struct virtnet_info *vi, char *buf) +{ + struct virtio_net_hdr_mrg_rxbuf *hdr; + + if (vi->mergeable_rx_bufs) { + hdr = (struct virtio_net_hdr_mrg_rxbuf *)buf; + return virtio16_to_cpu(vi->vdev, hdr->num_buffers); + } + + return 1; +} + static void virtnet_xsk_check_queue(struct virtnet_sq *sq) { struct virtnet_info *vi = sq->vq->vdev->priv; @@ -37,6 +49,114 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq) netif_stop_subqueue(dev, qnum); } +static void merge_drop_follow_xdp(struct net_device *dev, + struct virtnet_rq *rq, + u32 num_buf, + struct virtnet_rq_stats *stats) +{ + struct xdp_buff *xdp; + u32 len; + + while (num_buf-- > 1) { + xdp = virtqueue_get_buf(rq->vq, &len); + if (unlikely(!xdp)) { + pr_debug("%s: rx error: %d buffers missing\n", +dev->name, num_buf); + dev->stats.rx_length_errors++; + break; + } + stats->bytes += len; + xsk_buff_free(xdp); + } +} + +static struct sk_buff *construct_skb(struct virtnet_rq *rq, +struct xdp_buff *xdp) +{ + unsigned int metasize = xdp->data - xdp->data_meta; + struct sk_buff *skb; + unsigned int size; + + size = xdp->data_end - xdp->data_hard_start; + skb = napi_alloc_skb(&rq->napi, size); + if (unlikely(!skb)) + return NULL; + + skb_reserve(skb, xdp->data_meta - xdp->data_hard_start); + + size = xdp->data_end - xdp->data_meta; + memcpy(__skb_put(skb, size), xdp->data_meta, size); + + if (metasize) { +
[PATCH net-next v1 18/19] virtio_net: update tx timeout record
If send queue sent some packets, we update the tx timeout record to prevent the tx timeout. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/xsk.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index f1c64414fac9..5d3de505c56c 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -274,6 +274,16 @@ bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool, virtnet_xsk_check_queue(sq); + if (stats.packets) { + struct netdev_queue *txq; + struct virtnet_info *vi; + + vi = sq->vq->vdev->priv; + + txq = netdev_get_tx_queue(vi->dev, sq - vi->sq); + txq_trans_cond_update(txq); + } + u64_stats_update_begin(&sq->stats.syncp); sq->stats.packets += stats.packets; sq->stats.bytes += stats.bytes; -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 14/19] virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
virtnet_sq_free_unused_buf() check xsk buffer. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 1a21352e..58bb38f9b453 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -3876,10 +3876,12 @@ static void free_receive_page_frags(struct virtnet_info *vi) void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf) { - if (!virtnet_is_xdp_frame(buf)) + if (virtnet_is_skb_ptr(buf)) dev_kfree_skb(buf); - else + else if (virtnet_is_xdp_frame(buf)) xdp_return_frame(virtnet_ptr_to_xdp(buf)); + + /* xsk buffer do not need handle. */ } void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf) -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 15/19] virtio_net: xsk: rx: introduce add_recvbuf_xsk()
Implement the logic of filling vq with XSK buffer. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 13 +++ drivers/net/virtio/virtio_net.h | 5 +++ drivers/net/virtio/xsk.c| 66 - drivers/net/virtio/xsk.h| 2 + 4 files changed, 85 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 58bb38f9b453..0e740447b142 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -1787,9 +1787,20 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq, gfp_t gfp) { + struct xsk_buff_pool *pool; int err; bool oom; + rcu_read_lock(); + pool = rcu_dereference(rq->xsk.pool); + if (pool) { + err = virtnet_add_recvbuf_xsk(vi, rq, pool, gfp); + oom = err == -ENOMEM; + rcu_read_unlock(); + goto kick; + } + rcu_read_unlock(); + do { if (vi->mergeable_rx_bufs) err = add_recvbuf_mergeable(vi, rq, gfp); @@ -1802,6 +1813,8 @@ static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq, if (err) break; } while (rq->vq->num_free); + +kick: if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) { unsigned long flags; diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index d4e620a084f4..6e71622fca45 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -156,6 +156,11 @@ struct virtnet_rq { /* xdp rxq used by xsk */ struct xdp_rxq_info xdp_rxq; + + struct xdp_buff **xsk_buffs; + u32 nxt_idx; + u32 num; + u32 size; } xsk; }; diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 973e783260c3..841fb078882a 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -37,6 +37,58 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq) netif_stop_subqueue(dev, qnum); } +static int virtnet_add_recvbuf_batch(struct virtnet_info *vi, struct virtnet_rq *rq, +struct xsk_buff_pool *pool, gfp_t gfp) +{ + struct xdp_buff **xsk_buffs; + dma_addr_t addr; + u32 len, i; + int err = 0; + + xsk_buffs = rq->xsk.xsk_buffs; + + if (rq->xsk.nxt_idx >= rq->xsk.num) { + rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->xsk.size); + if (!rq->xsk.num) + return -ENOMEM; + rq->xsk.nxt_idx = 0; + } + + while (rq->xsk.nxt_idx < rq->xsk.num) { + i = rq->xsk.nxt_idx; + + /* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space */ + addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len; + len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len; + + sg_init_table(rq->sg, 1); + sg_fill_dma(rq->sg, addr, len); + + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp); + if (err) + return err; + + rq->xsk.nxt_idx++; + } + + return 0; +} + +int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq, + struct xsk_buff_pool *pool, gfp_t gfp) +{ + int err; + + do { + err = virtnet_add_recvbuf_batch(vi, rq, pool, gfp); + if (err) + return err; + + } while (rq->vq->num_free); + + return 0; +} + static int virtnet_xsk_xmit_one(struct virtnet_sq *sq, struct xsk_buff_pool *pool, struct xdp_desc *desc) @@ -244,7 +296,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev, struct virtnet_sq *sq; struct device *dma_dev; dma_addr_t hdr_dma; - int err; + int err, size; /* In big_packets mode, xdp cannot work, so there is no need to * initialize xsk of rq. @@ -276,6 +328,16 @@ static int virtnet_xsk_pool_enable(struct net_device *dev, if (!dma_dev) return -EPERM; + size = virtqueue_get_vring_size(rq->vq); + + rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), GFP_KERNEL); + if (!rq->xsk.xsk_buffs) + return -ENOMEM; + + rq->xsk.size = size; + rq->xsk.nxt_idx = 0; + rq->xsk.num = 0; + hdr_dma = dma_map_single(dma_dev, &xsk_hdr, vi->hdr_len, DMA_TO_DEVICE); if (dma_mapping_error(dma_dev, hdr_dma)) return -ENOMEM; @@ -338,6 +400,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, u16 qid)
[PATCH net-next v1 12/19] virtio_net: xsk: tx: support wakeup
xsk wakeup is used to trigger the logic for xsk xmit by xsk framework or user. Virtio-Net does not support to actively generate an interruption, so it tries to trigger tx NAPI on the tx interrupt cpu. Consider the effect of cache. When interrupt triggers, it is generally fixed on a CPU. It is better to start TX Napi on the same CPU. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 3 ++ drivers/net/virtio/virtio_net.h | 8 + drivers/net/virtio/xsk.c| 57 + drivers/net/virtio/xsk.h| 1 + 4 files changed, 69 insertions(+) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index a08429bef61f..1a21352e 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -2066,6 +2066,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) return 0; } + sq->xsk.last_cpu = smp_processor_id(); + txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); virtqueue_disable_cb(sq->vq); @@ -3770,6 +3772,7 @@ static const struct net_device_ops virtnet_netdev = { .ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid, .ndo_bpf= virtnet_xdp, .ndo_xdp_xmit = virtnet_xdp_xmit, + .ndo_xsk_wakeup = virtnet_xsk_wakeup, .ndo_features_check = passthru_features_check, .ndo_get_phys_port_name = virtnet_get_phys_port_name, .ndo_set_features = virtnet_set_features, diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 3bbb1f5baad5..7c72a8bb1813 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -101,6 +101,14 @@ struct virtnet_sq { struct xsk_buff_pool __rcu *pool; dma_addr_t hdr_dma_address; + + u32 last_cpu; + struct __call_single_data csd; + + /* The lock to prevent the repeat of calling +* smp_call_function_single_async(). +*/ + spinlock_t ipi_lock; } xsk; }; diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 0e775a9d270f..973e783260c3 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -115,6 +115,60 @@ bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool, return sent == budget; } +static void virtnet_remote_napi_schedule(void *info) +{ + struct virtnet_sq *sq = info; + + virtnet_vq_napi_schedule(&sq->napi, sq->vq); +} + +static void virtnet_remote_raise_napi(struct virtnet_sq *sq) +{ + u32 last_cpu, cur_cpu; + + last_cpu = sq->xsk.last_cpu; + cur_cpu = get_cpu(); + + /* On remote cpu, softirq will run automatically when ipi irq exit. On +* local cpu, smp_call_xxx will not trigger ipi interrupt, then softirq +* cannot be triggered automatically. So Call local_bh_enable after to +* trigger softIRQ processing. +*/ + if (last_cpu == cur_cpu) { + local_bh_disable(); + virtnet_vq_napi_schedule(&sq->napi, sq->vq); + local_bh_enable(); + } else { + if (spin_trylock(&sq->xsk.ipi_lock)) { + smp_call_function_single_async(last_cpu, &sq->xsk.csd); + spin_unlock(&sq->xsk.ipi_lock); + } + } + + put_cpu(); +} + +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag) +{ + struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_sq *sq; + + if (!netif_running(dev)) + return -ENETDOWN; + + if (qid >= vi->curr_queue_pairs) + return -EINVAL; + + sq = &vi->sq[qid]; + + if (napi_if_scheduled_mark_missed(&sq->napi)) + return 0; + + virtnet_remote_raise_napi(sq); + + return 0; +} + static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *rq, struct xsk_buff_pool *pool) { @@ -240,6 +294,9 @@ static int virtnet_xsk_pool_enable(struct net_device *dev, sq->xsk.hdr_dma_address = hdr_dma; + INIT_CSD(&sq->xsk.csd, virtnet_remote_napi_schedule, sq); + spin_lock_init(&sq->xsk.ipi_lock); + return 0; err_sq: diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h index 73ca8cd5308b..1bd19dcda649 100644 --- a/drivers/net/virtio/xsk.h +++ b/drivers/net/virtio/xsk.h @@ -17,4 +17,5 @@ static inline void *virtnet_xsk_to_ptr(u32 len) int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp); bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool, int budget); +int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag); #endif -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization
[PATCH net-next v1 11/19] virtio_net: xsk: tx: support tx
The driver's tx napi is very important for XSK. It is responsible for obtaining data from the XSK queue and sending it out. At the beginning, we need to trigger tx napi. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 18 +- drivers/net/virtio/virtio_net.h | 3 +- drivers/net/virtio/xsk.c| 108 drivers/net/virtio/xsk.h| 13 4 files changed, 140 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index b320770e5f4e..a08429bef61f 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -2054,7 +2054,9 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) struct virtnet_sq *sq = container_of(napi, struct virtnet_sq, napi); struct virtnet_info *vi = sq->vq->vdev->priv; unsigned int index = vq2txq(sq->vq); + struct xsk_buff_pool *pool; struct netdev_queue *txq; + int busy = 0; int opaque; bool done; @@ -2067,11 +2069,25 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); virtqueue_disable_cb(sq->vq); - free_old_xmit(sq, true); + + rcu_read_lock(); + pool = rcu_dereference(sq->xsk.pool); + if (pool) { + busy |= virtnet_xsk_xmit(sq, pool, budget); + rcu_read_unlock(); + } else { + rcu_read_unlock(); + free_old_xmit(sq, true); + } if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq); + if (busy) { + __netif_tx_unlock(txq); + return budget; + } + opaque = virtqueue_enable_cb_prepare(sq->vq); done = napi_complete_done(napi, 0); diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 9e69b6c5921b..3bbb1f5baad5 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -9,7 +9,8 @@ #include #define VIRTIO_XDP_FLAGBIT(0) -#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG) +#define VIRTIO_XSK_FLAGBIT(1) +#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG) /* RX packet size EWMA. The average packet size is used to determine the packet * buffer size when refilling RX rings. As the entire RX ring may be refilled diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c index 01962a3f..0e775a9d270f 100644 --- a/drivers/net/virtio/xsk.c +++ b/drivers/net/virtio/xsk.c @@ -7,6 +7,114 @@ static struct virtio_net_hdr_mrg_rxbuf xsk_hdr; +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) +{ + sg->dma_address = addr; + sg->length = len; +} + +static void virtnet_xsk_check_queue(struct virtnet_sq *sq) +{ + struct virtnet_info *vi = sq->vq->vdev->priv; + struct net_device *dev = vi->dev; + int qnum = sq - vi->sq; + + /* If it is a raw buffer queue, it does not check whether the status +* of the queue is stopped when sending. So there is no need to check +* the situation of the raw buffer queue. +*/ + if (virtnet_is_xdp_raw_buffer_queue(vi, qnum)) + return; + + /* If this sq is not the exclusive queue of the current cpu, +* then it may be called by start_xmit, so check it running out +* of space. +* +* Stop the queue to avoid getting packets that we are +* then unable to transmit. Then wait the tx interrupt. +*/ + if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) + netif_stop_subqueue(dev, qnum); +} + +static int virtnet_xsk_xmit_one(struct virtnet_sq *sq, + struct xsk_buff_pool *pool, + struct xdp_desc *desc) +{ + struct virtnet_info *vi; + dma_addr_t addr; + + vi = sq->vq->vdev->priv; + + addr = xsk_buff_raw_get_dma(pool, desc->addr); + xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len); + + sg_init_table(sq->sg, 2); + + sg_fill_dma(sq->sg, sq->xsk.hdr_dma_address, vi->hdr_len); + sg_fill_dma(sq->sg + 1, addr, desc->len); + + return virtqueue_add_outbuf(sq->vq, sq->sg, 2, + virtnet_xsk_to_ptr(desc->len), GFP_ATOMIC); +} + +static int virtnet_xsk_xmit_batch(struct virtnet_sq *sq, + struct xsk_buff_pool *pool, + unsigned int budget, + struct virtnet_sq_stats *stats) +{ + struct xdp_desc *descs = pool->tx_descs; + u32 nb_pkts, max_pkts, i; + bool kick = false; + int err; + + max_pkts = min_t(u32, budget, sq->vq->num_free / 2); + + nb_pkts = xsk_tx_peek_release_desc_batch(pool, max_pkts); + if (!nb_pkts) + return 0; + + for
[PATCH net-next v1 17/19] virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
Since this will be called in other circumstances(freeze), we must check whether it is xsk's buffer in this function. It cannot be judged outside this function. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 13 + 1 file changed, 13 insertions(+) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 003dd67ab707..ac62d0955c13 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -3904,8 +3904,21 @@ void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf) void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf) { struct virtnet_info *vi = vq->vdev->priv; + struct xsk_buff_pool *pool; int i = vq2rxq(vq); + rcu_read_lock(); + pool = rcu_dereference(vi->rq[i].xsk.pool); + if (pool) { + struct xdp_buff *xdp; + + xdp = (struct xdp_buff *)buf; + xsk_buff_free(xdp); + rcu_read_unlock(); + return; + } + rcu_read_unlock(); + if (vi->mergeable_rx_bufs) put_page(virt_to_head_page(buf)); else if (vi->big_packets) -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 08/19] virtio_net: sq support premapped mode
If the xsk is enabling, the xsk tx will share the send queue. But the xsk requires that the send queue use the premapped mode. So the send queue must support premapped mode. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 108 drivers/net/virtio/virtio_net.h | 54 +++- 2 files changed, 149 insertions(+), 13 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 8da84ea9bcbe..02d27101fef1 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -514,20 +514,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq *rq, u32 size, gfp_t gfp) return buf; } -static void virtnet_rq_set_premapped(struct virtnet_info *vi) +static int virtnet_sq_set_premapped(struct virtnet_sq *sq) { - int i; + struct virtnet_sq_dma *d; + int err, size, i; - /* disable for big mode */ - if (!vi->mergeable_rx_bufs && vi->big_packets) - return; + size = virtqueue_get_vring_size(sq->vq); + + size += MAX_SKB_FRAGS + 2; + + sq->dmainfo.head = kcalloc(size, sizeof(*sq->dmainfo.head), GFP_KERNEL); + if (!sq->dmainfo.head) + return -ENOMEM; + + err = virtqueue_set_dma_premapped(sq->vq); + if (err) { + kfree(sq->dmainfo.head); + return err; + } + + sq->dmainfo.free = NULL; + + sq->do_dma = true; + + for (i = 0; i < size; ++i) { + d = &sq->dmainfo.head[i]; + + d->next = sq->dmainfo.free; + sq->dmainfo.free = d; + } + + return 0; +} + +static void virtnet_set_premapped(struct virtnet_info *vi) +{ + int i; for (i = 0; i < vi->max_queue_pairs; i++) { - if (virtqueue_set_dma_premapped(vi->rq[i].vq)) + if (!virtnet_sq_set_premapped(&vi->sq[i])) + vi->sq[i].do_dma = true; + + /* disable for big mode */ + if (!vi->mergeable_rx_bufs && vi->big_packets) continue; - vi->rq[i].do_dma = true; + if (!virtqueue_set_dma_premapped(vi->rq[i].vq)) + vi->rq[i].do_dma = true; + } +} + +static struct virtnet_sq_dma *virtnet_sq_map_sg(struct virtnet_sq *sq, int nents, void *data) +{ + struct virtnet_sq_dma *d, *head; + struct scatterlist *sg; + int i; + + head = NULL; + + for_each_sg(sq->sg, sg, nents, i) { + sg->dma_address = virtqueue_dma_map_single_attrs(sq->vq, sg_virt(sg), +sg->length, +DMA_TO_DEVICE, 0); + if (virtqueue_dma_mapping_error(sq->vq, sg->dma_address)) + goto err; + + d = sq->dmainfo.free; + sq->dmainfo.free = d->next; + + d->addr = sg->dma_address; + d->len = sg->length; + + d->next = head; + head = d; + } + + head->data = data; + + return (void *)((unsigned long)head | ((unsigned long)data & VIRTIO_XMIT_DATA_MASK)); +err: + virtnet_sq_unmap(sq, head); + return NULL; +} + +static int virtnet_add_outbuf(struct virtnet_sq *sq, u32 num, void *data) +{ + int ret; + + if (sq->do_dma) { + data = virtnet_sq_map_sg(sq, num, data); + if (!data) + return -ENOMEM; } + + ret = virtqueue_add_outbuf(sq->vq, sq->sg, num, data, GFP_ATOMIC); + if (ret && sq->do_dma) + virtnet_sq_unmap(sq, data); + + return ret; } static void free_old_xmit(struct virtnet_sq *sq, bool in_napi) @@ -623,8 +707,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, skb_frag_size(frag), skb_frag_off(frag)); } - err = virtqueue_add_outbuf(sq->vq, sq->sg, nr_frags + 1, - xdp_to_ptr(xdpf), GFP_ATOMIC); + err = virtnet_add_outbuf(sq, nr_frags + 1, xdp_to_ptr(xdpf)); if (unlikely(err)) return -ENOSPC; /* Caller handle free/refcnt */ @@ -2060,7 +2143,8 @@ static int xmit_skb(struct virtnet_sq *sq, struct sk_buff *skb) return num_sg; num_sg++; } - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); + + return virtnet_add_outbuf(sq, num_sg, skb); } static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) @@ -3717,6 +3801,8 @@ static void virtnet_free_queues(struct virtnet_info *vi) for (i = 0; i < vi->max_queue_pairs; i++) { __netif_napi_del(&vi->rq[i].napi); __netif_napi_del(&vi->sq[i].napi); + + kfree(vi->sq[i].dmainfo.head); } /* We called __netif_napi_del(), @@ -3974,7 +4060,7 @@ static i
[PATCH net-next v1 09/19] virtio_net: xsk: bind/unbind xsk
This patch implement the logic of bind/unbind xsk pool to sq and rq. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/Makefile | 2 +- drivers/net/virtio/main.c | 10 +- drivers/net/virtio/virtio_net.h | 18 drivers/net/virtio/xsk.c| 186 drivers/net/virtio/xsk.h| 7 ++ 5 files changed, 216 insertions(+), 7 deletions(-) create mode 100644 drivers/net/virtio/xsk.c create mode 100644 drivers/net/virtio/xsk.h diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile index 15ed7c97fd4f..8c2a884d2dba 100644 --- a/drivers/net/virtio/Makefile +++ b/drivers/net/virtio/Makefile @@ -5,4 +5,4 @@ obj-$(CONFIG_VIRTIO_NET) += virtio_net.o -virtio_net-y := main.o +virtio_net-y := main.o xsk.o diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 02d27101fef1..38733a782f12 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -8,7 +8,6 @@ #include #include #include -#include #include #include #include @@ -139,9 +138,6 @@ struct virtio_net_common_hdr { }; }; -static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf); -static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); - static void *xdp_to_ptr(struct xdp_frame *ptr) { return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); @@ -3664,6 +3660,8 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return virtnet_xdp_set(dev, xdp->prog, xdp->extack); + case XDP_SETUP_XSK_POOL: + return virtnet_xsk_pool_setup(dev, xdp); default: return -EINVAL; } @@ -3849,7 +3847,7 @@ static void free_receive_page_frags(struct virtnet_info *vi) } } -static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf) +void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf) { if (!virtnet_is_xdp_frame(buf)) dev_kfree_skb(buf); @@ -3857,7 +3855,7 @@ static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf) xdp_return_frame(virtnet_ptr_to_xdp(buf)); } -static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf) +void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf) { struct virtnet_info *vi = vq->vdev->priv; int i = vq2rxq(vq); diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index cc742756e19a..9e69b6c5921b 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -5,6 +5,8 @@ #include #include +#include +#include #define VIRTIO_XDP_FLAGBIT(0) #define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG) @@ -94,6 +96,11 @@ struct virtnet_sq { bool do_dma; struct virtnet_sq_dma_head dmainfo; + struct { + struct xsk_buff_pool __rcu *pool; + + dma_addr_t hdr_dma_address; + } xsk; }; /* Internal representation of a receive virtqueue */ @@ -134,6 +141,13 @@ struct virtnet_rq { /* Do dma by self */ bool do_dma; + + struct { + struct xsk_buff_pool __rcu *pool; + + /* xdp rxq used by xsk */ + struct xdp_rxq_info xdp_rxq; + } xsk; }; struct virtnet_info { @@ -218,6 +232,8 @@ struct virtnet_info { struct failover *failover; }; +#include "xsk.h" + static inline bool virtnet_is_xdp_frame(void *ptr) { return (unsigned long)ptr & VIRTIO_XDP_FLAG; @@ -308,4 +324,6 @@ void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq); void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq); void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq); void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq); +void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); +void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf); #endif diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c new file mode 100644 index ..01962a3f --- /dev/null +++ b/drivers/net/virtio/xsk.c @@ -0,0 +1,186 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * virtio-net xsk + */ + +#include "virtio_net.h" + +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr; + +static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq *rq, + struct xsk_buff_pool *pool) +{ + int err, qindex; + + qindex = rq - vi->rq; + + if (pool) { + err = xdp_rxq_info_reg(&rq->xsk.xdp_rxq, vi->dev, qindex, rq->napi.napi_id); + if (err < 0) + return err; + + err = xdp_rxq_info_reg_mem_model(&rq->xsk.xdp_rxq, +MEM_TYPE_XSK_BUFF_POOL, NULL); + if (err < 0) { + xdp_rxq_info_unr
[PATCH net-next v1 06/19] virtio_net: separate virtnet_rx_resize()
This patch separates two sub-functions from virtnet_rx_resize(): * virtnet_rx_pause * virtnet_rx_resume Then the subsequent reset rx for xsk can share these two functions. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 29 + drivers/net/virtio/virtio_net.h | 3 +++ 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index ba38b6078e1d..e6b262341619 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -2120,26 +2120,39 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } -static int virtnet_rx_resize(struct virtnet_info *vi, -struct virtnet_rq *rq, u32 ring_num) +void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq) { bool running = netif_running(vi->dev); - int err, qindex; - - qindex = rq - vi->rq; if (running) napi_disable(&rq->napi); +} - err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_free_unused_buf); - if (err) - netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: %d\n", qindex, err); +void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq) +{ + bool running = netif_running(vi->dev); if (!try_fill_recv(vi, rq, GFP_KERNEL)) schedule_delayed_work(&vi->refill, 0); if (running) virtnet_napi_enable(rq->vq, &rq->napi); +} + +static int virtnet_rx_resize(struct virtnet_info *vi, +struct virtnet_rq *rq, u32 ring_num) +{ + int err, qindex; + + qindex = rq - vi->rq; + + virtnet_rx_pause(vi, rq); + + err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_free_unused_buf); + if (err) + netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: %d\n", qindex, err); + + virtnet_rx_resume(vi, rq); return err; } diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 282504d6639a..70eea23adba6 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -253,4 +253,7 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int else return false; } + +void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq); +void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq); #endif -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 10/19] virtio_net: xsk: prevent disable tx napi
Since xsk's TX queue is consumed by TX NAPI, if sq is bound to xsk, then we must stop tx napi from being disabled. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 38733a782f12..b320770e5f4e 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -3203,7 +3203,7 @@ static int virtnet_set_coalesce(struct net_device *dev, struct netlink_ext_ack *extack) { struct virtnet_info *vi = netdev_priv(dev); - int ret, queue_number, napi_weight; + int ret, queue_number, napi_weight, i; bool update_napi = false; /* Can't change NAPI weight if the link is up */ @@ -3232,6 +3232,14 @@ static int virtnet_set_coalesce(struct net_device *dev, return ret; if (update_napi) { + /* xsk xmit depends on the tx napi. So if xsk is active, +* prevent modifications to tx napi. +*/ + for (i = queue_number; i < vi->max_queue_pairs; i++) { + if (rtnl_dereference(vi->sq[i].xsk.pool)) + return -EBUSY; + } + for (; queue_number < vi->max_queue_pairs; queue_number++) vi->sq[queue_number].napi.weight = napi_weight; } -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 04/19] virtio_net: move to virtio_net.h
Move some structure definitions and inline functions into the virtio_net.h file. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 252 +-- drivers/net/virtio/virtio_net.h | 256 2 files changed, 258 insertions(+), 250 deletions(-) create mode 100644 drivers/net/virtio/virtio_net.h diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index 6cf77b6acdab..d8b6c0d86f29 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -6,7 +6,6 @@ //#define DEBUG #include #include -#include #include #include #include @@ -16,7 +15,6 @@ #include #include #include -#include #include #include #include @@ -24,6 +22,8 @@ #include #include +#include "virtio_net.h" + static int napi_weight = NAPI_POLL_WEIGHT; module_param(napi_weight, int, 0444); @@ -45,15 +45,6 @@ module_param(napi_tx, bool, 0644); #define VIRTIO_XDP_TX BIT(0) #define VIRTIO_XDP_REDIR BIT(1) -#define VIRTIO_XDP_FLAGBIT(0) - -/* RX packet size EWMA. The average packet size is used to determine the packet - * buffer size when refilling RX rings. As the entire RX ring may be refilled - * at once, the weight is chosen so that the EWMA will be insensitive to short- - * term, transient changes in packet size. - */ -DECLARE_EWMA(pkt_len, 0, 64) - #define VIRTNET_DRIVER_VERSION "1.0.0" static const unsigned long guest_offloads[] = { @@ -74,36 +65,6 @@ static const unsigned long guest_offloads[] = { (1ULL << VIRTIO_NET_F_GUEST_USO4) | \ (1ULL << VIRTIO_NET_F_GUEST_USO6)) -struct virtnet_stat_desc { - char desc[ETH_GSTRING_LEN]; - size_t offset; -}; - -struct virtnet_sq_stats { - struct u64_stats_sync syncp; - u64 packets; - u64 bytes; - u64 xdp_tx; - u64 xdp_tx_drops; - u64 kicks; - u64 tx_timeouts; -}; - -struct virtnet_rq_stats { - struct u64_stats_sync syncp; - u64 packets; - u64 bytes; - u64 drops; - u64 xdp_packets; - u64 xdp_tx; - u64 xdp_redirects; - u64 xdp_drops; - u64 kicks; -}; - -#define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m) -#define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m) - static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = { { "packets",VIRTNET_SQ_STAT(packets) }, { "bytes", VIRTNET_SQ_STAT(bytes) }, @@ -127,80 +88,6 @@ static const struct virtnet_stat_desc virtnet_rq_stats_desc[] = { #define VIRTNET_SQ_STATS_LEN ARRAY_SIZE(virtnet_sq_stats_desc) #define VIRTNET_RQ_STATS_LEN ARRAY_SIZE(virtnet_rq_stats_desc) -struct virtnet_interrupt_coalesce { - u32 max_packets; - u32 max_usecs; -}; - -/* The dma information of pages allocated at a time. */ -struct virtnet_rq_dma { - dma_addr_t addr; - u32 ref; - u16 len; - u16 need_sync; -}; - -/* Internal representation of a send virtqueue */ -struct send_queue { - /* Virtqueue associated with this send _queue */ - struct virtqueue *vq; - - /* TX: fragments + linear part + virtio header */ - struct scatterlist sg[MAX_SKB_FRAGS + 2]; - - /* Name of the send queue: output.$index */ - char name[16]; - - struct virtnet_sq_stats stats; - - struct virtnet_interrupt_coalesce intr_coal; - - struct napi_struct napi; - - /* Record whether sq is in reset state. */ - bool reset; -}; - -/* Internal representation of a receive virtqueue */ -struct receive_queue { - /* Virtqueue associated with this receive_queue */ - struct virtqueue *vq; - - struct napi_struct napi; - - struct bpf_prog __rcu *xdp_prog; - - struct virtnet_rq_stats stats; - - struct virtnet_interrupt_coalesce intr_coal; - - /* Chain pages by the private ptr. */ - struct page *pages; - - /* Average packet length for mergeable receive buffers. */ - struct ewma_pkt_len mrg_avg_pkt_len; - - /* Page frag for packet buffer allocation. */ - struct page_frag alloc_frag; - - /* RX: fragments + linear part + virtio header */ - struct scatterlist sg[MAX_SKB_FRAGS + 2]; - - /* Min single buffer size for mergeable buffers case. */ - unsigned int min_buf_len; - - /* Name of this receive queue: input.$index */ - char name[16]; - - struct xdp_rxq_info xdp_rxq; - - /* Record the last dma info to free after new pages is allocated. */ - struct virtnet_rq_dma *last_dma; - - /* Do dma by self */ - bool do_dma; -}; - /* This structure can contain rss message with maximum settings for indirection table and keysize * Note, that default structure that describes RSS configuration virtio_net_rss_config * contains same info but can't handle table values. @@ -234,88 +121,6 @@ struct control_buf {
[PATCH net-next v1 03/19] virtio_net: independent directory
Create a separate directory for virtio-net. AF_XDP support will be added later, then a separate xsk.c file will be added, so we should create a directory for virtio-net. Signed-off-by: Xuan Zhuo --- MAINTAINERS | 2 +- drivers/net/Kconfig | 8 +--- drivers/net/Makefile| 2 +- drivers/net/virtio/Kconfig | 13 + drivers/net/virtio/Makefile | 8 drivers/net/{virtio_net.c => virtio/main.c} | 0 6 files changed, 24 insertions(+), 9 deletions(-) create mode 100644 drivers/net/virtio/Kconfig create mode 100644 drivers/net/virtio/Makefile rename drivers/net/{virtio_net.c => virtio/main.c} (100%) diff --git a/MAINTAINERS b/MAINTAINERS index 9c186c214c54..e4fbcbc100e3 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -22768,7 +22768,7 @@ F: Documentation/devicetree/bindings/virtio/ F: Documentation/driver-api/virtio/ F: drivers/block/virtio_blk.c F: drivers/crypto/virtio/ -F: drivers/net/virtio_net.c +F: drivers/net/virtio/ F: drivers/vdpa/ F: drivers/virtio/ F: include/linux/vdpa.h diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 44eeb5d61ba9..54ee6fa4f4a6 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -430,13 +430,7 @@ config VETH When one end receives the packet it appears on its pair and vice versa. -config VIRTIO_NET - tristate "Virtio network driver" - depends on VIRTIO - select NET_FAILOVER - help - This is the virtual network driver for virtio. It can be used with - QEMU based VMMs (like KVM or Xen). Say Y or M. +source "drivers/net/virtio/Kconfig" config NLMON tristate "Virtual netlink monitoring device" diff --git a/drivers/net/Makefile b/drivers/net/Makefile index e26f98f897c5..47537dd0f120 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -31,7 +31,7 @@ obj-$(CONFIG_NET_TEAM) += team/ obj-$(CONFIG_TUN) += tun.o obj-$(CONFIG_TAP) += tap.o obj-$(CONFIG_VETH) += veth.o -obj-$(CONFIG_VIRTIO_NET) += virtio_net.o +obj-$(CONFIG_VIRTIO_NET) += virtio/ obj-$(CONFIG_VXLAN) += vxlan/ obj-$(CONFIG_GENEVE) += geneve.o obj-$(CONFIG_BAREUDP) += bareudp.o diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig new file mode 100644 index ..d8ccb3ac49df --- /dev/null +++ b/drivers/net/virtio/Kconfig @@ -0,0 +1,13 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# virtio-net device configuration +# +config VIRTIO_NET + tristate "Virtio network driver" + depends on VIRTIO + select NET_FAILOVER + help + This is the virtual network driver for virtio. It can be used with + QEMU based VMMs (like KVM or Xen). + + Say Y or M. diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile new file mode 100644 index ..15ed7c97fd4f --- /dev/null +++ b/drivers/net/virtio/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the virtio network device drivers. +# + +obj-$(CONFIG_VIRTIO_NET) += virtio_net.o + +virtio_net-y := main.o diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/main.c similarity index 100% rename from drivers/net/virtio_net.c rename to drivers/net/virtio/main.c -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 02/19] virtio_net: unify the code for recycling the xmit ptr
There are two completely similar and independent implementations. This is inconvenient for the subsequent addition of new types. So extract a function from this piece of code and call this function uniformly to recover old xmit ptr. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 76 +--- 1 file changed, 33 insertions(+), 43 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3d87386d8220..6cf77b6acdab 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -352,6 +352,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr) return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); } +static void __free_old_xmit(struct send_queue *sq, bool in_napi, + struct virtnet_sq_stats *stats) +{ + unsigned int len; + void *ptr; + + while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { + if (!is_xdp_frame(ptr)) { + struct sk_buff *skb = ptr; + + pr_debug("Sent skb %p\n", skb); + + stats->bytes += skb->len; + napi_consume_skb(skb, in_napi); + } else { + struct xdp_frame *frame = ptr_to_xdp(ptr); + + stats->bytes += xdp_get_frame_len(frame); + xdp_return_frame(frame); + } + stats->packets++; + } +} + /* Converting between virtqueue no. and kernel tx/rx queue no. * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq */ @@ -746,37 +770,19 @@ static void virtnet_rq_set_premapped(struct virtnet_info *vi) static void free_old_xmit(struct send_queue *sq, bool in_napi) { - unsigned int len; - unsigned int packets = 0; - unsigned int bytes = 0; - void *ptr; + struct virtnet_sq_stats stats = {}; - while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (likely(!is_xdp_frame(ptr))) { - struct sk_buff *skb = ptr; - - pr_debug("Sent skb %p\n", skb); - - bytes += skb->len; - napi_consume_skb(skb, in_napi); - } else { - struct xdp_frame *frame = ptr_to_xdp(ptr); - - bytes += xdp_get_frame_len(frame); - xdp_return_frame(frame); - } - packets++; - } + __free_old_xmit(sq, in_napi, &stats); /* Avoid overhead when no packets have been processed * happens when called speculatively from start_xmit. */ - if (!packets) + if (!stats.packets) return; u64_stats_update_begin(&sq->stats.syncp); - sq->stats.bytes += bytes; - sq->stats.packets += packets; + sq->stats.bytes += stats.bytes; + sq->stats.packets += stats.packets; u64_stats_update_end(&sq->stats.syncp); } @@ -915,15 +921,12 @@ static int virtnet_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, u32 flags) { struct virtnet_info *vi = netdev_priv(dev); + struct virtnet_sq_stats stats = {}; struct receive_queue *rq = vi->rq; struct bpf_prog *xdp_prog; struct send_queue *sq; - unsigned int len; - int packets = 0; - int bytes = 0; int nxmit = 0; int kicks = 0; - void *ptr; int ret; int i; @@ -942,20 +945,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, } /* Free up any pending old buffers before queueing new ones. */ - while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { - if (likely(is_xdp_frame(ptr))) { - struct xdp_frame *frame = ptr_to_xdp(ptr); - - bytes += xdp_get_frame_len(frame); - xdp_return_frame(frame); - } else { - struct sk_buff *skb = ptr; - - bytes += skb->len; - napi_consume_skb(skb, false); - } - packets++; - } + __free_old_xmit(sq, false, &stats); for (i = 0; i < n; i++) { struct xdp_frame *xdpf = frames[i]; @@ -975,8 +965,8 @@ static int virtnet_xdp_xmit(struct net_device *dev, } out: u64_stats_update_begin(&sq->stats.syncp); - sq->stats.bytes += bytes; - sq->stats.packets += packets; + sq->stats.bytes += stats.bytes; + sq->stats.packets += stats.packets; sq->stats.xdp_tx += n; sq->stats.xdp_tx_drops += n - nxmit; sq->stats.kicks += kicks; -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 07/19] virtio_net: separate virtnet_tx_resize()
This patch separates two sub-functions from virtnet_tx_resize(): * virtnet_tx_pause * virtnet_tx_resume Then the subsequent virtnet_tx_reset() can share these two functions. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 35 +++-- drivers/net/virtio/virtio_net.h | 2 ++ 2 files changed, 31 insertions(+), 6 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index e6b262341619..8da84ea9bcbe 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -2156,12 +2156,11 @@ static int virtnet_rx_resize(struct virtnet_info *vi, return err; } -static int virtnet_tx_resize(struct virtnet_info *vi, -struct virtnet_sq *sq, u32 ring_num) +void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq) { bool running = netif_running(vi->dev); struct netdev_queue *txq; - int err, qindex; + int qindex; qindex = sq - vi->sq; @@ -2182,10 +2181,17 @@ static int virtnet_tx_resize(struct virtnet_info *vi, netif_stop_subqueue(vi->dev, qindex); __netif_tx_unlock_bh(txq); +} - err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf); - if (err) - netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: %d\n", qindex, err); +void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq) +{ + bool running = netif_running(vi->dev); + struct netdev_queue *txq; + int qindex; + + qindex = sq - vi->sq; + + txq = netdev_get_tx_queue(vi->dev, qindex); __netif_tx_lock_bh(txq); sq->reset = false; @@ -2194,6 +2200,23 @@ static int virtnet_tx_resize(struct virtnet_info *vi, if (running) virtnet_napi_tx_enable(vi, sq->vq, &sq->napi); +} + +static int virtnet_tx_resize(struct virtnet_info *vi, struct virtnet_sq *sq, +u32 ring_num) +{ + int qindex, err; + + qindex = sq - vi->sq; + + virtnet_tx_pause(vi, sq); + + err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf); + if (err) + netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: %d\n", qindex, err); + + virtnet_tx_resume(vi, sq); + return err; } diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h index 70eea23adba6..2f930af35364 100644 --- a/drivers/net/virtio/virtio_net.h +++ b/drivers/net/virtio/virtio_net.h @@ -256,4 +256,6 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct virtnet_info *vi, int void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq); void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq); +void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq); +void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq); #endif -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy
## AF_XDP XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero copy feature of xsk (XDP socket) needs to be supported by the driver. The performance of zero copy is very good. mlx5 and intel ixgbe already support this feature, This patch set allows virtio-net to support xsk's zerocopy xmit feature. At present, we have completed some preparation: 1. vq-reset (virtio spec and kernel code) 2. virtio-core premapped dma 3. virtio-net xdp refactor So it is time for Virtio-Net to complete the support for the XDP Socket Zerocopy. Virtio-net can not increase the queue num at will, so xsk shares the queue with kernel. On the other hand, Virtio-Net does not support generate interrupt from driver manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it is also the local CPU, then we wake up napi directly. This patch set includes some refactor to the virtio-net to let that to support AF_XDP. ## performance ENV: Qemu with vhost-user(polling mode). Sockperf: https://github.com/Mellanox/sockperf I use this tool to send udp packet by kernel syscall. xmit command: sockperf tp -i 10.0.3.1 -t 1000 I write a tool that sends udp packets or recvs udp packets by AF_XDP. | Guest APP CPU |Guest Softirq CPU | UDP PPS --|---|--| xmit by syscall | 100%| | 676,915 xmit by xsk | 59.1% | 100% | 5,447,168 recv by syscall | 60% | 100% | 932,288 recv by xsk | 35.7% | 100% | 3,343,168 ## maintain I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support in virtio-net. Please review. Thanks. v1: 1. remove two virtio commits. Push this patchset to net-next 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: support tx 3. fix some warnings Xuan Zhuo (19): virtio_net: rename free_old_xmit_skbs to free_old_xmit virtio_net: unify the code for recycling the xmit ptr virtio_net: independent directory virtio_net: move to virtio_net.h virtio_net: add prefix virtnet to all struct/api inside virtio_net.h virtio_net: separate virtnet_rx_resize() virtio_net: separate virtnet_tx_resize() virtio_net: sq support premapped mode virtio_net: xsk: bind/unbind xsk virtio_net: xsk: prevent disable tx napi virtio_net: xsk: tx: support tx virtio_net: xsk: tx: support wakeup virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer virtio_net: xsk: rx: introduce add_recvbuf_xsk() virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer virtio_net: update tx timeout record virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY MAINTAINERS | 2 +- drivers/net/Kconfig | 8 +- drivers/net/Makefile| 2 +- drivers/net/virtio/Kconfig | 13 + drivers/net/virtio/Makefile | 8 + drivers/net/{virtio_net.c => virtio/main.c} | 652 +--- drivers/net/virtio/virtio_net.h | 359 +++ drivers/net/virtio/xsk.c| 545 drivers/net/virtio/xsk.h| 32 + 9 files changed, 1247 insertions(+), 374 deletions(-) create mode 100644 drivers/net/virtio/Kconfig create mode 100644 drivers/net/virtio/Makefile rename drivers/net/{virtio_net.c => virtio/main.c} (91%) create mode 100644 drivers/net/virtio/virtio_net.h create mode 100644 drivers/net/virtio/xsk.c create mode 100644 drivers/net/virtio/xsk.h -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net-next v1 05/19] virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
We move some structures and APIs to the header file, but these structures and APIs do not prefixed with virtnet. This patch adds virtnet for these. Signed-off-by: Xuan Zhuo --- drivers/net/virtio/main.c | 122 drivers/net/virtio/virtio_net.h | 30 2 files changed, 76 insertions(+), 76 deletions(-) diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c index d8b6c0d86f29..ba38b6078e1d 100644 --- a/drivers/net/virtio/main.c +++ b/drivers/net/virtio/main.c @@ -180,7 +180,7 @@ skb_vnet_common_hdr(struct sk_buff *skb) * private is used to chain pages for big packets, put the whole * most recent used list in the beginning for reuse */ -static void give_pages(struct receive_queue *rq, struct page *page) +static void give_pages(struct virtnet_rq *rq, struct page *page) { struct page *end; @@ -190,7 +190,7 @@ static void give_pages(struct receive_queue *rq, struct page *page) rq->pages = page; } -static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask) +static struct page *get_a_page(struct virtnet_rq *rq, gfp_t gfp_mask) { struct page *p = rq->pages; @@ -225,7 +225,7 @@ static void virtqueue_napi_complete(struct napi_struct *napi, opaque = virtqueue_enable_cb_prepare(vq); if (napi_complete_done(napi, processed)) { if (unlikely(virtqueue_poll(vq, opaque))) - virtqueue_napi_schedule(napi, vq); + virtnet_vq_napi_schedule(napi, vq); } else { virtqueue_disable_cb(vq); } @@ -240,7 +240,7 @@ static void skb_xmit_done(struct virtqueue *vq) virtqueue_disable_cb(vq); if (napi->weight) - virtqueue_napi_schedule(napi, vq); + virtnet_vq_napi_schedule(napi, vq); else /* We were probably waiting for more output buffers. */ netif_wake_subqueue(vi->dev, vq2txq(vq)); @@ -281,7 +281,7 @@ static struct sk_buff *virtnet_build_skb(void *buf, unsigned int buflen, /* Called from bottom half context */ static struct sk_buff *page_to_skb(struct virtnet_info *vi, - struct receive_queue *rq, + struct virtnet_rq *rq, struct page *page, unsigned int offset, unsigned int len, unsigned int truesize, unsigned int headroom) @@ -380,7 +380,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi, return skb; } -static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len) +static void virtnet_rq_unmap(struct virtnet_rq *rq, void *buf, u32 len) { struct page *page = virt_to_head_page(buf); struct virtnet_rq_dma *dma; @@ -409,7 +409,7 @@ static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len) put_page(page); } -static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx) +static void *virtnet_rq_get_buf(struct virtnet_rq *rq, u32 *len, void **ctx) { void *buf; @@ -420,7 +420,7 @@ static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx) return buf; } -static void *virtnet_rq_detach_unused_buf(struct receive_queue *rq) +static void *virtnet_rq_detach_unused_buf(struct virtnet_rq *rq) { void *buf; @@ -431,7 +431,7 @@ static void *virtnet_rq_detach_unused_buf(struct receive_queue *rq) return buf; } -static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len) +static void virtnet_rq_init_one_sg(struct virtnet_rq *rq, void *buf, u32 len) { struct virtnet_rq_dma *dma; dma_addr_t addr; @@ -456,7 +456,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 len) rq->sg[0].length = len; } -static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp) +static void *virtnet_rq_alloc(struct virtnet_rq *rq, u32 size, gfp_t gfp) { struct page_frag *alloc_frag = &rq->alloc_frag; struct virtnet_rq_dma *dma; @@ -530,11 +530,11 @@ static void virtnet_rq_set_premapped(struct virtnet_info *vi) } } -static void free_old_xmit(struct send_queue *sq, bool in_napi) +static void free_old_xmit(struct virtnet_sq *sq, bool in_napi) { struct virtnet_sq_stats stats = {}; - __free_old_xmit(sq, in_napi, &stats); + virtnet_free_old_xmit(sq, in_napi, &stats); /* Avoid overhead when no packets have been processed * happens when called speculatively from start_xmit. @@ -550,7 +550,7 @@ static void free_old_xmit(struct send_queue *sq, bool in_napi) static void check_sq_full_and_disable(struct virtnet_info *vi, struct net_device *dev, - struct send_queue *sq) + struct virtn
[PATCH net-next v1 01/19] virtio_net: rename free_old_xmit_skbs to free_old_xmit
Since free_old_xmit_skbs not only deals with skb, but also xdp frame and subsequent added xsk, so change the name of this function to free_old_xmit. Signed-off-by: Xuan Zhuo --- drivers/net/virtio_net.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index fe7f314d65c9..3d87386d8220 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -744,7 +744,7 @@ static void virtnet_rq_set_premapped(struct virtnet_info *vi) } } -static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi) +static void free_old_xmit(struct send_queue *sq, bool in_napi) { unsigned int len; unsigned int packets = 0; @@ -816,7 +816,7 @@ static void check_sq_full_and_disable(struct virtnet_info *vi, virtqueue_napi_schedule(&sq->napi, sq->vq); } else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) { /* More just got used, free them then recheck. */ - free_old_xmit_skbs(sq, false); + free_old_xmit(sq, false); if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) { netif_start_subqueue(dev, qnum); virtqueue_disable_cb(sq->vq); @@ -2124,7 +2124,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq) do { virtqueue_disable_cb(sq->vq); - free_old_xmit_skbs(sq, true); + free_old_xmit(sq, true); } while (unlikely(!virtqueue_enable_cb_delayed(sq->vq))); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) @@ -2246,7 +2246,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int budget) txq = netdev_get_tx_queue(vi->dev, index); __netif_tx_lock(txq, raw_smp_processor_id()); virtqueue_disable_cb(sq->vq); - free_old_xmit_skbs(sq, true); + free_old_xmit(sq, true); if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) netif_tx_wake_queue(txq); @@ -2336,7 +2336,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) if (use_napi) virtqueue_disable_cb(sq->vq); - free_old_xmit_skbs(sq, false); + free_old_xmit(sq, false); } while (use_napi && kick && unlikely(!virtqueue_enable_cb_delayed(sq->vq))); -- 2.32.0.3.g01195cf9f ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device
On 10/16/2023 4:52 PM, Michael S. Tsirkin wrote: On Mon, Oct 16, 2023 at 04:33:10PM +0800, Zhu, Lingshan wrote: On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote: On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote: On 10/12/2023 9:27 PM, Jason Gunthorpe wrote: On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote: sorry for the late reply, we have discussed this for weeks in virtio mailing list. I have proposed a live migration solution which is a config space solution. I'm sorry that can't be a serious proposal - config space can't do DMA, it is not suitable. config space only controls the live migration process and config the related facilities. We don't use config space to transfer data. The new added registers work like queue_enable or features. For example, we use DMA to report dirty pages and MMIO to fetch the dirty data. I remember in another thread you said:"you can't use DMA for any migration flows" And I agree to that statement, so we use config space registers to control the flow. Thanks, Zhu Lingshan Jason If you are using dma then I don't see what's wrong with admin vq. dma is all it does. dma != admin vq, Well they share the same issue that they don't work for nesting because DMA can not be intercepted. (hope this is not a spam to virtualization list and I try to keep this short) only use dma for host memory access, e.g., dirty page bitmap, no need to intercepted. and I think we have discussed many details in pros and cons in admin vq live migration proposal in virtio-comment. I am not sure we should span the discussions here, repeat them over again. Thanks Yea let's not. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device
On Mon, Oct 16, 2023 at 04:33:10PM +0800, Zhu, Lingshan wrote: > > > On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote: > > On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote: > > > > > > On 10/12/2023 9:27 PM, Jason Gunthorpe wrote: > > > > > > On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote: > > > > > > > > > sorry for the late reply, we have discussed this for weeks in > > > virtio mailing > > > list. I have proposed a live migration solution which is a > > > config space solution. > > > > > > I'm sorry that can't be a serious proposal - config space can't do > > > DMA, it is not suitable. > > > > > > config space only controls the live migration process and config the > > > related > > > facilities. > > > We don't use config space to transfer data. > > > > > > The new added registers work like queue_enable or features. > > > > > > For example, we use DMA to report dirty pages and MMIO to fetch the dirty > > > data. > > > > > > I remember in another thread you said:"you can't use DMA for any migration > > > flows" > > > > > > And I agree to that statement, so we use config space registers to > > > control the > > > flow. > > > > > > Thanks, > > > Zhu Lingshan > > > > > > > > > Jason > > > > > If you are using dma then I don't see what's wrong with admin vq. > > dma is all it does. > dma != admin vq, Well they share the same issue that they don't work for nesting because DMA can not be intercepted. > and I think we have discussed many details in pros and cons > in admin vq live migration proposal in virtio-comment. > I am not sure we should span the discussions here, repeat them over again. > > Thanks > > Yea let's not. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device
On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote: On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote: On 10/12/2023 9:27 PM, Jason Gunthorpe wrote: On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote: sorry for the late reply, we have discussed this for weeks in virtio mailing list. I have proposed a live migration solution which is a config space solution. I'm sorry that can't be a serious proposal - config space can't do DMA, it is not suitable. config space only controls the live migration process and config the related facilities. We don't use config space to transfer data. The new added registers work like queue_enable or features. For example, we use DMA to report dirty pages and MMIO to fetch the dirty data. I remember in another thread you said:"you can't use DMA for any migration flows" And I agree to that statement, so we use config space registers to control the flow. Thanks, Zhu Lingshan Jason If you are using dma then I don't see what's wrong with admin vq. dma is all it does. dma != admin vq, and I think we have discussed many details in pros and cons in admin vq live migration proposal in virtio-comment. I am not sure we should span the discussions here, repeat them over again. Thanks ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net-next 2/5] virtio-net: separate rx/tx coalescing moderation cmds
On Mon, Oct 16, 2023 at 03:45:38PM +0800, Heng Qi wrote: > > > 在 2023/10/14 上午9:11, Jakub Kicinski 写道: > > On Thu, 12 Oct 2023 15:44:06 +0800 Heng Qi wrote: > > > + > > > +static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi, > > > + struct ethtool_coalesce *ec) > > > +{ > > > + struct scatterlist sgs_rx; > > > + > > ../drivers/net/virtio_net.c: In function ‘virtnet_send_rx_notf_coal_cmds’: > > ../drivers/net/virtio_net.c:3306:14: error: ‘i’ undeclared (first use in > > this function); did you mean ‘vi’? > > 3306 | for (i = 0; i < vi->max_queue_pairs; i++) { > >| ^ > >| vi > > Will fix in the next version. > > Thanks! OK, however pls do test individual patches as well as the whole patchset. -- MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH][next] iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
On Mon, Oct 09, 2023 at 12:24:27PM -0600, Gustavo A. R. Silva wrote: > drivers/iommu/virtio-iommu.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Applied, thanks. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RESEND PATCH v2] vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
Hi, On 9/27/23 16:05, Eric Auger wrote: > Commit e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb > entries") Forbade vhost iotlb msg with null size to prevent entries > with size = start = 0 and last = ULONG_MAX to end up in the iotlb. > > Then commit 95932ab2ea07 ("vhost: allow batching hint without size") > only applied the check for VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE > message types to fix a regression observed with batching hit. > > Still, the introduction of that check introduced a regression for > some users attempting to invalidate the whole ULONG_MAX range by > setting the size to 0. This is the case with qemu/smmuv3/vhost > integration which does not work anymore. It Looks safe to partially > revert the original commit and allow VHOST_IOTLB_INVALIDATE messages > with null size. vhost_iotlb_del_range() will compute a correct end > iova. Same for vhost_vdpa_iotlb_unmap(). > > Signed-off-by: Eric Auger > Fixes: e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries") > Cc: sta...@vger.kernel.org # v5.17+ > Acked-by: Jason Wang Gentle Ping. Thanks Eric > --- > drivers/vhost/vhost.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index c71d573f1c94..e0c181ad17e3 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1458,9 +1458,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev, > goto done; > } > > - if ((msg.type == VHOST_IOTLB_UPDATE || > - msg.type == VHOST_IOTLB_INVALIDATE) && > - msg.size == 0) { > + if (msg.type == VHOST_IOTLB_UPDATE && msg.size == 0) { > ret = -EINVAL; > goto done; > } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization