Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Jason Wang
On Tue, Oct 17, 2023 at 11:28 AM Jason Wang  wrote:
>
> On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo  wrote:
> >
> > On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang  wrote:
> > > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo  
> > > wrote:
> > > >
> > > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang  
> > > > wrote:
> > > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo 
> > > > >  wrote:
> > > > > >
> > > > > > ## AF_XDP
> > > > > >
> > > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. 
> > > > > > The zero
> > > > > > copy feature of xsk (XDP socket) needs to be supported by the 
> > > > > > driver. The
> > > > > > performance of zero copy is very good. mlx5 and intel ixgbe already 
> > > > > > support
> > > > > > this feature, This patch set allows virtio-net to support xsk's 
> > > > > > zerocopy xmit
> > > > > > feature.
> > > > > >
> > > > > > At present, we have completed some preparation:
> > > > > >
> > > > > > 1. vq-reset (virtio spec and kernel code)
> > > > > > 2. virtio-core premapped dma
> > > > > > 3. virtio-net xdp refactor
> > > > > >
> > > > > > So it is time for Virtio-Net to complete the support for the XDP 
> > > > > > Socket
> > > > > > Zerocopy.
> > > > > >
> > > > > > Virtio-net can not increase the queue num at will, so xsk shares 
> > > > > > the queue with
> > > > > > kernel.
> > > > > >
> > > > > > On the other hand, Virtio-Net does not support generate interrupt 
> > > > > > from driver
> > > > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU 
> > > > > > run by TX
> > > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote 
> > > > > > CPU. If it
> > > > > > is also the local CPU, then we wake up napi directly.
> > > > > >
> > > > > > This patch set includes some refactor to the virtio-net to let that 
> > > > > > to support
> > > > > > AF_XDP.
> > > > > >
> > > > > > ## performance
> > > > > >
> > > > > > ENV: Qemu with vhost-user(polling mode).
> > > > > >
> > > > > > Sockperf: https://github.com/Mellanox/sockperf
> > > > > > I use this tool to send udp packet by kernel syscall.
> > > > > >
> > > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000
> > > > > >
> > > > > > I write a tool that sends udp packets or recvs udp packets by 
> > > > > > AF_XDP.
> > > > > >
> > > > > >   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> > > > > > --|---|--|
> > > > > > xmit by syscall   |   100%|  |   676,915
> > > > > > xmit by xsk   |   59.1%   |   100%   | 5,447,168
> > > > > > recv by syscall   |   60% |   100%   |   932,288
> > > > > > recv by xsk   |   35.7%   |   100%   | 3,343,168
> > > > >
> > > > > Any chance we can get a testpmd result (which I guess should be better
> > > > > than PPS above)?
> > > >
> > > > Do you mean testpmd + DPDK + AF_XDP?
> > >
> > > Yes.
> > >
> > > >
> > > > Yes. This is probably better because my tool does more work. That is 
> > > > not a
> > > > complete testing tool used by our business.
> > >
> > > Probably, but it would be appealing for others. Especially considering
> > > DPDK supports AF_XDP PMD now.
> >
> > OK.
> >
> > Let me try.
> >
> > But could you start to review firstly?
>
> Yes, it's in my todo list.

Speaking too fast, I think if it doesn't take too long time, I would
wait for the result first as netdim series. One reason is that I
remember claims to be only 10% to 20% loss comparing to wire speed, so
I'd expect it should be much faster. I vaguely remember, even a vhost
can gives us more than 3M PPS if we disable SMAP, so the numbers here
are not as impressive as expected.

Thanks

>
> >
> >
> > >
> > > >
> > > > What I noticed is that the hotspot is the driver writing virtio desc. 
> > > > Because
> > > > the device is in busy mode. So there is race between driver and device.
> > > > So I modified the virtio core and lazily updated avail idx. Then pps 
> > > > can reach
> > > > 10,000,000.
> > >
> > > Care to post a draft for this?
> >
> > YES, I is thinking for this.
> > But maybe that is just work for split. The packed mode has some troubles.
>
> Ok.
>
> Thanks
>
> >
> > Thanks.
> >
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > ## maintain
> > > > > >
> > > > > > I am currently a reviewer for virtio-net. I commit to maintain 
> > > > > > AF_XDP support in
> > > > > > virtio-net.
> > > > > >
> > > > > > Please review.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > v1:
> > > > > > 1. remove two virtio commits. Push this patchset to net-next
> > > > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to 
> > > > > > xsk: support tx
> > > > > > 3. fix some warnings
> > > > > >
> > > > > > Xuan Zhuo (19):
> > > > > >   virtio_net: rename free_old_xmit_skbs to free_old_xmit
> > > > > >   virtio_net: unify the code for recycling the 

Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Jason Wang
On Tue, Oct 17, 2023 at 11:26 AM Xuan Zhuo  wrote:
>
> On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang  wrote:
> > On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo  
> > wrote:
> > >
> > > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang  
> > > wrote:
> > > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > ## AF_XDP
> > > > >
> > > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. 
> > > > > The zero
> > > > > copy feature of xsk (XDP socket) needs to be supported by the driver. 
> > > > > The
> > > > > performance of zero copy is very good. mlx5 and intel ixgbe already 
> > > > > support
> > > > > this feature, This patch set allows virtio-net to support xsk's 
> > > > > zerocopy xmit
> > > > > feature.
> > > > >
> > > > > At present, we have completed some preparation:
> > > > >
> > > > > 1. vq-reset (virtio spec and kernel code)
> > > > > 2. virtio-core premapped dma
> > > > > 3. virtio-net xdp refactor
> > > > >
> > > > > So it is time for Virtio-Net to complete the support for the XDP 
> > > > > Socket
> > > > > Zerocopy.
> > > > >
> > > > > Virtio-net can not increase the queue num at will, so xsk shares the 
> > > > > queue with
> > > > > kernel.
> > > > >
> > > > > On the other hand, Virtio-Net does not support generate interrupt 
> > > > > from driver
> > > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU 
> > > > > run by TX
> > > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote 
> > > > > CPU. If it
> > > > > is also the local CPU, then we wake up napi directly.
> > > > >
> > > > > This patch set includes some refactor to the virtio-net to let that 
> > > > > to support
> > > > > AF_XDP.
> > > > >
> > > > > ## performance
> > > > >
> > > > > ENV: Qemu with vhost-user(polling mode).
> > > > >
> > > > > Sockperf: https://github.com/Mellanox/sockperf
> > > > > I use this tool to send udp packet by kernel syscall.
> > > > >
> > > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000
> > > > >
> > > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP.
> > > > >
> > > > >   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> > > > > --|---|--|
> > > > > xmit by syscall   |   100%|  |   676,915
> > > > > xmit by xsk   |   59.1%   |   100%   | 5,447,168
> > > > > recv by syscall   |   60% |   100%   |   932,288
> > > > > recv by xsk   |   35.7%   |   100%   | 3,343,168
> > > >
> > > > Any chance we can get a testpmd result (which I guess should be better
> > > > than PPS above)?
> > >
> > > Do you mean testpmd + DPDK + AF_XDP?
> >
> > Yes.
> >
> > >
> > > Yes. This is probably better because my tool does more work. That is not a
> > > complete testing tool used by our business.
> >
> > Probably, but it would be appealing for others. Especially considering
> > DPDK supports AF_XDP PMD now.
>
> OK.
>
> Let me try.
>
> But could you start to review firstly?

Yes, it's in my todo list.

>
>
> >
> > >
> > > What I noticed is that the hotspot is the driver writing virtio desc. 
> > > Because
> > > the device is in busy mode. So there is race between driver and device.
> > > So I modified the virtio core and lazily updated avail idx. Then pps can 
> > > reach
> > > 10,000,000.
> >
> > Care to post a draft for this?
>
> YES, I is thinking for this.
> But maybe that is just work for split. The packed mode has some troubles.

Ok.

Thanks

>
> Thanks.
>
> >
> > Thanks
> >
> > >
> > > Thanks.
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > ## maintain
> > > > >
> > > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP 
> > > > > support in
> > > > > virtio-net.
> > > > >
> > > > > Please review.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > v1:
> > > > > 1. remove two virtio commits. Push this patchset to net-next
> > > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to 
> > > > > xsk: support tx
> > > > > 3. fix some warnings
> > > > >
> > > > > Xuan Zhuo (19):
> > > > >   virtio_net: rename free_old_xmit_skbs to free_old_xmit
> > > > >   virtio_net: unify the code for recycling the xmit ptr
> > > > >   virtio_net: independent directory
> > > > >   virtio_net: move to virtio_net.h
> > > > >   virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
> > > > >   virtio_net: separate virtnet_rx_resize()
> > > > >   virtio_net: separate virtnet_tx_resize()
> > > > >   virtio_net: sq support premapped mode
> > > > >   virtio_net: xsk: bind/unbind xsk
> > > > >   virtio_net: xsk: prevent disable tx napi
> > > > >   virtio_net: xsk: tx: support tx
> > > > >   virtio_net: xsk: tx: support wakeup
> > > > >   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk 
> > > > > buffer
> > > > >   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
> > > > >   virtio_net: xsk: rx: introduce 

Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Xuan Zhuo
On Tue, 17 Oct 2023 11:20:41 +0800, Jason Wang  wrote:
> On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo  wrote:
> >
> > On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang  wrote:
> > > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo  
> > > wrote:
> > > >
> > > > ## AF_XDP
> > > >
> > > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The 
> > > > zero
> > > > copy feature of xsk (XDP socket) needs to be supported by the driver. 
> > > > The
> > > > performance of zero copy is very good. mlx5 and intel ixgbe already 
> > > > support
> > > > this feature, This patch set allows virtio-net to support xsk's 
> > > > zerocopy xmit
> > > > feature.
> > > >
> > > > At present, we have completed some preparation:
> > > >
> > > > 1. vq-reset (virtio spec and kernel code)
> > > > 2. virtio-core premapped dma
> > > > 3. virtio-net xdp refactor
> > > >
> > > > So it is time for Virtio-Net to complete the support for the XDP Socket
> > > > Zerocopy.
> > > >
> > > > Virtio-net can not increase the queue num at will, so xsk shares the 
> > > > queue with
> > > > kernel.
> > > >
> > > > On the other hand, Virtio-Net does not support generate interrupt from 
> > > > driver
> > > > manually, so when we wakeup tx xmit, we used some tips. If the CPU run 
> > > > by TX
> > > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote 
> > > > CPU. If it
> > > > is also the local CPU, then we wake up napi directly.
> > > >
> > > > This patch set includes some refactor to the virtio-net to let that to 
> > > > support
> > > > AF_XDP.
> > > >
> > > > ## performance
> > > >
> > > > ENV: Qemu with vhost-user(polling mode).
> > > >
> > > > Sockperf: https://github.com/Mellanox/sockperf
> > > > I use this tool to send udp packet by kernel syscall.
> > > >
> > > > xmit command: sockperf tp -i 10.0.3.1 -t 1000
> > > >
> > > > I write a tool that sends udp packets or recvs udp packets by AF_XDP.
> > > >
> > > >   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> > > > --|---|--|
> > > > xmit by syscall   |   100%|  |   676,915
> > > > xmit by xsk   |   59.1%   |   100%   | 5,447,168
> > > > recv by syscall   |   60% |   100%   |   932,288
> > > > recv by xsk   |   35.7%   |   100%   | 3,343,168
> > >
> > > Any chance we can get a testpmd result (which I guess should be better
> > > than PPS above)?
> >
> > Do you mean testpmd + DPDK + AF_XDP?
>
> Yes.
>
> >
> > Yes. This is probably better because my tool does more work. That is not a
> > complete testing tool used by our business.
>
> Probably, but it would be appealing for others. Especially considering
> DPDK supports AF_XDP PMD now.

OK.

Let me try.

But could you start to review firstly?


>
> >
> > What I noticed is that the hotspot is the driver writing virtio desc. 
> > Because
> > the device is in busy mode. So there is race between driver and device.
> > So I modified the virtio core and lazily updated avail idx. Then pps can 
> > reach
> > 10,000,000.
>
> Care to post a draft for this?

YES, I is thinking for this.
But maybe that is just work for split. The packed mode has some troubles.

Thanks.

>
> Thanks
>
> >
> > Thanks.
> >
> > >
> > > Thanks
> > >
> > > >
> > > > ## maintain
> > > >
> > > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP 
> > > > support in
> > > > virtio-net.
> > > >
> > > > Please review.
> > > >
> > > > Thanks.
> > > >
> > > > v1:
> > > > 1. remove two virtio commits. Push this patchset to net-next
> > > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> > > > support tx
> > > > 3. fix some warnings
> > > >
> > > > Xuan Zhuo (19):
> > > >   virtio_net: rename free_old_xmit_skbs to free_old_xmit
> > > >   virtio_net: unify the code for recycling the xmit ptr
> > > >   virtio_net: independent directory
> > > >   virtio_net: move to virtio_net.h
> > > >   virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
> > > >   virtio_net: separate virtnet_rx_resize()
> > > >   virtio_net: separate virtnet_tx_resize()
> > > >   virtio_net: sq support premapped mode
> > > >   virtio_net: xsk: bind/unbind xsk
> > > >   virtio_net: xsk: prevent disable tx napi
> > > >   virtio_net: xsk: tx: support tx
> > > >   virtio_net: xsk: tx: support wakeup
> > > >   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
> > > >   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
> > > >   virtio_net: xsk: rx: introduce add_recvbuf_xsk()
> > > >   virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
> > > >   virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
> > > >   virtio_net: update tx timeout record
> > > >   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
> > > >
> > > >  MAINTAINERS |   2 +-
> > > >  drivers/net/Kconfig 

Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Jason Wang
On Tue, Oct 17, 2023 at 11:11 AM Xuan Zhuo  wrote:
>
> On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang  wrote:
> > On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo  
> > wrote:
> > >
> > > ## AF_XDP
> > >
> > > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The 
> > > zero
> > > copy feature of xsk (XDP socket) needs to be supported by the driver. The
> > > performance of zero copy is very good. mlx5 and intel ixgbe already 
> > > support
> > > this feature, This patch set allows virtio-net to support xsk's zerocopy 
> > > xmit
> > > feature.
> > >
> > > At present, we have completed some preparation:
> > >
> > > 1. vq-reset (virtio spec and kernel code)
> > > 2. virtio-core premapped dma
> > > 3. virtio-net xdp refactor
> > >
> > > So it is time for Virtio-Net to complete the support for the XDP Socket
> > > Zerocopy.
> > >
> > > Virtio-net can not increase the queue num at will, so xsk shares the 
> > > queue with
> > > kernel.
> > >
> > > On the other hand, Virtio-Net does not support generate interrupt from 
> > > driver
> > > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by 
> > > TX
> > > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. 
> > > If it
> > > is also the local CPU, then we wake up napi directly.
> > >
> > > This patch set includes some refactor to the virtio-net to let that to 
> > > support
> > > AF_XDP.
> > >
> > > ## performance
> > >
> > > ENV: Qemu with vhost-user(polling mode).
> > >
> > > Sockperf: https://github.com/Mellanox/sockperf
> > > I use this tool to send udp packet by kernel syscall.
> > >
> > > xmit command: sockperf tp -i 10.0.3.1 -t 1000
> > >
> > > I write a tool that sends udp packets or recvs udp packets by AF_XDP.
> > >
> > >   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> > > --|---|--|
> > > xmit by syscall   |   100%|  |   676,915
> > > xmit by xsk   |   59.1%   |   100%   | 5,447,168
> > > recv by syscall   |   60% |   100%   |   932,288
> > > recv by xsk   |   35.7%   |   100%   | 3,343,168
> >
> > Any chance we can get a testpmd result (which I guess should be better
> > than PPS above)?
>
> Do you mean testpmd + DPDK + AF_XDP?

Yes.

>
> Yes. This is probably better because my tool does more work. That is not a
> complete testing tool used by our business.

Probably, but it would be appealing for others. Especially considering
DPDK supports AF_XDP PMD now.

>
> What I noticed is that the hotspot is the driver writing virtio desc. Because
> the device is in busy mode. So there is race between driver and device.
> So I modified the virtio core and lazily updated avail idx. Then pps can reach
> 10,000,000.

Care to post a draft for this?

Thanks

>
> Thanks.
>
> >
> > Thanks
> >
> > >
> > > ## maintain
> > >
> > > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP 
> > > support in
> > > virtio-net.
> > >
> > > Please review.
> > >
> > > Thanks.
> > >
> > > v1:
> > > 1. remove two virtio commits. Push this patchset to net-next
> > > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> > > support tx
> > > 3. fix some warnings
> > >
> > > Xuan Zhuo (19):
> > >   virtio_net: rename free_old_xmit_skbs to free_old_xmit
> > >   virtio_net: unify the code for recycling the xmit ptr
> > >   virtio_net: independent directory
> > >   virtio_net: move to virtio_net.h
> > >   virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
> > >   virtio_net: separate virtnet_rx_resize()
> > >   virtio_net: separate virtnet_tx_resize()
> > >   virtio_net: sq support premapped mode
> > >   virtio_net: xsk: bind/unbind xsk
> > >   virtio_net: xsk: prevent disable tx napi
> > >   virtio_net: xsk: tx: support tx
> > >   virtio_net: xsk: tx: support wakeup
> > >   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
> > >   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
> > >   virtio_net: xsk: rx: introduce add_recvbuf_xsk()
> > >   virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
> > >   virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
> > >   virtio_net: update tx timeout record
> > >   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
> > >
> > >  MAINTAINERS |   2 +-
> > >  drivers/net/Kconfig |   8 +-
> > >  drivers/net/Makefile|   2 +-
> > >  drivers/net/virtio/Kconfig  |  13 +
> > >  drivers/net/virtio/Makefile |   8 +
> > >  drivers/net/{virtio_net.c => virtio/main.c} | 652 +---
> > >  drivers/net/virtio/virtio_net.h | 359 +++
> > >  drivers/net/virtio/xsk.c| 545 
> > >  drivers/net/virtio/xsk.h|  32 +
> > >  9 files changed, 1247 

Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Xuan Zhuo
On Tue, 17 Oct 2023 10:53:44 +0800, Jason Wang  wrote:
> On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo  wrote:
> >
> > ## AF_XDP
> >
> > XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> > copy feature of xsk (XDP socket) needs to be supported by the driver. The
> > performance of zero copy is very good. mlx5 and intel ixgbe already support
> > this feature, This patch set allows virtio-net to support xsk's zerocopy 
> > xmit
> > feature.
> >
> > At present, we have completed some preparation:
> >
> > 1. vq-reset (virtio spec and kernel code)
> > 2. virtio-core premapped dma
> > 3. virtio-net xdp refactor
> >
> > So it is time for Virtio-Net to complete the support for the XDP Socket
> > Zerocopy.
> >
> > Virtio-net can not increase the queue num at will, so xsk shares the queue 
> > with
> > kernel.
> >
> > On the other hand, Virtio-Net does not support generate interrupt from 
> > driver
> > manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> > NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If 
> > it
> > is also the local CPU, then we wake up napi directly.
> >
> > This patch set includes some refactor to the virtio-net to let that to 
> > support
> > AF_XDP.
> >
> > ## performance
> >
> > ENV: Qemu with vhost-user(polling mode).
> >
> > Sockperf: https://github.com/Mellanox/sockperf
> > I use this tool to send udp packet by kernel syscall.
> >
> > xmit command: sockperf tp -i 10.0.3.1 -t 1000
> >
> > I write a tool that sends udp packets or recvs udp packets by AF_XDP.
> >
> >   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> > --|---|--|
> > xmit by syscall   |   100%|  |   676,915
> > xmit by xsk   |   59.1%   |   100%   | 5,447,168
> > recv by syscall   |   60% |   100%   |   932,288
> > recv by xsk   |   35.7%   |   100%   | 3,343,168
>
> Any chance we can get a testpmd result (which I guess should be better
> than PPS above)?

Do you mean testpmd + DPDK + AF_XDP?

Yes. This is probably better because my tool does more work. That is not a
complete testing tool used by our business.

What I noticed is that the hotspot is the driver writing virtio desc. Because
the device is in busy mode. So there is race between driver and device.
So I modified the virtio core and lazily updated avail idx. Then pps can reach
10,000,000.

Thanks.

>
> Thanks
>
> >
> > ## maintain
> >
> > I am currently a reviewer for virtio-net. I commit to maintain AF_XDP 
> > support in
> > virtio-net.
> >
> > Please review.
> >
> > Thanks.
> >
> > v1:
> > 1. remove two virtio commits. Push this patchset to net-next
> > 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> > support tx
> > 3. fix some warnings
> >
> > Xuan Zhuo (19):
> >   virtio_net: rename free_old_xmit_skbs to free_old_xmit
> >   virtio_net: unify the code for recycling the xmit ptr
> >   virtio_net: independent directory
> >   virtio_net: move to virtio_net.h
> >   virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
> >   virtio_net: separate virtnet_rx_resize()
> >   virtio_net: separate virtnet_tx_resize()
> >   virtio_net: sq support premapped mode
> >   virtio_net: xsk: bind/unbind xsk
> >   virtio_net: xsk: prevent disable tx napi
> >   virtio_net: xsk: tx: support tx
> >   virtio_net: xsk: tx: support wakeup
> >   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
> >   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
> >   virtio_net: xsk: rx: introduce add_recvbuf_xsk()
> >   virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
> >   virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
> >   virtio_net: update tx timeout record
> >   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
> >
> >  MAINTAINERS |   2 +-
> >  drivers/net/Kconfig |   8 +-
> >  drivers/net/Makefile|   2 +-
> >  drivers/net/virtio/Kconfig  |  13 +
> >  drivers/net/virtio/Makefile |   8 +
> >  drivers/net/{virtio_net.c => virtio/main.c} | 652 +---
> >  drivers/net/virtio/virtio_net.h | 359 +++
> >  drivers/net/virtio/xsk.c| 545 
> >  drivers/net/virtio/xsk.h|  32 +
> >  9 files changed, 1247 insertions(+), 374 deletions(-)
> >  create mode 100644 drivers/net/virtio/Kconfig
> >  create mode 100644 drivers/net/virtio/Makefile
> >  rename drivers/net/{virtio_net.c => virtio/main.c} (91%)
> >  create mode 100644 drivers/net/virtio/virtio_net.h
> >  create mode 100644 drivers/net/virtio/xsk.c
> >  create mode 100644 drivers/net/virtio/xsk.h
> >
> > --
> > 2.32.0.3.g01195cf9f
> >
>
___
Virtualization mailing list

Re: [PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Jason Wang
On Mon, Oct 16, 2023 at 8:00 PM Xuan Zhuo  wrote:
>
> ## AF_XDP
>
> XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> copy feature of xsk (XDP socket) needs to be supported by the driver. The
> performance of zero copy is very good. mlx5 and intel ixgbe already support
> this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
> feature.
>
> At present, we have completed some preparation:
>
> 1. vq-reset (virtio spec and kernel code)
> 2. virtio-core premapped dma
> 3. virtio-net xdp refactor
>
> So it is time for Virtio-Net to complete the support for the XDP Socket
> Zerocopy.
>
> Virtio-net can not increase the queue num at will, so xsk shares the queue 
> with
> kernel.
>
> On the other hand, Virtio-Net does not support generate interrupt from driver
> manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
> is also the local CPU, then we wake up napi directly.
>
> This patch set includes some refactor to the virtio-net to let that to support
> AF_XDP.
>
> ## performance
>
> ENV: Qemu with vhost-user(polling mode).
>
> Sockperf: https://github.com/Mellanox/sockperf
> I use this tool to send udp packet by kernel syscall.
>
> xmit command: sockperf tp -i 10.0.3.1 -t 1000
>
> I write a tool that sends udp packets or recvs udp packets by AF_XDP.
>
>   | Guest APP CPU |Guest Softirq CPU | UDP PPS
> --|---|--|
> xmit by syscall   |   100%|  |   676,915
> xmit by xsk   |   59.1%   |   100%   | 5,447,168
> recv by syscall   |   60% |   100%   |   932,288
> recv by xsk   |   35.7%   |   100%   | 3,343,168

Any chance we can get a testpmd result (which I guess should be better
than PPS above)?

Thanks

>
> ## maintain
>
> I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support 
> in
> virtio-net.
>
> Please review.
>
> Thanks.
>
> v1:
> 1. remove two virtio commits. Push this patchset to net-next
> 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> support tx
> 3. fix some warnings
>
> Xuan Zhuo (19):
>   virtio_net: rename free_old_xmit_skbs to free_old_xmit
>   virtio_net: unify the code for recycling the xmit ptr
>   virtio_net: independent directory
>   virtio_net: move to virtio_net.h
>   virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: sq support premapped mode
>   virtio_net: xsk: bind/unbind xsk
>   virtio_net: xsk: prevent disable tx napi
>   virtio_net: xsk: tx: support tx
>   virtio_net: xsk: tx: support wakeup
>   virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
>   virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
>   virtio_net: xsk: rx: introduce add_recvbuf_xsk()
>   virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
>   virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
>   virtio_net: update tx timeout record
>   virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
>
>  MAINTAINERS |   2 +-
>  drivers/net/Kconfig |   8 +-
>  drivers/net/Makefile|   2 +-
>  drivers/net/virtio/Kconfig  |  13 +
>  drivers/net/virtio/Makefile |   8 +
>  drivers/net/{virtio_net.c => virtio/main.c} | 652 +---
>  drivers/net/virtio/virtio_net.h | 359 +++
>  drivers/net/virtio/xsk.c| 545 
>  drivers/net/virtio/xsk.h|  32 +
>  9 files changed, 1247 insertions(+), 374 deletions(-)
>  create mode 100644 drivers/net/virtio/Kconfig
>  create mode 100644 drivers/net/virtio/Makefile
>  rename drivers/net/{virtio_net.c => virtio/main.c} (91%)
>  create mode 100644 drivers/net/virtio/virtio_net.h
>  create mode 100644 drivers/net/virtio/xsk.c
>  create mode 100644 drivers/net/virtio/xsk.h
>
> --
> 2.32.0.3.g01195cf9f
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH

2023-10-16 Thread Jason Wang
On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
 wrote:
>
> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki  
> wrote:
> >
> > On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki  
> > > wrote:
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 0448700890f7..298634556fab 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> > >>  BPF_PROG_TYPE_SK_LOOKUP,
> > >>  BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > >>  BPF_PROG_TYPE_NETFILTER,
> > >> +   BPF_PROG_TYPE_VNET_HASH,
> > >
> > > Sorry, we do not add new stable program types anymore.
> > >
> > >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> > >>  __u8  tstamp_type;
> > >>  __u32 :24;  /* Padding, future use. */
> > >>  __u64 hwtstamp;
> > >> +
> > >> +   __u32 vnet_hash_value;
> > >> +   __u16 vnet_hash_report;
> > >> +   __u16 vnet_rss_queue;
> > >>   };
> > >
> > > we also do not add anything to uapi __sk_buff.
> > >
> > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> > >> +   .get_func_proto = sk_filter_func_proto,
> > >> +   .is_valid_access= sk_filter_is_valid_access,
> > >> +   .convert_ctx_access = bpf_convert_ctx_access,
> > >> +   .gen_ld_abs = bpf_gen_ld_abs,
> > >> +};
> > >
> > > and we don't do ctx rewrites like this either.
> > >
> > > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > > in _unstable_ way.
> >
> > Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> > and I'm worried if it may mean the interface stability.
> >
> > Let me describe the context. QEMU bundles an eBPF program that is used
> > for the "eBPF steering program" feature of tun. Now I'm proposing to
> > extend the feature to allow to return some values to the userspace and
> > vhost_net. As such, the extension needs to be done in a way that ensures
> > interface stability.
>
> bpf is not an option then.
> we do not add stable bpf program types or hooks any more.

Does this mean eBPF could not be used for any new use cases other than
the existing ones?

> If a kernel subsystem wants to use bpf it needs to accept the fact
> that such bpf extensibility will be unstable and subsystem maintainers
> can decide to remove such bpf support in the future.

I don't see how it is different from the existing ones.

Thanks

>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Jason Wang
On Tue, Oct 17, 2023 at 4:30 AM Si-Wei Liu  wrote:
>
>
>
> On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:
> > On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:
> >> On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:
> >>>
> >>>
> >>> On 10/12/2023 8:01 PM, Jason Wang wrote:
>  On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:
> > Devices with on-chip IOMMU or vendor specific IOTLB implementation
> > may need to restore iotlb mapping to the initial or default state
> > using the .reset_map op, as it's desirable for some parent devices
> > to solely manipulate mappings by its own, independent of virtio device
> > state. For instance, device reset does not cause mapping go away on
> > such IOTLB model in need of persistent mapping. Before vhost-vdpa
> > is going away, give them a chance to reset iotlb back to the initial
> > state in vhost_vdpa_cleanup().
> >
> > Signed-off-by: Si-Wei Liu 
> > ---
> >drivers/vhost/vdpa.c | 16 
> >1 file changed, 16 insertions(+)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 851535f..a3f8160 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
> > *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >   return vhost_vdpa_alloc_as(v, asid);
> >}
> >
> > +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> > +{
> > +   struct vdpa_device *vdpa = v->vdpa;
> > +   const struct vdpa_config_ops *ops = vdpa->config;
> > +
> > +   if (ops->reset_map)
> > +   ops->reset_map(vdpa, asid);
> > +}
> > +
> >static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >{
> >   struct vhost_vdpa_as *as = asid_to_as(v, asid);
> > @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa 
> > *v, u32 asid)
> >
> >   hlist_del(>hash_link);
> >   vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
> > +   /*
> > +* Devices with vendor specific IOMMU may need to restore
> > +* iotlb to the initial or default state which is not done
> > +* through device reset, as the IOTLB mapping manipulation
> > +* could be decoupled from the virtio device life cycle.
> > +*/
>  Should we do this according to whether IOTLB_PRESIST is set?
> >>> Well, in theory this seems like so but it's unnecessary code change
> >>> actually, as that is the way how vDPA parent behind platform IOMMU works
> >>> today, and userspace doesn't break as of today. :)
> >> Well, this is one question I've ever asked before. You have explained
> >> that one of the reason that we don't break userspace is that they may
> >> couple IOTLB reset with vDPA reset as well. One example is the Qemu.
> >>
> >>> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> >>> it doesn't necessarily mean the iotlb will definitely be destroyed
> >>> across reset (think about the platform IOMMU case), so userspace today
> >>> is already tolerating enough with either good or bad IOMMU.

I'm confused, how to define tolerating here? For example, if it has
tolerance, why bother?

> >>This code of
> >>> not checking IOTLB_PERSIST being set is intentional, there's no point to
> >>> emulate bad IOMMU behavior even for older userspace (with improper
> >>> emulation to be done it would result in even worse performance).

I can easily imagine a case:

The old Qemu that works only with a setup like mlx5_vdpa. If we do
this without a negotiation, IOTLB will not be clear but the Qemu will
try to re-program the IOTLB after reset. Which will break?

1) stick the exact old behaviour with just one line of check
2) audit all the possible cases to avoid a one line of code

1) seems much easier than 2)

> >> For two reasons:
> >>
> >> 1) backend features need acked by userspace this is by design
> >> 2) keep the odd behaviour seems to be more safe as we can't audit
> >> every userspace program
> >>
> > The old behavior (without flag ack) cannot be trusted already, as:

Possibly but the point is to unbreak userspace no matter how weird the
behaviour we've ever had.

> > * Devices using platform IOMMU (in other words, not implementing
> > neither .set_map nor .dma_map) does not unmap memory at virtio reset.
> > * Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
> > reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
> > called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
> > now all backends work the same as far as I know., which was (and is)
> > the way devices using the platform IOMMU works.
> >
> > The difference in behavior did not matter as QEMU unmaps all the
> > memory unregistering the memory listener at vhost_vdpa_dev_start(...,
> > started = 

Re: [PATCH net-next v1 13/19] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer

2023-10-16 Thread Xuan Zhuo
On Mon, 16 Oct 2023 16:44:34 -0700, Jakub Kicinski  wrote:
> On Mon, 16 Oct 2023 20:00:27 +0800 Xuan Zhuo wrote:
> > @@ -305,9 +311,15 @@ static inline void virtnet_free_old_xmit(struct 
> > virtnet_sq *sq, bool in_napi,
> >
> > stats->bytes += xdp_get_frame_len(frame);
> > xdp_return_frame(frame);
> > +   } else {
> > +   stats->bytes += virtnet_ptr_to_xsk(ptr);
> > +   ++xsknum;
> > }
> > stats->packets++;
> > }
> > +
> > +   if (xsknum)
> > +   xsk_tx_completed(sq->xsk.pool, xsknum);
> >  }
>
> sparse complains:
>
> drivers/net/virtio/virtio_net.h:322:41: warning: incorrect type in argument 1 
> (different address spaces)
> drivers/net/virtio/virtio_net.h:322:41:expected struct xsk_buff_pool *pool
> drivers/net/virtio/virtio_net.h:322:41:got struct xsk_buff_pool
> [noderef] __rcu *pool
>
> please build test with W=1 C=1

OK. I will add C=1 to may script.

Thanks.


> --
> pw-bot: cr
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 4/4] vdpa/mlx5: implement .reset_map driver op

2023-10-16 Thread Si-Wei Liu
Since commit 6f5312f80183 ("vdpa/mlx5: Add support for running with
virtio_vdpa"), mlx5_vdpa starts with preallocate 1:1 DMA MR at device
creation time. This 1:1 DMA MR will be implicitly destroyed while
the first .set_map call is invoked, in which case callers like
vhost-vdpa will start to set up custom mappings. When the .reset
callback is invoked, the custom mappings will be cleared and the 1:1
DMA MR will be re-created.

In order to reduce excessive memory mapping cost in live migration,
it is desirable to decouple the vhost-vdpa IOTLB abstraction from
the virtio device life cycle, i.e. mappings can be kept around intact
across virtio device reset. Leverage the .reset_map callback, which
is meant to destroy the regular MR on the given ASID and recreate the
initial DMA mapping. That way, the device .reset op can run free from
having to maintain and clean up memory mappings by itself.

The cvq mapping also needs to be cleared if is in the given ASID.

Co-developed-by: Dragos Tatulea 
Signed-off-by: Dragos Tatulea 
Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 +
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 +-
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index db988ced5a5d..84547d998bcf 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -127,6 +127,7 @@ int mlx5_vdpa_update_cvq_iotlb(struct mlx5_vdpa_dev *mvdev,
struct vhost_iotlb *iotlb,
unsigned int asid);
 int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev);
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid);
 
 #define mlx5_vdpa_warn(__dev, format, ...) 
\
dev_warn((__dev)->mdev->device, "%s:%d:(pid %d) warning: " format, 
__func__, __LINE__, \
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 66530e28f327..2197c46e563a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -645,3 +645,20 @@ int mlx5_vdpa_create_dma_mr(struct mlx5_vdpa_dev *mvdev)
 
return mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, 0);
 }
+
+int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, unsigned int asid)
+{
+   if (asid >= MLX5_VDPA_NUM_AS)
+   return -EINVAL;
+
+   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[asid]);
+
+   if (asid == 0 && MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
+   if (mlx5_vdpa_create_dma_mr(mvdev))
+   mlx5_vdpa_warn(mvdev, "create DMA MR failed\n");
+   } else {
+   mlx5_vdpa_update_cvq_iotlb(mvdev, NULL, asid);
+   }
+
+   return 0;
+}
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 6abe02310f2b..928e71bc5571 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -2838,7 +2838,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
unregister_link_notifier(ndev);
teardown_driver(ndev);
clear_vqs_ready(ndev);
-   mlx5_vdpa_destroy_mr_resources(>mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.suspended = false;
ndev->cur_num_vqs = 0;
@@ -2849,10 +2848,6 @@ static int mlx5_vdpa_reset(struct vdpa_device *vdev)
init_group_to_asid_map(mvdev);
++mvdev->generation;
 
-   if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
-   if (mlx5_vdpa_create_dma_mr(mvdev))
-   mlx5_vdpa_warn(mvdev, "create MR failed\n");
-   }
up_write(>reslock);
 
return 0;
@@ -2932,6 +2927,18 @@ static int mlx5_vdpa_set_map(struct vdpa_device *vdev, 
unsigned int asid,
return err;
 }
 
+static int mlx5_vdpa_reset_map(struct vdpa_device *vdev, unsigned int asid)
+{
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+   int err;
+
+   down_write(>reslock);
+   err = mlx5_vdpa_reset_mr(mvdev, asid);
+   up_write(>reslock);
+   return err;
+}
+
 static struct device *mlx5_get_vq_dma_dev(struct vdpa_device *vdev, u16 idx)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
@@ -3199,6 +3206,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.set_config = mlx5_vdpa_set_config,
.get_generation = mlx5_vdpa_get_generation,
.set_map = mlx5_vdpa_set_map,
+   .reset_map = mlx5_vdpa_reset_map,
.set_group_asid = mlx5_set_group_asid,
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 0/4] vdpa: decouple reset of iotlb mapping from device reset

2023-10-16 Thread Si-Wei Liu
In order to reduce needlessly high setup and teardown cost
of iotlb mapping during live migration, it's crucial to
decouple the vhost-vdpa iotlb abstraction from the virtio
device life cycle, i.e. iotlb mappings should be left
intact across virtio device reset [1]. For it to work, the
on-chip IOMMU parent device could implement a separate
.reset_map() operation callback to restore 1:1 DMA mapping
without having to resort to the .reset() callback, the
latter of which is mainly used to reset virtio device state.
This new .reset_map() callback will be invoked only before
the vhost-vdpa driver is to be removed and detached from
the vdpa bus, such that other vdpa bus drivers, e.g. 
virtio-vdpa, can start with 1:1 DMA mapping when they
are attached. For the context, those on-chip IOMMU parent
devices, create the 1:1 DMA mapping at vdpa device creation,
and they would implicitly destroy the 1:1 mapping when
the first .set_map or .dma_map callback is invoked.

This patchset is based off of the descriptor group v3 series
from Dragos. [2]

[1] Reducing vdpa migration downtime because of memory pin / maps
https://www.mail-archive.com/qemu-devel@nongnu.org/msg953755.html
[2] [PATCH vhost v3 00/16] vdpa: Add support for vq descriptor mappings
https://lore.kernel.org/lkml/20231009112401.1060447-1-dtatu...@nvidia.com/

---
v2:
- improved commit message to clarify the intended csope of .reset_map API
- improved commit messages to clarify no breakage on older userspace

v1:
- rewrote commit messages to include more detailed description and background
- reword to vendor specific IOMMU implementation from on-chip IOMMU
- include parent device backend feautres to persistent iotlb precondition
- reimplement mlx5_vdpa patch on top of descriptor group series

RFC v3:
- fix missing return due to merge error in patch #4

RFC v2:
- rebased on top of the "[PATCH RFC v2 0/3] vdpa: dedicated descriptor table 
group" series:
  
https://lore.kernel.org/virtualization/1694248959-13369-1-git-send-email-si-wei@oracle.com/

---

Si-Wei Liu (4):
  vdpa: introduce .reset_map operation callback
  vhost-vdpa: reset vendor specific mapping to initial state in .release
  vhost-vdpa: introduce IOTLB_PERSIST backend feature bit
  vdpa/mlx5: implement .reset_map driver op

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  1 +
 drivers/vdpa/mlx5/core/mr.c| 17 
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 18 -
 drivers/vhost/vdpa.c   | 31 ++
 include/linux/vdpa.h   | 10 ++
 include/uapi/linux/vhost_types.h   |  2 ++
 6 files changed, 74 insertions(+), 5 deletions(-)

-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 1/4] vdpa: introduce .reset_map operation callback

2023-10-16 Thread Si-Wei Liu
Device specific IOMMU parent driver who wishes to see mapping to be
decoupled from virtio or vdpa device life cycle (device reset) can use
it to restore memory mapping in the device IOMMU to the initial or
default state. The reset of mapping is done per address space basis.

The reason why a separate .reset_map op is introduced is because this
allows a simple on-chip IOMMU model without exposing too much device
implementation details to the upper vdpa layer. The .dma_map/unmap or
.set_map driver API is meant to be used to manipulate the IOTLB mappings,
and has been abstracted in a way similar to how a real IOMMU device maps
or unmaps pages for certain memory ranges. However, apart from this there
also exists other mapping needs, in which case 1:1 passthrough mapping
has to be used by other users (read virtio-vdpa). To ease parent/vendor
driver implementation and to avoid abusing DMA ops in an unexpacted way,
these on-chip IOMMU devices can start with 1:1 passthrough mapping mode
initially at the he time of creation. Then the .reset_map op can be used
to switch iotlb back to this initial state without having to expose a
complex two-dimensional IOMMU device model.

The .reset_map is not a MUST for every parent that implements the
.dma_map or .set_map API, because there could be software vDPA devices
(which has use_va=true) that don't have to pin kernel memory so they
don't care much about high mapping cost during device reset. And those
software devices may have also implemented their own DMA ops, so don't
have to use .reset_map to achieve a simple IOMMU device model for 1:1
passthrough mapping, either.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
Acked-by: Jason Wang 
---
 include/linux/vdpa.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/vdpa.h b/include/linux/vdpa.h
index d376309b99cf..26ae6ae1eac3 100644
--- a/include/linux/vdpa.h
+++ b/include/linux/vdpa.h
@@ -327,6 +327,15 @@ struct vdpa_map_file {
  * @iova: iova to be unmapped
  * @size: size of the area
  * Returns integer: success (0) or error (< 0)
+ * @reset_map: Reset device memory mapping to the default
+ * state (optional)
+ * Needed for devices that are using device
+ * specific DMA translation and prefer mapping
+ * to be decoupled from the virtio life cycle,
+ * i.e. device .reset op does not reset mapping
+ * @vdev: vdpa device
+ * @asid: address space identifier
+ * Returns integer: success (0) or error (< 0)
  * @get_vq_dma_dev:Get the dma device for a specific
  * virtqueue (optional)
  * @vdev: vdpa device
@@ -405,6 +414,7 @@ struct vdpa_config_ops {
   u64 iova, u64 size, u64 pa, u32 perm, void *opaque);
int (*dma_unmap)(struct vdpa_device *vdev, unsigned int asid,
 u64 iova, u64 size);
+   int (*reset_map)(struct vdpa_device *vdev, unsigned int asid);
int (*set_group_asid)(struct vdpa_device *vdev, unsigned int group,
  unsigned int asid);
struct device *(*get_vq_dma_dev)(struct vdpa_device *vdev, u16 idx);
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu
Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f57b95..a3f8160c9807 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
return vhost_vdpa_alloc_as(v, asid);
 }
 
+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
 static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
 {
struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)
 
hlist_del(>hash_link);
vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/
+   vhost_vdpa_reset_map(v, asid);
kfree(as);
 
return 0;
-- 
2.39.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 3/4] vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

2023-10-16 Thread Si-Wei Liu
Userspace needs this feature flag to distinguish if vhost-vdpa
iotlb in the kernel can be trusted to persist IOTLB mapping
across vDPA reset. Without it, userspace has no way to tell apart
if it's running on an older kernel, which could silently drop
all iotlb mapping across vDPA reset, especially with broken
parent driver implementation for the .reset driver op. The broken
driver may incorrectly drop all mappings of its own as part of
.reset, which inadvertently ends up with corrupted mapping state
between vhost-vdpa userspace and the kernel. As a workaround, to
make the mapping behaviour predictable across reset, userspace
has to pro-actively remove all mappings before vDPA reset, and
then restore all the mappings afterwards. This workaround is done
unconditionally on top of all parent drivers today, due to the
parent driver implementation issue and no means to differentiate.
This workaround had been utilized in QEMU since day one when the
corresponding vhost-vdpa userspace backend came to the world.

There are 3 cases that backend may claim this feature bit on for:

- parent device that has to work with platform IOMMU
- parent device with on-chip IOMMU that has the expected
  .reset_map support in driver
- parent device with vendor specific IOMMU implementation with
  persistent IOTLB mapping already that has to specifically
  declare this backend feature

The reason why .reset_map is being one of the pre-condition for
persistent iotlb is because without it, vhost-vdpa can't switch
back iotlb to the initial state later on, especially for the
on-chip IOMMU case which starts with identity mapping at device
creation. virtio-vdpa requires on-chip IOMMU to perform 1:1
passthrough translation from PA to IOVA as-is to begin with, and
.reset_map is the only means to turn back iotlb to the identity
mapping mode after vhost-vdpa is gone.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(
started = false), but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace
to know that the driver bug has been solved. Not offering it
indicates that userspace cannot trust the kernel will retain the
maps.

Signed-off-by: Si-Wei Liu 
Acked-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 15 +++
 include/uapi/linux/vhost_types.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index a3f8160c9807..9202986a7d81 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -438,6 +438,15 @@ static u64 vhost_vdpa_get_backend_features(const struct 
vhost_vdpa *v)
return ops->get_backend_features(vdpa);
 }
 
+static bool vhost_vdpa_has_persistent_map(const struct vhost_vdpa *v)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   return (!ops->set_map && !ops->dma_map) || ops->reset_map ||
+  vhost_vdpa_get_backend_features(v) & 
BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
+}
+
 static long vhost_vdpa_set_features(struct vhost_vdpa *v, u64 __user *featurep)
 {
struct vdpa_device *vdpa = v->vdpa;
@@ -725,6 +734,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
return -EFAULT;
if (features & ~(VHOST_VDPA_BACKEND_FEATURES |
 BIT_ULL(VHOST_BACKEND_F_DESC_ASID) |
+BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
@@ -741,6 +751,9 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
if ((features & BIT_ULL(VHOST_BACKEND_F_DESC_ASID)) &&
 !vhost_vdpa_has_desc_group(v))
return -EOPNOTSUPP;
+   if ((features & BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST)) &&
+!vhost_vdpa_has_persistent_map(v))
+   return -EOPNOTSUPP;
vhost_set_backend_features(>vdev, features);
return 0;
}
@@ -796,6 +809,8 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
features |= BIT_ULL(VHOST_BACKEND_F_RESUME);
if (vhost_vdpa_has_desc_group(v))
features |= BIT_ULL(VHOST_BACKEND_F_DESC_ASID);
+   if (vhost_vdpa_has_persistent_map(v))
+   features |= BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST);
features |= vhost_vdpa_get_backend_features(v);
if (copy_to_user(featurep, , sizeof(features)))
r = -EFAULT;
diff --git 

Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH

2023-10-16 Thread Willem de Bruijn
On Mon, Oct 16, 2023 at 7:53 PM Alexei Starovoitov
 wrote:
>
> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki  
> wrote:
> >
> > On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki  
> > > wrote:
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 0448700890f7..298634556fab 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> > >>  BPF_PROG_TYPE_SK_LOOKUP,
> > >>  BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > >>  BPF_PROG_TYPE_NETFILTER,
> > >> +   BPF_PROG_TYPE_VNET_HASH,
> > >
> > > Sorry, we do not add new stable program types anymore.
> > >
> > >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> > >>  __u8  tstamp_type;
> > >>  __u32 :24;  /* Padding, future use. */
> > >>  __u64 hwtstamp;
> > >> +
> > >> +   __u32 vnet_hash_value;
> > >> +   __u16 vnet_hash_report;
> > >> +   __u16 vnet_rss_queue;
> > >>   };
> > >
> > > we also do not add anything to uapi __sk_buff.
> > >
> > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> > >> +   .get_func_proto = sk_filter_func_proto,
> > >> +   .is_valid_access= sk_filter_is_valid_access,
> > >> +   .convert_ctx_access = bpf_convert_ctx_access,
> > >> +   .gen_ld_abs = bpf_gen_ld_abs,
> > >> +};
> > >
> > > and we don't do ctx rewrites like this either.
> > >
> > > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > > in _unstable_ way.
> >
> > Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> > and I'm worried if it may mean the interface stability.
> >
> > Let me describe the context. QEMU bundles an eBPF program that is used
> > for the "eBPF steering program" feature of tun. Now I'm proposing to
> > extend the feature to allow to return some values to the userspace and
> > vhost_net. As such, the extension needs to be done in a way that ensures
> > interface stability.
>
> bpf is not an option then.
> we do not add stable bpf program types or hooks any more.
> If a kernel subsystem wants to use bpf it needs to accept the fact
> that such bpf extensibility will be unstable and subsystem maintainers
> can decide to remove such bpf support in the future.

Based on hooks for tracepoints and kfuncs, correct?

Perhaps the existing stable flow dissector type is extensible to
optionally compute the hash and report hash and hash type. Else we
probably should revisit the previous version of this series.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v2 1/7] bpf: Introduce BPF_PROG_TYPE_VNET_HASH

2023-10-16 Thread Alexei Starovoitov
On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki  wrote:
>
> On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki  
> > wrote:
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 0448700890f7..298634556fab 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> >>  BPF_PROG_TYPE_SK_LOOKUP,
> >>  BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> >>  BPF_PROG_TYPE_NETFILTER,
> >> +   BPF_PROG_TYPE_VNET_HASH,
> >
> > Sorry, we do not add new stable program types anymore.
> >
> >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> >>  __u8  tstamp_type;
> >>  __u32 :24;  /* Padding, future use. */
> >>  __u64 hwtstamp;
> >> +
> >> +   __u32 vnet_hash_value;
> >> +   __u16 vnet_hash_report;
> >> +   __u16 vnet_rss_queue;
> >>   };
> >
> > we also do not add anything to uapi __sk_buff.
> >
> >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> >> +   .get_func_proto = sk_filter_func_proto,
> >> +   .is_valid_access= sk_filter_is_valid_access,
> >> +   .convert_ctx_access = bpf_convert_ctx_access,
> >> +   .gen_ld_abs = bpf_gen_ld_abs,
> >> +};
> >
> > and we don't do ctx rewrites like this either.
> >
> > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > in _unstable_ way.
>
> Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> and I'm worried if it may mean the interface stability.
>
> Let me describe the context. QEMU bundles an eBPF program that is used
> for the "eBPF steering program" feature of tun. Now I'm proposing to
> extend the feature to allow to return some values to the userspace and
> vhost_net. As such, the extension needs to be done in a way that ensures
> interface stability.

bpf is not an option then.
we do not add stable bpf program types or hooks any more.
If a kernel subsystem wants to use bpf it needs to accept the fact
that such bpf extensibility will be unstable and subsystem maintainers
can decide to remove such bpf support in the future.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu



On 10/16/2023 4:28 AM, Eugenio Perez Martin wrote:

On Mon, Oct 16, 2023 at 8:33 AM Jason Wang  wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:



On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
   drivers/vhost/vdpa.c | 16 
   1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
  return vhost_vdpa_alloc_as(v, asid);
   }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
   {
  struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

  hlist_del(>hash_link);
  vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.


As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU. This code of
not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program


The old behavior (without flag ack) cannot be trusted already, as:
* Devices using platform IOMMU (in other words, not implementing
neither .set_map nor .dma_map) does not unmap memory at virtio reset.
* Devices that implement .set_map or .dma_map (vdpa_sim, mlx5) do
reset IOTLB, but in their parent ops (vdpasim_do_reset, prune_iotlb
called from mlx5_vdpa_reset). With vdpa_sim patch removing the reset,
now all backends work the same as far as I know., which was (and is)
the way devices using the platform IOMMU works.

The difference in behavior did not matter as QEMU unmaps all the
memory unregistering the memory listener at vhost_vdpa_dev_start(...,
started = false),
Exactly. It's not just QEMU, but any (older) userspace manipulates 
mappings through the vhost-vdpa iotlb interface has to unmap all 
mappings to workaround the vdpa parent driver bug. If they don't do 
explicit unmap, it would cause state inconsistency between vhost-vdpa 
and parent driver, then old mappings can't be restored, and new mapping 
can be added to iotlb after vDPA reset. There's no point to preserve 
this broken and inconsistent behavior between vhost-vdpa and parent 
driver, as userspace doesn't care at all!



but the backend acknowledging this feature flag
allows QEMU to make sure it is safe to skip this unmap & map in the
case of vhost stop & start cycle.

In that sense, this feature flag is actually a signal for userspace to
know that the bug has been solved.
Right, I couldn't say it better than you do, thanks! The feature flag is 
more of an unusual means to indicating kernel bug having been fixed, 
rather than introduce a new feature or new kernel behavior ending up in 
change of userspace's expectation.



Not offering it indicates that
userspace cannot trust the kernel will retain the maps.

Si-Wei or Dragos, please correct me if 

Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Si-Wei Liu



On 10/15/2023 11:32 PM, Jason Wang wrote:

On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:



On 10/12/2023 8:01 PM, Jason Wang wrote:

On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:

Devices with on-chip IOMMU or vendor specific IOTLB implementation
may need to restore iotlb mapping to the initial or default state
using the .reset_map op, as it's desirable for some parent devices
to solely manipulate mappings by its own, independent of virtio device
state. For instance, device reset does not cause mapping go away on
such IOTLB model in need of persistent mapping. Before vhost-vdpa
is going away, give them a chance to reset iotlb back to the initial
state in vhost_vdpa_cleanup().

Signed-off-by: Si-Wei Liu 
---
   drivers/vhost/vdpa.c | 16 
   1 file changed, 16 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 851535f..a3f8160 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
*vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
  return vhost_vdpa_alloc_as(v, asid);
   }

+static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+
+   if (ops->reset_map)
+   ops->reset_map(vdpa, asid);
+}
+
   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
   {
  struct vhost_vdpa_as *as = asid_to_as(v, asid);
@@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 
asid)

  hlist_del(>hash_link);
  vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
+   /*
+* Devices with vendor specific IOMMU may need to restore
+* iotlb to the initial or default state which is not done
+* through device reset, as the IOTLB mapping manipulation
+* could be decoupled from the virtio device life cycle.
+*/

Should we do this according to whether IOTLB_PRESIST is set?

Well, in theory this seems like so but it's unnecessary code change
actually, as that is the way how vDPA parent behind platform IOMMU works
today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.
Nope, it was the opposite. Maybe it was not clear enough, let me try 
once more - userspace CANNOT decouple IOTLB reset from vDPA reset today. 
This is because of bug/discrepancy in mlx5_vdap and vdpa_sim already 
breaking userspace's expectation, rendering the brokenness/inconsistency 
on vhost-vdpa mapping interface from behaving what it promised and 
should have done. Only with the IOTLB_PERSIST flag seen userspace can 
trust vhost-vdpa kernel interface *reliably* to decouple IOTLB reset 
from vDPA reset. Without seeing this flag, no matter how the code in 
QEMU was written, today's older userspace was never like to assume the 
mappings will *definitely* be cleared by vDPA reset. If any userspace 
implementation wants to get consistent behavior for all vDPA parent 
devices, it still has to *explicitly* clear all existing mappings by its 
own by sending bunch of unmap (iotlb invalidate) requests to vhost-vdpa 
kernel before resetting the vDPA backend.


In brief, userspace is already broken by kernel implementation today, 
and new userspace needs some device flag to know for sure if kernel bug 
has already been fixed; older userspace doesn't care about preserving 
the broken kernel behavior at all, regardless whether or not it wants to 
decouple IOTLB from vDPA reset.





As explained in previous threads [1][2], when IOTLB_PERSIST is not set
it doesn't necessarily mean the iotlb will definitely be destroyed
across reset (think about the platform IOMMU case), so userspace today
is already tolerating enough with either good or bad IOMMU. This code of
not checking IOTLB_PERSIST being set is intentional, there's no point to
emulate bad IOMMU behavior even for older userspace (with improper
emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
There's no breakage on this part. Backend feature IOTLB_PERSIST won't be 
set if userspace doesn't ack.

2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program
Definitely don't have to audit every userspace program, but I cannot 
think of a case where a sane userspace program can be broken. Can you 
elaborate one or two potential userspace usage that may break because of 
this? As said, platform IOMMU already did it this way.


Regards,
-Siwei


Thanks


I think
the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
certainty of persistent iotlb mapping not getting lost across vdpa reset.

Thanks,
-Siwei

[1]

Re: [RFC PATCH] ALSA: virtio: use copy and fill_silence callbacks

2023-10-16 Thread Stefan Hajnoczi
On Thu, Oct 12, 2023 at 05:10:50PM +0200, Matias Ezequiel Vara Larsen wrote:
> This commit replaces the mmap mechanism with the copy() and
> fill_silence() callbacks for both capturing and playback for the
> virtio-sound driver. This change is required to prevent the updating of
> the content of a buffer that is already in the available ring.
> 
> The current mechanism splits a dma buffer into descriptors that are
> exposed to the device. This dma buffer is shared with the user
> application. When the device consumes a buffer, the driver moves the
> request from the used ring to available ring.
> 
> The driver exposes the buffer to the device without knowing if the
> content has been updated from the user. The section 2.8.21.1 of the
> virtio spec states that: "The device MAY access the descriptor chains
> the driver created and the memory they refer to immediately". If the
> device picks up buffers from the available ring just after it is
> notified, it happens that the content may be old.
> 
> By providing the copy() callback, the driver first updates the content
> of the buffer, and then, exposes the buffer to the device by enqueuing
> it in the available ring. Thus, device always picks up a buffer that is
> updated.
> 
> For capturing, the driver starts by exposing all the available buffers
> to device. After device updates the content of a buffer, it enqueues it
> in the used ring. It is only after the copy() for capturing is issued
> that the driver re-enqueues the buffer in the available ring.
> 
> Note that the copy() function assumes that user is always writing a
> period. Testing shows that this is true but I may be wrong. This RFC
> aims at clarifying this.

This sounds like an ALSA driver API question.

Jaroslav and Takashi: Any thoughts?

Stefan

> 
> Signed-off-by: Matias Ezequiel Vara Larsen 
> ---
>  sound/virtio/virtio_pcm.c | 11 ++--
>  sound/virtio/virtio_pcm.h |  9 +++-
>  sound/virtio/virtio_pcm_msg.c | 50 ---
>  sound/virtio/virtio_pcm_ops.c | 94 +++
>  4 files changed, 137 insertions(+), 27 deletions(-)
> 
> diff --git a/sound/virtio/virtio_pcm.c b/sound/virtio/virtio_pcm.c
> index c10d91fff2fb..bfe982952303 100644
> --- a/sound/virtio/virtio_pcm.c
> +++ b/sound/virtio/virtio_pcm.c
> @@ -104,8 +104,6 @@ static int virtsnd_pcm_build_hw(struct 
> virtio_pcm_substream *vss,
>* only message-based transport.
>*/
>   vss->hw.info =
> - SNDRV_PCM_INFO_MMAP |
> - SNDRV_PCM_INFO_MMAP_VALID |
>   SNDRV_PCM_INFO_BATCH |
>   SNDRV_PCM_INFO_BLOCK_TRANSFER |
>   SNDRV_PCM_INFO_INTERLEAVED |
> @@ -471,12 +469,11 @@ int virtsnd_pcm_build_devs(struct virtio_snd *snd)
>   for (kss = ks->substream; kss; kss = kss->next)
>   vs->substreams[kss->number]->substream = kss;
>  
> - snd_pcm_set_ops(vpcm->pcm, i, _pcm_ops);
> + if (i == SNDRV_PCM_STREAM_CAPTURE)
> + snd_pcm_set_ops(vpcm->pcm, i, 
> _pcm_capture_ops);
> + else
> + snd_pcm_set_ops(vpcm->pcm, i, 
> _pcm_playback_ops);
>   }
> -
> - snd_pcm_set_managed_buffer_all(vpcm->pcm,
> -SNDRV_DMA_TYPE_VMALLOC, NULL,
> -0, 0);
>   }
>  
>   return 0;
> diff --git a/sound/virtio/virtio_pcm.h b/sound/virtio/virtio_pcm.h
> index 062eb8e8f2cf..1c1106ec971f 100644
> --- a/sound/virtio/virtio_pcm.h
> +++ b/sound/virtio/virtio_pcm.h
> @@ -50,6 +50,8 @@ struct virtio_pcm_substream {
>   struct work_struct elapsed_period;
>   spinlock_t lock;
>   size_t buffer_bytes;
> + u8 *buffer;
> + size_t buffer_sz;
>   size_t hw_ptr;
>   bool xfer_enabled;
>   bool xfer_xrun;
> @@ -90,7 +92,8 @@ struct virtio_pcm {
>   struct virtio_pcm_stream streams[SNDRV_PCM_STREAM_LAST + 1];
>  };
>  
> -extern const struct snd_pcm_ops virtsnd_pcm_ops;
> +extern const struct snd_pcm_ops virtsnd_pcm_playback_ops;
> +extern const struct snd_pcm_ops virtsnd_pcm_capture_ops;
>  
>  int virtsnd_pcm_validate(struct virtio_device *vdev);
>  
> @@ -117,7 +120,9 @@ int virtsnd_pcm_msg_alloc(struct virtio_pcm_substream 
> *vss,
>  
>  void virtsnd_pcm_msg_free(struct virtio_pcm_substream *vss);
>  
> -int virtsnd_pcm_msg_send(struct virtio_pcm_substream *vss);
> +int virtsnd_pcm_msg_send(struct virtio_pcm_substream *vss, bool single);
> +
> +int virtsnd_pcm_msg_send_locked(struct virtio_pcm_substream *vss, bool 
> single);
>  
>  unsigned int virtsnd_pcm_msg_pending_num(struct virtio_pcm_substream *vss);
>  
> diff --git a/sound/virtio/virtio_pcm_msg.c b/sound/virtio/virtio_pcm_msg.c
> index aca2dc1989ba..9a5f9814cb62 100644
> --- a/sound/virtio/virtio_pcm_msg.c
> +++ b/sound/virtio/virtio_pcm_msg.c
> @@ -132,7 +132,6 @@ static void 

Re: [PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-10-16 Thread Juergen Gross via Virtualization

On 16.10.23 16:29, Peter Zijlstra wrote:

On Mon, Oct 16, 2023 at 02:39:32PM +0200, Juergen Gross wrote:

Instead of stacking alternative and paravirt patching, use the new
ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
handling.

This eliminates the need to be careful regarding the sequence of
alternative and paravirt patching.

For call depth tracking callthunks_setup() needs to be adapted to patch
calls at alternative patching sites instead of paravirt calls.

Signed-off-by: Juergen Gross 


I cannot help but feel this would've been better as two patches, one
introducing ALT_NOT_XEN and then a second with the rest.


In case I need to respin I'll split it up.



Regardless,

Acked-by: Peter Zijlstra (Intel) 


Thanks,


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 4/4] x86/paravirt: remove no longer needed paravirt patching code

2023-10-16 Thread Peter Zijlstra
On Mon, Oct 16, 2023 at 02:39:33PM +0200, Juergen Gross wrote:
> Now that paravirt is using the alternatives patching infrastructure,
> remove the paravirt patching code.
> 
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   | 18 
>  arch/x86/include/asm/paravirt_types.h | 40 
>  arch/x86/include/asm/text-patching.h  | 12 -
>  arch/x86/kernel/alternative.c | 66 +--
>  arch/x86/kernel/paravirt.c| 30 
>  arch/x86/kernel/vmlinux.lds.S | 13 --
>  arch/x86/tools/relocs.c   |  2 +-
>  7 files changed, 3 insertions(+), 178 deletions(-)

More - more better! :-)

Acked-by: Peter Zijlstra (Intel) 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-10-16 Thread Peter Zijlstra
On Mon, Oct 16, 2023 at 02:39:32PM +0200, Juergen Gross wrote:
> Instead of stacking alternative and paravirt patching, use the new
> ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
> handling.
> 
> This eliminates the need to be careful regarding the sequence of
> alternative and paravirt patching.
> 
> For call depth tracking callthunks_setup() needs to be adapted to patch
> calls at alternative patching sites instead of paravirt calls.
> 
> Signed-off-by: Juergen Gross 

I cannot help but feel this would've been better as two patches, one
introducing ALT_NOT_XEN and then a second with the rest.

Regardless,

Acked-by: Peter Zijlstra (Intel) 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 4/4] x86/paravirt: remove no longer needed paravirt patching code

2023-10-16 Thread Juergen Gross via Virtualization
Now that paravirt is using the alternatives patching infrastructure,
remove the paravirt patching code.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/paravirt.h   | 18 
 arch/x86/include/asm/paravirt_types.h | 40 
 arch/x86/include/asm/text-patching.h  | 12 -
 arch/x86/kernel/alternative.c | 66 +--
 arch/x86/kernel/paravirt.c| 30 
 arch/x86/kernel/vmlinux.lds.S | 13 --
 arch/x86/tools/relocs.c   |  2 +-
 7 files changed, 3 insertions(+), 178 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9c6c5cfa9fe2..f09acce9432c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -725,31 +725,13 @@ void native_pv_lock_init(void) __init;
 
 #else  /* __ASSEMBLY__ */
 
-#define _PVSITE(ptype, ops, word, algn)\
-771:;  \
-   ops;\
-772:;  \
-   .pushsection .parainstructions,"a"; \
-.align algn;   \
-word 771b; \
-.byte ptype;   \
-.byte 772b-771b;   \
-_ASM_ALIGN;\
-   .popsection
-
-
 #ifdef CONFIG_X86_64
 #ifdef CONFIG_PARAVIRT_XXL
 #ifdef CONFIG_DEBUG_ENTRY
 
-#define PARA_PATCH(off)((off) / 8)
-#define PARA_SITE(ptype, ops)  _PVSITE(ptype, ops, .quad, 8)
 #define PARA_INDIRECT(addr)*addr(%rip)
 
 .macro PARA_IRQ_save_fl
-   PARA_SITE(PARA_PATCH(PV_IRQ_save_fl),
- ANNOTATE_RETPOLINE_SAFE;
- call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);)
ANNOTATE_RETPOLINE_SAFE;
call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl);
 .endm
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 323dca625eea..756cb75d22b5 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -2,15 +2,6 @@
 #ifndef _ASM_X86_PARAVIRT_TYPES_H
 #define _ASM_X86_PARAVIRT_TYPES_H
 
-#ifndef __ASSEMBLY__
-/* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch_site {
-   u8 *instr;  /* original instructions */
-   u8 type;/* type of this instruction */
-   u8 len; /* length of original instruction */
-};
-#endif
-
 #ifdef CONFIG_PARAVIRT
 
 #ifndef __ASSEMBLY__
@@ -250,34 +241,6 @@ struct paravirt_patch_template {
 extern struct pv_info pv_info;
 extern struct paravirt_patch_template pv_ops;
 
-#define PARAVIRT_PATCH(x)  \
-   (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
-
-#define paravirt_type(op)  \
-   [paravirt_typenum] "i" (PARAVIRT_PATCH(op)),\
-   [paravirt_opptr] "m" (pv_ops.op)
-/*
- * Generate some code, and mark it as patchable by the
- * apply_paravirt() alternate instruction patcher.
- */
-#define _paravirt_alt(insn_string, type)   \
-   "771:\n\t" insn_string "\n" "772:\n"\
-   ".pushsection .parainstructions,\"a\"\n"\
-   _ASM_ALIGN "\n" \
-   _ASM_PTR " 771b\n"  \
-   "  .byte " type "\n"\
-   "  .byte 772b-771b\n"   \
-   _ASM_ALIGN "\n" \
-   ".popsection\n"
-
-/* Generate patchable code, with the default asm parameters. */
-#define paravirt_alt(insn_string)  \
-   _paravirt_alt(insn_string, "%c[paravirt_typenum]")
-
-/* Simple instruction patching code. */
-#define NATIVE_LABEL(a,x,b) "\n\t.globl " a #x "_" #b "\n" a #x "_" #b ":\n\t"
-
-unsigned int paravirt_patch(u8 type, void *insn_buff, unsigned long addr, 
unsigned int len);
 #define paravirt_ptr(op)   [paravirt_opptr] "m" (pv_ops.op)
 
 int paravirt_disable_iospace(void);
@@ -545,9 +508,6 @@ unsigned long pv_native_read_cr2(void);
 
 #define paravirt_nop   ((void *)x86_nop)
 
-extern struct paravirt_patch_site __parainstructions[],
-   __parainstructions_end[];
-
 #endif /* __ASSEMBLY__ */
 
 #define ALT_NOT_XENALT_NOT(X86_FEATURE_XENPV)
diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index 29832c338cdc..0b70653a98c1 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -6,18 +6,6 @@
 #include 
 #include 
 
-struct paravirt_patch_site;
-#ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch_site *start,
-   struct paravirt_patch_site *end);
-#else
-static inline void apply_paravirt(struct paravirt_patch_site *start,
- struct 

[PATCH v2 3/4] x86/paravirt: switch mixed paravirt/alternative calls to alternative_2

2023-10-16 Thread Juergen Gross via Virtualization
Instead of stacking alternative and paravirt patching, use the new
ALT_FLAG_CALL flag to switch those mixed calls to pure alternative
handling.

This eliminates the need to be careful regarding the sequence of
alternative and paravirt patching.

For call depth tracking callthunks_setup() needs to be adapted to patch
calls at alternative patching sites instead of paravirt calls.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/alternative.h|  5 +--
 arch/x86/include/asm/paravirt.h   | 49 +--
 arch/x86/include/asm/paravirt_types.h | 29 +++-
 arch/x86/kernel/callthunks.c  | 17 +-
 arch/x86/kernel/module.c  | 20 +++
 5 files changed, 51 insertions(+), 69 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index dd63b96577e9..f6c0ff678e2e 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -89,6 +89,8 @@ struct alt_instr {
u8  replacementlen; /* length of new instruction */
 } __packed;
 
+extern struct alt_instr __alt_instructions[], __alt_instructions_end[];
+
 /*
  * Debug flag that can be tested to see whether alternative
  * instructions were patched in already:
@@ -104,11 +106,10 @@ extern void apply_fineibt(s32 *start_retpoline, s32 
*end_retpoine,
  s32 *start_cfi, s32 *end_cfi);
 
 struct module;
-struct paravirt_patch_site;
 
 struct callthunk_sites {
s32 *call_start, *call_end;
-   struct paravirt_patch_site  *pv_start, *pv_end;
+   struct alt_instr*alt_start, *alt_end;
 };
 
 #ifdef CONFIG_CALL_THUNKS
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ed5c7342f2ef..9c6c5cfa9fe2 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -142,8 +142,7 @@ static inline void write_cr0(unsigned long x)
 static __always_inline unsigned long read_cr2(void)
 {
return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2,
-   "mov %%cr2, %%rax;",
-   ALT_NOT(X86_FEATURE_XENPV));
+   "mov %%cr2, %%rax;", ALT_NOT_XEN);
 }
 
 static __always_inline void write_cr2(unsigned long x)
@@ -154,13 +153,12 @@ static __always_inline void write_cr2(unsigned long x)
 static inline unsigned long __read_cr3(void)
 {
return PVOP_ALT_CALL0(unsigned long, mmu.read_cr3,
- "mov %%cr3, %%rax;", ALT_NOT(X86_FEATURE_XENPV));
+ "mov %%cr3, %%rax;", ALT_NOT_XEN);
 }
 
 static inline void write_cr3(unsigned long x)
 {
-   PVOP_ALT_VCALL1(mmu.write_cr3, x,
-   "mov %%rdi, %%cr3", ALT_NOT(X86_FEATURE_XENPV));
+   PVOP_ALT_VCALL1(mmu.write_cr3, x, "mov %%rdi, %%cr3", ALT_NOT_XEN);
 }
 
 static inline void __write_cr4(unsigned long x)
@@ -182,7 +180,7 @@ extern noinstr void pv_native_wbinvd(void);
 
 static __always_inline void wbinvd(void)
 {
-   PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV));
+   PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT_XEN);
 }
 
 static inline u64 paravirt_read_msr(unsigned msr)
@@ -390,27 +388,25 @@ static inline void paravirt_release_p4d(unsigned long pfn)
 static inline pte_t __pte(pteval_t val)
 {
return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val,
- "mov %%rdi, %%rax",
- ALT_NOT(X86_FEATURE_XENPV)) };
+ "mov %%rdi, %%rax", ALT_NOT_XEN) };
 }
 
 static inline pteval_t pte_val(pte_t pte)
 {
return PVOP_ALT_CALLEE1(pteval_t, mmu.pte_val, pte.pte,
-   "mov %%rdi, %%rax", ALT_NOT(X86_FEATURE_XENPV));
+   "mov %%rdi, %%rax", ALT_NOT_XEN);
 }
 
 static inline pgd_t __pgd(pgdval_t val)
 {
return (pgd_t) { PVOP_ALT_CALLEE1(pgdval_t, mmu.make_pgd, val,
- "mov %%rdi, %%rax",
- ALT_NOT(X86_FEATURE_XENPV)) };
+ "mov %%rdi, %%rax", ALT_NOT_XEN) };
 }
 
 static inline pgdval_t pgd_val(pgd_t pgd)
 {
return PVOP_ALT_CALLEE1(pgdval_t, mmu.pgd_val, pgd.pgd,
-   "mov %%rdi, %%rax", ALT_NOT(X86_FEATURE_XENPV));
+   "mov %%rdi, %%rax", ALT_NOT_XEN);
 }
 
 #define  __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
@@ -444,14 +440,13 @@ static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 static inline pmd_t __pmd(pmdval_t val)
 {
return (pmd_t) { PVOP_ALT_CALLEE1(pmdval_t, mmu.make_pmd, val,
- "mov %%rdi, %%rax",
- ALT_NOT(X86_FEATURE_XENPV)) };
+ "mov %%rdi, %%rax", ALT_NOT_XEN) };
 }
 
 static 

[PATCH v2 1/4] x86/paravirt: move some functions and defines to alternative

2023-10-16 Thread Juergen Gross via Virtualization
As a preparation for replacing paravirt patching completely by
alternative patching, move some backend functions and #defines to
alternative code and header.

Signed-off-by: Juergen Gross 
Acked-by: Peter Zijlstra (Intel) 
---
 arch/x86/include/asm/alternative.h| 16 
 arch/x86/include/asm/paravirt.h   | 12 -
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/qspinlock_paravirt.h |  4 +--
 arch/x86/kernel/alternative.c | 10 
 arch/x86/kernel/kvm.c |  4 +--
 arch/x86/kernel/paravirt.c| 30 +++
 arch/x86/xen/irq.c|  2 +-
 8 files changed, 41 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h 
b/arch/x86/include/asm/alternative.h
index 9c4da699e11a..67e50bd40bb8 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -330,6 +330,22 @@ static inline int alternatives_text_reserved(void *start, 
void *end)
  */
 #define ASM_NO_INPUT_CLOBBER(clbr...) "i" (0) : clbr
 
+/* Macro for creating assembler functions avoiding any C magic. */
+#define DEFINE_ASM_FUNC(func, instr, sec)  \
+   asm (".pushsection " #sec ", \"ax\"\n"  \
+".global " #func "\n\t"\
+".type " #func ", @function\n\t"   \
+ASM_FUNC_ALIGN "\n"\
+#func ":\n\t"  \
+ASM_ENDBR  \
+instr "\n\t"   \
+ASM_RET\
+".size " #func ", . - " #func "\n\t"   \
+".popsection")
+
+void x86_BUG(void);
+void x86_nop(void);
+
 #else /* __ASSEMBLY__ */
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6c8ff12140ae..ed5c7342f2ef 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -726,18 +726,6 @@ static __always_inline unsigned long 
arch_local_irq_save(void)
 #undef PVOP_VCALL4
 #undef PVOP_CALL4
 
-#define DEFINE_PARAVIRT_ASM(func, instr, sec)  \
-   asm (".pushsection " #sec ", \"ax\"\n"  \
-".global " #func "\n\t"\
-".type " #func ", @function\n\t"   \
-ASM_FUNC_ALIGN "\n"\
-#func ":\n\t"  \
-ASM_ENDBR  \
-instr "\n\t"   \
-ASM_RET\
-".size " #func ", . - " #func "\n\t"   \
-".popsection")
-
 extern void default_banner(void);
 void native_pv_lock_init(void) __init;
 
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 772d03487520..9f84dc074f88 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -542,8 +542,6 @@ int paravirt_disable_iospace(void);
__PVOP_VCALL(op, PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),\
 PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
 
-void _paravirt_nop(void);
-void paravirt_BUG(void);
 unsigned long paravirt_ret0(void);
 #ifdef CONFIG_PARAVIRT_XXL
 u64 _paravirt_ident_64(u64);
@@ -553,7 +551,7 @@ void pv_native_irq_enable(void);
 unsigned long pv_native_read_cr2(void);
 #endif
 
-#define paravirt_nop   ((void *)_paravirt_nop)
+#define paravirt_nop   ((void *)x86_nop)
 
 extern struct paravirt_patch_site __parainstructions[],
__parainstructions_end[];
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h 
b/arch/x86/include/asm/qspinlock_paravirt.h
index 85b6e3609cb9..ef9697f20129 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -56,8 +56,8 @@ __PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath, 
".spinlock.text");
"pop%rdx\n\t"   \
FRAME_END
 
-DEFINE_PARAVIRT_ASM(__raw_callee_save___pv_queued_spin_unlock,
-   PV_UNLOCK_ASM, .spinlock.text);
+DEFINE_ASM_FUNC(__raw_callee_save___pv_queued_spin_unlock,
+   PV_UNLOCK_ASM, .spinlock.text);
 
 #else /* CONFIG_64BIT */
 
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 73be3931e4f0..1c8dd8e05f3f 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -385,6 +385,16 @@ apply_relocation(u8 *buf, size_t len, u8 *dest, u8 *src, 
size_t src_len)
}
 }
 
+/* Low-level backend functions usable from alternative code replacements. */
+DEFINE_ASM_FUNC(x86_nop, "", .entry.text);
+EXPORT_SYMBOL_GPL(x86_nop);
+
+noinstr void x86_BUG(void)
+{
+   BUG();
+}
+EXPORT_SYMBOL_GPL(x86_BUG);
+
 /*
  * Replace instructions with better alternatives for 

[PATCH v2 0/4] x86/paravirt: Get rid of paravirt patching

2023-10-16 Thread Juergen Gross via Virtualization
This is a small series getting rid of paravirt patching by switching
completely to alternative patching for the same functionality.

The basic idea is to add the capability to switch from indirect to
direct calls via a special alternative patching option.

This removes _some_ of the paravirt macro maze, but most of it needs
to stay due to the need of hiding the call instructions from the
compiler in order to avoid needless register save/restore.

What is going away is the nasty stacking of alternative and paravirt
patching and (of course) the special .parainstructions linker section.

I have tested the series on bare metal and as Xen PV domain to still
work.

Note that objtool might need some changes to cope with the new
indirect call patching mechanism. Additionally some paravirt handling
can probably be removed from it.

Changes in V2:
- split last patch into 2
- rebase of patch 2 as suggested by Peter
- addressed Peter's comments for patch 3

Juergen Gross (4):
  x86/paravirt: move some functions and defines to alternative
  x86/alternative: add indirect call patching
  x86/paravirt: switch mixed paravirt/alternative calls to alternative_2
  x86/paravirt: remove no longer needed paravirt patching code

 arch/x86/include/asm/alternative.h|  26 -
 arch/x86/include/asm/paravirt.h   |  79 +--
 arch/x86/include/asm/paravirt_types.h |  73 +++---
 arch/x86/include/asm/qspinlock_paravirt.h |   4 +-
 arch/x86/include/asm/text-patching.h  |  12 ---
 arch/x86/kernel/alternative.c | 116 ++
 arch/x86/kernel/callthunks.c  |  17 ++--
 arch/x86/kernel/kvm.c |   4 +-
 arch/x86/kernel/module.c  |  20 +---
 arch/x86/kernel/paravirt.c|  54 ++
 arch/x86/kernel/vmlinux.lds.S |  13 ---
 arch/x86/tools/relocs.c   |   2 +-
 arch/x86/xen/irq.c|   2 +-
 13 files changed, 137 insertions(+), 285 deletions(-)

-- 
2.35.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 19/19] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY

2023-10-16 Thread Xuan Zhuo
Now, we supported AF_XDP(xsk). Add NETDEV_XDP_ACT_XSK_ZEROCOPY to
xdp_features.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index ac62d0955c13..c66d8509330e 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -4329,7 +4329,8 @@ static int virtnet_probe(struct virtio_device *vdev)
dev->hw_features |= NETIF_F_GRO_HW;
 
dev->vlan_features = dev->features;
-   dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT;
+   dev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
+   NETDEV_XDP_ACT_XSK_ZEROCOPY;
 
/* MTU range: 68 - 65535 */
dev->min_mtu = MIN_MTU;
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 13/19] virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer

2023-10-16 Thread Xuan Zhuo
virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
buffer) by the last two types bits.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/virtio_net.h | 16 ++--
 drivers/net/virtio/xsk.h|  5 +
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 7c72a8bb1813..d4e620a084f4 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -243,6 +243,11 @@ struct virtnet_info {
 
 #include "xsk.h"
 
+static inline bool virtnet_is_skb_ptr(void *ptr)
+{
+   return !((unsigned long)ptr & VIRTIO_XMIT_DATA_MASK);
+}
+
 static inline bool virtnet_is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -279,11 +284,12 @@ static inline void *virtnet_sq_unmap(struct virtnet_sq 
*sq, void *data)
 static inline void virtnet_free_old_xmit(struct virtnet_sq *sq, bool in_napi,
 struct virtnet_sq_stats *stats)
 {
+   unsigned int xsknum = 0;
unsigned int len;
void *ptr;
 
while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
-   if (!virtnet_is_xdp_frame(ptr)) {
+   if (virtnet_is_skb_ptr(ptr)) {
struct sk_buff *skb;
 
if (sq->do_dma)
@@ -295,7 +301,7 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq 
*sq, bool in_napi,
 
stats->bytes += skb->len;
napi_consume_skb(skb, in_napi);
-   } else {
+   } else if (virtnet_is_xdp_frame(ptr)) {
struct xdp_frame *frame;
 
if (sq->do_dma)
@@ -305,9 +311,15 @@ static inline void virtnet_free_old_xmit(struct virtnet_sq 
*sq, bool in_napi,
 
stats->bytes += xdp_get_frame_len(frame);
xdp_return_frame(frame);
+   } else {
+   stats->bytes += virtnet_ptr_to_xsk(ptr);
+   ++xsknum;
}
stats->packets++;
}
+
+   if (xsknum)
+   xsk_tx_completed(sq->xsk.pool, xsknum);
 }
 
 static inline void virtnet_vq_napi_schedule(struct napi_struct *napi,
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index 1bd19dcda649..7ebc9bda7aee 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -14,6 +14,11 @@ static inline void *virtnet_xsk_to_ptr(u32 len)
return (void *)(p | VIRTIO_XSK_FLAG);
 }
 
+static inline u32 virtnet_ptr_to_xsk(void *ptr)
+{
+   return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
+}
+
 int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
 bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
  int budget);
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 16/19] virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer

2023-10-16 Thread Xuan Zhuo
Implementing the logic of xsk rx. If this packet is not for XSK
determined in XDP, then we need to copy once to generate a SKB.
If it is for XSK, it is a zerocopy receive packet process.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   |  14 ++--
 drivers/net/virtio/virtio_net.h |   4 ++
 drivers/net/virtio/xsk.c| 120 
 drivers/net/virtio/xsk.h|   4 ++
 4 files changed, 137 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 0e740447b142..003dd67ab707 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -822,10 +822,10 @@ static void put_xdp_frags(struct xdp_buff *xdp)
}
 }
 
-static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
-  struct net_device *dev,
-  unsigned int *xdp_xmit,
-  struct virtnet_rq_stats *stats)
+int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
+   struct net_device *dev,
+   unsigned int *xdp_xmit,
+   struct virtnet_rq_stats *stats)
 {
struct xdp_frame *xdpf;
int err;
@@ -1589,13 +1589,17 @@ static void receive_buf(struct virtnet_info *vi, struct 
virtnet_rq *rq,
return;
}
 
-   if (vi->mergeable_rx_bufs)
+   rcu_read_lock();
+   if (rcu_dereference(rq->xsk.pool))
+   skb = virtnet_receive_xsk(dev, vi, rq, buf, len, xdp_xmit, 
stats);
+   else if (vi->mergeable_rx_bufs)
skb = receive_mergeable(dev, vi, rq, buf, ctx, len, xdp_xmit,
stats);
else if (vi->big_packets)
skb = receive_big(dev, vi, rq, buf, len, stats);
else
skb = receive_small(dev, vi, rq, buf, ctx, len, xdp_xmit, 
stats);
+   rcu_read_unlock();
 
if (unlikely(!skb))
return;
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 6e71622fca45..fd7f34703c9b 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -346,6 +346,10 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct 
virtnet_info *vi, int
return false;
 }
 
+int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff *xdp,
+   struct net_device *dev,
+   unsigned int *xdp_xmit,
+   struct virtnet_rq_stats *stats);
 void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq);
 void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq);
 void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq);
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 841fb078882a..f1c64414fac9 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -13,6 +13,18 @@ static void sg_fill_dma(struct scatterlist *sg, dma_addr_t 
addr, u32 len)
sg->length = len;
 }
 
+static unsigned int virtnet_receive_buf_num(struct virtnet_info *vi, char *buf)
+{
+   struct virtio_net_hdr_mrg_rxbuf *hdr;
+
+   if (vi->mergeable_rx_bufs) {
+   hdr = (struct virtio_net_hdr_mrg_rxbuf *)buf;
+   return virtio16_to_cpu(vi->vdev, hdr->num_buffers);
+   }
+
+   return 1;
+}
+
 static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
 {
struct virtnet_info *vi = sq->vq->vdev->priv;
@@ -37,6 +49,114 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
netif_stop_subqueue(dev, qnum);
 }
 
+static void merge_drop_follow_xdp(struct net_device *dev,
+ struct virtnet_rq *rq,
+ u32 num_buf,
+ struct virtnet_rq_stats *stats)
+{
+   struct xdp_buff *xdp;
+   u32 len;
+
+   while (num_buf-- > 1) {
+   xdp = virtqueue_get_buf(rq->vq, );
+   if (unlikely(!xdp)) {
+   pr_debug("%s: rx error: %d buffers missing\n",
+dev->name, num_buf);
+   dev->stats.rx_length_errors++;
+   break;
+   }
+   stats->bytes += len;
+   xsk_buff_free(xdp);
+   }
+}
+
+static struct sk_buff *construct_skb(struct virtnet_rq *rq,
+struct xdp_buff *xdp)
+{
+   unsigned int metasize = xdp->data - xdp->data_meta;
+   struct sk_buff *skb;
+   unsigned int size;
+
+   size = xdp->data_end - xdp->data_hard_start;
+   skb = napi_alloc_skb(>napi, size);
+   if (unlikely(!skb))
+   return NULL;
+
+   skb_reserve(skb, xdp->data_meta - xdp->data_hard_start);
+
+   size = xdp->data_end - xdp->data_meta;
+   memcpy(__skb_put(skb, size), xdp->data_meta, size);
+
+   if (metasize) {
+   

[PATCH net-next v1 18/19] virtio_net: update tx timeout record

2023-10-16 Thread Xuan Zhuo
If send queue sent some packets, we update the tx timeout
record to prevent the tx timeout.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/xsk.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index f1c64414fac9..5d3de505c56c 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -274,6 +274,16 @@ bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct 
xsk_buff_pool *pool,
 
virtnet_xsk_check_queue(sq);
 
+   if (stats.packets) {
+   struct netdev_queue *txq;
+   struct virtnet_info *vi;
+
+   vi = sq->vq->vdev->priv;
+
+   txq = netdev_get_tx_queue(vi->dev, sq - vi->sq);
+   txq_trans_cond_update(txq);
+   }
+
u64_stats_update_begin(>stats.syncp);
sq->stats.packets += stats.packets;
sq->stats.bytes += stats.bytes;
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 14/19] virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer

2023-10-16 Thread Xuan Zhuo
virtnet_sq_free_unused_buf() check xsk buffer.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 1a21352e..58bb38f9b453 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -3876,10 +3876,12 @@ static void free_receive_page_frags(struct virtnet_info 
*vi)
 
 void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
 {
-   if (!virtnet_is_xdp_frame(buf))
+   if (virtnet_is_skb_ptr(buf))
dev_kfree_skb(buf);
-   else
+   else if (virtnet_is_xdp_frame(buf))
xdp_return_frame(virtnet_ptr_to_xdp(buf));
+
+   /* xsk buffer do not need handle. */
 }
 
 void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 15/19] virtio_net: xsk: rx: introduce add_recvbuf_xsk()

2023-10-16 Thread Xuan Zhuo
Implement the logic of filling vq with XSK buffer.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 13 +++
 drivers/net/virtio/virtio_net.h |  5 +++
 drivers/net/virtio/xsk.c| 66 -
 drivers/net/virtio/xsk.h|  2 +
 4 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 58bb38f9b453..0e740447b142 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -1787,9 +1787,20 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
 static bool try_fill_recv(struct virtnet_info *vi, struct virtnet_rq *rq,
  gfp_t gfp)
 {
+   struct xsk_buff_pool *pool;
int err;
bool oom;
 
+   rcu_read_lock();
+   pool = rcu_dereference(rq->xsk.pool);
+   if (pool) {
+   err = virtnet_add_recvbuf_xsk(vi, rq, pool, gfp);
+   oom = err == -ENOMEM;
+   rcu_read_unlock();
+   goto kick;
+   }
+   rcu_read_unlock();
+
do {
if (vi->mergeable_rx_bufs)
err = add_recvbuf_mergeable(vi, rq, gfp);
@@ -1802,6 +1813,8 @@ static bool try_fill_recv(struct virtnet_info *vi, struct 
virtnet_rq *rq,
if (err)
break;
} while (rq->vq->num_free);
+
+kick:
if (virtqueue_kick_prepare(rq->vq) && virtqueue_notify(rq->vq)) {
unsigned long flags;
 
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index d4e620a084f4..6e71622fca45 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -156,6 +156,11 @@ struct virtnet_rq {
 
/* xdp rxq used by xsk */
struct xdp_rxq_info xdp_rxq;
+
+   struct xdp_buff **xsk_buffs;
+   u32 nxt_idx;
+   u32 num;
+   u32 size;
} xsk;
 };
 
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 973e783260c3..841fb078882a 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -37,6 +37,58 @@ static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
netif_stop_subqueue(dev, qnum);
 }
 
+static int virtnet_add_recvbuf_batch(struct virtnet_info *vi, struct 
virtnet_rq *rq,
+struct xsk_buff_pool *pool, gfp_t gfp)
+{
+   struct xdp_buff **xsk_buffs;
+   dma_addr_t addr;
+   u32 len, i;
+   int err = 0;
+
+   xsk_buffs = rq->xsk.xsk_buffs;
+
+   if (rq->xsk.nxt_idx >= rq->xsk.num) {
+   rq->xsk.num = xsk_buff_alloc_batch(pool, xsk_buffs, 
rq->xsk.size);
+   if (!rq->xsk.num)
+   return -ENOMEM;
+   rq->xsk.nxt_idx = 0;
+   }
+
+   while (rq->xsk.nxt_idx < rq->xsk.num) {
+   i = rq->xsk.nxt_idx;
+
+   /* use the part of XDP_PACKET_HEADROOM as the virtnet hdr space 
*/
+   addr = xsk_buff_xdp_get_dma(xsk_buffs[i]) - vi->hdr_len;
+   len = xsk_pool_get_rx_frame_size(pool) + vi->hdr_len;
+
+   sg_init_table(rq->sg, 1);
+   sg_fill_dma(rq->sg, addr, len);
+
+   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], gfp);
+   if (err)
+   return err;
+
+   rq->xsk.nxt_idx++;
+   }
+
+   return 0;
+}
+
+int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct virtnet_rq *rq,
+   struct xsk_buff_pool *pool, gfp_t gfp)
+{
+   int err;
+
+   do {
+   err = virtnet_add_recvbuf_batch(vi, rq, pool, gfp);
+   if (err)
+   return err;
+
+   } while (rq->vq->num_free);
+
+   return 0;
+}
+
 static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
struct xsk_buff_pool *pool,
struct xdp_desc *desc)
@@ -244,7 +296,7 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
struct virtnet_sq *sq;
struct device *dma_dev;
dma_addr_t hdr_dma;
-   int err;
+   int err, size;
 
/* In big_packets mode, xdp cannot work, so there is no need to
 * initialize xsk of rq.
@@ -276,6 +328,16 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
if (!dma_dev)
return -EPERM;
 
+   size = virtqueue_get_vring_size(rq->vq);
+
+   rq->xsk.xsk_buffs = kcalloc(size, sizeof(*rq->xsk.xsk_buffs), 
GFP_KERNEL);
+   if (!rq->xsk.xsk_buffs)
+   return -ENOMEM;
+
+   rq->xsk.size = size;
+   rq->xsk.nxt_idx = 0;
+   rq->xsk.num = 0;
+
hdr_dma = dma_map_single(dma_dev, _hdr, vi->hdr_len, DMA_TO_DEVICE);
if (dma_mapping_error(dma_dev, hdr_dma))
return -ENOMEM;
@@ -338,6 +400,8 @@ static int virtnet_xsk_pool_disable(struct net_device *dev, 
u16 qid)

[PATCH net-next v1 12/19] virtio_net: xsk: tx: support wakeup

2023-10-16 Thread Xuan Zhuo
xsk wakeup is used to trigger the logic for xsk xmit by xsk framework or
user.

Virtio-Net does not support to actively generate an interruption, so it
tries to trigger tx NAPI on the tx interrupt cpu.

Consider the effect of cache. When interrupt triggers, it is
generally fixed on a CPU. It is better to start TX Napi on the same
CPU.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   |  3 ++
 drivers/net/virtio/virtio_net.h |  8 +
 drivers/net/virtio/xsk.c| 57 +
 drivers/net/virtio/xsk.h|  1 +
 4 files changed, 69 insertions(+)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index a08429bef61f..1a21352e 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -2066,6 +2066,8 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
return 0;
}
 
+   sq->xsk.last_cpu = smp_processor_id();
+
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
virtqueue_disable_cb(sq->vq);
@@ -3770,6 +3772,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_vlan_rx_kill_vid = virtnet_vlan_rx_kill_vid,
.ndo_bpf= virtnet_xdp,
.ndo_xdp_xmit   = virtnet_xdp_xmit,
+   .ndo_xsk_wakeup = virtnet_xsk_wakeup,
.ndo_features_check = passthru_features_check,
.ndo_get_phys_port_name = virtnet_get_phys_port_name,
.ndo_set_features   = virtnet_set_features,
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 3bbb1f5baad5..7c72a8bb1813 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -101,6 +101,14 @@ struct virtnet_sq {
struct xsk_buff_pool __rcu *pool;
 
dma_addr_t hdr_dma_address;
+
+   u32 last_cpu;
+   struct __call_single_data csd;
+
+   /* The lock to prevent the repeat of calling
+* smp_call_function_single_async().
+*/
+   spinlock_t ipi_lock;
} xsk;
 };
 
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 0e775a9d270f..973e783260c3 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -115,6 +115,60 @@ bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct 
xsk_buff_pool *pool,
return sent == budget;
 }
 
+static void virtnet_remote_napi_schedule(void *info)
+{
+   struct virtnet_sq *sq = info;
+
+   virtnet_vq_napi_schedule(>napi, sq->vq);
+}
+
+static void virtnet_remote_raise_napi(struct virtnet_sq *sq)
+{
+   u32 last_cpu, cur_cpu;
+
+   last_cpu = sq->xsk.last_cpu;
+   cur_cpu = get_cpu();
+
+   /* On remote cpu, softirq will run automatically when ipi irq exit. On
+* local cpu, smp_call_xxx will not trigger ipi interrupt, then softirq
+* cannot be triggered automatically. So Call local_bh_enable after to
+* trigger softIRQ processing.
+*/
+   if (last_cpu == cur_cpu) {
+   local_bh_disable();
+   virtnet_vq_napi_schedule(>napi, sq->vq);
+   local_bh_enable();
+   } else {
+   if (spin_trylock(>xsk.ipi_lock)) {
+   smp_call_function_single_async(last_cpu, >xsk.csd);
+   spin_unlock(>xsk.ipi_lock);
+   }
+   }
+
+   put_cpu();
+}
+
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   struct virtnet_sq *sq;
+
+   if (!netif_running(dev))
+   return -ENETDOWN;
+
+   if (qid >= vi->curr_queue_pairs)
+   return -EINVAL;
+
+   sq = >sq[qid];
+
+   if (napi_if_scheduled_mark_missed(>napi))
+   return 0;
+
+   virtnet_remote_raise_napi(sq);
+
+   return 0;
+}
+
 static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq 
*rq,
struct xsk_buff_pool *pool)
 {
@@ -240,6 +294,9 @@ static int virtnet_xsk_pool_enable(struct net_device *dev,
 
sq->xsk.hdr_dma_address = hdr_dma;
 
+   INIT_CSD(>xsk.csd, virtnet_remote_napi_schedule, sq);
+   spin_lock_init(>xsk.ipi_lock);
+
return 0;
 
 err_sq:
diff --git a/drivers/net/virtio/xsk.h b/drivers/net/virtio/xsk.h
index 73ca8cd5308b..1bd19dcda649 100644
--- a/drivers/net/virtio/xsk.h
+++ b/drivers/net/virtio/xsk.h
@@ -17,4 +17,5 @@ static inline void *virtnet_xsk_to_ptr(u32 len)
 int virtnet_xsk_pool_setup(struct net_device *dev, struct netdev_bpf *xdp);
 bool virtnet_xsk_xmit(struct virtnet_sq *sq, struct xsk_buff_pool *pool,
  int budget);
+int virtnet_xsk_wakeup(struct net_device *dev, u32 qid, u32 flag);
 #endif
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org

[PATCH net-next v1 11/19] virtio_net: xsk: tx: support tx

2023-10-16 Thread Xuan Zhuo
The driver's tx napi is very important for XSK. It is responsible for
obtaining data from the XSK queue and sending it out.

At the beginning, we need to trigger tx napi.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   |  18 +-
 drivers/net/virtio/virtio_net.h |   3 +-
 drivers/net/virtio/xsk.c| 108 
 drivers/net/virtio/xsk.h|  13 
 4 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index b320770e5f4e..a08429bef61f 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -2054,7 +2054,9 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
struct virtnet_sq *sq = container_of(napi, struct virtnet_sq, napi);
struct virtnet_info *vi = sq->vq->vdev->priv;
unsigned int index = vq2txq(sq->vq);
+   struct xsk_buff_pool *pool;
struct netdev_queue *txq;
+   int busy = 0;
int opaque;
bool done;
 
@@ -2067,11 +2069,25 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
int budget)
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
virtqueue_disable_cb(sq->vq);
-   free_old_xmit(sq, true);
+
+   rcu_read_lock();
+   pool = rcu_dereference(sq->xsk.pool);
+   if (pool) {
+   busy |= virtnet_xsk_xmit(sq, pool, budget);
+   rcu_read_unlock();
+   } else {
+   rcu_read_unlock();
+   free_old_xmit(sq, true);
+   }
 
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
netif_tx_wake_queue(txq);
 
+   if (busy) {
+   __netif_tx_unlock(txq);
+   return budget;
+   }
+
opaque = virtqueue_enable_cb_prepare(sq->vq);
 
done = napi_complete_done(napi, 0);
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 9e69b6c5921b..3bbb1f5baad5 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -9,7 +9,8 @@
 #include 
 
 #define VIRTIO_XDP_FLAGBIT(0)
-#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
+#define VIRTIO_XSK_FLAGBIT(1)
+#define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG | VIRTIO_XSK_FLAG)
 
 /* RX packet size EWMA. The average packet size is used to determine the packet
  * buffer size when refilling RX rings. As the entire RX ring may be refilled
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
index 01962a3f..0e775a9d270f 100644
--- a/drivers/net/virtio/xsk.c
+++ b/drivers/net/virtio/xsk.c
@@ -7,6 +7,114 @@
 
 static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
 
+static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
+{
+   sg->dma_address = addr;
+   sg->length = len;
+}
+
+static void virtnet_xsk_check_queue(struct virtnet_sq *sq)
+{
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct net_device *dev = vi->dev;
+   int qnum = sq - vi->sq;
+
+   /* If it is a raw buffer queue, it does not check whether the status
+* of the queue is stopped when sending. So there is no need to check
+* the situation of the raw buffer queue.
+*/
+   if (virtnet_is_xdp_raw_buffer_queue(vi, qnum))
+   return;
+
+   /* If this sq is not the exclusive queue of the current cpu,
+* then it may be called by start_xmit, so check it running out
+* of space.
+*
+* Stop the queue to avoid getting packets that we are
+* then unable to transmit. Then wait the tx interrupt.
+*/
+   if (sq->vq->num_free < 2 + MAX_SKB_FRAGS)
+   netif_stop_subqueue(dev, qnum);
+}
+
+static int virtnet_xsk_xmit_one(struct virtnet_sq *sq,
+   struct xsk_buff_pool *pool,
+   struct xdp_desc *desc)
+{
+   struct virtnet_info *vi;
+   dma_addr_t addr;
+
+   vi = sq->vq->vdev->priv;
+
+   addr = xsk_buff_raw_get_dma(pool, desc->addr);
+   xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
+
+   sg_init_table(sq->sg, 2);
+
+   sg_fill_dma(sq->sg, sq->xsk.hdr_dma_address, vi->hdr_len);
+   sg_fill_dma(sq->sg + 1, addr, desc->len);
+
+   return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
+   virtnet_xsk_to_ptr(desc->len), GFP_ATOMIC);
+}
+
+static int virtnet_xsk_xmit_batch(struct virtnet_sq *sq,
+ struct xsk_buff_pool *pool,
+ unsigned int budget,
+ struct virtnet_sq_stats *stats)
+{
+   struct xdp_desc *descs = pool->tx_descs;
+   u32 nb_pkts, max_pkts, i;
+   bool kick = false;
+   int err;
+
+   max_pkts = min_t(u32, budget, sq->vq->num_free / 2);
+
+   nb_pkts = xsk_tx_peek_release_desc_batch(pool, max_pkts);
+   if (!nb_pkts)
+   return 0;
+
+   for 

[PATCH net-next v1 17/19] virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer

2023-10-16 Thread Xuan Zhuo
Since this will be called in other circumstances(freeze), we must check
whether it is xsk's buffer in this function. It cannot be judged outside
this function.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 003dd67ab707..ac62d0955c13 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -3904,8 +3904,21 @@ void virtnet_sq_free_unused_buf(struct virtqueue *vq, 
void *buf)
 void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct xsk_buff_pool *pool;
int i = vq2rxq(vq);
 
+   rcu_read_lock();
+   pool = rcu_dereference(vi->rq[i].xsk.pool);
+   if (pool) {
+   struct xdp_buff *xdp;
+
+   xdp = (struct xdp_buff *)buf;
+   xsk_buff_free(xdp);
+   rcu_read_unlock();
+   return;
+   }
+   rcu_read_unlock();
+
if (vi->mergeable_rx_bufs)
put_page(virt_to_head_page(buf));
else if (vi->big_packets)
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 08/19] virtio_net: sq support premapped mode

2023-10-16 Thread Xuan Zhuo
If the xsk is enabling, the xsk tx will share the send queue.
But the xsk requires that the send queue use the premapped mode.
So the send queue must support premapped mode.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 108 
 drivers/net/virtio/virtio_net.h |  54 +++-
 2 files changed, 149 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 8da84ea9bcbe..02d27101fef1 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -514,20 +514,104 @@ static void *virtnet_rq_alloc(struct virtnet_rq *rq, u32 
size, gfp_t gfp)
return buf;
 }
 
-static void virtnet_rq_set_premapped(struct virtnet_info *vi)
+static int virtnet_sq_set_premapped(struct virtnet_sq *sq)
 {
-   int i;
+   struct virtnet_sq_dma *d;
+   int err, size, i;
 
-   /* disable for big mode */
-   if (!vi->mergeable_rx_bufs && vi->big_packets)
-   return;
+   size = virtqueue_get_vring_size(sq->vq);
+
+   size += MAX_SKB_FRAGS + 2;
+
+   sq->dmainfo.head = kcalloc(size, sizeof(*sq->dmainfo.head), GFP_KERNEL);
+   if (!sq->dmainfo.head)
+   return -ENOMEM;
+
+   err = virtqueue_set_dma_premapped(sq->vq);
+   if (err) {
+   kfree(sq->dmainfo.head);
+   return err;
+   }
+
+   sq->dmainfo.free = NULL;
+
+   sq->do_dma = true;
+
+   for (i = 0; i < size; ++i) {
+   d = >dmainfo.head[i];
+
+   d->next = sq->dmainfo.free;
+   sq->dmainfo.free = d;
+   }
+
+   return 0;
+}
+
+static void virtnet_set_premapped(struct virtnet_info *vi)
+{
+   int i;
 
for (i = 0; i < vi->max_queue_pairs; i++) {
-   if (virtqueue_set_dma_premapped(vi->rq[i].vq))
+   if (!virtnet_sq_set_premapped(>sq[i]))
+   vi->sq[i].do_dma = true;
+
+   /* disable for big mode */
+   if (!vi->mergeable_rx_bufs && vi->big_packets)
continue;
 
-   vi->rq[i].do_dma = true;
+   if (!virtqueue_set_dma_premapped(vi->rq[i].vq))
+   vi->rq[i].do_dma = true;
+   }
+}
+
+static struct virtnet_sq_dma *virtnet_sq_map_sg(struct virtnet_sq *sq, int 
nents, void *data)
+{
+   struct virtnet_sq_dma *d, *head;
+   struct scatterlist *sg;
+   int i;
+
+   head = NULL;
+
+   for_each_sg(sq->sg, sg, nents, i) {
+   sg->dma_address = virtqueue_dma_map_single_attrs(sq->vq, 
sg_virt(sg),
+sg->length,
+DMA_TO_DEVICE, 
0);
+   if (virtqueue_dma_mapping_error(sq->vq, sg->dma_address))
+   goto err;
+
+   d = sq->dmainfo.free;
+   sq->dmainfo.free = d->next;
+
+   d->addr = sg->dma_address;
+   d->len = sg->length;
+
+   d->next = head;
+   head = d;
+   }
+
+   head->data = data;
+
+   return (void *)((unsigned long)head | ((unsigned long)data & 
VIRTIO_XMIT_DATA_MASK));
+err:
+   virtnet_sq_unmap(sq, head);
+   return NULL;
+}
+
+static int virtnet_add_outbuf(struct virtnet_sq *sq, u32 num, void *data)
+{
+   int ret;
+
+   if (sq->do_dma) {
+   data = virtnet_sq_map_sg(sq, num, data);
+   if (!data)
+   return -ENOMEM;
}
+
+   ret = virtqueue_add_outbuf(sq->vq, sq->sg, num, data, GFP_ATOMIC);
+   if (ret && sq->do_dma)
+   virtnet_sq_unmap(sq, data);
+
+   return ret;
 }
 
 static void free_old_xmit(struct virtnet_sq *sq, bool in_napi)
@@ -623,8 +707,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
skb_frag_size(frag), skb_frag_off(frag));
}
 
-   err = virtqueue_add_outbuf(sq->vq, sq->sg, nr_frags + 1,
-  xdp_to_ptr(xdpf), GFP_ATOMIC);
+   err = virtnet_add_outbuf(sq, nr_frags + 1, xdp_to_ptr(xdpf));
if (unlikely(err))
return -ENOSPC; /* Caller handle free/refcnt */
 
@@ -2060,7 +2143,8 @@ static int xmit_skb(struct virtnet_sq *sq, struct sk_buff 
*skb)
return num_sg;
num_sg++;
}
-   return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
+
+   return virtnet_add_outbuf(sq, num_sg, skb);
 }
 
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -3717,6 +3801,8 @@ static void virtnet_free_queues(struct virtnet_info *vi)
for (i = 0; i < vi->max_queue_pairs; i++) {
__netif_napi_del(>rq[i].napi);
__netif_napi_del(>sq[i].napi);
+
+   kfree(vi->sq[i].dmainfo.head);
}
 
/* We called __netif_napi_del(),
@@ -3974,7 +4060,7 @@ static int 

[PATCH net-next v1 09/19] virtio_net: xsk: bind/unbind xsk

2023-10-16 Thread Xuan Zhuo
This patch implement the logic of bind/unbind xsk pool to sq and rq.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/Makefile |   2 +-
 drivers/net/virtio/main.c   |  10 +-
 drivers/net/virtio/virtio_net.h |  18 
 drivers/net/virtio/xsk.c| 186 
 drivers/net/virtio/xsk.h|   7 ++
 5 files changed, 216 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
index 15ed7c97fd4f..8c2a884d2dba 100644
--- a/drivers/net/virtio/Makefile
+++ b/drivers/net/virtio/Makefile
@@ -5,4 +5,4 @@
 
 obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
 
-virtio_net-y := main.o
+virtio_net-y := main.o xsk.o
diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 02d27101fef1..38733a782f12 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -139,9 +138,6 @@ struct virtio_net_common_hdr {
};
 };
 
-static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf);
-static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
-
 static void *xdp_to_ptr(struct xdp_frame *ptr)
 {
return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
@@ -3664,6 +3660,8 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
switch (xdp->command) {
case XDP_SETUP_PROG:
return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+   case XDP_SETUP_XSK_POOL:
+   return virtnet_xsk_pool_setup(dev, xdp);
default:
return -EINVAL;
}
@@ -3849,7 +3847,7 @@ static void free_receive_page_frags(struct virtnet_info 
*vi)
}
 }
 
-static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
+void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf)
 {
if (!virtnet_is_xdp_frame(buf))
dev_kfree_skb(buf);
@@ -3857,7 +3855,7 @@ static void virtnet_sq_free_unused_buf(struct virtqueue 
*vq, void *buf)
xdp_return_frame(virtnet_ptr_to_xdp(buf));
 }
 
-static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
+void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf)
 {
struct virtnet_info *vi = vq->vdev->priv;
int i = vq2rxq(vq);
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index cc742756e19a..9e69b6c5921b 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -5,6 +5,8 @@
 
 #include 
 #include 
+#include 
+#include 
 
 #define VIRTIO_XDP_FLAGBIT(0)
 #define VIRTIO_XMIT_DATA_MASK (VIRTIO_XDP_FLAG)
@@ -94,6 +96,11 @@ struct virtnet_sq {
bool do_dma;
 
struct virtnet_sq_dma_head dmainfo;
+   struct {
+   struct xsk_buff_pool __rcu *pool;
+
+   dma_addr_t hdr_dma_address;
+   } xsk;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -134,6 +141,13 @@ struct virtnet_rq {
 
/* Do dma by self */
bool do_dma;
+
+   struct {
+   struct xsk_buff_pool __rcu *pool;
+
+   /* xdp rxq used by xsk */
+   struct xdp_rxq_info xdp_rxq;
+   } xsk;
 };
 
 struct virtnet_info {
@@ -218,6 +232,8 @@ struct virtnet_info {
struct failover *failover;
 };
 
+#include "xsk.h"
+
 static inline bool virtnet_is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -308,4 +324,6 @@ void virtnet_rx_pause(struct virtnet_info *vi, struct 
virtnet_rq *rq);
 void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq);
 void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq);
 void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq);
+void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
+void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf);
 #endif
diff --git a/drivers/net/virtio/xsk.c b/drivers/net/virtio/xsk.c
new file mode 100644
index ..01962a3f
--- /dev/null
+++ b/drivers/net/virtio/xsk.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * virtio-net xsk
+ */
+
+#include "virtio_net.h"
+
+static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
+
+static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct virtnet_rq 
*rq,
+   struct xsk_buff_pool *pool)
+{
+   int err, qindex;
+
+   qindex = rq - vi->rq;
+
+   if (pool) {
+   err = xdp_rxq_info_reg(>xsk.xdp_rxq, vi->dev, qindex, 
rq->napi.napi_id);
+   if (err < 0)
+   return err;
+
+   err = xdp_rxq_info_reg_mem_model(>xsk.xdp_rxq,
+MEM_TYPE_XSK_BUFF_POOL, NULL);
+   if (err < 0) {
+   

[PATCH net-next v1 06/19] virtio_net: separate virtnet_rx_resize()

2023-10-16 Thread Xuan Zhuo
This patch separates two sub-functions from virtnet_rx_resize():

* virtnet_rx_pause
* virtnet_rx_resume

Then the subsequent reset rx for xsk can share these two functions.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 29 +
 drivers/net/virtio/virtio_net.h |  3 +++
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index ba38b6078e1d..e6b262341619 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -2120,26 +2120,39 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
return NETDEV_TX_OK;
 }
 
-static int virtnet_rx_resize(struct virtnet_info *vi,
-struct virtnet_rq *rq, u32 ring_num)
+void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq)
 {
bool running = netif_running(vi->dev);
-   int err, qindex;
-
-   qindex = rq - vi->rq;
 
if (running)
napi_disable(>napi);
+}
 
-   err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_free_unused_buf);
-   if (err)
-   netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: 
%d\n", qindex, err);
+void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq)
+{
+   bool running = netif_running(vi->dev);
 
if (!try_fill_recv(vi, rq, GFP_KERNEL))
schedule_delayed_work(>refill, 0);
 
if (running)
virtnet_napi_enable(rq->vq, >napi);
+}
+
+static int virtnet_rx_resize(struct virtnet_info *vi,
+struct virtnet_rq *rq, u32 ring_num)
+{
+   int err, qindex;
+
+   qindex = rq - vi->rq;
+
+   virtnet_rx_pause(vi, rq);
+
+   err = virtqueue_resize(rq->vq, ring_num, virtnet_rq_free_unused_buf);
+   if (err)
+   netdev_err(vi->dev, "resize rx fail: rx queue index: %d err: 
%d\n", qindex, err);
+
+   virtnet_rx_resume(vi, rq);
return err;
 }
 
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 282504d6639a..70eea23adba6 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -253,4 +253,7 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct 
virtnet_info *vi, int
else
return false;
 }
+
+void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq);
+void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq);
 #endif
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 10/19] virtio_net: xsk: prevent disable tx napi

2023-10-16 Thread Xuan Zhuo
Since xsk's TX queue is consumed by TX NAPI, if sq is bound to xsk, then
we must stop tx napi from being disabled.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 38733a782f12..b320770e5f4e 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -3203,7 +3203,7 @@ static int virtnet_set_coalesce(struct net_device *dev,
struct netlink_ext_ack *extack)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   int ret, queue_number, napi_weight;
+   int ret, queue_number, napi_weight, i;
bool update_napi = false;
 
/* Can't change NAPI weight if the link is up */
@@ -3232,6 +3232,14 @@ static int virtnet_set_coalesce(struct net_device *dev,
return ret;
 
if (update_napi) {
+   /* xsk xmit depends on the tx napi. So if xsk is active,
+* prevent modifications to tx napi.
+*/
+   for (i = queue_number; i < vi->max_queue_pairs; i++) {
+   if (rtnl_dereference(vi->sq[i].xsk.pool))
+   return -EBUSY;
+   }
+
for (; queue_number < vi->max_queue_pairs; queue_number++)
vi->sq[queue_number].napi.weight = napi_weight;
}
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 04/19] virtio_net: move to virtio_net.h

2023-10-16 Thread Xuan Zhuo
Move some structure definitions and inline functions into the
virtio_net.h file.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 252 +--
 drivers/net/virtio/virtio_net.h | 256 
 2 files changed, 258 insertions(+), 250 deletions(-)
 create mode 100644 drivers/net/virtio/virtio_net.h

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index 6cf77b6acdab..d8b6c0d86f29 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -6,7 +6,6 @@
 //#define DEBUG
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -16,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -24,6 +22,8 @@
 #include 
 #include 
 
+#include "virtio_net.h"
+
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
@@ -45,15 +45,6 @@ module_param(napi_tx, bool, 0644);
 #define VIRTIO_XDP_TX  BIT(0)
 #define VIRTIO_XDP_REDIR   BIT(1)
 
-#define VIRTIO_XDP_FLAGBIT(0)
-
-/* RX packet size EWMA. The average packet size is used to determine the packet
- * buffer size when refilling RX rings. As the entire RX ring may be refilled
- * at once, the weight is chosen so that the EWMA will be insensitive to short-
- * term, transient changes in packet size.
- */
-DECLARE_EWMA(pkt_len, 0, 64)
-
 #define VIRTNET_DRIVER_VERSION "1.0.0"
 
 static const unsigned long guest_offloads[] = {
@@ -74,36 +65,6 @@ static const unsigned long guest_offloads[] = {
(1ULL << VIRTIO_NET_F_GUEST_USO4) | \
(1ULL << VIRTIO_NET_F_GUEST_USO6))
 
-struct virtnet_stat_desc {
-   char desc[ETH_GSTRING_LEN];
-   size_t offset;
-};
-
-struct virtnet_sq_stats {
-   struct u64_stats_sync syncp;
-   u64 packets;
-   u64 bytes;
-   u64 xdp_tx;
-   u64 xdp_tx_drops;
-   u64 kicks;
-   u64 tx_timeouts;
-};
-
-struct virtnet_rq_stats {
-   struct u64_stats_sync syncp;
-   u64 packets;
-   u64 bytes;
-   u64 drops;
-   u64 xdp_packets;
-   u64 xdp_tx;
-   u64 xdp_redirects;
-   u64 xdp_drops;
-   u64 kicks;
-};
-
-#define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m)
-#define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m)
-
 static const struct virtnet_stat_desc virtnet_sq_stats_desc[] = {
{ "packets",VIRTNET_SQ_STAT(packets) },
{ "bytes",  VIRTNET_SQ_STAT(bytes) },
@@ -127,80 +88,6 @@ static const struct virtnet_stat_desc 
virtnet_rq_stats_desc[] = {
 #define VIRTNET_SQ_STATS_LEN   ARRAY_SIZE(virtnet_sq_stats_desc)
 #define VIRTNET_RQ_STATS_LEN   ARRAY_SIZE(virtnet_rq_stats_desc)
 
-struct virtnet_interrupt_coalesce {
-   u32 max_packets;
-   u32 max_usecs;
-};
-
-/* The dma information of pages allocated at a time. */
-struct virtnet_rq_dma {
-   dma_addr_t addr;
-   u32 ref;
-   u16 len;
-   u16 need_sync;
-};
-
-/* Internal representation of a send virtqueue */
-struct send_queue {
-   /* Virtqueue associated with this send _queue */
-   struct virtqueue *vq;
-
-   /* TX: fragments + linear part + virtio header */
-   struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-   /* Name of the send queue: output.$index */
-   char name[16];
-
-   struct virtnet_sq_stats stats;
-
-   struct virtnet_interrupt_coalesce intr_coal;
-
-   struct napi_struct napi;
-
-   /* Record whether sq is in reset state. */
-   bool reset;
-};
-
-/* Internal representation of a receive virtqueue */
-struct receive_queue {
-   /* Virtqueue associated with this receive_queue */
-   struct virtqueue *vq;
-
-   struct napi_struct napi;
-
-   struct bpf_prog __rcu *xdp_prog;
-
-   struct virtnet_rq_stats stats;
-
-   struct virtnet_interrupt_coalesce intr_coal;
-
-   /* Chain pages by the private ptr. */
-   struct page *pages;
-
-   /* Average packet length for mergeable receive buffers. */
-   struct ewma_pkt_len mrg_avg_pkt_len;
-
-   /* Page frag for packet buffer allocation. */
-   struct page_frag alloc_frag;
-
-   /* RX: fragments + linear part + virtio header */
-   struct scatterlist sg[MAX_SKB_FRAGS + 2];
-
-   /* Min single buffer size for mergeable buffers case. */
-   unsigned int min_buf_len;
-
-   /* Name of this receive queue: input.$index */
-   char name[16];
-
-   struct xdp_rxq_info xdp_rxq;
-
-   /* Record the last dma info to free after new pages is allocated. */
-   struct virtnet_rq_dma *last_dma;
-
-   /* Do dma by self */
-   bool do_dma;
-};
-
 /* This structure can contain rss message with maximum settings for 
indirection table and keysize
  * Note, that default structure that describes RSS configuration 
virtio_net_rss_config
  * contains same info but can't handle table values.
@@ -234,88 +121,6 @@ struct control_buf {

[PATCH net-next v1 03/19] virtio_net: independent directory

2023-10-16 Thread Xuan Zhuo
Create a separate directory for virtio-net. AF_XDP support will be added
later, then a separate xsk.c file will be added, so we should create a
directory for virtio-net.

Signed-off-by: Xuan Zhuo 
---
 MAINTAINERS |  2 +-
 drivers/net/Kconfig |  8 +---
 drivers/net/Makefile|  2 +-
 drivers/net/virtio/Kconfig  | 13 +
 drivers/net/virtio/Makefile |  8 
 drivers/net/{virtio_net.c => virtio/main.c} |  0
 6 files changed, 24 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{virtio_net.c => virtio/main.c} (100%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9c186c214c54..e4fbcbc100e3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22768,7 +22768,7 @@ F:  Documentation/devicetree/bindings/virtio/
 F: Documentation/driver-api/virtio/
 F: drivers/block/virtio_blk.c
 F: drivers/crypto/virtio/
-F: drivers/net/virtio_net.c
+F: drivers/net/virtio/
 F: drivers/vdpa/
 F: drivers/virtio/
 F: include/linux/vdpa.h
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 44eeb5d61ba9..54ee6fa4f4a6 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -430,13 +430,7 @@ config VETH
  When one end receives the packet it appears on its pair and vice
  versa.
 
-config VIRTIO_NET
-   tristate "Virtio network driver"
-   depends on VIRTIO
-   select NET_FAILOVER
-   help
- This is the virtual network driver for virtio.  It can be used with
- QEMU based VMMs (like KVM or Xen).  Say Y or M.
+source "drivers/net/virtio/Kconfig"
 
 config NLMON
tristate "Virtual netlink monitoring device"
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e26f98f897c5..47537dd0f120 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -31,7 +31,7 @@ obj-$(CONFIG_NET_TEAM) += team/
 obj-$(CONFIG_TUN) += tun.o
 obj-$(CONFIG_TAP) += tap.o
 obj-$(CONFIG_VETH) += veth.o
-obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+obj-$(CONFIG_VIRTIO_NET) += virtio/
 obj-$(CONFIG_VXLAN) += vxlan/
 obj-$(CONFIG_GENEVE) += geneve.o
 obj-$(CONFIG_BAREUDP) += bareudp.o
diff --git a/drivers/net/virtio/Kconfig b/drivers/net/virtio/Kconfig
new file mode 100644
index ..d8ccb3ac49df
--- /dev/null
+++ b/drivers/net/virtio/Kconfig
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# virtio-net device configuration
+#
+config VIRTIO_NET
+   tristate "Virtio network driver"
+   depends on VIRTIO
+   select NET_FAILOVER
+   help
+ This is the virtual network driver for virtio.  It can be used with
+ QEMU based VMMs (like KVM or Xen).
+
+ Say Y or M.
diff --git a/drivers/net/virtio/Makefile b/drivers/net/virtio/Makefile
new file mode 100644
index ..15ed7c97fd4f
--- /dev/null
+++ b/drivers/net/virtio/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the virtio network device drivers.
+#
+
+obj-$(CONFIG_VIRTIO_NET) += virtio_net.o
+
+virtio_net-y := main.o
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio/main.c
similarity index 100%
rename from drivers/net/virtio_net.c
rename to drivers/net/virtio/main.c
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 02/19] virtio_net: unify the code for recycling the xmit ptr

2023-10-16 Thread Xuan Zhuo
There are two completely similar and independent implementations. This
is inconvenient for the subsequent addition of new types. So extract a
function from this piece of code and call this function uniformly to
recover old xmit ptr.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 76 +---
 1 file changed, 33 insertions(+), 43 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3d87386d8220..6cf77b6acdab 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -352,6 +352,30 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+static void __free_old_xmit(struct send_queue *sq, bool in_napi,
+   struct virtnet_sq_stats *stats)
+{
+   unsigned int len;
+   void *ptr;
+
+   while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
+   if (!is_xdp_frame(ptr)) {
+   struct sk_buff *skb = ptr;
+
+   pr_debug("Sent skb %p\n", skb);
+
+   stats->bytes += skb->len;
+   napi_consume_skb(skb, in_napi);
+   } else {
+   struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+   stats->bytes += xdp_get_frame_len(frame);
+   xdp_return_frame(frame);
+   }
+   stats->packets++;
+   }
+}
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -746,37 +770,19 @@ static void virtnet_rq_set_premapped(struct virtnet_info 
*vi)
 
 static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-   unsigned int len;
-   unsigned int packets = 0;
-   unsigned int bytes = 0;
-   void *ptr;
+   struct virtnet_sq_stats stats = {};
 
-   while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
-   if (likely(!is_xdp_frame(ptr))) {
-   struct sk_buff *skb = ptr;
-
-   pr_debug("Sent skb %p\n", skb);
-
-   bytes += skb->len;
-   napi_consume_skb(skb, in_napi);
-   } else {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-   bytes += xdp_get_frame_len(frame);
-   xdp_return_frame(frame);
-   }
-   packets++;
-   }
+   __free_old_xmit(sq, in_napi, );
 
/* Avoid overhead when no packets have been processed
 * happens when called speculatively from start_xmit.
 */
-   if (!packets)
+   if (!stats.packets)
return;
 
u64_stats_update_begin(>stats.syncp);
-   sq->stats.bytes += bytes;
-   sq->stats.packets += packets;
+   sq->stats.bytes += stats.bytes;
+   sq->stats.packets += stats.packets;
u64_stats_update_end(>stats.syncp);
 }
 
@@ -915,15 +921,12 @@ static int virtnet_xdp_xmit(struct net_device *dev,
int n, struct xdp_frame **frames, u32 flags)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   struct virtnet_sq_stats stats = {};
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
struct send_queue *sq;
-   unsigned int len;
-   int packets = 0;
-   int bytes = 0;
int nxmit = 0;
int kicks = 0;
-   void *ptr;
int ret;
int i;
 
@@ -942,20 +945,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
}
 
/* Free up any pending old buffers before queueing new ones. */
-   while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
-   if (likely(is_xdp_frame(ptr))) {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
-
-   bytes += xdp_get_frame_len(frame);
-   xdp_return_frame(frame);
-   } else {
-   struct sk_buff *skb = ptr;
-
-   bytes += skb->len;
-   napi_consume_skb(skb, false);
-   }
-   packets++;
-   }
+   __free_old_xmit(sq, false, );
 
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
@@ -975,8 +965,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
}
 out:
u64_stats_update_begin(>stats.syncp);
-   sq->stats.bytes += bytes;
-   sq->stats.packets += packets;
+   sq->stats.bytes += stats.bytes;
+   sq->stats.packets += stats.packets;
sq->stats.xdp_tx += n;
sq->stats.xdp_tx_drops += n - nxmit;
sq->stats.kicks += kicks;
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 07/19] virtio_net: separate virtnet_tx_resize()

2023-10-16 Thread Xuan Zhuo
This patch separates two sub-functions from virtnet_tx_resize():

* virtnet_tx_pause
* virtnet_tx_resume

Then the subsequent virtnet_tx_reset() can share these two functions.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 35 +++--
 drivers/net/virtio/virtio_net.h |  2 ++
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index e6b262341619..8da84ea9bcbe 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -2156,12 +2156,11 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
return err;
 }
 
-static int virtnet_tx_resize(struct virtnet_info *vi,
-struct virtnet_sq *sq, u32 ring_num)
+void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq)
 {
bool running = netif_running(vi->dev);
struct netdev_queue *txq;
-   int err, qindex;
+   int qindex;
 
qindex = sq - vi->sq;
 
@@ -2182,10 +2181,17 @@ static int virtnet_tx_resize(struct virtnet_info *vi,
netif_stop_subqueue(vi->dev, qindex);
 
__netif_tx_unlock_bh(txq);
+}
 
-   err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf);
-   if (err)
-   netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: 
%d\n", qindex, err);
+void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq)
+{
+   bool running = netif_running(vi->dev);
+   struct netdev_queue *txq;
+   int qindex;
+
+   qindex = sq - vi->sq;
+
+   txq = netdev_get_tx_queue(vi->dev, qindex);
 
__netif_tx_lock_bh(txq);
sq->reset = false;
@@ -2194,6 +2200,23 @@ static int virtnet_tx_resize(struct virtnet_info *vi,
 
if (running)
virtnet_napi_tx_enable(vi, sq->vq, >napi);
+}
+
+static int virtnet_tx_resize(struct virtnet_info *vi, struct virtnet_sq *sq,
+u32 ring_num)
+{
+   int qindex, err;
+
+   qindex = sq - vi->sq;
+
+   virtnet_tx_pause(vi, sq);
+
+   err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf);
+   if (err)
+   netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: 
%d\n", qindex, err);
+
+   virtnet_tx_resume(vi, sq);
+
return err;
 }
 
diff --git a/drivers/net/virtio/virtio_net.h b/drivers/net/virtio/virtio_net.h
index 70eea23adba6..2f930af35364 100644
--- a/drivers/net/virtio/virtio_net.h
+++ b/drivers/net/virtio/virtio_net.h
@@ -256,4 +256,6 @@ static inline bool virtnet_is_xdp_raw_buffer_queue(struct 
virtnet_info *vi, int
 
 void virtnet_rx_pause(struct virtnet_info *vi, struct virtnet_rq *rq);
 void virtnet_rx_resume(struct virtnet_info *vi, struct virtnet_rq *rq);
+void virtnet_tx_pause(struct virtnet_info *vi, struct virtnet_sq *sq);
+void virtnet_tx_resume(struct virtnet_info *vi, struct virtnet_sq *sq);
 #endif
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 00/19] virtio-net: support AF_XDP zero copy

2023-10-16 Thread Xuan Zhuo
## AF_XDP

XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support
this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

At present, we have completed some preparation:

1. vq-reset (virtio spec and kernel code)
2. virtio-core premapped dma
3. virtio-net xdp refactor

So it is time for Virtio-Net to complete the support for the XDP Socket
Zerocopy.

Virtio-net can not increase the queue num at will, so xsk shares the queue with
kernel.

On the other hand, Virtio-Net does not support generate interrupt from driver
manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
is also the local CPU, then we wake up napi directly.

This patch set includes some refactor to the virtio-net to let that to support
AF_XDP.

## performance

ENV: Qemu with vhost-user(polling mode).

Sockperf: https://github.com/Mellanox/sockperf
I use this tool to send udp packet by kernel syscall.

xmit command: sockperf tp -i 10.0.3.1 -t 1000

I write a tool that sends udp packets or recvs udp packets by AF_XDP.

  | Guest APP CPU |Guest Softirq CPU | UDP PPS
--|---|--|
xmit by syscall   |   100%|  |   676,915
xmit by xsk   |   59.1%   |   100%   | 5,447,168
recv by syscall   |   60% |   100%   |   932,288
recv by xsk   |   35.7%   |   100%   | 3,343,168

## maintain

I am currently a reviewer for virtio-net. I commit to maintain AF_XDP support in
virtio-net.

Please review.

Thanks.

v1:
1. remove two virtio commits. Push this patchset to net-next
2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: support 
tx
3. fix some warnings

Xuan Zhuo (19):
  virtio_net: rename free_old_xmit_skbs to free_old_xmit
  virtio_net: unify the code for recycling the xmit ptr
  virtio_net: independent directory
  virtio_net: move to virtio_net.h
  virtio_net: add prefix virtnet to all struct/api inside virtio_net.h
  virtio_net: separate virtnet_rx_resize()
  virtio_net: separate virtnet_tx_resize()
  virtio_net: sq support premapped mode
  virtio_net: xsk: bind/unbind xsk
  virtio_net: xsk: prevent disable tx napi
  virtio_net: xsk: tx: support tx
  virtio_net: xsk: tx: support wakeup
  virtio_net: xsk: tx: virtnet_free_old_xmit() distinguishes xsk buffer
  virtio_net: xsk: tx: virtnet_sq_free_unused_buf() check xsk buffer
  virtio_net: xsk: rx: introduce add_recvbuf_xsk()
  virtio_net: xsk: rx: introduce receive_xsk() to recv xsk buffer
  virtio_net: xsk: rx: virtnet_rq_free_unused_buf() check xsk buffer
  virtio_net: update tx timeout record
  virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY

 MAINTAINERS |   2 +-
 drivers/net/Kconfig |   8 +-
 drivers/net/Makefile|   2 +-
 drivers/net/virtio/Kconfig  |  13 +
 drivers/net/virtio/Makefile |   8 +
 drivers/net/{virtio_net.c => virtio/main.c} | 652 +---
 drivers/net/virtio/virtio_net.h | 359 +++
 drivers/net/virtio/xsk.c| 545 
 drivers/net/virtio/xsk.h|  32 +
 9 files changed, 1247 insertions(+), 374 deletions(-)
 create mode 100644 drivers/net/virtio/Kconfig
 create mode 100644 drivers/net/virtio/Makefile
 rename drivers/net/{virtio_net.c => virtio/main.c} (91%)
 create mode 100644 drivers/net/virtio/virtio_net.h
 create mode 100644 drivers/net/virtio/xsk.c
 create mode 100644 drivers/net/virtio/xsk.h

--
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net-next v1 05/19] virtio_net: add prefix virtnet to all struct/api inside virtio_net.h

2023-10-16 Thread Xuan Zhuo
We move some structures and APIs to the header file, but these
structures and APIs do not prefixed with virtnet. This patch adds
virtnet for these.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio/main.c   | 122 
 drivers/net/virtio/virtio_net.h |  30 
 2 files changed, 76 insertions(+), 76 deletions(-)

diff --git a/drivers/net/virtio/main.c b/drivers/net/virtio/main.c
index d8b6c0d86f29..ba38b6078e1d 100644
--- a/drivers/net/virtio/main.c
+++ b/drivers/net/virtio/main.c
@@ -180,7 +180,7 @@ skb_vnet_common_hdr(struct sk_buff *skb)
  * private is used to chain pages for big packets, put the whole
  * most recent used list in the beginning for reuse
  */
-static void give_pages(struct receive_queue *rq, struct page *page)
+static void give_pages(struct virtnet_rq *rq, struct page *page)
 {
struct page *end;
 
@@ -190,7 +190,7 @@ static void give_pages(struct receive_queue *rq, struct 
page *page)
rq->pages = page;
 }
 
-static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
+static struct page *get_a_page(struct virtnet_rq *rq, gfp_t gfp_mask)
 {
struct page *p = rq->pages;
 
@@ -225,7 +225,7 @@ static void virtqueue_napi_complete(struct napi_struct 
*napi,
opaque = virtqueue_enable_cb_prepare(vq);
if (napi_complete_done(napi, processed)) {
if (unlikely(virtqueue_poll(vq, opaque)))
-   virtqueue_napi_schedule(napi, vq);
+   virtnet_vq_napi_schedule(napi, vq);
} else {
virtqueue_disable_cb(vq);
}
@@ -240,7 +240,7 @@ static void skb_xmit_done(struct virtqueue *vq)
virtqueue_disable_cb(vq);
 
if (napi->weight)
-   virtqueue_napi_schedule(napi, vq);
+   virtnet_vq_napi_schedule(napi, vq);
else
/* We were probably waiting for more output buffers. */
netif_wake_subqueue(vi->dev, vq2txq(vq));
@@ -281,7 +281,7 @@ static struct sk_buff *virtnet_build_skb(void *buf, 
unsigned int buflen,
 
 /* Called from bottom half context */
 static struct sk_buff *page_to_skb(struct virtnet_info *vi,
-  struct receive_queue *rq,
+  struct virtnet_rq *rq,
   struct page *page, unsigned int offset,
   unsigned int len, unsigned int truesize,
   unsigned int headroom)
@@ -380,7 +380,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
return skb;
 }
 
-static void virtnet_rq_unmap(struct receive_queue *rq, void *buf, u32 len)
+static void virtnet_rq_unmap(struct virtnet_rq *rq, void *buf, u32 len)
 {
struct page *page = virt_to_head_page(buf);
struct virtnet_rq_dma *dma;
@@ -409,7 +409,7 @@ static void virtnet_rq_unmap(struct receive_queue *rq, void 
*buf, u32 len)
put_page(page);
 }
 
-static void *virtnet_rq_get_buf(struct receive_queue *rq, u32 *len, void **ctx)
+static void *virtnet_rq_get_buf(struct virtnet_rq *rq, u32 *len, void **ctx)
 {
void *buf;
 
@@ -420,7 +420,7 @@ static void *virtnet_rq_get_buf(struct receive_queue *rq, 
u32 *len, void **ctx)
return buf;
 }
 
-static void *virtnet_rq_detach_unused_buf(struct receive_queue *rq)
+static void *virtnet_rq_detach_unused_buf(struct virtnet_rq *rq)
 {
void *buf;
 
@@ -431,7 +431,7 @@ static void *virtnet_rq_detach_unused_buf(struct 
receive_queue *rq)
return buf;
 }
 
-static void virtnet_rq_init_one_sg(struct receive_queue *rq, void *buf, u32 
len)
+static void virtnet_rq_init_one_sg(struct virtnet_rq *rq, void *buf, u32 len)
 {
struct virtnet_rq_dma *dma;
dma_addr_t addr;
@@ -456,7 +456,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue 
*rq, void *buf, u32 len)
rq->sg[0].length = len;
 }
 
-static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
+static void *virtnet_rq_alloc(struct virtnet_rq *rq, u32 size, gfp_t gfp)
 {
struct page_frag *alloc_frag = >alloc_frag;
struct virtnet_rq_dma *dma;
@@ -530,11 +530,11 @@ static void virtnet_rq_set_premapped(struct virtnet_info 
*vi)
}
 }
 
-static void free_old_xmit(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct virtnet_sq *sq, bool in_napi)
 {
struct virtnet_sq_stats stats = {};
 
-   __free_old_xmit(sq, in_napi, );
+   virtnet_free_old_xmit(sq, in_napi, );
 
/* Avoid overhead when no packets have been processed
 * happens when called speculatively from start_xmit.
@@ -550,7 +550,7 @@ static void free_old_xmit(struct send_queue *sq, bool 
in_napi)
 
 static void check_sq_full_and_disable(struct virtnet_info *vi,
  struct net_device *dev,
- struct send_queue *sq)
+ struct virtnet_sq *sq)
 {
  

[PATCH net-next v1 01/19] virtio_net: rename free_old_xmit_skbs to free_old_xmit

2023-10-16 Thread Xuan Zhuo
Since free_old_xmit_skbs not only deals with skb, but also xdp frame and
subsequent added xsk, so change the name of this function to
free_old_xmit.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fe7f314d65c9..3d87386d8220 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -744,7 +744,7 @@ static void virtnet_rq_set_premapped(struct virtnet_info 
*vi)
}
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
unsigned int len;
unsigned int packets = 0;
@@ -816,7 +816,7 @@ static void check_sq_full_and_disable(struct virtnet_info 
*vi,
virtqueue_napi_schedule(>napi, sq->vq);
} else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
/* More just got used, free them then recheck. */
-   free_old_xmit_skbs(sq, false);
+   free_old_xmit(sq, false);
if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
netif_start_subqueue(dev, qnum);
virtqueue_disable_cb(sq->vq);
@@ -2124,7 +2124,7 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
 
do {
virtqueue_disable_cb(sq->vq);
-   free_old_xmit_skbs(sq, true);
+   free_old_xmit(sq, true);
} while (unlikely(!virtqueue_enable_cb_delayed(sq->vq)));
 
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
@@ -2246,7 +2246,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, int 
budget)
txq = netdev_get_tx_queue(vi->dev, index);
__netif_tx_lock(txq, raw_smp_processor_id());
virtqueue_disable_cb(sq->vq);
-   free_old_xmit_skbs(sq, true);
+   free_old_xmit(sq, true);
 
if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
netif_tx_wake_queue(txq);
@@ -2336,7 +2336,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (use_napi)
virtqueue_disable_cb(sq->vq);
 
-   free_old_xmit_skbs(sq, false);
+   free_old_xmit(sq, false);
 
} while (use_napi && kick &&
   unlikely(!virtqueue_enable_cb_delayed(sq->vq)));
-- 
2.32.0.3.g01195cf9f

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device

2023-10-16 Thread Zhu, Lingshan




On 10/16/2023 4:52 PM, Michael S. Tsirkin wrote:

On Mon, Oct 16, 2023 at 04:33:10PM +0800, Zhu, Lingshan wrote:


On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote:

On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote:

On 10/12/2023 9:27 PM, Jason Gunthorpe wrote:

  On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote:


  sorry for the late reply, we have discussed this for weeks in virtio 
mailing
  list. I have proposed a live migration solution which is a config 
space solution.

  I'm sorry that can't be a serious proposal - config space can't do
  DMA, it is not suitable.

config space only controls the live migration process and config the related
facilities.
We don't use config space to transfer data.

The new added registers work like queue_enable or features.

For example, we use DMA to report dirty pages and MMIO to fetch the dirty data.

I remember in another thread you said:"you can't use DMA for any migration
flows"

And I agree to that statement, so we use config space registers to control the
flow.

Thanks,
Zhu Lingshan


  Jason


If you are using dma then I don't see what's wrong with admin vq.
dma is all it does.

dma != admin vq,

Well they share the same issue that they don't work for nesting
because DMA can not be intercepted.
(hope this is not a spam to virtualization list and I try to keep this 
short)
only use dma for host memory access, e.g., dirty page bitmap, no need to 
intercepted.



and I think we have discussed many details in pros and cons
in admin vq live migration proposal in virtio-comment.
I am not sure we should span the discussions here, repeat them over again.

Thanks

Yea let's not.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device

2023-10-16 Thread Michael S. Tsirkin
On Mon, Oct 16, 2023 at 04:33:10PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote:
> > On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 10/12/2023 9:27 PM, Jason Gunthorpe wrote:
> > > 
> > >  On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote:
> > > 
> > > 
> > >  sorry for the late reply, we have discussed this for weeks in 
> > > virtio mailing
> > >  list. I have proposed a live migration solution which is a 
> > > config space solution.
> > > 
> > >  I'm sorry that can't be a serious proposal - config space can't do
> > >  DMA, it is not suitable.
> > > 
> > > config space only controls the live migration process and config the 
> > > related
> > > facilities.
> > > We don't use config space to transfer data.
> > > 
> > > The new added registers work like queue_enable or features.
> > > 
> > > For example, we use DMA to report dirty pages and MMIO to fetch the dirty 
> > > data.
> > > 
> > > I remember in another thread you said:"you can't use DMA for any migration
> > > flows"
> > > 
> > > And I agree to that statement, so we use config space registers to 
> > > control the
> > > flow.
> > > 
> > > Thanks,
> > > Zhu Lingshan
> > > 
> > > 
> > >  Jason
> > > 
> > If you are using dma then I don't see what's wrong with admin vq.
> > dma is all it does.
> dma != admin vq,

Well they share the same issue that they don't work for nesting
because DMA can not be intercepted.

> and I think we have discussed many details in pros and cons
> in admin vq live migration proposal in virtio-comment.
> I am not sure we should span the discussions here, repeat them over again.
> 
> Thanks
> > 

Yea let's not.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH vfio 10/11] vfio/virtio: Expose admin commands over virtio device

2023-10-16 Thread Zhu, Lingshan




On 10/13/2023 9:50 PM, Michael S. Tsirkin wrote:

On Fri, Oct 13, 2023 at 06:28:34PM +0800, Zhu, Lingshan wrote:


On 10/12/2023 9:27 PM, Jason Gunthorpe wrote:

 On Thu, Oct 12, 2023 at 06:29:47PM +0800, Zhu, Lingshan wrote:


 sorry for the late reply, we have discussed this for weeks in virtio 
mailing
 list. I have proposed a live migration solution which is a config 
space solution.

 I'm sorry that can't be a serious proposal - config space can't do
 DMA, it is not suitable.

config space only controls the live migration process and config the related
facilities.
We don't use config space to transfer data.

The new added registers work like queue_enable or features.

For example, we use DMA to report dirty pages and MMIO to fetch the dirty data.

I remember in another thread you said:"you can't use DMA for any migration
flows"

And I agree to that statement, so we use config space registers to control the
flow.

Thanks,
Zhu Lingshan


 Jason


If you are using dma then I don't see what's wrong with admin vq.
dma is all it does.

dma != admin vq,

and I think we have discussed many details in pros and cons
in admin vq live migration proposal in virtio-comment.
I am not sure we should span the discussions here, repeat them over again.

Thanks




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 2/5] virtio-net: separate rx/tx coalescing moderation cmds

2023-10-16 Thread Michael S. Tsirkin
On Mon, Oct 16, 2023 at 03:45:38PM +0800, Heng Qi wrote:
> 
> 
> 在 2023/10/14 上午9:11, Jakub Kicinski 写道:
> > On Thu, 12 Oct 2023 15:44:06 +0800 Heng Qi wrote:
> > > +
> > > +static int virtnet_send_rx_notf_coal_cmds(struct virtnet_info *vi,
> > > +   struct ethtool_coalesce *ec)
> > > +{
> > > + struct scatterlist sgs_rx;
> > > +
> > ../drivers/net/virtio_net.c: In function ‘virtnet_send_rx_notf_coal_cmds’:
> > ../drivers/net/virtio_net.c:3306:14: error: ‘i’ undeclared (first use in 
> > this function); did you mean ‘vi’?
> >   3306 | for (i = 0; i < vi->max_queue_pairs; i++) {
> >|  ^
> >|  vi
> 
> Will fix in the next version.
> 
> Thanks!

OK, however pls do test individual patches as well as the whole
patchset.

-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH][next] iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()

2023-10-16 Thread Joerg Roedel
On Mon, Oct 09, 2023 at 12:24:27PM -0600, Gustavo A. R. Silva wrote:
>  drivers/iommu/virtio-iommu.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied, thanks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RESEND PATCH v2] vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE

2023-10-16 Thread Eric Auger
Hi,

On 9/27/23 16:05, Eric Auger wrote:
> Commit e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb
> entries") Forbade vhost iotlb msg with null size to prevent entries
> with size = start = 0 and last = ULONG_MAX to end up in the iotlb.
>
> Then commit 95932ab2ea07 ("vhost: allow batching hint without size")
> only applied the check for VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE
> message types to fix a regression observed with batching hit.
>
> Still, the introduction of that check introduced a regression for
> some users attempting to invalidate the whole ULONG_MAX range by
> setting the size to 0. This is the case with qemu/smmuv3/vhost
> integration which does not work anymore. It Looks safe to partially
> revert the original commit and allow VHOST_IOTLB_INVALIDATE messages
> with null size. vhost_iotlb_del_range() will compute a correct end
> iova. Same for vhost_vdpa_iotlb_unmap().
>
> Signed-off-by: Eric Auger 
> Fixes: e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries")
> Cc: sta...@vger.kernel.org # v5.17+
> Acked-by: Jason Wang 
Gentle Ping.

Thanks

Eric
> ---
>  drivers/vhost/vhost.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c71d573f1c94..e0c181ad17e3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1458,9 +1458,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>   goto done;
>   }
>  
> - if ((msg.type == VHOST_IOTLB_UPDATE ||
> -  msg.type == VHOST_IOTLB_INVALIDATE) &&
> -  msg.size == 0) {
> + if (msg.type == VHOST_IOTLB_UPDATE && msg.size == 0) {
>   ret = -EINVAL;
>   goto done;
>   }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/4] vhost-vdpa: reset vendor specific mapping to initial state in .release

2023-10-16 Thread Jason Wang
On Fri, Oct 13, 2023 at 3:36 PM Si-Wei Liu  wrote:
>
>
>
> On 10/12/2023 8:01 PM, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 5:05 PM Si-Wei Liu  wrote:
> >> Devices with on-chip IOMMU or vendor specific IOTLB implementation
> >> may need to restore iotlb mapping to the initial or default state
> >> using the .reset_map op, as it's desirable for some parent devices
> >> to solely manipulate mappings by its own, independent of virtio device
> >> state. For instance, device reset does not cause mapping go away on
> >> such IOTLB model in need of persistent mapping. Before vhost-vdpa
> >> is going away, give them a chance to reset iotlb back to the initial
> >> state in vhost_vdpa_cleanup().
> >>
> >> Signed-off-by: Si-Wei Liu 
> >> ---
> >>   drivers/vhost/vdpa.c | 16 
> >>   1 file changed, 16 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 851535f..a3f8160 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -131,6 +131,15 @@ static struct vhost_vdpa_as 
> >> *vhost_vdpa_find_alloc_as(struct vhost_vdpa *v,
> >>  return vhost_vdpa_alloc_as(v, asid);
> >>   }
> >>
> >> +static void vhost_vdpa_reset_map(struct vhost_vdpa *v, u32 asid)
> >> +{
> >> +   struct vdpa_device *vdpa = v->vdpa;
> >> +   const struct vdpa_config_ops *ops = vdpa->config;
> >> +
> >> +   if (ops->reset_map)
> >> +   ops->reset_map(vdpa, asid);
> >> +}
> >> +
> >>   static int vhost_vdpa_remove_as(struct vhost_vdpa *v, u32 asid)
> >>   {
> >>  struct vhost_vdpa_as *as = asid_to_as(v, asid);
> >> @@ -140,6 +149,13 @@ static int vhost_vdpa_remove_as(struct vhost_vdpa *v, 
> >> u32 asid)
> >>
> >>  hlist_del(>hash_link);
> >>  vhost_vdpa_iotlb_unmap(v, >iotlb, 0ULL, 0ULL - 1, asid);
> >> +   /*
> >> +* Devices with vendor specific IOMMU may need to restore
> >> +* iotlb to the initial or default state which is not done
> >> +* through device reset, as the IOTLB mapping manipulation
> >> +* could be decoupled from the virtio device life cycle.
> >> +*/
> > Should we do this according to whether IOTLB_PRESIST is set?
> Well, in theory this seems like so but it's unnecessary code change
> actually, as that is the way how vDPA parent behind platform IOMMU works
> today, and userspace doesn't break as of today. :)

Well, this is one question I've ever asked before. You have explained
that one of the reason that we don't break userspace is that they may
couple IOTLB reset with vDPA reset as well. One example is the Qemu.

>
> As explained in previous threads [1][2], when IOTLB_PERSIST is not set
> it doesn't necessarily mean the iotlb will definitely be destroyed
> across reset (think about the platform IOMMU case), so userspace today
> is already tolerating enough with either good or bad IOMMU. This code of
> not checking IOTLB_PERSIST being set is intentional, there's no point to
> emulate bad IOMMU behavior even for older userspace (with improper
> emulation to be done it would result in even worse performance).

For two reasons:

1) backend features need acked by userspace this is by design
2) keep the odd behaviour seems to be more safe as we can't audit
every userspace program

Thanks

> I think
> the purpose of the IOTLB_PERSIST flag is just to give userspace 100%
> certainty of persistent iotlb mapping not getting lost across vdpa reset.
>
> Thanks,
> -Siwei
>
> [1]
> https://lore.kernel.org/virtualization/9f118fc9-4f6f-dd67-a291-be78152e4...@oracle.com/
> [2]
> https://lore.kernel.org/virtualization/3364adfd-1eb7-8bce-41f9-bfe5473f1...@oracle.com/
> >   Otherwise
> > we may break old userspace.
> >
> > Thanks
> >
> >> +   vhost_vdpa_reset_map(v, asid);
> >>  kfree(as);
> >>
> >>  return 0;
> >> --
> >> 1.8.3.1
> >>
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization