Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr
On Mon, Dec 30, 2024 at 5:51 PM Chia-Yu Chang (Nokia) wrote: > > >From: Jason Wang > >Sent: Monday, December 30, 2024 8:52 AM > >To: Chia-Yu Chang (Nokia) > >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; > >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; > >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; > >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; > >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; > >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; > >xuanz...@linux.alibaba.com; epere...@redhat.com; > >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen > >De Schepper (Nokia) ; > >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; > >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; > >jason_living...@comcast.com; vidhi_g...@apple.com > >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in > >virtio_net_hdr > > > >[You don't often get email from jasow...@redhat.com. Learn why this is > >important at https://aka.ms/LearnAboutSenderIdentification ] > > > >CAUTION: This is an external email. Please be very careful when clicking > >links or opening attachments. See the URL nok.it/ext for additional > >information. > > > > > > > >On Sat, Dec 28, 2024 at 3:13 AM wrote: > >> > >> From: Chia-Yu Chang > >> > >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE > >> field to count new packets with CE mark; however, it will be corrupted > >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by > >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be > >> changed within a super-skb. > >> > >> To apply the aforementieond new AccECN GSO for virtio, new featue bits > >> for host and guest are added for feature negotiation between driver > >> and device. And the translation of Accurate ECN GSO flag between > >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to > >> avoid CWR flag corruption due to RFC3168 ECN TSO. > >> > >> Signed-off-by: Chia-Yu Chang > >> --- > >> drivers/net/virtio_net.c| 14 +++--- > >> drivers/vdpa/pds/debugfs.c | 6 ++ > >> include/linux/virtio_net.h | 16 ++-- > >> include/uapi/linux/virtio_net.h | 5 + > >> 4 files changed, 32 insertions(+), 9 deletions(-) > > > >Is there a link to the spec patch? It needs to be accepted first. > > > >Thanks > > Hi Jason, > > Thanks for the feedback, I found the virtio-spec in github: > https://github.com/oasis-tcs/virtio-spec but not able to find the procedure > to propose. > Could you help to share the procedure to propose spec patch? Thanks. Sorry for the late reply. You can start by subscribing to virtio-dev and send a spec patch there. Thanks > > -- > Chia-Yu
Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr
I tested this series of patches v6 with virtio-net regression tests, everything works fine. Tested-by: Lei Yang On Wed, Jan 8, 2025 at 8:33 PM Michael S. Tsirkin wrote: > > On Mon, Dec 30, 2024 at 09:50:59AM +, Chia-Yu Chang (Nokia) wrote: > > >From: Jason Wang > > >Sent: Monday, December 30, 2024 8:52 AM > > >To: Chia-Yu Chang (Nokia) > > >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; > > >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; > > >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; > > >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; > > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; > > >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; > > >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; > > >xuanz...@linux.alibaba.com; epere...@redhat.com; > > >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; > > >Koen De Schepper (Nokia) ; > > >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; > > >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; > > >jason_living...@comcast.com; vidhi_g...@apple.com > > >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in > > >virtio_net_hdr > > > > > >[You don't often get email from jasow...@redhat.com. Learn why this is > > >important at https://aka.ms/LearnAboutSenderIdentification ] > > > > > >CAUTION: This is an external email. Please be very careful when clicking > > >links or opening attachments. See the URL nok.it/ext for additional > > >information. > > > > > > > > > > > >On Sat, Dec 28, 2024 at 3:13 AM wrote: > > >> > > >> From: Chia-Yu Chang > > >> > > >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE > > >> field to count new packets with CE mark; however, it will be corrupted > > >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by > > >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be > > >> changed within a super-skb. > > >> > > >> To apply the aforementieond new AccECN GSO for virtio, new featue bits > > >> for host and guest are added for feature negotiation between driver > > >> and device. And the translation of Accurate ECN GSO flag between > > >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to > > >> avoid CWR flag corruption due to RFC3168 ECN TSO. > > >> > > >> Signed-off-by: Chia-Yu Chang > > >> --- > > >> drivers/net/virtio_net.c| 14 +++--- > > >> drivers/vdpa/pds/debugfs.c | 6 ++ > > >> include/linux/virtio_net.h | 16 ++-- > > >> include/uapi/linux/virtio_net.h | 5 + > > >> 4 files changed, 32 insertions(+), 9 deletions(-) > > > > > >Is there a link to the spec patch? It needs to be accepted first. > > > > > >Thanks > > > > Hi Jason, > > > > Thanks for the feedback, I found the virtio-spec in github: > > https://github.com/oasis-tcs/virtio-spec but not able to find the procedure > > to propose. > > Could you help to share the procedure to propose spec patch? Thanks. > > > You post it on virtio-comment for discussion. Github issues are then used > for voting and to track acceptance. > https://github.com/oasis-tcs/virtio-spec/blob/master/README.md#use-of-github-issues > > > > -- > > Chia-Yu > >
Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr
On Mon, Dec 30, 2024 at 09:50:59AM +, Chia-Yu Chang (Nokia) wrote: > >From: Jason Wang > >Sent: Monday, December 30, 2024 8:52 AM > >To: Chia-Yu Chang (Nokia) > >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; > >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; > >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; > >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; > >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; > >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; > >xuanz...@linux.alibaba.com; epere...@redhat.com; > >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen > >De Schepper (Nokia) ; > >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; > >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; > >jason_living...@comcast.com; vidhi_g...@apple.com > >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in > >virtio_net_hdr > > > >[You don't often get email from jasow...@redhat.com. Learn why this is > >important at https://aka.ms/LearnAboutSenderIdentification ] > > > >CAUTION: This is an external email. Please be very careful when clicking > >links or opening attachments. See the URL nok.it/ext for additional > >information. > > > > > > > >On Sat, Dec 28, 2024 at 3:13 AM wrote: > >> > >> From: Chia-Yu Chang > >> > >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE > >> field to count new packets with CE mark; however, it will be corrupted > >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by > >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be > >> changed within a super-skb. > >> > >> To apply the aforementieond new AccECN GSO for virtio, new featue bits > >> for host and guest are added for feature negotiation between driver > >> and device. And the translation of Accurate ECN GSO flag between > >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to > >> avoid CWR flag corruption due to RFC3168 ECN TSO. > >> > >> Signed-off-by: Chia-Yu Chang > >> --- > >> drivers/net/virtio_net.c| 14 +++--- > >> drivers/vdpa/pds/debugfs.c | 6 ++ > >> include/linux/virtio_net.h | 16 ++-- > >> include/uapi/linux/virtio_net.h | 5 + > >> 4 files changed, 32 insertions(+), 9 deletions(-) > > > >Is there a link to the spec patch? It needs to be accepted first. > > > >Thanks > > Hi Jason, > > Thanks for the feedback, I found the virtio-spec in github: > https://github.com/oasis-tcs/virtio-spec but not able to find the procedure > to propose. > Could you help to share the procedure to propose spec patch? Thanks. You post it on virtio-comment for discussion. Github issues are then used for voting and to track acceptance. https://github.com/oasis-tcs/virtio-spec/blob/master/README.md#use-of-github-issues > -- > Chia-Yu
RE: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr
>From: Jason Wang >Sent: Monday, December 30, 2024 8:52 AM >To: Chia-Yu Chang (Nokia) >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; >netfilter-de...@vger.kernel.org; coret...@netfilter.org; >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; >xuanz...@linux.alibaba.com; epere...@redhat.com; >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen De >Schepper (Nokia) ; >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; >jason_living...@comcast.com; vidhi_g...@apple.com >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in >virtio_net_hdr > >[You don't often get email from jasow...@redhat.com. Learn why this is >important at https://aka.ms/LearnAboutSenderIdentification ] > >CAUTION: This is an external email. Please be very careful when clicking links >or opening attachments. See the URL nok.it/ext for additional information. > > > >On Sat, Dec 28, 2024 at 3:13 AM wrote: >> >> From: Chia-Yu Chang >> >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE >> field to count new packets with CE mark; however, it will be corrupted >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be >> changed within a super-skb. >> >> To apply the aforementieond new AccECN GSO for virtio, new featue bits >> for host and guest are added for feature negotiation between driver >> and device. And the translation of Accurate ECN GSO flag between >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to >> avoid CWR flag corruption due to RFC3168 ECN TSO. >> >> Signed-off-by: Chia-Yu Chang >> --- >> drivers/net/virtio_net.c| 14 +++--- >> drivers/vdpa/pds/debugfs.c | 6 ++ >> include/linux/virtio_net.h | 16 ++-- >> include/uapi/linux/virtio_net.h | 5 + >> 4 files changed, 32 insertions(+), 9 deletions(-) > >Is there a link to the spec patch? It needs to be accepted first. > >Thanks Hi Jason, Thanks for the feedback, I found the virtio-spec in github: https://github.com/oasis-tcs/virtio-spec but not able to find the procedure to propose. Could you help to share the procedure to propose spec patch? Thanks. -- Chia-Yu
Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr
On Sat, Dec 28, 2024 at 3:13 AM wrote: > > From: Chia-Yu Chang > > Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE > field to count new packets with CE mark; however, it will be corrupted > by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by > seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be > changed within a super-skb. > > To apply the aforementieond new AccECN GSO for virtio, new featue bits > for host and guest are added for feature negotiation between driver and > device. And the translation of Accurate ECN GSO flag between > virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to > avoid CWR flag corruption due to RFC3168 ECN TSO. > > Signed-off-by: Chia-Yu Chang > --- > drivers/net/virtio_net.c| 14 +++--- > drivers/vdpa/pds/debugfs.c | 6 ++ > include/linux/virtio_net.h | 16 ++-- > include/uapi/linux/virtio_net.h | 5 + > 4 files changed, 32 insertions(+), 9 deletions(-) Is there a link to the spec patch? It needs to be accepted first. Thanks
Re: [PATCH v6 net-next 00/14] AccECN protocol preparation patch series
On Fri, 27 Dec 2024 20:11:57 +0100 chia-yu.ch...@nokia-bell-labs.com wrote: > Hello Hello. Someone may still review, but: ## Form letter - winter-break Networking development is suspended for winter holidays, until Jan 2nd. We are currently accepting bug fixes only, see the announcements at: https://lore.kernel.org/20241211164022.6a075...@kernel.org https://lore.kernel.org/20241220182851.7acb6...@kernel.org RFC patches sent for review only are welcome at any time. -- pw-bot: defer pv-bot: closed
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On 12/20/24 11:49, Stefano Garzarella wrote: > ... > Note that non-NULL -> NULL should only occur before a connection is > established, so before any data is passed. Is this a problem for BPF? Please take a look at vsock_bpf_update_proto(). The condition is to have a transport assigned. BPF assumes transport will stay valid. And currently that's a wrong assumption: transport can transition from non-NULL to NULL (due to a failed reconnect). That's why we hit null ptr deref via vsock_bpf_recvmsg(). That said, I sure hope someone BPF-competent is reading this :)
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Thu, Dec 19, 2024 at 05:09:42PM +0100, Michal Luczaj wrote: On 12/19/24 16:12, Stefano Garzarella wrote: On Thu, 19 Dec 2024 at 16:05, Michal Luczaj wrote: On 12/19/24 15:48, Stefano Garzarella wrote: On Thu, 19 Dec 2024 at 15:36, Michal Luczaj wrote: On 12/19/24 09:19, Stefano Garzarella wrote: ... I think the best thing though is to better understand how to handle deassign, rather than checking everywhere that it's not null, also because in some cases (like the one in virtio-vsock), it's also important that the transport is the same. My vote would be to apply your virtio_transport_recv_pkt() patch *and* make it impossible-by-design to switch ->transport from non-NULL to NULL in vsock_assign_transport(). I don't know if that's enough, in this case the problem is that some response packets are intended for a socket, where the transport has changed. So whether it's null or assigned but different, it's still a problem we have to handle. So making it impossible for the transport to be null, but allowing it to be different (we can't prevent it from changing), doesn't solve the problem for us, it only shifts it. Got it. I assumed this issue would be solved by `vsk->transport != &t->transport` in the critical place(s). (Note that BPF doesn't care if transport has changed; BPF just expects to have _a_ transport.) If I'm not mistaken, that would require rewriting vsock_assign_transport() so that a new transport is assigned only once fully initialized, otherwise keep the old one (still unhurt and functional) and return error. Because failing connect() should not change anything under the hood, right? Nope, connect should be able to change the transport. Because a user can do an initial connect() that requires a specific transport, this one fails maybe because there's no peer with that cid. Then the user can redo the connect() to a different cid that requires a different transport. But the initial connect() failing does not change anything under the hood (transport should/could stay NULL). Nope, isn't null, it's assigned to a transport, because for example it has to send a packet to connect to the remote CID and wait back for a response that for example says the CID doesn't exist. Ahh, I think I get it. So the initial connect() passed vsock_assign_transport() successfully and then failed deeper in vsock_connect(), right? That's fine. Let the socket have a useless transport (a valid pointer nevertheless). Just to be clear, it's not useless, since it's used to make the connection. We know that it's useless just when we report that the connection failed, so maybe we should de-assign it when we set `sock->state = SS_UNCONNECTED`. Sure, upcoming connect() can assign a new (possibly useless just as well) transport, but there's no reason to allow ->transport becoming NULL. I'm not sure about this, in the end in the connection failure case, when we set `sock->state = SS_UNCONNECTED`, we're returning the socket to a pre-connect state, so it might make sense to also reset the transport to NULL, so that we have exactly the same conditions. And a pre-connect socket (where ->transport==NULL) is not an issue, because BPF won't let it in any sockmap, so vsock_bpf_recvmsg() won't be reachable. Anywa, thanks for explaining, Michal PS. Or ignore the above and remove the socket from the sockmap at every reconnect? Possible unhash abuse: I should take a closer look at unhash, but it might make sense! diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 5cf8109f672a..8a65153ee186 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -483,6 +483,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk) if (vsk->transport == new_transport) return 0; + const struct proto *prot = READ_ONCE(sk->sk_prot); + if (prot->unhash) + prot->unhash(sk); + /* transport->release() must be called with sock lock acquired. * This path can only be taken during vsock_connect(), where we * have already held the sock lock. In the other cases, this diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c index 4aa6e74ec295..80deb4d70aea 100644 --- a/net/vmw_vsock/vsock_bpf.c +++ b/net/vmw_vsock/vsock_bpf.c @@ -119,6 +119,7 @@ static void vsock_bpf_rebuild_protos(struct proto *prot, const struct proto *bas *prot= *base; prot->close = sock_map_close; prot->recvmsg = vsock_bpf_recvmsg; + prot->unhash = sock_map_unhash; prot->sock_is_readable = sk_msg_is_readable; } Then a successful re-connect assigns the transport (NULL -> non-NULL). And it's all good because all I wan
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On 12/19/24 16:12, Stefano Garzarella wrote: > On Thu, 19 Dec 2024 at 16:05, Michal Luczaj wrote: >> >> On 12/19/24 15:48, Stefano Garzarella wrote: >>> On Thu, 19 Dec 2024 at 15:36, Michal Luczaj wrote: >>>> >>>> On 12/19/24 09:19, Stefano Garzarella wrote: >>>>> ... >>>>> I think the best thing though is to better understand how to handle >>>>> deassign, rather than checking everywhere that it's not null, also >>>>> because in some cases (like the one in virtio-vsock), it's also >>>>> important that the transport is the same. >>>> >>>> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make >>>> it impossible-by-design to switch ->transport from non-NULL to NULL in >>>> vsock_assign_transport(). >>> >>> I don't know if that's enough, in this case the problem is that some >>> response packets are intended for a socket, where the transport has >>> changed. So whether it's null or assigned but different, it's still a >>> problem we have to handle. >>> >>> So making it impossible for the transport to be null, but allowing it >>> to be different (we can't prevent it from changing), doesn't solve the >>> problem for us, it only shifts it. >> >> Got it. I assumed this issue would be solved by `vsk->transport != >> &t->transport` in the critical place(s). >> >> (Note that BPF doesn't care if transport has changed; BPF just expects to >> have _a_ transport.) >> >>>> If I'm not mistaken, that would require rewriting vsock_assign_transport() >>>> so that a new transport is assigned only once fully initialized, otherwise >>>> keep the old one (still unhurt and functional) and return error. Because >>>> failing connect() should not change anything under the hood, right? >>>> >>> >>> Nope, connect should be able to change the transport. >>> >>> Because a user can do an initial connect() that requires a specific >>> transport, this one fails maybe because there's no peer with that cid. >>> Then the user can redo the connect() to a different cid that requires >>> a different transport. >> >> But the initial connect() failing does not change anything under the hood >> (transport should/could stay NULL). > > Nope, isn't null, it's assigned to a transport, because for example it > has to send a packet to connect to the remote CID and wait back for a > response that for example says the CID doesn't exist. Ahh, I think I get it. So the initial connect() passed vsock_assign_transport() successfully and then failed deeper in vsock_connect(), right? That's fine. Let the socket have a useless transport (a valid pointer nevertheless). Sure, upcoming connect() can assign a new (possibly useless just as well) transport, but there's no reason to allow ->transport becoming NULL. And a pre-connect socket (where ->transport==NULL) is not an issue, because BPF won't let it in any sockmap, so vsock_bpf_recvmsg() won't be reachable. Anywa, thanks for explaining, Michal PS. Or ignore the above and remove the socket from the sockmap at every reconnect? Possible unhash abuse: diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 5cf8109f672a..8a65153ee186 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -483,6 +483,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk) if (vsk->transport == new_transport) return 0; + const struct proto *prot = READ_ONCE(sk->sk_prot); + if (prot->unhash) + prot->unhash(sk); + /* transport->release() must be called with sock lock acquired. * This path can only be taken during vsock_connect(), where we * have already held the sock lock. In the other cases, this diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c index 4aa6e74ec295..80deb4d70aea 100644 --- a/net/vmw_vsock/vsock_bpf.c +++ b/net/vmw_vsock/vsock_bpf.c @@ -119,6 +119,7 @@ static void vsock_bpf_rebuild_protos(struct proto *prot, const struct proto *bas *prot= *base; prot->close = sock_map_close; prot->recvmsg = vsock_bpf_recvmsg; + prot->unhash = sock_map_unhash; prot->sock_is_readable = sk_msg_is_readable; } >> Then a successful re-connect assigns >> the transport (NULL -> non-NULL). And it's all good because all I wanted to >> avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly >> shallow understanding :)
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Thu, 19 Dec 2024 at 16:05, Michal Luczaj wrote: > > On 12/19/24 15:48, Stefano Garzarella wrote: > > On Thu, 19 Dec 2024 at 15:36, Michal Luczaj wrote: > >> > >> On 12/19/24 09:19, Stefano Garzarella wrote: > >>> ... > >>> I think the best thing though is to better understand how to handle > >>> deassign, rather than checking everywhere that it's not null, also > >>> because in some cases (like the one in virtio-vsock), it's also > >>> important that the transport is the same. > >> > >> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make > >> it impossible-by-design to switch ->transport from non-NULL to NULL in > >> vsock_assign_transport(). > > > > I don't know if that's enough, in this case the problem is that some > > response packets are intended for a socket, where the transport has > > changed. So whether it's null or assigned but different, it's still a > > problem we have to handle. > > > > So making it impossible for the transport to be null, but allowing it > > to be different (we can't prevent it from changing), doesn't solve the > > problem for us, it only shifts it. > > Got it. I assumed this issue would be solved by `vsk->transport != > &t->transport` in the critical place(s). > > (Note that BPF doesn't care if transport has changed; BPF just expects to > have _a_ transport.) > > >> If I'm not mistaken, that would require rewriting vsock_assign_transport() > >> so that a new transport is assigned only once fully initialized, otherwise > >> keep the old one (still unhurt and functional) and return error. Because > >> failing connect() should not change anything under the hood, right? > >> > > > > Nope, connect should be able to change the transport. > > > > Because a user can do an initial connect() that requires a specific > > transport, this one fails maybe because there's no peer with that cid. > > Then the user can redo the connect() to a different cid that requires > > a different transport. > > But the initial connect() failing does not change anything under the hood > (transport should/could stay NULL). Nope, isn't null, it's assigned to a transport, because for example it has to send a packet to connect to the remote CID and wait back for a response that for example says the CID doesn't exist. > Then a successful re-connect assigns > the transport (NULL -> non-NULL). And it's all good because all I wanted to > avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly > shallow understanding :) >
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On 12/19/24 15:48, Stefano Garzarella wrote: > On Thu, 19 Dec 2024 at 15:36, Michal Luczaj wrote: >> >> On 12/19/24 09:19, Stefano Garzarella wrote: >>> ... >>> I think the best thing though is to better understand how to handle >>> deassign, rather than checking everywhere that it's not null, also >>> because in some cases (like the one in virtio-vsock), it's also >>> important that the transport is the same. >> >> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make >> it impossible-by-design to switch ->transport from non-NULL to NULL in >> vsock_assign_transport(). > > I don't know if that's enough, in this case the problem is that some > response packets are intended for a socket, where the transport has > changed. So whether it's null or assigned but different, it's still a > problem we have to handle. > > So making it impossible for the transport to be null, but allowing it > to be different (we can't prevent it from changing), doesn't solve the > problem for us, it only shifts it. Got it. I assumed this issue would be solved by `vsk->transport != &t->transport` in the critical place(s). (Note that BPF doesn't care if transport has changed; BPF just expects to have _a_ transport.) >> If I'm not mistaken, that would require rewriting vsock_assign_transport() >> so that a new transport is assigned only once fully initialized, otherwise >> keep the old one (still unhurt and functional) and return error. Because >> failing connect() should not change anything under the hood, right? >> > > Nope, connect should be able to change the transport. > > Because a user can do an initial connect() that requires a specific > transport, this one fails maybe because there's no peer with that cid. > Then the user can redo the connect() to a different cid that requires > a different transport. But the initial connect() failing does not change anything under the hood (transport should/could stay NULL). Then a successful re-connect assigns the transport (NULL -> non-NULL). And it's all good because all I wanted to avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly shallow understanding :)
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Thu, 19 Dec 2024 at 15:36, Michal Luczaj wrote: > > On 12/19/24 09:19, Stefano Garzarella wrote: > > ... > > I think the best thing though is to better understand how to handle > > deassign, rather than checking everywhere that it's not null, also > > because in some cases (like the one in virtio-vsock), it's also > > important that the transport is the same. > > My vote would be to apply your virtio_transport_recv_pkt() patch *and* make > it impossible-by-design to switch ->transport from non-NULL to NULL in > vsock_assign_transport(). I don't know if that's enough, in this case the problem is that some response packets are intended for a socket, where the transport has changed. So whether it's null or assigned but different, it's still a problem we have to handle. So making it impossible for the transport to be null, but allowing it to be different (we can't prevent it from changing), doesn't solve the problem for us, it only shifts it. > > If I'm not mistaken, that would require rewriting vsock_assign_transport() > so that a new transport is assigned only once fully initialized, otherwise > keep the old one (still unhurt and functional) and return error. Because > failing connect() should not change anything under the hood, right? > Nope, connect should be able to change the transport. Because a user can do an initial connect() that requires a specific transport, this one fails maybe because there's no peer with that cid. Then the user can redo the connect() to a different cid that requires a different transport. Stefano
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On 12/19/24 09:19, Stefano Garzarella wrote: > ... > I think the best thing though is to better understand how to handle > deassign, rather than checking everywhere that it's not null, also > because in some cases (like the one in virtio-vsock), it's also > important that the transport is the same. My vote would be to apply your virtio_transport_recv_pkt() patch *and* make it impossible-by-design to switch ->transport from non-NULL to NULL in vsock_assign_transport(). If I'm not mistaken, that would require rewriting vsock_assign_transport() so that a new transport is assigned only once fully initialized, otherwise keep the old one (still unhurt and functional) and return error. Because failing connect() should not change anything under the hood, right?
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 08:37:38PM -0500, Hyunwoo Kim wrote: On Thu, Dec 19, 2024 at 01:25:34AM +0100, Michal Luczaj wrote: On 12/18/24 16:51, Hyunwoo Kim wrote: > On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote: >> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote: >>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: At least for vsock_loopback.c, this change doesn’t seem to introduce any particular issues. >>> >>> But was it working for you? because the check was wrong, this one should >>> work, but still, I didn't have time to test it properly, I'll do later. >>> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c >>> index 9acc13ab3f82..ddecf6e430d6 100644 >>> --- a/net/vmw_vsock/virtio_transport_common.c >>> +++ b/net/vmw_vsock/virtio_transport_common.c >>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, >>>lock_sock(sk); >>> - /* Check if sk has been closed before lock_sock */ >>> - if (sock_flag(sk, SOCK_DONE)) { >>> + /* Check if sk has been closed or assigned to another transport before >>> +* lock_sock >>> +*/ >>> + if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) { >>>(void)virtio_transport_reset_no_sock(t, skb); >>>release_sock(sk); >>>sock_put(sk); Hi, I got curious about this race, my 2 cents: Your patch seems to fix the reported issue, but there's also a variant (as in: transport going null unexpectedly) involving BPF: I think it is a different problem, to be solved in vsock_bpf.c With something like this (untested): diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c index 4aa6e74ec295..8c2322dc2af7 100644 --- a/net/vmw_vsock/vsock_bpf.c +++ b/net/vmw_vsock/vsock_bpf.c @@ -25,10 +25,8 @@ static struct proto vsock_bpf_prot; static bool vsock_has_data(struct sock *sk, struct sk_psock *psock) { struct vsock_sock *vsk = vsock_sk(sk); - s64 ret; - ret = vsock_connectible_has_data(vsk); - if (ret > 0) + if (vsk->transport && vsock_connectible_has_data(vsk) > 0) return true; return vsock_sk_has_data(sk, psock); Note: we should check better this, because sometime we call it without lock_sock, also thi Yes. It seems that calling connect() twice causes the transport to become NULL, leading to null-ptr-deref in any flow that tries to access that transport. We already expect vsk->transport to be null in several parts, but in some we assume it is called when this is valid. So we should check better what we do when we deassign a transport. And that null-ptr-deref occurs because, unlike __vsock_stream_recvmsg, vsock_bpf_recvmsg does not check vsock->transport: Right. So, thanks for the report, I'll try to see if I can make a patch with everything before tomorrow, because then I'm gone for 2 weeks. Otherwise we'll see as soon as I get back or if you have time in the meantime, any solution is welcome. I think the best thing though is to better understand how to handle deassign, rather than checking everywhere that it's not null, also because in some cases (like the one in virtio-vsock), it's also important that the transport is the same. Thanks, Stefano ``` int __vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, int flags) { ... lock_sock(sk); transport = vsk->transport; if (!transport || sk->sk_state != TCP_ESTABLISHED) { /* Recvmsg is supposed to return 0 if a peer performs an * orderly shutdown. Differentiate between that case and when a * peer has not connected or a local shutdown occurred with the * SOCK_DONE flag. */ if (sock_flag(sk, SOCK_DONE)) err = 0; else err = -ENOTCONN; goto out; } ``` /* $ gcc vsock-transport.c && sudo ./a.out BUG: kernel NULL pointer dereference, address: 00a0 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0 Oops: Oops: [#1] PREEMPT SMP NOPTI CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+ RIP: 0010:vsock_connectible_has_data+0x1f/0x40 Call Trace: vsock_bpf_recvmsg+0xca/0x5e0 sock_recvmsg+0xb9/0xc0 __sys_recvfrom+0xb3/0x130 __x64_sys_recvfrom+0x20/0x30 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e */ #include #include #include #include #include #include #include #include static void die(const char *msg) { perror(msg); exit(-1); } static int create_sockmap(void) { union bpf_attr attr = { .map_type = BPF_MAP_TYPE_SOCKMAP, .key_size = sizeof(int),
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Thu, Dec 19, 2024 at 01:25:34AM +0100, Michal Luczaj wrote: > On 12/18/24 16:51, Hyunwoo Kim wrote: > > On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote: > >> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote: > >>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: > At least for vsock_loopback.c, this change doesn’t seem to introduce any > particular issues. > >>> > >>> But was it working for you? because the check was wrong, this one should > >>> work, but still, I didn't have time to test it properly, I'll do later. > >>> > >>> diff --git a/net/vmw_vsock/virtio_transport_common.c > >>> b/net/vmw_vsock/virtio_transport_common.c > >>> index 9acc13ab3f82..ddecf6e430d6 100644 > >>> --- a/net/vmw_vsock/virtio_transport_common.c > >>> +++ b/net/vmw_vsock/virtio_transport_common.c > >>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct > >>> virtio_transport *t, > >>>lock_sock(sk); > >>> - /* Check if sk has been closed before lock_sock */ > >>> - if (sock_flag(sk, SOCK_DONE)) { > >>> + /* Check if sk has been closed or assigned to another transport > >>> before > >>> +* lock_sock > >>> +*/ > >>> + if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) { > >>>(void)virtio_transport_reset_no_sock(t, skb); > >>>release_sock(sk); > >>>sock_put(sk); > > Hi, I got curious about this race, my 2 cents: > > Your patch seems to fix the reported issue, but there's also a variant (as > in: transport going null unexpectedly) involving BPF: Yes. It seems that calling connect() twice causes the transport to become NULL, leading to null-ptr-deref in any flow that tries to access that transport. And that null-ptr-deref occurs because, unlike __vsock_stream_recvmsg, vsock_bpf_recvmsg does not check vsock->transport: ``` int __vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, int flags) { ... lock_sock(sk); transport = vsk->transport; if (!transport || sk->sk_state != TCP_ESTABLISHED) { /* Recvmsg is supposed to return 0 if a peer performs an * orderly shutdown. Differentiate between that case and when a * peer has not connected or a local shutdown occurred with the * SOCK_DONE flag. */ if (sock_flag(sk, SOCK_DONE)) err = 0; else err = -ENOTCONN; goto out; } ``` > > /* > $ gcc vsock-transport.c && sudo ./a.out > > BUG: kernel NULL pointer dereference, address: 00a0 > #PF: supervisor read access in kernel mode > #PF: error_code(0x) - not-present page > PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0 > Oops: Oops: [#1] PREEMPT SMP NOPTI > CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+ > RIP: 0010:vsock_connectible_has_data+0x1f/0x40 > Call Trace: > vsock_bpf_recvmsg+0xca/0x5e0 > sock_recvmsg+0xb9/0xc0 > __sys_recvfrom+0xb3/0x130 > __x64_sys_recvfrom+0x20/0x30 > do_syscall_64+0x93/0x180 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > */ > > #include > #include > #include > #include > #include > #include > #include > #include > > static void die(const char *msg) > { > perror(msg); > exit(-1); > } > > static int create_sockmap(void) > { > union bpf_attr attr = { > .map_type = BPF_MAP_TYPE_SOCKMAP, > .key_size = sizeof(int), > .value_size = sizeof(int), > .max_entries = 1 > }; > int map; > > map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); > if (map < 0) > die("create_sockmap"); > > return map; > } > > static void map_update_elem(int fd, int key, int value) > { > union bpf_attr attr = { > .map_fd = fd, > .key = (uint64_t)&key, > .value = (uint64_t)&value, > .flags = BPF_ANY > }; > > if (syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr))) > die("map_update_elem"); > } > > int main(void) > { > struct sockaddr_vm addr = { > .svm_family = AF_VSOCK, > .svm_port = VMADDR_PORT_ANY, > .svm_cid = VMADDR_CID_LOCAL > }; > socklen_t alen = sizeof(addr); > int map, s; > > map = create_sockmap(); > > s = socket(AF_VSOCK, SOCK_SEQPACKET, 0); > if (s < 0) > die("socket"); > > if (!connect(s, (struct sockaddr *)&addr, alen)) > die("connect #1"); > perror("ok, connect #1 failed; transport set"); > > map_update_elem(map, 0, s); > > addr.svm_cid = 42; > if (!connect(s, (struct sockaddr *)&addr, alen)) > die("connect #2"); > perror("ok, connect #2 failed;
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On 12/18/24 16:51, Hyunwoo Kim wrote: > On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote: >> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote: >>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: At least for vsock_loopback.c, this change doesn’t seem to introduce any particular issues. >>> >>> But was it working for you? because the check was wrong, this one should >>> work, but still, I didn't have time to test it properly, I'll do later. >>> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c >>> b/net/vmw_vsock/virtio_transport_common.c >>> index 9acc13ab3f82..ddecf6e430d6 100644 >>> --- a/net/vmw_vsock/virtio_transport_common.c >>> +++ b/net/vmw_vsock/virtio_transport_common.c >>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct >>> virtio_transport *t, >>>lock_sock(sk); >>> - /* Check if sk has been closed before lock_sock */ >>> - if (sock_flag(sk, SOCK_DONE)) { >>> + /* Check if sk has been closed or assigned to another transport >>> before >>> +* lock_sock >>> +*/ >>> + if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) { >>>(void)virtio_transport_reset_no_sock(t, skb); >>>release_sock(sk); >>>sock_put(sk); Hi, I got curious about this race, my 2 cents: Your patch seems to fix the reported issue, but there's also a variant (as in: transport going null unexpectedly) involving BPF: /* $ gcc vsock-transport.c && sudo ./a.out BUG: kernel NULL pointer dereference, address: 00a0 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0 Oops: Oops: [#1] PREEMPT SMP NOPTI CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+ RIP: 0010:vsock_connectible_has_data+0x1f/0x40 Call Trace: vsock_bpf_recvmsg+0xca/0x5e0 sock_recvmsg+0xb9/0xc0 __sys_recvfrom+0xb3/0x130 __x64_sys_recvfrom+0x20/0x30 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e */ #include #include #include #include #include #include #include #include static void die(const char *msg) { perror(msg); exit(-1); } static int create_sockmap(void) { union bpf_attr attr = { .map_type = BPF_MAP_TYPE_SOCKMAP, .key_size = sizeof(int), .value_size = sizeof(int), .max_entries = 1 }; int map; map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr)); if (map < 0) die("create_sockmap"); return map; } static void map_update_elem(int fd, int key, int value) { union bpf_attr attr = { .map_fd = fd, .key = (uint64_t)&key, .value = (uint64_t)&value, .flags = BPF_ANY }; if (syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr))) die("map_update_elem"); } int main(void) { struct sockaddr_vm addr = { .svm_family = AF_VSOCK, .svm_port = VMADDR_PORT_ANY, .svm_cid = VMADDR_CID_LOCAL }; socklen_t alen = sizeof(addr); int map, s; map = create_sockmap(); s = socket(AF_VSOCK, SOCK_SEQPACKET, 0); if (s < 0) die("socket"); if (!connect(s, (struct sockaddr *)&addr, alen)) die("connect #1"); perror("ok, connect #1 failed; transport set"); map_update_elem(map, 0, s); addr.svm_cid = 42; if (!connect(s, (struct sockaddr *)&addr, alen)) die("connect #2"); perror("ok, connect #2 failed; transport unset"); recv(s, NULL, 0, 0); return 0; }
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote: > On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote: > > On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: > > > On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote: > > > > On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote: > > > > > When calling connect to change the CID of a vsock, the loopback > > > > > worker for the VIRTIO_VSOCK_OP_RST command is invoked. > > > > > During this process, vsock_stream_has_data() calls > > > > > vsk->transport->stream_has_data(). > > > > > However, a null-ptr-deref occurs because vsk->transport was set > > > > > to NULL in vsock_deassign_transport(). > > > > > > > > > > cpu0 > > > > > cpu1 > > > > > > > > > > > > > > > socket(A) > > > > > > > > > > bind(A, > > > > > VMADDR_CID_LOCAL) > > > > > > > > > > vsock_bind() > > > > > > > > > > > > > > > listen(A) > > > > > > > > > > vsock_listen() > > > > > socket(B) > > > > > > > > > > connect(B, VMADDR_CID_LOCAL) > > > > > > > > > > connect(B, VMADDR_CID_HYPERVISOR) > > > > >vsock_connect(B) > > > > > lock_sock(sk); > > It shouldn't go on here anyway, because there's this check in > vsock_connect(): > > switch (sock->state) { > case SS_CONNECTED: > err = -EISCONN; > goto out; By using a kill signal, set sock->state to SS_UNCONNECTED and sk->sk_state to TCP_CLOSING before the second connect() is called, causing the worker to execute without the SOCK_DONE flag being set. > > > Indeed if I try, I have this behaviour: > > shell1# python3 > import socket > s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM) > s.bind((1,1234)) > s.listen() > > shell2# python3 > import socket > s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM) > s.connect((1, 1234)) > s.connect((2, 1234)) > Traceback (most recent call last): > File "", line 1, in > OSError: [Errno 106] Transport endpoint is already connected > > > Where 106 is exactly EISCONN. > So, do you have a better reproducer for that? The full scenario is as follows: ``` cpu0 cpu1 socket(A) bind(A, {cid: VMADDR_CID_LOCAL, port: 1024}) vsock_bind() listen(A) vsock_listen() socket(B) connect(B, {cid: VMADDR_CID_LOCAL, port: 1024}) vsock_connect() lock_sock(sk); virtio_transport_connect() virtio_transport_connect() virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_REQUEST) queue_work(vsock_loopback_work) sk->sk_state = TCP_SYN_SENT; release_sock(sk); vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_REQUEST) sk = vsock_find_bound_socket(&dst); virtio_transport_recv_listen(sk, skb) child = vsock_create_connected(sk); vsock_assign_transport() vvs = kzalloc(sizeof(*vvs), GFP_KERNEL); vsock_insert_connected(vchild); list_add(&vsk->connected_table, list); virtio_transport_send_response(vchild, skb); virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RESPONSE) queue_work(vsock_loopback_work) vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RESPONSE)
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote: On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote: On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote: When calling connect to change the CID of a vsock, the loopback worker for the VIRTIO_VSOCK_OP_RST command is invoked. During this process, vsock_stream_has_data() calls vsk->transport->stream_has_data(). However, a null-ptr-deref occurs because vsk->transport was set to NULL in vsock_deassign_transport(). cpu0 cpu1 socket(A) bind(A, VMADDR_CID_LOCAL) vsock_bind() listen(A) vsock_listen() socket(B) connect(B, VMADDR_CID_LOCAL) connect(B, VMADDR_CID_HYPERVISOR) vsock_connect(B) lock_sock(sk); It shouldn't go on here anyway, because there's this check in vsock_connect(): switch (sock->state) { case SS_CONNECTED: err = -EISCONN; goto out; Indeed if I try, I have this behaviour: shell1# python3 import socket s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM) s.bind((1,1234)) s.listen() shell2# python3 import socket s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM) s.connect((1, 1234)) s.connect((2, 1234)) Traceback (most recent call last): File "", line 1, in OSError: [Errno 106] Transport endpoint is already connected Where 106 is exactly EISCONN. So, do you have a better reproducer for that? Would be nice to add a test in tools/testing/vsock/vsock_test.c Thanks, Stefano vsock_assign_transport() virtio_transport_release() virtio_transport_close() virtio_transport_shutdown() virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) queue_work(vsock_loopback_work) vsock_deassign_transport() vsk->transport = NULL; vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) virtio_transport_recv_connected() virtio_transport_reset() virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST) queue_work(vsock_loopback_work) vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST) virtio_transport_recv_disconnecting() virtio_transport_do_close() vsock_stream_has_data() vsk->transport->stream_has_data(vsk);// null-ptr-deref To resolve this issue, add a check for vsk->transport, similar to functions like vsock_send_shutdown(). Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock") Signed-off-by: Hyunwoo Kim Signed-off-by: Wongi Lee --- net/vmw_vsock/af_vsock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 5cf8109f672a..a0c008626798 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected); s64 vsock_stream_has_data(struct vsock_sock *vsk) { + if (!vsk->transport) + return 0; + I understand that this alleviates the problem, but IMO it is not the right solution. We should understand why we're still processing the packet in the context of this socket if it's no longer assigned to the right transport. Got it. I agree with you. Maybe we can try to improve virtio_transport_recv_pkt() and check if the vsk->transport is what we expect, I mean something like this (untested): diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 9acc13ab3f82..18b91149a62e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virti
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote: On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote: On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote: > When calling connect to change the CID of a vsock, the loopback > worker for the VIRTIO_VSOCK_OP_RST command is invoked. > During this process, vsock_stream_has_data() calls > vsk->transport->stream_has_data(). > However, a null-ptr-deref occurs because vsk->transport was set > to NULL in vsock_deassign_transport(). > > cpu0 cpu1 > > socket(A) > > bind(A, VMADDR_CID_LOCAL) > vsock_bind() > > listen(A) > vsock_listen() > socket(B) > > connect(B, VMADDR_CID_LOCAL) > > connect(B, VMADDR_CID_HYPERVISOR) >vsock_connect(B) > lock_sock(sk); > vsock_assign_transport() >virtio_transport_release() > virtio_transport_close() >virtio_transport_shutdown() > virtio_transport_send_pkt_info() >vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) > queue_work(vsock_loopback_work) >vsock_deassign_transport() > vsk->transport = NULL; > vsock_loopback_work() > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) > virtio_transport_recv_connected() > virtio_transport_reset() > virtio_transport_send_pkt_info() > vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST) > queue_work(vsock_loopback_work) > > vsock_loopback_work() > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST) > virtio_transport_recv_disconnecting() > virtio_transport_do_close() > vsock_stream_has_data() > vsk->transport->stream_has_data(vsk);// null-ptr-deref > > To resolve this issue, add a check for vsk->transport, similar to > functions like vsock_send_shutdown(). > > Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock") > Signed-off-by: Hyunwoo Kim > Signed-off-by: Wongi Lee > --- > net/vmw_vsock/af_vsock.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c > index 5cf8109f672a..a0c008626798 100644 > --- a/net/vmw_vsock/af_vsock.c > +++ b/net/vmw_vsock/af_vsock.c > @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected); > > s64 vsock_stream_has_data(struct vsock_sock *vsk) > { > + if (!vsk->transport) > + return 0; > + I understand that this alleviates the problem, but IMO it is not the right solution. We should understand why we're still processing the packet in the context of this socket if it's no longer assigned to the right transport. Got it. I agree with you. Maybe we can try to improve virtio_transport_recv_pkt() and check if the vsk->transport is what we expect, I mean something like this (untested): diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 9acc13ab3f82..18b91149a62e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); - /* Check if sk has been closed before lock_sock */ - if (sock_flag(sk, SOCK_DONE)) { + /* Check if sk has been closed or assigned to another transport before +* lock_sock +*/ + if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) { (void)virtio_transport_reset_no_sock(t, skb); release_sock(sk); sock_put(sk); BTW I'm not sure it is the best solution, we have to check that we do not introduce strange cases, but IMHO we have to solve the problem earlier in virtio_transport_recv_pkt(). At least for vsock_loopback.c, this change doesn’t seem to introduce any particular issues. But was it working for yo
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote: > On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote: > > When calling connect to change the CID of a vsock, the loopback > > worker for the VIRTIO_VSOCK_OP_RST command is invoked. > > During this process, vsock_stream_has_data() calls > > vsk->transport->stream_has_data(). > > However, a null-ptr-deref occurs because vsk->transport was set > > to NULL in vsock_deassign_transport(). > > > > cpu0 > > cpu1 > > > > socket(A) > > > > bind(A, > > VMADDR_CID_LOCAL) > > vsock_bind() > > > > listen(A) > > > > vsock_listen() > > socket(B) > > > > connect(B, VMADDR_CID_LOCAL) > > > > connect(B, VMADDR_CID_HYPERVISOR) > >vsock_connect(B) > > lock_sock(sk); > > vsock_assign_transport() > >virtio_transport_release() > > virtio_transport_close() > >virtio_transport_shutdown() > > virtio_transport_send_pkt_info() > >vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) > > queue_work(vsock_loopback_work) > >vsock_deassign_transport() > > vsk->transport = NULL; > > > > vsock_loopback_work() > > > > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) > > > > virtio_transport_recv_connected() > > > > virtio_transport_reset() > > > > virtio_transport_send_pkt_info() > > > > vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST) > > > > queue_work(vsock_loopback_work) > > > > > > vsock_loopback_work() > > > > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST) > > > > virtio_transport_recv_disconnecting() > > > > virtio_transport_do_close() > > > > vsock_stream_has_data() > > > > vsk->transport->stream_has_data(vsk);// null-ptr-deref > > > > To resolve this issue, add a check for vsk->transport, similar to > > functions like vsock_send_shutdown(). > > > > Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct > > vsock_sock") > > Signed-off-by: Hyunwoo Kim > > Signed-off-by: Wongi Lee > > --- > > net/vmw_vsock/af_vsock.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c > > index 5cf8109f672a..a0c008626798 100644 > > --- a/net/vmw_vsock/af_vsock.c > > +++ b/net/vmw_vsock/af_vsock.c > > @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected); > > > > s64 vsock_stream_has_data(struct vsock_sock *vsk) > > { > > + if (!vsk->transport) > > + return 0; > > + > > I understand that this alleviates the problem, but IMO it is not the right > solution. We should understand why we're still processing the packet in the > context of this socket if it's no longer assigned to the right transport. Got it. I agree with you. > > Maybe we can try to improve virtio_transport_recv_pkt() and check if the > vsk->transport is what we expect, I mean something like this (untested): > > diff --git a/net/vmw_vsock/virtio_transport_common.c > b/net/vmw_vsock/virtio_transport_common.c > index 9acc13ab3f82..18b91149a62e 100644 > --- a/net/vmw_vsock/virtio_transport_common.c > +++ b/net/vmw_vsock/virtio_transport_common.c > @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport > *t, > > lock_sock(sk); > > - /* Check if sk has been closed before lock_sock */ > - if (sock_flag(sk, SOCK_DONE)) { > + /* Check if sk has been closed or assigned to another transport before > +* lock_sock > +*/ > + if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) { > (void)virtio_transport_reset_no_sock(t, skb); > release_sock(sk); > sock_put(sk); > > BTW I'm not sure it is the best solution, we have to check that we do not > intro
Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data
On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote: When calling connect to change the CID of a vsock, the loopback worker for the VIRTIO_VSOCK_OP_RST command is invoked. During this process, vsock_stream_has_data() calls vsk->transport->stream_has_data(). However, a null-ptr-deref occurs because vsk->transport was set to NULL in vsock_deassign_transport(). cpu0 cpu1 socket(A) bind(A, VMADDR_CID_LOCAL) vsock_bind() listen(A) vsock_listen() socket(B) connect(B, VMADDR_CID_LOCAL) connect(B, VMADDR_CID_HYPERVISOR) vsock_connect(B) lock_sock(sk); vsock_assign_transport() virtio_transport_release() virtio_transport_close() virtio_transport_shutdown() virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) queue_work(vsock_loopback_work) vsock_deassign_transport() vsk->transport = NULL; vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN) virtio_transport_recv_connected() virtio_transport_reset() virtio_transport_send_pkt_info() vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST) queue_work(vsock_loopback_work) vsock_loopback_work() virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST) virtio_transport_recv_disconnecting() virtio_transport_do_close() vsock_stream_has_data() vsk->transport->stream_has_data(vsk);// null-ptr-deref To resolve this issue, add a check for vsk->transport, similar to functions like vsock_send_shutdown(). Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock") Signed-off-by: Hyunwoo Kim Signed-off-by: Wongi Lee --- net/vmw_vsock/af_vsock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 5cf8109f672a..a0c008626798 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected); s64 vsock_stream_has_data(struct vsock_sock *vsk) { + if (!vsk->transport) + return 0; + I understand that this alleviates the problem, but IMO it is not the right solution. We should understand why we're still processing the packet in the context of this socket if it's no longer assigned to the right transport. Maybe we can try to improve virtio_transport_recv_pkt() and check if the vsk->transport is what we expect, I mean something like this (untested): diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 9acc13ab3f82..18b91149a62e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport *t, lock_sock(sk); - /* Check if sk has been closed before lock_sock */ - if (sock_flag(sk, SOCK_DONE)) { + /* Check if sk has been closed or assigned to another transport before +* lock_sock +*/ + if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) { (void)virtio_transport_reset_no_sock(t, skb); release_sock(sk); sock_put(sk); BTW I'm not sure it is the best solution, we have to check that we do not introduce strange cases, but IMHO we have to solve the problem earlier in virtio_transport_recv_pkt(). Thanks, Stefano return vsk->transport->stream_has_data(vsk); } EXPORT_SYMBOL_GPL(vsock_stream_has_data); -- 2.34.1
Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks
On 11/19/24 11:31, Stefano Garzarella wrote: > Hi Michal, > > On Thu, Nov 07, 2024 at 09:46:11PM +0100, Michal Luczaj wrote: >> Short series fixing some memory leaks that I've stumbled upon while >> toying >> with the selftests. > > Are these tests already upstream? > I would like to add them to my suite, can you tell me how to run them? CC got accidentally stripped, sorry. Long story short: https://lore.kernel.org/netdev/20241206-test-vsock-leaks-v1-0-c31e8c875...@rbox.co/ Thanks, Michal
Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks
Hi Michal, On Thu, Nov 07, 2024 at 09:46:11PM +0100, Michal Luczaj wrote: Short series fixing some memory leaks that I've stumbled upon while toying with the selftests. Are these tests already upstream? I would like to add them to my suite, can you tell me how to run them? Thanks, Stefano Signed-off-by: Michal Luczaj --- Changes in v2: - Remove the refactoring patch from the series [Stefano] - PATCH 2: Drop "virtio" from the commit title [Stefano] - Collect Reviewed-by [Stefano] - Link to v1: https://lore.kernel.org/r/20241106-vsock-mem-leaks-v1-0-8f4ffc309...@rbox.co --- Michal Luczaj (3): virtio/vsock: Fix accept_queue memory leak vsock: Fix sk_error_queue memory leak virtio/vsock: Improve MSG_ZEROCOPY error handling net/vmw_vsock/af_vsock.c| 3 +++ net/vmw_vsock/virtio_transport_common.c | 9 + 2 files changed, 12 insertions(+) --- base-commit: 71712cf519faeed529549a79559c06c7fc250a15 change-id: 20241106-vsock-mem-leaks-9b63e912560a Best regards, -- Michal Luczaj
Re: [PATCH net-next v4 00/13] virtio-net: support AF_XDP zero copy (tx)
Hello: This series was applied to netdev/net-next.git (main) by Jakub Kicinski : On Tue, 12 Nov 2024 09:29:15 +0800 you wrote: > v4: > 1. rebase net-next > 2. update the kdoc for the new APIs > > v3: > 1. use sg_dma_address/length api to set the premapped sg > 2. remove 'premapped' parameter from the new APIs > 3. tweak the comment of commit #2,#3 > > [...] Here is the summary with links: - [net-next,v4,01/13] virtio_ring: introduce vring_need_unmap_buffer https://git.kernel.org/netdev/net-next/c/9f19c084057a - [net-next,v4,02/13] virtio_ring: split: record extras for indirect buffers https://git.kernel.org/netdev/net-next/c/bc2b4c3401c6 - [net-next,v4,03/13] virtio_ring: packed: record extras for indirect buffers https://git.kernel.org/netdev/net-next/c/aaa789843a93 - [net-next,v4,04/13] virtio_ring: perform premapped operations based on per-buffer https://git.kernel.org/netdev/net-next/c/c7e1b422afac - [net-next,v4,05/13] virtio_ring: introduce add api for premapped https://git.kernel.org/netdev/net-next/c/3ef66af31fea - [net-next,v4,06/13] virtio-net: rq submits premapped per-buffer https://git.kernel.org/netdev/net-next/c/31f3cd4e5756 - [net-next,v4,07/13] virtio_ring: remove API virtqueue_set_dma_premapped https://git.kernel.org/netdev/net-next/c/880ebcbe0663 - [net-next,v4,08/13] virtio_net: refactor the xmit type https://git.kernel.org/netdev/net-next/c/7db956707f5f - [net-next,v4,09/13] virtio_net: xsk: bind/unbind xsk for tx https://git.kernel.org/netdev/net-next/c/21a4e3ce6dc7 - [net-next,v4,10/13] virtio_net: xsk: prevent disable tx napi https://git.kernel.org/netdev/net-next/c/1df5116a41a8 - [net-next,v4,11/13] virtio_net: xsk: tx: support xmit xsk buffer https://git.kernel.org/netdev/net-next/c/89f86675cb03 - [net-next,v4,12/13] virtio_net: update tx timeout record https://git.kernel.org/netdev/net-next/c/e2c5c57f1af8 - [net-next,v4,13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY https://git.kernel.org/netdev/net-next/c/37e0ca657a3d You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
>From: Jason Wang >Sent: Wednesday, November 13, 2024 2:43 AM >To: Ilpo Järvinen >Cc: Chia-Yu Chang (Nokia) ; Michael S. >Tsirkin ; virtualizat...@lists.linux.dev; Eric Dumazet >; netdev@vger.kernel.org; dsah...@gmail.com; >da...@davemloft.net; dsah...@kernel.org; pab...@redhat.com; >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; >netfilter-de...@vger.kernel.org; coret...@netfilter.org; ncardw...@google.com; >Koen De Schepper (Nokia) ; >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; >jason_living...@comcast.com; vidhi_g...@apple.com >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & >better AccECN handling > >[You don't often get email from jasow...@redhat.com. Learn why this is >important at https://aka.ms/LearnAboutSenderIdentification ] > >CAUTION: This is an external email. Please be very careful when clicking links >or opening attachments. See the URL nok.it/ext for additional information. > > > >On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen wrote: >> >> Adding a few virtio people. Please see the virtio spec/flag question >> below. >> >> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote: >> >> > >From: Ilpo Järvinen >> > >Sent: Thursday, November 7, 2024 8:28 PM >> > >To: Eric Dumazet >> > >Cc: Chia-Yu Chang (Nokia) ; >> > >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; >> > >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; >> > >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; >> > >pa...@netfilter.org; kad...@netfilter.org; >> > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; >> > >ncardw...@google.com; Koen De Schepper (Nokia) >> > >; g.wh...@cablelabs.com; >> > >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; >> > >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; >> > >vidhi_g...@apple.com >> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field >> > >corruption & better AccECN handling >> > > >> > > >> > >CAUTION: This is an external email. Please be very careful when clicking >> > >links or opening attachments. See the URL nok.it/ext for additional >> > >information. >> > > >> > > >> > > >> > >On Thu, 7 Nov 2024, Eric Dumazet wrote: >> > > >> > >>On Tue, Nov 5, 2024 at 11:07 AM >> > >>wrote: >> > >>> >> > >>> From: Ilpo Järvinen >> > >>> >> > >>> There are important differences in how the CWR field behaves in >> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE >> > >>> counter and its changes are important so adjust the flags >> > >>> changed mask accordingly. >> > >>> >> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid >> > >>> corrupting CWR flag somewhere. >> > >>> >> > >>> Signed-off-by: Ilpo Järvinen >> > >>> Signed-off-by: Chia-Yu Chang >> > >>> --- >> > >>> net/ipv4/tcp_offload.c | 4 ++-- >> > >>> 1 file changed, 2 insertions(+), 2 deletions(-) >> > >>> >> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c >> > >>> index >> > >>> 0b05f30e9e5f..f59762d88c38 100644 >> > >>> --- a/net/ipv4/tcp_offload.c >> > >>> +++ b/net/ipv4/tcp_offload.c >> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head >> > >>> *head, struct sk_buff *skb, >> > >>> th2 = tcp_hdr(p); >> > >>> flush = (__force int)(flags & TCP_FLAG_CWR); >> > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) & >> > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); >> > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); >> > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq); >> > >>> for (i = sizeof(*th); i < thlen; i += 4) >> > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 >> > >>> +405,7
Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen wrote: > > Adding a few virtio people. Please see the virtio spec/flag question > below. > > On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote: > > > >From: Ilpo Järvinen > > >Sent: Thursday, November 7, 2024 8:28 PM > > >To: Eric Dumazet > > >Cc: Chia-Yu Chang (Nokia) ; > > >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; > > >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; > > >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; > > >pa...@netfilter.org; kad...@netfilter.org; > > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; > > >ncardw...@google.com; Koen De Schepper (Nokia) > > >; g.wh...@cablelabs.com; > > >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; > > >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; > > >vidhi_g...@apple.com > > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & > > >better AccECN handling > > > > > > > > >CAUTION: This is an external email. Please be very careful when clicking > > >links or opening attachments. See the URL nok.it/ext for additional > > >information. > > > > > > > > > > > >On Thu, 7 Nov 2024, Eric Dumazet wrote: > > > > > >>On Tue, Nov 5, 2024 at 11:07 AM wrote: > > >>> > > >>> From: Ilpo Järvinen > > >>> > > >>> There are important differences in how the CWR field behaves in > > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter > > >>> and its changes are important so adjust the flags changed mask > > >>> accordingly. > > >>> > > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid > > >>> corrupting CWR flag somewhere. > > >>> > > >>> Signed-off-by: Ilpo Järvinen > > >>> Signed-off-by: Chia-Yu Chang > > >>> --- > > >>> net/ipv4/tcp_offload.c | 4 ++-- > > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > > >>> > > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index > > >>> 0b05f30e9e5f..f59762d88c38 100644 > > >>> --- a/net/ipv4/tcp_offload.c > > >>> +++ b/net/ipv4/tcp_offload.c > > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head > > >>> *head, struct sk_buff *skb, > > >>> th2 = tcp_hdr(p); > > >>> flush = (__force int)(flags & TCP_FLAG_CWR); > > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) & > > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); > > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); > > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq); > > >>> for (i = sizeof(*th); i < thlen; i += 4) > > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 > > >>> @@ void tcp_gro_complete(struct sk_buff *skb) > > >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count; > > >>> > > >>> if (th->cwr) > > >>> - shinfo->gso_type |= SKB_GSO_TCP_ECN; > > >>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN; > > >>> } > > >>> EXPORT_SYMBOL(tcp_gro_complete); > > >>> > > >> > > >> I do not really understand this patch. How a GRO engine can know which > > >> ECN variant the peers are using ? > > > > > >Hi Eric, > > > > > >Thanks for taking a look. > > > > > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can > > >result in header change that corrupts ACE field. Thus, GRO has to assume > > >the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks > > >the connection and knows the bits are used for RFC3168 which is something > > >nobody is going to do). > > > > > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or > > >NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE > > >field value. > > > > Hi Eric and Ilpo, > > > > From my understanding of another email thread (patch 08/13), it seems the > > conclusions that SKB_GSO_TCP_ACC
RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling
Adding a few virtio people. Please see the virtio spec/flag question below. On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote: > >From: Ilpo Järvinen > >Sent: Thursday, November 7, 2024 8:28 PM > >To: Eric Dumazet > >Cc: Chia-Yu Chang (Nokia) ; > >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; > >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; > >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; > >pa...@netfilter.org; kad...@netfilter.org; netfilter-de...@vger.kernel.org; > >coret...@netfilter.org; ncardw...@google.com; Koen De Schepper (Nokia) > >; g.wh...@cablelabs.com; > >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; > >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; > >vidhi_g...@apple.com > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & > >better AccECN handling > > > > > >CAUTION: This is an external email. Please be very careful when clicking > >links or opening attachments. See the URL nok.it/ext for additional > >information. > > > > > > > >On Thu, 7 Nov 2024, Eric Dumazet wrote: > > > >>On Tue, Nov 5, 2024 at 11:07 AM wrote: > >>> > >>> From: Ilpo Järvinen > >>> > >>> There are important differences in how the CWR field behaves in > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter > >>> and its changes are important so adjust the flags changed mask > >>> accordingly. > >>> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid > >>> corrupting CWR flag somewhere. > >>> > >>> Signed-off-by: Ilpo Järvinen > >>> Signed-off-by: Chia-Yu Chang > >>> --- > >>> net/ipv4/tcp_offload.c | 4 ++-- > >>> 1 file changed, 2 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index > >>> 0b05f30e9e5f..f59762d88c38 100644 > >>> --- a/net/ipv4/tcp_offload.c > >>> +++ b/net/ipv4/tcp_offload.c > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head > >>> *head, struct sk_buff *skb, > >>> th2 = tcp_hdr(p); > >>> flush = (__force int)(flags & TCP_FLAG_CWR); > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) & > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH)); > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq); > >>> for (i = sizeof(*th); i < thlen; i += 4) > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 > >>> @@ void tcp_gro_complete(struct sk_buff *skb) > >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count; > >>> > >>> if (th->cwr) > >>> - shinfo->gso_type |= SKB_GSO_TCP_ECN; > >>> + shinfo->gso_type |= SKB_GSO_TCP_ACCECN; > >>> } > >>> EXPORT_SYMBOL(tcp_gro_complete); > >>> > >> > >> I do not really understand this patch. How a GRO engine can know which > >> ECN variant the peers are using ? > > > >Hi Eric, > > > >Thanks for taking a look. > > > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result > >in header change that corrupts ACE field. Thus, GRO has to assume the > >RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the > >connection and knows the bits are used for RFC3168 which is something nobody > >is going to do). > > > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or > >NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE > >field value. > > Hi Eric and Ilpo, > > From my understanding of another email thread (patch 08/13), it seems the > conclusions that SKB_GSO_TCP_ACCECN will still be needed. > > > > >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the > >same CWR flag should be repeated for all segments. In a sense, it's simpler > >than RFC3168 offloading. > > > >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ? > >> > >> git grep -n SKB_GSO_TCP_ECN > >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888: > >> skb_shinfo(skb)->gso_type |=
Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks
Hello: This series was applied to netdev/net.git (main) by Paolo Abeni : On Thu, 07 Nov 2024 21:46:11 +0100 you wrote: > Short series fixing some memory leaks that I've stumbled upon while toying > with the selftests. > > Signed-off-by: Michal Luczaj > --- > Changes in v2: > - Remove the refactoring patch from the series [Stefano] > - PATCH 2: Drop "virtio" from the commit title [Stefano] > - Collect Reviewed-by [Stefano] > - Link to v1: > https://lore.kernel.org/r/20241106-vsock-mem-leaks-v1-0-8f4ffc309...@rbox.co > > [...] Here is the summary with links: - [net,v2,1/3] virtio/vsock: Fix accept_queue memory leak https://git.kernel.org/netdev/net/c/d7b0ff5a8667 - [net,v2,2/3] vsock: Fix sk_error_queue memory leak https://git.kernel.org/netdev/net/c/fbf7085b3ad1 - [net,v2,3/3] virtio/vsock: Improve MSG_ZEROCOPY error handling https://git.kernel.org/netdev/net/c/60cf6206a1f5 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH net-next v3 05/13] virtio_ring: introduce add api for premapped
On Thu, 7 Nov 2024 16:54:56 +0800 Xuan Zhuo wrote: > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). Looks like we need a rebase, doesn't apply any more. When you rebase please correct the kdoc, looks like the official format wants a ":" after the "Returns"
Re: [PATCH net v2 2/3] vsock: Fix sk_error_queue memory leak
On 07.11.2024 23:46, Michal Luczaj wrote: > Kernel queues MSG_ZEROCOPY completion notifications on the error queue. > Where they remain, until explicitly recv()ed. To prevent memory leaks, > clean up the queue when the socket is destroyed. > > unreferenced object 0x8881028beb00 (size 224): > comm "vsock_test", pid 1218, jiffies 4294694897 > hex dump (first 32 bytes): > 90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff ..!...!. > 00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff ..!. > backtrace (crc 6c7031ca): > [] kmem_cache_alloc_node_noprof+0x2f7/0x370 > [] __alloc_skb+0x132/0x180 > [] sock_omalloc+0x4b/0x80 > [] msg_zerocopy_realloc+0x9e/0x240 > [] virtio_transport_send_pkt_info+0x412/0x4c0 > [] virtio_transport_stream_enqueue+0x43/0x50 > [] vsock_connectible_sendmsg+0x373/0x450 > [] sys_sendmsg+0x365/0x3a0 > [] ___sys_sendmsg+0x84/0xd0 > [] __sys_sendmsg+0x47/0x80 > [] do_syscall_64+0x93/0x180 > [] entry_SYSCALL_64_after_hwframe+0x76/0x7e > > Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") > Signed-off-by: Michal Luczaj > --- > net/vmw_vsock/af_vsock.c | 3 +++ > 1 file changed, 3 insertions(+) Acked-by: Arseniy Krasnov > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c > index > 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6 > 100644 > --- a/net/vmw_vsock/af_vsock.c > +++ b/net/vmw_vsock/af_vsock.c > @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk) > { > struct vsock_sock *vsk = vsock_sk(sk); > > + /* Flush MSG_ZEROCOPY leftovers. */ > + __skb_queue_purge(&sk->sk_error_queue); > + > vsock_deassign_transport(vsk); > > /* When clearing these addresses, there's no need to set the family and >
Re: [PATCH net v2 3/3] virtio/vsock: Improve MSG_ZEROCOPY error handling
On 07.11.2024 23:46, Michal Luczaj wrote: > Add a missing kfree_skb() to prevent memory leaks. > > Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") > Reviewed-by: Stefano Garzarella > Signed-off-by: Michal Luczaj > --- > net/vmw_vsock/virtio_transport_common.c | 1 + > 1 file changed, 1 insertion(+) Acked-by: Arseniy Krasnov > > diff --git a/net/vmw_vsock/virtio_transport_common.c > b/net/vmw_vsock/virtio_transport_common.c > index > cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2 > 100644 > --- a/net/vmw_vsock/virtio_transport_common.c > +++ b/net/vmw_vsock/virtio_transport_common.c > @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct > vsock_sock *vsk, > if (virtio_transport_init_zcopy_skb(vsk, skb, > info->msg, > can_zcopy)) { > + kfree_skb(skb); > ret = -ENOMEM; > break; > } >
Re: [PATCH net 3/4] virtio/vsock: Improve MSG_ZEROCOPY error handling
On 06.11.2024 20:51, Michal Luczaj wrote: > Add a missing kfree_skb() to prevent memory leaks. > > Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") > Signed-off-by: Michal Luczaj > --- > net/vmw_vsock/virtio_transport_common.c | 1 + > 1 file changed, 1 insertion(+) Acked-by: Arseniy Krasnov > > diff --git a/net/vmw_vsock/virtio_transport_common.c > b/net/vmw_vsock/virtio_transport_common.c > index > cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2 > 100644 > --- a/net/vmw_vsock/virtio_transport_common.c > +++ b/net/vmw_vsock/virtio_transport_common.c > @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct > vsock_sock *vsk, > if (virtio_transport_init_zcopy_skb(vsk, skb, > info->msg, > can_zcopy)) { > + kfree_skb(skb); > ret = -ENOMEM; > break; > } >
Re: [PATCH net-next v3 03/13] virtio_ring: packed: record extras for indirect buffers
On Mon, Nov 11, 2024 at 11:53 AM Xuan Zhuo wrote: > > On Thu, 7 Nov 2024 16:54:54 +0800, Xuan Zhuo > wrote: > > The subsequent commit needs to know whether every indirect buffer is > > premapped or not. So we need to introduce an extra struct for every > > indirect buffer to record this info. > > > > Signed-off-by: Xuan Zhuo > > > Hi, Jason > > This also needs a review. Sorry, I miss this. Acked-by: Jason Wang Thanks
Re: [PATCH net-next v3 03/13] virtio_ring: packed: record extras for indirect buffers
On Thu, 7 Nov 2024 16:54:54 +0800, Xuan Zhuo wrote: > The subsequent commit needs to know whether every indirect buffer is > premapped or not. So we need to introduce an extra struct for every > indirect buffer to record this info. > > Signed-off-by: Xuan Zhuo Hi, Jason This also needs a review. Thanks. > --- > drivers/virtio/virtio_ring.c | 60 +--- > 1 file changed, 36 insertions(+), 24 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 405d5a348795..cfe70c40f630 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -78,7 +78,11 @@ struct vring_desc_state_split { > > struct vring_desc_state_packed { > void *data; /* Data for callback. */ > - struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */ > + > + /* Indirect desc table and extra table, if any. These two will be > + * allocated together. So we won't stress more to the memory allocator. > + */ > + struct vring_packed_desc *indir_desc; > u16 num;/* Descriptor list length. */ > u16 last; /* The last desc state in a list. */ > }; > @@ -1238,27 +1242,12 @@ static void vring_unmap_extra_packed(const struct > vring_virtqueue *vq, > } > } > > -static void vring_unmap_desc_packed(const struct vring_virtqueue *vq, > - const struct vring_packed_desc *desc) > -{ > - u16 flags; > - > - if (!vring_need_unmap_buffer(vq)) > - return; > - > - flags = le16_to_cpu(desc->flags); > - > - dma_unmap_page(vring_dma_dev(vq), > -le64_to_cpu(desc->addr), > -le32_to_cpu(desc->len), > -(flags & VRING_DESC_F_WRITE) ? > -DMA_FROM_DEVICE : DMA_TO_DEVICE); > -} > - > static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg, > gfp_t gfp) > { > + struct vring_desc_extra *extra; > struct vring_packed_desc *desc; > + int i, size; > > /* >* We require lowmem mappings for the descriptors because > @@ -1267,7 +1256,16 @@ static struct vring_packed_desc > *alloc_indirect_packed(unsigned int total_sg, >*/ > gfp &= ~__GFP_HIGHMEM; > > - desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp); > + size = (sizeof(*desc) + sizeof(*extra)) * total_sg; > + > + desc = kmalloc(size, gfp); > + if (!desc) > + return NULL; > + > + extra = (struct vring_desc_extra *)&desc[total_sg]; > + > + for (i = 0; i < total_sg; i++) > + extra[i].next = i + 1; > > return desc; > } > @@ -1280,6 +1278,7 @@ static int virtqueue_add_indirect_packed(struct > vring_virtqueue *vq, >void *data, >gfp_t gfp) > { > + struct vring_desc_extra *extra; > struct vring_packed_desc *desc; > struct scatterlist *sg; > unsigned int i, n, err_idx; > @@ -1291,6 +1290,8 @@ static int virtqueue_add_indirect_packed(struct > vring_virtqueue *vq, > if (!desc) > return -ENOMEM; > > + extra = (struct vring_desc_extra *)&desc[total_sg]; > + > if (unlikely(vq->vq.num_free < 1)) { > pr_debug("Can't add buf len 1 - avail = 0\n"); > kfree(desc); > @@ -1312,6 +1313,13 @@ static int virtqueue_add_indirect_packed(struct > vring_virtqueue *vq, > 0 : VRING_DESC_F_WRITE); > desc[i].addr = cpu_to_le64(addr); > desc[i].len = cpu_to_le32(sg->length); > + > + if (unlikely(vq->use_dma_api)) { > + extra[i].addr = addr; > + extra[i].len = sg->length; > + extra[i].flags = n < out_sgs ? 0 : > VRING_DESC_F_WRITE; > + } > + > i++; > } > } > @@ -1381,7 +1389,7 @@ static int virtqueue_add_indirect_packed(struct > vring_virtqueue *vq, > err_idx = i; > > for (i = 0; i < err_idx; i++) > - vring_unmap_desc_packed(vq, &desc[i]); > + vring_unmap_extra_packed(vq, &extra[i]); > > free_desc: > kfree(desc); > @@ -1617,7 +1625,8 @@ static void detach_buf_packed(struct vring_virtqueue > *vq, > } > > if (vq->indirect) { > - u32 len; > + struct vring_desc_extra *extra; > + u32 len, num; > > /* Free the indirect table, if any, now that it's unmapped. */ > desc = state->indir_desc; > @@ -1626,9 +1635,12 @@ static void detach_buf_packed(struct vring_virtqueue > *vq, > > if (vring_need_unmap_buffer(vq)) { > len = vq->packed.desc_extra[id].len;
Re: [PATCH net-next v3 06/13] virtio-net: rq submits premapped per-buffer
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo wrote: > > virtio-net rq submits premapped per-buffer by setting sg page to NULL; > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 22 +++--- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 4b27ded8fc16..862beacef5d7 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -539,6 +539,12 @@ static struct sk_buff *ptr_to_skb(void *ptr) > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG); > } > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > +{ > + sg_dma_address(sg) = addr; > + sg_dma_len(sg) = len; > +} > + In the future, we need to consider hiding those in the core. Anyhow Acked-by: Jason Wang Thanks
Re: [PATCH net-next v3 05/13] virtio_ring: introduce add api for premapped
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo wrote: > > Two APIs are introduced to submit premapped per-buffers. > > int virtqueue_add_inbuf_premapped(struct virtqueue *vq, > struct scatterlist *sg, unsigned int num, > void *data, > void *ctx, > gfp_t gfp); > > int virtqueue_add_outbuf_premapped(struct virtqueue *vq, > struct scatterlist *sg, unsigned int num, > void *data, > gfp_t gfp); > > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks
Re: [PATCH net-next v3 02/13] virtio_ring: split: record extras for indirect buffers
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo wrote: > > The subsequent commit needs to know whether every indirect buffer is > premapped or not. So we need to introduce an extra struct for every > indirect buffer to record this info. > > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Wed, Nov 6, 2024 at 3:42 PM Michael S. Tsirkin wrote: > > On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote: > > > > > while (vq->split.vring.desc[i].flags & nextflag) { > > > > > - vring_unmap_one_split(vq, i); > > > > > + vring_unmap_one_split(vq, &extra[i]); > > > > > > > > Not sure if I've asked this before. But this part seems to deserve an > > > > independent fix for -stable. > > > > > > What fix? > > > > I meant for hardening we need to check the flags stored in the extra > > instead of the descriptor itself as it could be mangled by the device. > > > > Thanks > > Good point. Jason, want to cook up a patch? Will do. Thanks > > -- > MST >
Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use
On Thu, Nov 07, 2024 at 10:04:03PM +0100, Michal Luczaj wrote: On 11/7/24 11:22, Stefano Garzarella wrote: On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote: Macro vsock_connected_sockets_vsk() has been unused since its introduction. Instead of removing it, utilise it in vsock_insert_connected() where it's been open-coded. No functional change intended. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") This is not a fix, so please remove the Fixes tag, we don't need to backport this patch in stable branches. Also in this case this is not related at all with virtio transport, so please remove `virtio` from the commit title. In addition maybe you can remove this patch from this series, and send it to net-next. ... Right, I get it. Just to be clear: are such small (and non-functional) cleanups welcomed coming by themselves? Good question, in this case I think it's a good cleanup, since we have the function already there, why not use it. So I don't see a problem with that. If you find others cleanups, it's better to pack in a single series, otherwise let's send just it. Thanks for the fixes and cleanups! Stefano
Re: [PATCH net v2 2/3] vsock: Fix sk_error_queue memory leak
On Thu, Nov 07, 2024 at 09:46:13PM +0100, Michal Luczaj wrote: Kernel queues MSG_ZEROCOPY completion notifications on the error queue. Where they remain, until explicitly recv()ed. To prevent memory leaks, clean up the queue when the socket is destroyed. unreferenced object 0x8881028beb00 (size 224): comm "vsock_test", pid 1218, jiffies 4294694897 hex dump (first 32 bytes): 90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff ..!...!. 00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff ..!. backtrace (crc 6c7031ca): [] kmem_cache_alloc_node_noprof+0x2f7/0x370 [] __alloc_skb+0x132/0x180 [] sock_omalloc+0x4b/0x80 [] msg_zerocopy_realloc+0x9e/0x240 [] virtio_transport_send_pkt_info+0x412/0x4c0 [] virtio_transport_stream_enqueue+0x43/0x50 [] vsock_connectible_sendmsg+0x373/0x450 [] sys_sendmsg+0x365/0x3a0 [] ___sys_sendmsg+0x84/0xd0 [] __sys_sendmsg+0x47/0x80 [] do_syscall_64+0x93/0x180 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Michal Luczaj --- net/vmw_vsock/af_vsock.c | 3 +++ 1 file changed, 3 insertions(+) Reviewed-by: Stefano Garzarella diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk) { struct vsock_sock *vsk = vsock_sk(sk); + /* Flush MSG_ZEROCOPY leftovers. */ + __skb_queue_purge(&sk->sk_error_queue); + vsock_deassign_transport(vsk); /* When clearing these addresses, there's no need to set the family and -- 2.46.2
Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use
On 11/7/24 11:22, Stefano Garzarella wrote: > On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote: >> Macro vsock_connected_sockets_vsk() has been unused since its introduction. >> Instead of removing it, utilise it in vsock_insert_connected() where it's >> been open-coded. >> >> No functional change intended. >> >> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") > > This is not a fix, so please remove the Fixes tag, we don't need to > backport this patch in stable branches. > > Also in this case this is not related at all with virtio transport, so > please remove `virtio` from the commit title. > > In addition maybe you can remove this patch from this series, and send > it to net-next. > ... Right, I get it. Just to be clear: are such small (and non-functional) cleanups welcomed coming by themselves? Thanks, Michal
Re: [PATCH net 2/4] virtio/vsock: Fix sk_error_queue memory leak
On 11/7/24 11:17, Stefano Garzarella wrote: > On Wed, Nov 06, 2024 at 06:51:19PM +0100, Michal Luczaj wrote: >> Kernel queues MSG_ZEROCOPY completion notifications on the error queue. >> Where they remain, until explicitly recv()ed. To prevent memory leaks, >> clean up the queue when the socket is destroyed. >> >> unreferenced object 0x8881028beb00 (size 224): >> comm "vsock_test", pid 1218, jiffies 4294694897 >> hex dump (first 32 bytes): >>90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff ..!...!. >>00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff ..!. >> backtrace (crc 6c7031ca): >>[] kmem_cache_alloc_node_noprof+0x2f7/0x370 >>[] __alloc_skb+0x132/0x180 >>[] sock_omalloc+0x4b/0x80 >>[] msg_zerocopy_realloc+0x9e/0x240 >>[] virtio_transport_send_pkt_info+0x412/0x4c0 >>[] virtio_transport_stream_enqueue+0x43/0x50 >>[] vsock_connectible_sendmsg+0x373/0x450 >>[] sys_sendmsg+0x365/0x3a0 >>[] ___sys_sendmsg+0x84/0xd0 >>[] __sys_sendmsg+0x47/0x80 >>[] do_syscall_64+0x93/0x180 >>[] entry_SYSCALL_64_after_hwframe+0x76/0x7e >> >> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") >> Signed-off-by: Michal Luczaj >> --- >> net/vmw_vsock/af_vsock.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c >> index >> 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6 >> 100644 >> --- a/net/vmw_vsock/af_vsock.c >> +++ b/net/vmw_vsock/af_vsock.c >> @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk) >> { >> struct vsock_sock *vsk = vsock_sk(sk); >> >> +/* Flush MSG_ZEROCOPY leftovers. */ >> +__skb_queue_purge(&sk->sk_error_queue); >> + > > It is true that for now this is supported only in the virtio transport, > but it's more related to the core, so please remove `virtio` from the > commit title. > > The rest LGTM. > ... OK, done. Here's v2 of the series: https://lore.kernel.org/netdev/20241107-vsock-mem-leaks-v2-0-4e21bfcfc...@rbox.co/ Thanks for the reviews, Michal
Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use
On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote: Macro vsock_connected_sockets_vsk() has been unused since its introduction. Instead of removing it, utilise it in vsock_insert_connected() where it's been open-coded. No functional change intended. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") This is not a fix, so please remove the Fixes tag, we don't need to backport this patch in stable branches. Also in this case this is not related at all with virtio transport, so please remove `virtio` from the commit title. In addition maybe you can remove this patch from this series, and send it to net-next. Thanks, Stefano Signed-off-by: Michal Luczaj --- net/vmw_vsock/af_vsock.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index dfd29160fe11c4675f872c1ee123d65b2da0dae6..c3a37c3d4bf3c8117fbc8bd020da8dc1c9212732 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -275,8 +275,7 @@ static void vsock_insert_unbound(struct vsock_sock *vsk) void vsock_insert_connected(struct vsock_sock *vsk) { - struct list_head *list = vsock_connected_sockets( - &vsk->remote_addr, &vsk->local_addr); + struct list_head *list = vsock_connected_sockets_vsk(vsk); spin_lock_bh(&vsock_table_lock); __vsock_insert_connected(list, vsk); -- 2.46.2
Re: [PATCH net 3/4] virtio/vsock: Improve MSG_ZEROCOPY error handling
On Wed, Nov 06, 2024 at 06:51:20PM +0100, Michal Luczaj wrote: Add a missing kfree_skb() to prevent memory leaks. Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Michal Luczaj --- net/vmw_vsock/virtio_transport_common.c | 1 + 1 file changed, 1 insertion(+) Reviewed-by: Stefano Garzarella diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, if (virtio_transport_init_zcopy_skb(vsk, skb, info->msg, can_zcopy)) { + kfree_skb(skb); ret = -ENOMEM; break; } -- 2.46.2
Re: [PATCH net 2/4] virtio/vsock: Fix sk_error_queue memory leak
On Wed, Nov 06, 2024 at 06:51:19PM +0100, Michal Luczaj wrote: Kernel queues MSG_ZEROCOPY completion notifications on the error queue. Where they remain, until explicitly recv()ed. To prevent memory leaks, clean up the queue when the socket is destroyed. unreferenced object 0x8881028beb00 (size 224): comm "vsock_test", pid 1218, jiffies 4294694897 hex dump (first 32 bytes): 90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff ..!...!. 00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff ..!. backtrace (crc 6c7031ca): [] kmem_cache_alloc_node_noprof+0x2f7/0x370 [] __alloc_skb+0x132/0x180 [] sock_omalloc+0x4b/0x80 [] msg_zerocopy_realloc+0x9e/0x240 [] virtio_transport_send_pkt_info+0x412/0x4c0 [] virtio_transport_stream_enqueue+0x43/0x50 [] vsock_connectible_sendmsg+0x373/0x450 [] sys_sendmsg+0x365/0x3a0 [] ___sys_sendmsg+0x84/0xd0 [] __sys_sendmsg+0x47/0x80 [] do_syscall_64+0x93/0x180 [] entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support") Signed-off-by: Michal Luczaj --- net/vmw_vsock/af_vsock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk) { struct vsock_sock *vsk = vsock_sk(sk); + /* Flush MSG_ZEROCOPY leftovers. */ + __skb_queue_purge(&sk->sk_error_queue); + It is true that for now this is supported only in the virtio transport, but it's more related to the core, so please remove `virtio` from the commit title. The rest LGTM. Thanks, Stefano vsock_deassign_transport(vsk); /* When clearing these addresses, there's no need to set the family and -- 2.46.2
Re: [PATCH net 1/4] virtio/vsock: Fix accept_queue memory leak
On Wed, Nov 06, 2024 at 06:51:18PM +0100, Michal Luczaj wrote: As the final stages of socket destruction may be delayed, it is possible that virtio_transport_recv_listen() will be called after the accept_queue has been flushed, but before the SOCK_DONE flag has been set. As a result, sockets enqueued after the flush would remain unremoved, leading to a memory leak. vsock_release __vsock_release lock virtio_transport_release virtio_transport_close schedule_delayed_work(close_work) sk_shutdown = SHUTDOWN_MASK (!) flush accept_queue release virtio_transport_recv_pkt vsock_find_bound_socket lock if flag(SOCK_DONE) return virtio_transport_recv_listen child = vsock_create_connected (!) vsock_enqueue_accept(child) release close_work lock virtio_transport_do_close set_flag(SOCK_DONE) virtio_transport_remove_sock vsock_remove_sock vsock_remove_bound release Introduce a sk_shutdown check to disallow vsock_enqueue_accept() during socket destruction. unreferenced object 0x888109e3f800 (size 2040): comm "kworker/5:2", pid 371, jiffies 4294940105 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 28 00 0b 40 00 00 00 00 00 00 00 00 00 00 00 00 (..@ backtrace (crc 9e5f4e84): [] kmem_cache_alloc_noprof+0x2c1/0x360 [] sk_prot_alloc+0x30/0x120 [] sk_alloc+0x2c/0x4b0 [] __vsock_create.constprop.0+0x2a/0x310 [] virtio_transport_recv_pkt+0x4dc/0x9a0 [] vsock_loopback_work+0xfd/0x140 [] process_one_work+0x20c/0x570 [] worker_thread+0x1bf/0x3a0 [] kthread+0xdd/0x110 [] ret_from_fork+0x2d/0x50 [] ret_from_fork_asm+0x1a/0x30 Fixes: 3fe356d58efa ("vsock/virtio: discard packets only when socket is really closed") Signed-off-by: Michal Luczaj --- net/vmw_vsock/virtio_transport_common.c | 8 1 file changed, 8 insertions(+) Reviewed-by: Stefano Garzarella diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index ccbd2bc0d2109aea4f19e79a0438f85893e1d89c..cd075f608d4f6f48f894543e5e9c966d3e5f22df 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -1512,6 +1512,14 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb, return -ENOMEM; } + /* __vsock_release() might have already flushed accept_queue. +* Subsequent enqueues would lead to a memory leak. +*/ + if (sk->sk_shutdown == SHUTDOWN_MASK) { + virtio_transport_reset_no_sock(t, skb); + return -ESHUTDOWN; + } + child = vsock_create_connected(sk); if (!child) { virtio_transport_reset_no_sock(t, skb); -- 2.46.2
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
On Wed, 6 Nov 2024 04:16:07 -0500 Michael S. Tsirkin wrote: > I thought it can but can't remember why now. OK, nm then, thanks! FWIW (I think there was confusion in earlier discussions) we do merge net into net-next once a week. So we can always apply stuff to net, and then depending patches to net-next within a week. Just for future reference, this patch IIUC we just leave be.
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
On Wed, Nov 06, 2024 at 04:46:23PM +0800, Xuan Zhuo wrote: > On Wed, 6 Nov 2024 02:38:40 -0500, "Michael S. Tsirkin" > wrote: > > On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote: > > > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote: > > > > In the last linux version, we disabled this feature to fix the > > > > regress[1]. > > > > > > > > The patch set is try to fix the problem and re-enable it. > > > > > > > > More info: > > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > > > > > Sorry to ping, Michael, Jason we're waiting to hear from you on > > > this one. > > > > Can patch 1 be applied on net as well? Or I can take it through > > my tree. It's a bugfix, just for an uncommon configuration. > > > Why? That can not be triggered in net branch. > > Thanks I thought it can but can't remember why now. OK, nm then, thanks!
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
On Wed, 6 Nov 2024 02:38:40 -0500, "Michael S. Tsirkin" wrote: > On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote: > > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote: > > > In the last linux version, we disabled this feature to fix the > > > regress[1]. > > > > > > The patch set is try to fix the problem and re-enable it. > > > > > > More info: > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > > > Sorry to ping, Michael, Jason we're waiting to hear from you on > > this one. > > Can patch 1 be applied on net as well? Or I can take it through > my tree. It's a bugfix, just for an uncommon configuration. Why? That can not be triggered in net branch. Thanks
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote: > > > > while (vq->split.vring.desc[i].flags & nextflag) { > > > > - vring_unmap_one_split(vq, i); > > > > + vring_unmap_one_split(vq, &extra[i]); > > > > > > Not sure if I've asked this before. But this part seems to deserve an > > > independent fix for -stable. > > > > What fix? > > I meant for hardening we need to check the flags stored in the extra > instead of the descriptor itself as it could be mangled by the device. > > Thanks Good point. Jason, want to cook up a patch? -- MST
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote: > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote: > > In the last linux version, we disabled this feature to fix the > > regress[1]. > > > > The patch set is try to fix the problem and re-enable it. > > > > More info: > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > Sorry to ping, Michael, Jason we're waiting to hear from you on > this one. Can patch 1 be applied on net as well? Or I can take it through my tree. It's a bugfix, just for an uncommon configuration.
Re: [PATCH net-next v1 1/4] virtio-net: fix overflow inside virtnet_rq_alloc
On Tue, Oct 29, 2024 at 04:46:12PM +0800, Xuan Zhuo wrote: > When the frag just got a page, then may lead to regression on VM. > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > then the frag always get a page when do refill. > > Which could see reliable crashes or scp failure (scp a file 100M in size > to VM). > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > of a new frag. When the frag size is larger than PAGE_SIZE, > everything is fine. However, if the frag is only one page and the > total size of the buffer and virtnet_rq_dma is larger than one page, an > overflow may occur. > > The commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever > use_dma_api") introduced this problem. And we reverted some commits to > fix this in last linux version. Now we try to enable it and fix this > bug directly. > > Here, when the frag size is not enough, we reduce the buffer len to fix > this problem. > > Reported-by: "Si-Wei Liu" > Tested-by: Darren Kenny > Signed-off-by: Xuan Zhuo This one belongs in net I feel. However, patches 2 and on depend on it not to cause regressions. So I suggest merging it on both branches, git will figure out the conflict I think. > --- > drivers/net/virtio_net.c | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 792e9eadbfc3..d50c1940eb23 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, > u32 size, gfp_t gfp) > void *buf, *head; > dma_addr_t addr; > > - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp))) > - return NULL; > - > head = page_address(alloc_frag->page); > > if (rq->do_dma) { > @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, > struct receive_queue *rq, > len = SKB_DATA_ALIGN(len) + > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp))) > + return -ENOMEM; > + > buf = virtnet_rq_alloc(rq, len, gfp); > if (unlikely(!buf)) > return -ENOMEM; > @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info > *vi, >*/ > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); > > + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) > + return -ENOMEM; > + > + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > > alloc_frag->size) > + len -= sizeof(struct virtnet_rq_dma); > + > buf = virtnet_rq_alloc(rq, len + room, gfp); > if (unlikely(!buf)) > return -ENOMEM; > -- > 2.32.0.3.g01195cf9f
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Wed, 6 Nov 2024 09:44:39 +0800, Jason Wang wrote: > On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo wrote: > > > > On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang wrote: > > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo > > > wrote: > > > > > > > > The subsequent commit needs to know whether every indirect buffer is > > > > premapped or not. So we need to introduce an extra struct for every > > > > indirect buffer to record this info. > > > > > > > > Signed-off-by: Xuan Zhuo > > > > --- > > > > drivers/virtio/virtio_ring.c | 112 --- > > > > 1 file changed, 52 insertions(+), 60 deletions(-) > > > > > > Do we have a performance impact for this patch? > > > > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > > index 97590c201aa2..dca093744fe1 100644 > > > > --- a/drivers/virtio/virtio_ring.c > > > > +++ b/drivers/virtio/virtio_ring.c > > > > @@ -69,7 +69,11 @@ > > > > > > > > struct vring_desc_state_split { > > > > void *data; /* Data for callback. */ > > > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. > > > > */ > > > > + > > > > + /* Indirect extra table and desc table, if any. These two will > > > > be > > > > +* allocated together. So we won't stress more to the memory > > > > allocator. > > > > +*/ > > > > + struct vring_desc *indir_desc; > > > > > > So it looks like we put a descriptor table after the extra table. Can > > > this lead to more crossing page mappings for the indirect descriptors? > > > > > > If yes, it seems expensive so we probably need to make the descriptor > > > table come first. > > > > No, the descriptors are before extra table. > > Well, you need then tweak the above comment, it said > > "Indirect extra table and desc table". > > > So, there is not performance impact. > > > > > > > > > > > }; > > > > > > [...] > > > > > while (vq->split.vring.desc[i].flags & nextflag) { > > > > - vring_unmap_one_split(vq, i); > > > > + vring_unmap_one_split(vq, &extra[i]); > > > > > > Not sure if I've asked this before. But this part seems to deserve an > > > independent fix for -stable. > > > > What fix? > > I meant for hardening we need to check the flags stored in the extra > instead of the descriptor itself as it could be mangled by the device. I see, we can do it in future. Thanks. > > Thanks >
Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
On Wed, 6 Nov 2024 09:56:55 +0800, Jason Wang wrote: > On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo wrote: > > > > On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang wrote: > > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo > > > wrote: > > > > > > > > virtio-net rq submits premapped per-buffer by setting sg page to NULL; > > > > > > > > Signed-off-by: Xuan Zhuo > > > > --- > > > > drivers/net/virtio_net.c | 24 +--- > > > > 1 file changed, 13 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > > index 792e9eadbfc3..09757fa408bd 100644 > > > > --- a/drivers/net/virtio_net.c > > > > +++ b/drivers/net/virtio_net.c > > > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr) > > > > return (struct sk_buff *)((unsigned long)ptr & > > > > ~VIRTIO_ORPHAN_FLAG); > > > > } > > > > > > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 > > > > len) > > > > +{ > > > > + sg->dma_address = addr; > > > > + sg->length = len; > > > > > > This may work but I think it's better to reuse existing dma sg helpers > > > like: > > > > > > sg_dma_address(sg) = addr; > > > sg_dma_length(sg) = len; > > > > > > And we probably need to fix the virtio core which only uses > > > sg_dma_address() but not sg_dma_length(). > > > > > > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is > > > set. > > > > > > I don't think so. > > > > For no-premapped mode, we pass the sg as no-dma sg to virtio core, > > so the virtio core uses the sg->length directly. > > This is fine. > > > If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(), > > so we must use sg->length directly. > > I meant it's a hack. It may work now but will be a bug in the future. > > For example, I'm playing a prototype to do pre mapping for virtio-blk, > the idea is to move the expensive DMA mappings in the case of swiotlb > etc to be done outside the pre virtqueue lock. In that case, the > driver may want to use dma_map_sg() instead of dma_map_page(). > > I'd suppose we will finally go with the way where DMA mappings needs > to be handled by the driver, and dma_map_sg() is faster than per sg > dma_map_page() anyhow. > > > > > In this case, for the driver, we can not use sg_dma_length(), > > if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set > > sg->dma_length, > > but virtio core use sg->length. > > Well, we just need a minor tweak to get the length from > vring_map_one_sg(), then everything should be fine? > > if (sg_is_premapped) { > *addr = sg_dma_address(sg); > *len = sg_dma_len(sg); > } For now, let us start from: if (sg_is_premapped) { *addr = sg_dma_address(sg); sg->length = sg_dma_len(sg); } Then virtio core needs to be refactor to use dma_map_sg() in future. Thanks. > > > > > For sg->dma_address, it is ok for me to use sg_dma_address or not. > > But for consistency to sg->length, I use the sg->dma_address directly. > > > > I noticed this is special, so I put them into an independent function. > > > > Thanks. > > Actually, the code like sg_fill_dma() calls for a virtqueue dma > mapping helper, I think we've agreed that core needs to hide DMA > details from the driver. That is something like > virtqueue_dma_map_sg() etc. > > Thanks > > > > > > > > > Others look good. > > > > > > Thanks > > > > > > > +} > > > > + > > > > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue > > > > *txq, > > > > bool in_napi, struct virtnet_sq_free_stats > > > > *stats) > > > > { > > > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct > > > > receive_queue *rq, void *buf, u32 len) > > > > addr = dma->addr - sizeof(*dma) + offset; > > > > > > > > sg_init_table(rq->sg, 1); > > > > - rq->sg[0].dma_address = addr; > > > > - rq->sg[0].length = len; > > > > + sg_fill_dma(rq->sg, addr, len); > > > > } > > > > > > > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, > > > > gfp_t gfp) > > > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct > > > > virtnet_info *vi, > > > > } > > > > } > > > > > > > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 > > > > len) > > > > -{ > > > > - sg->dma_address = addr; > > > > - sg->length = len; > > > > -} > > > > - > > > > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi, > > > >struct receive_queue *rq, void *buf, > > > > u32 len) > > > > { > > > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct > > > > virtnet_info *vi, struct receive_queue > > > > sg_init_table(rq->sg, 1); > > > > sg_fill_dma(rq->sg, addr, len); > > > > > > > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, > > > > xsk_buffs[i], gfp); > > > > + err = virtqueue_ad
Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo wrote: > > On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang wrote: > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo > > wrote: > > > > > > virtio-net rq submits premapped per-buffer by setting sg page to NULL; > > > > > > Signed-off-by: Xuan Zhuo > > > --- > > > drivers/net/virtio_net.c | 24 +--- > > > 1 file changed, 13 insertions(+), 11 deletions(-) > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > index 792e9eadbfc3..09757fa408bd 100644 > > > --- a/drivers/net/virtio_net.c > > > +++ b/drivers/net/virtio_net.c > > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr) > > > return (struct sk_buff *)((unsigned long)ptr & > > > ~VIRTIO_ORPHAN_FLAG); > > > } > > > > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > > +{ > > > + sg->dma_address = addr; > > > + sg->length = len; > > > > This may work but I think it's better to reuse existing dma sg helpers like: > > > > sg_dma_address(sg) = addr; > > sg_dma_length(sg) = len; > > > > And we probably need to fix the virtio core which only uses > > sg_dma_address() but not sg_dma_length(). > > > > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set. > > > I don't think so. > > For no-premapped mode, we pass the sg as no-dma sg to virtio core, > so the virtio core uses the sg->length directly. This is fine. > If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(), > so we must use sg->length directly. I meant it's a hack. It may work now but will be a bug in the future. For example, I'm playing a prototype to do pre mapping for virtio-blk, the idea is to move the expensive DMA mappings in the case of swiotlb etc to be done outside the pre virtqueue lock. In that case, the driver may want to use dma_map_sg() instead of dma_map_page(). I'd suppose we will finally go with the way where DMA mappings needs to be handled by the driver, and dma_map_sg() is faster than per sg dma_map_page() anyhow. > > In this case, for the driver, we can not use sg_dma_length(), > if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length, > but virtio core use sg->length. Well, we just need a minor tweak to get the length from vring_map_one_sg(), then everything should be fine? if (sg_is_premapped) { *addr = sg_dma_address(sg); *len = sg_dma_len(sg); } > > For sg->dma_address, it is ok for me to use sg_dma_address or not. > But for consistency to sg->length, I use the sg->dma_address directly. > > I noticed this is special, so I put them into an independent function. > > Thanks. Actually, the code like sg_fill_dma() calls for a virtqueue dma mapping helper, I think we've agreed that core needs to hide DMA details from the driver. That is something like virtqueue_dma_map_sg() etc. Thanks > > > > > Others look good. > > > > Thanks > > > > > +} > > > + > > > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue > > > *txq, > > > bool in_napi, struct virtnet_sq_free_stats > > > *stats) > > > { > > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct > > > receive_queue *rq, void *buf, u32 len) > > > addr = dma->addr - sizeof(*dma) + offset; > > > > > > sg_init_table(rq->sg, 1); > > > - rq->sg[0].dma_address = addr; > > > - rq->sg[0].length = len; > > > + sg_fill_dma(rq->sg, addr, len); > > > } > > > > > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t > > > gfp) > > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct > > > virtnet_info *vi, > > > } > > > } > > > > > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > > -{ > > > - sg->dma_address = addr; > > > - sg->length = len; > > > -} > > > - > > > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi, > > >struct receive_queue *rq, void *buf, > > > u32 len) > > > { > > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct > > > virtnet_info *vi, struct receive_queue > > > sg_init_table(rq->sg, 1); > > > sg_fill_dma(rq->sg, addr, len); > > > > > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, > > > xsk_buffs[i], gfp); > > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, > > > xsk_buffs[i], > > > + NULL, true, gfp); > > > if (err) > > > goto err; > > > } > > > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info > > > *vi, struct receive_queue *rq, > > > > > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN); > > > > > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); > > > + err = virtqueue_add_inbuf_premapped(rq-
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo wrote: > > On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang wrote: > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo > > wrote: > > > > > > The subsequent commit needs to know whether every indirect buffer is > > > premapped or not. So we need to introduce an extra struct for every > > > indirect buffer to record this info. > > > > > > Signed-off-by: Xuan Zhuo > > > --- > > > drivers/virtio/virtio_ring.c | 112 --- > > > 1 file changed, 52 insertions(+), 60 deletions(-) > > > > Do we have a performance impact for this patch? > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > index 97590c201aa2..dca093744fe1 100644 > > > --- a/drivers/virtio/virtio_ring.c > > > +++ b/drivers/virtio/virtio_ring.c > > > @@ -69,7 +69,11 @@ > > > > > > struct vring_desc_state_split { > > > void *data; /* Data for callback. */ > > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > > > + > > > + /* Indirect extra table and desc table, if any. These two will be > > > +* allocated together. So we won't stress more to the memory > > > allocator. > > > +*/ > > > + struct vring_desc *indir_desc; > > > > So it looks like we put a descriptor table after the extra table. Can > > this lead to more crossing page mappings for the indirect descriptors? > > > > If yes, it seems expensive so we probably need to make the descriptor > > table come first. > > No, the descriptors are before extra table. Well, you need then tweak the above comment, it said "Indirect extra table and desc table". > So, there is not performance impact. > > > > > > > }; > > > [...] > > > while (vq->split.vring.desc[i].flags & nextflag) { > > > - vring_unmap_one_split(vq, i); > > > + vring_unmap_one_split(vq, &extra[i]); > > > > Not sure if I've asked this before. But this part seems to deserve an > > independent fix for -stable. > > What fix? I meant for hardening we need to check the flags stored in the extra instead of the descriptor itself as it could be mangled by the device. Thanks
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
Hello: This series was applied to netdev/net-next.git (main) by Paolo Abeni : On Tue, 29 Oct 2024 16:46:11 +0800 you wrote: > v1: > 1. fix some small problems > 2. remove commit "virtio_net: introduce vi->mode" > > In the last linux version, we disabled this feature to fix the > regress[1]. > > [...] Here is the summary with links: - [net-next,v1,1/4] virtio-net: fix overflow inside virtnet_rq_alloc https://git.kernel.org/netdev/net-next/c/6aacd1484468 - [net-next,v1,2/4] virtio_net: big mode skip the unmap check https://git.kernel.org/netdev/net-next/c/a33f3df85075 - [net-next,v1,3/4] virtio_net: enable premapped mode for merge and small by default https://git.kernel.org/netdev/net-next/c/47008bb51c3e - [net-next,v1,4/4] virtio_net: rx remove premapped failover code https://git.kernel.org/netdev/net-next/c/fb22437c1ba3 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH net-next v1 4/4] virtio_net: rx remove premapped failover code
On Tue, Oct 29, 2024 at 4:47 PM Xuan Zhuo wrote: > > Now, the premapped mode can be enabled unconditionally. > > So we can remove the failover code for merge and small mode. > > The virtnet_rq_xxx() helper would be only used if the mode is using pre > mapping. A check is added to prevent misusing of these API. > > Tested-by: Darren Kenny > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks
Re: [PATCH net-next v1 1/4] virtio-net: fix overflow inside virtnet_rq_alloc
On Tue, Oct 29, 2024 at 4:47 PM Xuan Zhuo wrote: > > When the frag just got a page, then may lead to regression on VM. > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > then the frag always get a page when do refill. > > Which could see reliable crashes or scp failure (scp a file 100M in size > to VM). > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > of a new frag. When the frag size is larger than PAGE_SIZE, > everything is fine. However, if the frag is only one page and the > total size of the buffer and virtnet_rq_dma is larger than one page, an > overflow may occur. > > The commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever > use_dma_api") introduced this problem. And we reverted some commits to > fix this in last linux version. Now we try to enable it and fix this > bug directly. > > Here, when the frag size is not enough, we reduce the buffer len to fix > this problem. > > Reported-by: "Si-Wei Liu" > Tested-by: Darren Kenny > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang wrote: > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > > > The subsequent commit needs to know whether every indirect buffer is > > premapped or not. So we need to introduce an extra struct for every > > indirect buffer to record this info. > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/virtio/virtio_ring.c | 112 --- > > 1 file changed, 52 insertions(+), 60 deletions(-) > > Do we have a performance impact for this patch? > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 97590c201aa2..dca093744fe1 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -69,7 +69,11 @@ > > > > struct vring_desc_state_split { > > void *data; /* Data for callback. */ > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > > + > > + /* Indirect extra table and desc table, if any. These two will be > > +* allocated together. So we won't stress more to the memory > > allocator. > > +*/ > > + struct vring_desc *indir_desc; > > So it looks like we put a descriptor table after the extra table. Can > this lead to more crossing page mappings for the indirect descriptors? > > If yes, it seems expensive so we probably need to make the descriptor > table come first. No, the descriptors are before extra table. So, there is not performance impact. > > > }; > > > > struct vring_desc_state_packed { > > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue > > *vq, u32 num) > > * Split ring specific functions - *_split(). > > */ > > > > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue > > *vq, > > - const struct vring_desc *desc) > > -{ > > - u16 flags; > > - > > - if (!vring_need_unmap_buffer(vq)) > > - return; > > - > > - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > > - > > - dma_unmap_page(vring_dma_dev(vq), > > - virtio64_to_cpu(vq->vq.vdev, desc->addr), > > - virtio32_to_cpu(vq->vq.vdev, desc->len), > > - (flags & VRING_DESC_F_WRITE) ? > > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > > -} > > - > > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > > - unsigned int i) > > + struct vring_desc_extra *extra) > > { > > - struct vring_desc_extra *extra = vq->split.desc_extra; > > u16 flags; > > > > - flags = extra[i].flags; > > + flags = extra->flags; > > > > if (flags & VRING_DESC_F_INDIRECT) { > > if (!vq->use_dma_api) > > goto out; > > > > dma_unmap_single(vring_dma_dev(vq), > > -extra[i].addr, > > -extra[i].len, > > +extra->addr, > > +extra->len, > > (flags & VRING_DESC_F_WRITE) ? > > DMA_FROM_DEVICE : DMA_TO_DEVICE); > > } else { > > @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const > > struct vring_virtqueue *vq, > > goto out; > > > > dma_unmap_page(vring_dma_dev(vq), > > - extra[i].addr, > > - extra[i].len, > > + extra->addr, > > + extra->len, > >(flags & VRING_DESC_F_WRITE) ? > >DMA_FROM_DEVICE : DMA_TO_DEVICE); > > } > > > > out: > > - return extra[i].next; > > + return extra->next; > > } > > > > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, > >unsigned int total_sg, > >gfp_t gfp) > > { > > + struct vring_desc_extra *extra; > > struct vring_desc *desc; > > - unsigned int i; > > + unsigned int i, size; > > > > /* > > * We require lowmem mappings for the descriptors because > > @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct > > virtqueue *_vq, > > */ > > gfp &= ~__GFP_HIGHMEM; > > > > - desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp); > > + size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg; > > + > > + desc = kmalloc(size, gfp); > > if (!desc) > > return NULL; > > > > + extra = (struct vring_desc_extra *)&desc[total_sg]; > > + > > for (i = 0; i < total_sg; i++) > > - desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1); >
Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang wrote: > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > > > virtio-net rq submits premapped per-buffer by setting sg page to NULL; > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 24 +--- > > 1 file changed, 13 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 792e9eadbfc3..09757fa408bd 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr) > > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG); > > } > > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > +{ > > + sg->dma_address = addr; > > + sg->length = len; > > This may work but I think it's better to reuse existing dma sg helpers like: > > sg_dma_address(sg) = addr; > sg_dma_length(sg) = len; > > And we probably need to fix the virtio core which only uses > sg_dma_address() but not sg_dma_length(). > > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set. I don't think so. For no-premapped mode, we pass the sg as no-dma sg to virtio core, so the virtio core uses the sg->length directly. If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(), so we must use sg->length directly. In this case, for the driver, we can not use sg_dma_length(), if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length, but virtio core use sg->length. For sg->dma_address, it is ok for me to use sg_dma_address or not. But for consistency to sg->length, I use the sg->dma_address directly. I noticed this is special, so I put them into an independent function. Thanks. > > Others look good. > > Thanks > > > +} > > + > > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue > > *txq, > > bool in_napi, struct virtnet_sq_free_stats > > *stats) > > { > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue > > *rq, void *buf, u32 len) > > addr = dma->addr - sizeof(*dma) + offset; > > > > sg_init_table(rq->sg, 1); > > - rq->sg[0].dma_address = addr; > > - rq->sg[0].length = len; > > + sg_fill_dma(rq->sg, addr, len); > > } > > > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t > > gfp) > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct > > virtnet_info *vi, > > } > > } > > > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > -{ > > - sg->dma_address = addr; > > - sg->length = len; > > -} > > - > > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi, > >struct receive_queue *rq, void *buf, u32 > > len) > > { > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct > > virtnet_info *vi, struct receive_queue > > sg_init_table(rq->sg, 1); > > sg_fill_dma(rq->sg, addr, len); > > > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], > > gfp); > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, > > xsk_buffs[i], > > + NULL, true, gfp); > > if (err) > > goto err; > > } > > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, > > struct receive_queue *rq, > > > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN); > > > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, > > + rq->do_dma, gfp); > > if (err < 0) { > > if (rq->do_dma) > > virtnet_rq_unmap(rq, buf, 0); > > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info > > *vi, > > virtnet_rq_init_one_sg(rq, buf, len); > > > > ctx = mergeable_len_to_ctx(len + room, headroom); > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); > > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, > > + rq->do_dma, gfp); > > if (err < 0) { > > if (rq->do_dma) > > virtnet_rq_unmap(rq, buf, 0); > > -- > > 2.32.0.3.g01195cf9f > > >
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
Hi Jakub: On Tue, Nov 5, 2024 at 10:46 AM Jakub Kicinski wrote: > > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote: > > In the last linux version, we disabled this feature to fix the > > regress[1]. > > > > The patch set is try to fix the problem and re-enable it. > > > > More info: > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > Sorry to ping, Michael, Jason we're waiting to hear from you on > this one. > Will review this today. Thanks
Re: [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > Now, this API is useless. remove it. > > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks
Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > Two APIs are introduced to submit premapped per-buffers. > > int virtqueue_add_inbuf_premapped(struct virtqueue *vq, > struct scatterlist *sg, unsigned int num, > void *data, > void *ctx, > bool premapped, > gfp_t gfp); > > int virtqueue_add_outbuf_premapped(struct virtqueue *vq, > struct scatterlist *sg, unsigned int num, > void *data, > bool premapped, > gfp_t gfp); > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 48 > include/linux/virtio.h | 13 ++ > 2 files changed, 61 insertions(+) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index a89295b79e66..525308d82728 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -2272,6 +2272,29 @@ int virtqueue_add_outbuf(struct virtqueue *vq, > } > EXPORT_SYMBOL_GPL(virtqueue_add_outbuf); > > +/** > + * virtqueue_add_outbuf_premapped - expose output buffers to other end > + * @vq: the struct virtqueue we're talking about. > + * @sg: scatterlist (must be well-formed and terminated!) > + * @num: the number of entries in @sg readable by other side > + * @data: the token identifying the buffer. > + * @gfp: how to do memory allocations (if necessary). > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_outbuf_premapped(struct virtqueue *vq, > + struct scatterlist *sg, unsigned int num, > + void *data, > + bool premapped, We don't need this parameter consider: 1) we've already had virtqueue_add_outbuf() which implies the buf has been mapped 2) no explanation for "premapped" in the function doc Thanks > + gfp_t gfp) > +{ > + return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp); > +} > +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped); > + > /** > * virtqueue_add_inbuf - expose input buffers to other end > * @vq: the struct virtqueue we're talking about. > @@ -2318,6 +2341,31 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq, > } > EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx); > > +/** > + * virtqueue_add_inbuf_premapped - expose input buffers to other end > + * @vq: the struct virtqueue we're talking about. > + * @sg: scatterlist (must be well-formed and terminated!) > + * @num: the number of entries in @sg writable by other side > + * @data: the token identifying the buffer. > + * @ctx: extra context for the token > + * @gfp: how to do memory allocations (if necessary). > + * > + * Caller must ensure we don't call this with other virtqueue operations > + * at the same time (except where noted). > + * > + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). > + */ > +int virtqueue_add_inbuf_premapped(struct virtqueue *vq, > + struct scatterlist *sg, unsigned int num, > + void *data, > + void *ctx, > + bool premapped, > + gfp_t gfp) > +{ > + return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, premapped, gfp); > +} > +EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_premapped); > + > /** > * virtqueue_dma_dev - get the dma dev > * @_vq: the struct virtqueue we're talking about. > diff --git a/include/linux/virtio.h b/include/linux/virtio.h > index 306137a15d07..19afa49b92d0 100644 > --- a/include/linux/virtio.h > +++ b/include/linux/virtio.h > @@ -56,6 +56,19 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq, > void *ctx, > gfp_t gfp); > > +int virtqueue_add_inbuf_premapped(struct virtqueue *vq, > + struct scatterlist *sg, unsigned int num, > + void *data, > + void *ctx, > + bool premapped, > + gfp_t gfp); > + > +int virtqueue_add_outbuf_premapped(struct virtqueue *vq, > + struct scatterlist *sg, unsigned int num, > + void *data, > + bool premapped, > + gfp_t gfp); > + > int virtqueue_add_sgs(struct virtqueue *vq, > struct scatterlist *sgs[], > unsigned int out_sgs, > -- > 2.32.0.3.g01195cf9f >
Re: [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > The current configuration sets the virtqueue (vq) to premapped mode, > implying that all buffers submitted to this queue must be mapped ahead > of time. This presents a challenge for the virtnet send queue (sq): the > virtnet driver would be required to keep track of dma information for vq > size * 17, Even worse as MAX_SKB_FRAGS could be even larger. > which can be substantial. However, if the premapped mode were > applied on a per-buffer basis, the complexity would be greatly reduced. > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel > skb buffers could remain unmapped. > > And consider that some sgs are not generated by the virtio driver, > that may be passed from the block stack. So we can not change the > sgs, new APIs are the better way. I had some new thoughts on this. Pre-mapping makes a lot of sense as it allows the us to move the expensive DMA mapping operations (swiotlb, IOMMU or VDUSE) out of the per-vq lock. I wonder what would we do if someday we want to convert the skb TX to be premapped (or even all virtio drivers). Considering we've already used skb_to_sgvec() in start_xmit() it looks like we need to allocate sg[queue_num][MAX_SKB_FRAGS + 2] and store the dma mapping there. Then we probably don't even need to duplicate the dma mapping information in the core. It means it's the driver's charge to store the dma information via sg, so we can simply use dma_map_sg() in add_sgs() which allows various optimizations in IOMMU layers. > > So we pass the new argument 'premapped' to indicate the buffers > submitted to virtio are premapped in advance. Additionally, > DMA unmap operations for these buffers will be bypassed. > > Suggested-by: Jason Wang > Signed-off-by: Xuan Zhuo Anyhow Acked-by: Jason Wang Thanks > --- > drivers/virtio/virtio_ring.c | 79 ++-- > 1 file changed, 40 insertions(+), 39 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 628e01af1c9a..a89295b79e66 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -300,9 +300,10 @@ static bool vring_use_dma_api(const struct virtio_device > *vdev) > return false; > } > > -static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring) > +static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring, > + const struct vring_desc_extra *extra) > { > - return vring->use_dma_api && !vring->premapped; > + return vring->use_dma_api && (extra->addr != DMA_MAPPING_ERROR); > } > > size_t virtio_max_dma_size(const struct virtio_device *vdev) > @@ -372,9 +373,10 @@ static struct device *vring_dma_dev(const struct > vring_virtqueue *vq) > > /* Map one sg entry. */ > static int vring_map_one_sg(const struct vring_virtqueue *vq, struct > scatterlist *sg, > - enum dma_data_direction direction, dma_addr_t > *addr) > + enum dma_data_direction direction, dma_addr_t > *addr, > + bool premapped) > { > - if (vq->premapped) { > + if (premapped) { > *addr = sg_dma_address(sg); > return 0; > } > @@ -465,7 +467,7 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > (flags & VRING_DESC_F_WRITE) ? > DMA_FROM_DEVICE : DMA_TO_DEVICE); > } else { > - if (!vring_need_unmap_buffer(vq)) > + if (!vring_need_unmap_buffer(vq, extra)) > goto out; > > dma_unmap_page(vring_dma_dev(vq), > @@ -514,7 +516,7 @@ static inline unsigned int > virtqueue_add_desc_split(struct virtqueue *vq, > unsigned int i, > dma_addr_t addr, > unsigned int len, > - u16 flags) > + u16 flags, bool premapped) > { > u16 next; > > @@ -522,7 +524,7 @@ static inline unsigned int > virtqueue_add_desc_split(struct virtqueue *vq, > desc[i].addr = cpu_to_virtio64(vq->vdev, addr); > desc[i].len = cpu_to_virtio32(vq->vdev, len); > > - extra[i].addr = addr; > + extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr; > extra[i].len = len; > extra[i].flags = flags; > > @@ -540,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > unsigned int in_sgs, > void *data, > void *ctx, > + bool premapped, > gfp_t gfp) > { > s
Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > The subsequent commit needs to know whether every indirect buffer is > premapped or not. So we need to introduce an extra struct for every > indirect buffer to record this info. > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 112 --- > 1 file changed, 52 insertions(+), 60 deletions(-) Do we have a performance impact for this patch? > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 97590c201aa2..dca093744fe1 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -69,7 +69,11 @@ > > struct vring_desc_state_split { > void *data; /* Data for callback. */ > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > + > + /* Indirect extra table and desc table, if any. These two will be > +* allocated together. So we won't stress more to the memory > allocator. > +*/ > + struct vring_desc *indir_desc; So it looks like we put a descriptor table after the extra table. Can this lead to more crossing page mappings for the indirect descriptors? If yes, it seems expensive so we probably need to make the descriptor table come first. > }; > > struct vring_desc_state_packed { > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, > u32 num) > * Split ring specific functions - *_split(). > */ > > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, > - const struct vring_desc *desc) > -{ > - u16 flags; > - > - if (!vring_need_unmap_buffer(vq)) > - return; > - > - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > - > - dma_unmap_page(vring_dma_dev(vq), > - virtio64_to_cpu(vq->vq.vdev, desc->addr), > - virtio32_to_cpu(vq->vq.vdev, desc->len), > - (flags & VRING_DESC_F_WRITE) ? > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > -} > - > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > - unsigned int i) > + struct vring_desc_extra *extra) > { > - struct vring_desc_extra *extra = vq->split.desc_extra; > u16 flags; > > - flags = extra[i].flags; > + flags = extra->flags; > > if (flags & VRING_DESC_F_INDIRECT) { > if (!vq->use_dma_api) > goto out; > > dma_unmap_single(vring_dma_dev(vq), > -extra[i].addr, > -extra[i].len, > +extra->addr, > +extra->len, > (flags & VRING_DESC_F_WRITE) ? > DMA_FROM_DEVICE : DMA_TO_DEVICE); > } else { > @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > goto out; > > dma_unmap_page(vring_dma_dev(vq), > - extra[i].addr, > - extra[i].len, > + extra->addr, > + extra->len, >(flags & VRING_DESC_F_WRITE) ? >DMA_FROM_DEVICE : DMA_TO_DEVICE); > } > > out: > - return extra[i].next; > + return extra->next; > } > > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, >unsigned int total_sg, >gfp_t gfp) > { > + struct vring_desc_extra *extra; > struct vring_desc *desc; > - unsigned int i; > + unsigned int i, size; > > /* > * We require lowmem mappings for the descriptors because > @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct > virtqueue *_vq, > */ > gfp &= ~__GFP_HIGHMEM; > > - desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp); > + size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg; > + > + desc = kmalloc(size, gfp); > if (!desc) > return NULL; > > + extra = (struct vring_desc_extra *)&desc[total_sg]; > + > for (i = 0; i < total_sg; i++) > - desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1); > + extra[i].next = i + 1; > + > return desc; > } > > static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq, > struct vring_desc *desc, > + struct vring_desc_extra > *extra, > unsigned int i, >
Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo wrote: > > virtio-net rq submits premapped per-buffer by setting sg page to NULL; > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 24 +--- > 1 file changed, 13 insertions(+), 11 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 792e9eadbfc3..09757fa408bd 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr) > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG); > } > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > +{ > + sg->dma_address = addr; > + sg->length = len; This may work but I think it's better to reuse existing dma sg helpers like: sg_dma_address(sg) = addr; sg_dma_length(sg) = len; And we probably need to fix the virtio core which only uses sg_dma_address() but not sg_dma_length(). This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set. Others look good. Thanks > +} > + > static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq, > bool in_napi, struct virtnet_sq_free_stats *stats) > { > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue > *rq, void *buf, u32 len) > addr = dma->addr - sizeof(*dma) + offset; > > sg_init_table(rq->sg, 1); > - rq->sg[0].dma_address = addr; > - rq->sg[0].length = len; > + sg_fill_dma(rq->sg, addr, len); > } > > static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp) > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct > virtnet_info *vi, > } > } > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > -{ > - sg->dma_address = addr; > - sg->length = len; > -} > - > static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi, >struct receive_queue *rq, void *buf, u32 > len) > { > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info > *vi, struct receive_queue > sg_init_table(rq->sg, 1); > sg_fill_dma(rq->sg, addr, len); > > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], > gfp); > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, > xsk_buffs[i], > + NULL, true, gfp); > if (err) > goto err; > } > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, > struct receive_queue *rq, > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN); > > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, > + rq->do_dma, gfp); > if (err < 0) { > if (rq->do_dma) > virtnet_rq_unmap(rq, buf, 0); > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info > *vi, > virtnet_rq_init_one_sg(rq, buf, len); > > ctx = mergeable_len_to_ctx(len + room, headroom); > - err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp); > + err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx, > + rq->do_dma, gfp); > if (err < 0) { > if (rq->do_dma) > virtnet_rq_unmap(rq, buf, 0); > -- > 2.32.0.3.g01195cf9f >
Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default
On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote: > In the last linux version, we disabled this feature to fix the > regress[1]. > > The patch set is try to fix the problem and re-enable it. > > More info: > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com Sorry to ping, Michael, Jason we're waiting to hear from you on this one.
Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped
Hi Xuan, kernel test robot noticed the following build warnings: [auto build test WARNING on net-next/main] url: https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_ring-introduce-vring_need_unmap_buffer/20241030-162739 base: net-next/main patch link: https://lore.kernel.org/r/20241030082453.97310-6-xuanzhuo%40linux.alibaba.com patch subject: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped config: x86_64-buildonly-randconfig-003-20241031 (https://download.01.org/0day-ci/archive/20241031/202410310925.lucycrtj-...@intel.com/config) compiler: clang version 19.1.2 (https://github.com/llvm/llvm-project 7ba7d8e2f7b6445b60679da826210cdde29eaf8b) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241031/202410310925.lucycrtj-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202410310925.lucycrtj-...@intel.com/ All warnings (new ones prefixed by >>): >> drivers/virtio/virtio_ring.c:2293: warning: Function parameter or struct >> member 'premapped' not described in 'virtqueue_add_outbuf_premapped' >> drivers/virtio/virtio_ring.c:2364: warning: Function parameter or struct >> member 'premapped' not described in 'virtqueue_add_inbuf_premapped' vim +2293 drivers/virtio/virtio_ring.c 2274 2275 /** 2276 * virtqueue_add_outbuf_premapped - expose output buffers to other end 2277 * @vq: the struct virtqueue we're talking about. 2278 * @sg: scatterlist (must be well-formed and terminated!) 2279 * @num: the number of entries in @sg readable by other side 2280 * @data: the token identifying the buffer. 2281 * @gfp: how to do memory allocations (if necessary). 2282 * 2283 * Caller must ensure we don't call this with other virtqueue operations 2284 * at the same time (except where noted). 2285 * 2286 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). 2287 */ 2288 int virtqueue_add_outbuf_premapped(struct virtqueue *vq, 2289 struct scatterlist *sg, unsigned int num, 2290 void *data, 2291 bool premapped, 2292 gfp_t gfp) > 2293 { 2294 return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp); 2295 } 2296 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped); 2297 2298 /** 2299 * virtqueue_add_inbuf - expose input buffers to other end 2300 * @vq: the struct virtqueue we're talking about. 2301 * @sg: scatterlist (must be well-formed and terminated!) 2302 * @num: the number of entries in @sg writable by other side 2303 * @data: the token identifying the buffer. 2304 * @gfp: how to do memory allocations (if necessary). 2305 * 2306 * Caller must ensure we don't call this with other virtqueue operations 2307 * at the same time (except where noted). 2308 * 2309 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). 2310 */ 2311 int virtqueue_add_inbuf(struct virtqueue *vq, 2312 struct scatterlist *sg, unsigned int num, 2313 void *data, 2314 gfp_t gfp) 2315 { 2316 return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp); 2317 } 2318 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf); 2319 2320 /** 2321 * virtqueue_add_inbuf_ctx - expose input buffers to other end 2322 * @vq: the struct virtqueue we're talking about. 2323 * @sg: scatterlist (must be well-formed and terminated!) 2324 * @num: the number of entries in @sg writable by other side 2325 * @data: the token identifying the buffer. 2326 * @ctx: extra context for the token 2327 * @gfp: how to do memory allocations (if necessary). 2328 * 2329 * Caller must ensure we don't call this with other virtqueue operations 2330 * at the same time (except where noted). 2331 * 2332 * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO). 2333 */ 2334 int virtqueue_add_inbuf_ctx(struct virtqueue *vq, 2335 struct scatterlist *sg, unsigned int num, 2336 void *data, 2337 void *ctx, 2338 gfp_t gfp) 2339 { 2340 return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp); 2341 } 2342 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx); 2343 2344 /** 2345 * virtqueue_add_inbuf_premapped - expose input buffers to other end 2346 * @vq: the struct virtqueue we're talking about. 2347 * @sg: scatterlist (must be well-formed and terminated!) 2348 * @num: the number of entries in @sg writable by other side 2349 * @data: the token
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On Fri, Oct 25, 2024 at 10:35:53AM +0800, Xuan Zhuo wrote: > On Thu, 17 Oct 2024 15:42:59 +0200, Paolo Abeni wrote: > > > > > > On 10/14/24 05:12, Xuan Zhuo wrote: > > > When the frag just got a page, then may lead to regression on VM. > > > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > > > then the frag always get a page when do refill. > > > > > > Which could see reliable crashes or scp failure (scp a file 100M in size > > > to VM): > > > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > > > of a new frag. When the frag size is larger than PAGE_SIZE, > > > everything is fine. However, if the frag is only one page and the > > > total size of the buffer and virtnet_rq_dma is larger than one page, an > > > overflow may occur. > > > > > > Here, when the frag size is not enough, we reduce the buffer len to fix > > > this problem. > > > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever > > > use_dma_api") > > > Reported-by: "Si-Wei Liu" > > > Signed-off-by: Xuan Zhuo > > > > This looks like a fix that should target the net tree, but the following > > patches looks like net-next material. Any special reason to bundle them > > together? > > Sorry, I forgot to add net-next as a target tree. > > This may look like a fix. But the feature was disabled in the last Linux > version. So the bug cannot be triggered, so we don't need to push to the net > tree. I think it would be useful to be clear in the commit message, use of tags, and target tree regarding fixes and non-fixes. Please describe in the commit message why this is not fixing a bug, as you have described above. And please do not include Fixes tags in patches that are not bug fixes, which seems to be the case here. If you want to refer to the patch that introduced the problem, you can use the following syntax, in the commit message, before the tags. Unlike Fixes tags, this may be line wrapped. This problem is not a bug fix for net because... It was was introduced by commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api"). Reported-by: ... ...
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On Fri, 18 Oct 2024 15:41:41 +0800, Jason Wang wrote: > On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > > > When the frag just got a page, then may lead to regression on VM. > > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > > then the frag always get a page when do refill. > > > > Which could see reliable crashes or scp failure (scp a file 100M in size > > to VM): > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > > of a new frag. When the frag size is larger than PAGE_SIZE, > > everything is fine. However, if the frag is only one page and the > > total size of the buffer and virtnet_rq_dma is larger than one page, an > > overflow may occur. > > > > Here, when the frag size is not enough, we reduce the buffer len to fix > > this problem. > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever > > use_dma_api") > > Reported-by: "Si-Wei Liu" > > Signed-off-by: Xuan Zhuo > > Though this may fix the problem and we need it now, I would like to > have some tweaks in the future. Basically because of the security > implication for sharing driver metadata with the device (most IOMMU > can only do protection at the granule at the page). So we may end up > with device-triggerable behaviour. For safety, we should use driver > private memory for DMA metadata. > > > --- > > drivers/net/virtio_net.c | 12 +--- > > 1 file changed, 9 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index f8131f92a392..59a99bbaf852 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, > > u32 size, gfp_t gfp) > > void *buf, *head; > > dma_addr_t addr; > > > > - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp))) > > - return NULL; > > - > > head = page_address(alloc_frag->page); > > > > if (rq->do_dma) { > > @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, > > struct receive_queue *rq, > > len = SKB_DATA_ALIGN(len) + > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > > > + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp))) > > + return -ENOMEM; > > + > > buf = virtnet_rq_alloc(rq, len, gfp); > > if (unlikely(!buf)) > > return -ENOMEM; > > @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info > > *vi, > > */ > > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); > > > > + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) > > + return -ENOMEM; > > This makes me think of another question, how could we guarantee len + > room is less or equal to PAGE_SIZE. Especially considering we need > extra head and tailroom for XDP? If we can't it is a bug: get_mergeable_buf_len() do this. > > """ > /** > * skb_page_frag_refill - check that a page_frag contains enough room > * @sz: minimum size of the fragment we want to get > * @pfrag: pointer to page_frag > * @gfp: priority for memory allocation > * > * Note: While this allocator tries to use high order pages, there is > * no guarantee that allocations succeed. Therefore, @sz MUST be > * less or equal than PAGE_SIZE. > */ > """ > > > + > > + if (!alloc_frag->offset && len + room + sizeof(struct > > virtnet_rq_dma) > alloc_frag->size) > > + len -= sizeof(struct virtnet_rq_dma); > > Any reason we need to check alloc_frag->offset? We just need to check when the page of the alloc frag is new. In this case, we need to allocate space to store dma meta. If the offset > 0, then we do not need to allocate extra space, so it is safe. > > > + > > buf = virtnet_rq_alloc(rq, len + room, gfp); > > Btw, as pointed out in previous review, we should have a consistent API: > > 1) hide the alloc_frag like virtnet_rq_alloc() > > or > > 2) pass the alloc_frag to virtnet_rq_alloc() Now we need to check len+room before calling skb_page_frag_refill() so we must move virtnet_rq_alloc() outside virtnet_rq_alloc(). Thanks > > > if (unlikely(!buf)) > > return -ENOMEM; > > -- > > 2.32.0.3.g01195cf9f > > > > Thanks >
Re: [PATCH 4/5] virtio_net: enable premapped mode for merge and small by default
On Fri, 18 Oct 2024 16:00:07 +0800, Jason Wang wrote: > On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > > > Currently, the virtio core will perform a dma operation for each > > buffer. Although, the same page may be operated multiple times. > > > > In premapped mod, we can perform only one dma operation for the pages of > > the alloc frag. This is beneficial for the iommu device. > > > > kernel command line: intel_iommu=on iommu.passthrough=0 > > > >| strict=0 | strict=1 > > Before | 775496pps | 428614pps > > After | 1109316pps | 742853pps > > > > In the 6.11, we disabled this feature because a regress [1]. > > > > Now, we fix the problem and re-enable it. > > > > [1]: > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 17 + > > 1 file changed, 17 insertions(+) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index cd90e77881df..8cf24b7b58bd 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -6133,6 +6133,21 @@ static int virtnet_alloc_queues(struct virtnet_info > > *vi) > > return -ENOMEM; > > } > > > > +static void virtnet_rq_set_premapped(struct virtnet_info *vi) > > +{ > > + int i; > > + > > + /* disable for big mode */ > > + if (vi->mode == VIRTNET_MODE_BIG) > > + return; > > Nitpick: I would like such a check to be done at the caller. I am ok, if you like. Thanks. > > But anyhow the patch looks good > > Acked-by: Jason Wang > > Thanks > > > + > > + for (i = 0; i < vi->max_queue_pairs; i++) { > > + /* error should never happen */ > > + BUG_ON(virtqueue_set_dma_premapped(vi->rq[i].vq)); > > + vi->rq[i].do_dma = true; > > + } > > +} > > + > > static int init_vqs(struct virtnet_info *vi) > > { > > int ret; > > @@ -6146,6 +6161,8 @@ static int init_vqs(struct virtnet_info *vi) > > if (ret) > > goto err_free; > > > > + virtnet_rq_set_premapped(vi); > > + > > cpus_read_lock(); > > virtnet_set_affinity(vi); > > cpus_read_unlock(); > > -- > > 2.32.0.3.g01195cf9f > > >
Re: [PATCH 2/5] virtio_net: introduce vi->mode
On Fri, 18 Oct 2024 15:48:38 +0800, Jason Wang wrote: > On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > > > Now, if we want to judge the rx work mode, we have to use such codes: > > > > 1. merge mode: vi->mergeable_rx_bufs > > 2. big mode: vi->big_packets && !vi->mergeable_rx_bufs > > 3. small: !vi->big_packets && !vi->mergeable_rx_bufs > > > > This is inconvenient and abstract, and we also have this use case: > > > > if (vi->mergeable_rx_bufs) > > > > else if (vi->big_packets) > > > > else > > > > For this case, I think switch-case is the better choice. > > > > So here I introduce vi->mode to record the virtio-net work mode. > > That is helpful to judge the work mode and choose the branches. > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 61 +++- > > 1 file changed, 47 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 59a99bbaf852..14809b614d62 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -385,6 +385,12 @@ struct control_buf { > > virtio_net_ctrl_ack status; > > }; > > > > +enum virtnet_mode { > > + VIRTNET_MODE_SMALL, > > + VIRTNET_MODE_MERGE, > > + VIRTNET_MODE_BIG > > +}; > > I'm not sure if this can ease or not. > > [...] > > > + if (vi->mergeable_rx_bufs) > > + vi->mode = VIRTNET_MODE_MERGE; > > + else if (vi->big_packets) > > + vi->mode = VIRTNET_MODE_BIG; > > Maybe we can just say big_packets doesn't mean big mode. > > > + else > > + vi->mode = VIRTNET_MODE_SMALL; > > + > > if (vi->any_header_sg) > > dev->needed_headroom = vi->hdr_len; > > Anyhow this seems not a fix so it should be a separate series than patch 1? I think this series is not about fixing the problem, the feature was disabled in the last Linux version. This series is about turning the pre-mapped mode of rx back on. This commit tries to make things easier. The current code is very unintuitive when we try to switch or check the virtio-net rx mode Thanks. > > Thanks > > > > > -- > > 2.32.0.3.g01195cf9f > > >
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On Thu, 17 Oct 2024 15:42:59 +0200, Paolo Abeni wrote: > > > On 10/14/24 05:12, Xuan Zhuo wrote: > > When the frag just got a page, then may lead to regression on VM. > > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > > then the frag always get a page when do refill. > > > > Which could see reliable crashes or scp failure (scp a file 100M in size > > to VM): > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > > of a new frag. When the frag size is larger than PAGE_SIZE, > > everything is fine. However, if the frag is only one page and the > > total size of the buffer and virtnet_rq_dma is larger than one page, an > > overflow may occur. > > > > Here, when the frag size is not enough, we reduce the buffer len to fix > > this problem. > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever > > use_dma_api") > > Reported-by: "Si-Wei Liu" > > Signed-off-by: Xuan Zhuo > > This looks like a fix that should target the net tree, but the following > patches looks like net-next material. Any special reason to bundle them > together? Sorry, I forgot to add net-next as a target tree. This may look like a fix. But the feature was disabled in the last Linux version. So the bug cannot be triggered, so we don't need to push to the net tree. Thanks. > > Also, please explicitly include the the target tree in the subj on next > submissions, thanks! > > Paolo > >
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On Thu, Oct 17, 2024 at 03:42:59PM +0200, Paolo Abeni wrote: > > > On 10/14/24 05:12, Xuan Zhuo wrote: > > When the frag just got a page, then may lead to regression on VM. > > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > > then the frag always get a page when do refill. > > > > Which could see reliable crashes or scp failure (scp a file 100M in size > > to VM): > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > > of a new frag. When the frag size is larger than PAGE_SIZE, > > everything is fine. However, if the frag is only one page and the > > total size of the buffer and virtnet_rq_dma is larger than one page, an > > overflow may occur. > > > > Here, when the frag size is not enough, we reduce the buffer len to fix > > this problem. > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever > > use_dma_api") > > Reported-by: "Si-Wei Liu" > > Signed-off-by: Xuan Zhuo > > This looks like a fix that should target the net tree, but the following > patches looks like net-next material. Any special reason to bundle them > together? > > Also, please explicitly include the the target tree in the subj on next > submissions, thanks! > > Paolo The issue is that net-next is not rebased on net, so if patches 2-5 land in next and 1 only in net, next will be broken. I think the simplest fix is just to have 1 on net-next, backport it to net too, and git will figure it out. Right? -- MST
Re: [PATCH 0/5] virtio_net: enable premapped mode by default
On Fri, Oct 18, 2024 at 03:59:14PM +0100, Darren Kenny wrote: > Hi Michael / Xuan Zhuo, > > On Wednesday, 2024-10-16 at 08:55:21 +01, Darren Kenny wrote: > > Hi Michael, > > > > On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote: > >> On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote: > >>> In the last linux version, we disabled this feature to fix the > >>> regress[1]. > >>> > >>> The patch set is try to fix the problem and re-enable it. > >>> > >>> More info: > >>> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > >>> > >>> Thanks. > >>> > >>> [1]: > >>> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > >> > >> Darren, you previously reported crashes with a patch very similar to 1/5. > >> Can you please test this patchset and report whether they > >> are still observed? > >> If yes, any data on how to reproduce would be very benefitial for Xuan > >> Zhuo. > >> > > > > I aim to get to this in the next week, but I don't currently have > > access to a system to test it, it will take a few days at least before I > > can get one. > > I finally a managed to get access to a system to test this on, and it > looks like things are working with this patch-set. So... > > Tested-by: Darren Kenny > > Thanks, > > Darren. Thanks a lot Darren! -- MST
Re: [PATCH 0/5] virtio_net: enable premapped mode by default
Hi Michael / Xuan Zhuo, On Wednesday, 2024-10-16 at 08:55:21 +01, Darren Kenny wrote: > Hi Michael, > > On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote: >> On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote: >>> In the last linux version, we disabled this feature to fix the >>> regress[1]. >>> >>> The patch set is try to fix the problem and re-enable it. >>> >>> More info: >>> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com >>> >>> Thanks. >>> >>> [1]: >>> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com >> >> Darren, you previously reported crashes with a patch very similar to 1/5. >> Can you please test this patchset and report whether they >> are still observed? >> If yes, any data on how to reproduce would be very benefitial for Xuan >> Zhuo. >> > > I aim to get to this in the next week, but I don't currently have > access to a system to test it, it will take a few days at least before I > can get one. I finally a managed to get access to a system to test this on, and it looks like things are working with this patch-set. So... Tested-by: Darren Kenny Thanks, Darren.
Re: [PATCH 4/5] virtio_net: enable premapped mode for merge and small by default
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > Currently, the virtio core will perform a dma operation for each > buffer. Although, the same page may be operated multiple times. > > In premapped mod, we can perform only one dma operation for the pages of > the alloc frag. This is beneficial for the iommu device. > > kernel command line: intel_iommu=on iommu.passthrough=0 > >| strict=0 | strict=1 > Before | 775496pps | 428614pps > After | 1109316pps | 742853pps > > In the 6.11, we disabled this feature because a regress [1]. > > Now, we fix the problem and re-enable it. > > [1]: > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index cd90e77881df..8cf24b7b58bd 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -6133,6 +6133,21 @@ static int virtnet_alloc_queues(struct virtnet_info > *vi) > return -ENOMEM; > } > > +static void virtnet_rq_set_premapped(struct virtnet_info *vi) > +{ > + int i; > + > + /* disable for big mode */ > + if (vi->mode == VIRTNET_MODE_BIG) > + return; Nitpick: I would like such a check to be done at the caller. But anyhow the patch looks good Acked-by: Jason Wang Thanks > + > + for (i = 0; i < vi->max_queue_pairs; i++) { > + /* error should never happen */ > + BUG_ON(virtqueue_set_dma_premapped(vi->rq[i].vq)); > + vi->rq[i].do_dma = true; > + } > +} > + > static int init_vqs(struct virtnet_info *vi) > { > int ret; > @@ -6146,6 +6161,8 @@ static int init_vqs(struct virtnet_info *vi) > if (ret) > goto err_free; > > + virtnet_rq_set_premapped(vi); > + > cpus_read_lock(); > virtnet_set_affinity(vi); > cpus_read_unlock(); > -- > 2.32.0.3.g01195cf9f >
Re: [PATCH 2/5] virtio_net: introduce vi->mode
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > Now, if we want to judge the rx work mode, we have to use such codes: > > 1. merge mode: vi->mergeable_rx_bufs > 2. big mode: vi->big_packets && !vi->mergeable_rx_bufs > 3. small: !vi->big_packets && !vi->mergeable_rx_bufs > > This is inconvenient and abstract, and we also have this use case: > > if (vi->mergeable_rx_bufs) > > else if (vi->big_packets) > > else > > For this case, I think switch-case is the better choice. > > So here I introduce vi->mode to record the virtio-net work mode. > That is helpful to judge the work mode and choose the branches. > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 61 +++- > 1 file changed, 47 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 59a99bbaf852..14809b614d62 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -385,6 +385,12 @@ struct control_buf { > virtio_net_ctrl_ack status; > }; > > +enum virtnet_mode { > + VIRTNET_MODE_SMALL, > + VIRTNET_MODE_MERGE, > + VIRTNET_MODE_BIG > +}; I'm not sure if this can ease or not. [...] > + if (vi->mergeable_rx_bufs) > + vi->mode = VIRTNET_MODE_MERGE; > + else if (vi->big_packets) > + vi->mode = VIRTNET_MODE_BIG; Maybe we can just say big_packets doesn't mean big mode. > + else > + vi->mode = VIRTNET_MODE_SMALL; > + > if (vi->any_header_sg) > dev->needed_headroom = vi->hdr_len; Anyhow this seems not a fix so it should be a separate series than patch 1? Thanks > > -- > 2.32.0.3.g01195cf9f >
Re: [PATCH 5/5] virtio_net: rx remove premapped failover code
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > Now, the premapped mode can be enabled unconditionally. > > So we can remove the failover code for merge and small mode. > > Signed-off-by: Xuan Zhuo Let's be more verbose here. For example, the virtnet_rq_xxx() helper would be only used if the mode is using pre mapping. And it might be worth adding a check to prevent misusing of the API. Others look good to me. Thanks
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo wrote: > > When the frag just got a page, then may lead to regression on VM. > Specially if the sysctl net.core.high_order_alloc_disable value is 1, > then the frag always get a page when do refill. > > Which could see reliable crashes or scp failure (scp a file 100M in size > to VM): > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning > of a new frag. When the frag size is larger than PAGE_SIZE, > everything is fine. However, if the frag is only one page and the > total size of the buffer and virtnet_rq_dma is larger than one page, an > overflow may occur. > > Here, when the frag size is not enough, we reduce the buffer len to fix > this problem. > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever > use_dma_api") > Reported-by: "Si-Wei Liu" > Signed-off-by: Xuan Zhuo Though this may fix the problem and we need it now, I would like to have some tweaks in the future. Basically because of the security implication for sharing driver metadata with the device (most IOMMU can only do protection at the granule at the page). So we may end up with device-triggerable behaviour. For safety, we should use driver private memory for DMA metadata. > --- > drivers/net/virtio_net.c | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index f8131f92a392..59a99bbaf852 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, > u32 size, gfp_t gfp) > void *buf, *head; > dma_addr_t addr; > > - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp))) > - return NULL; > - > head = page_address(alloc_frag->page); > > if (rq->do_dma) { > @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, > struct receive_queue *rq, > len = SKB_DATA_ALIGN(len) + > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp))) > + return -ENOMEM; > + > buf = virtnet_rq_alloc(rq, len, gfp); > if (unlikely(!buf)) > return -ENOMEM; > @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info > *vi, > */ > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room); > > + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp))) > + return -ENOMEM; This makes me think of another question, how could we guarantee len + room is less or equal to PAGE_SIZE. Especially considering we need extra head and tailroom for XDP? If we can't it is a bug: """ /** * skb_page_frag_refill - check that a page_frag contains enough room * @sz: minimum size of the fragment we want to get * @pfrag: pointer to page_frag * @gfp: priority for memory allocation * * Note: While this allocator tries to use high order pages, there is * no guarantee that allocations succeed. Therefore, @sz MUST be * less or equal than PAGE_SIZE. */ """ > + > + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > > alloc_frag->size) > + len -= sizeof(struct virtnet_rq_dma); Any reason we need to check alloc_frag->offset? > + > buf = virtnet_rq_alloc(rq, len + room, gfp); Btw, as pointed out in previous review, we should have a consistent API: 1) hide the alloc_frag like virtnet_rq_alloc() or 2) pass the alloc_frag to virtnet_rq_alloc() > if (unlikely(!buf)) > return -ENOMEM; > -- > 2.32.0.3.g01195cf9f > Thanks
Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc
On 10/14/24 05:12, Xuan Zhuo wrote: When the frag just got a page, then may lead to regression on VM. Specially if the sysctl net.core.high_order_alloc_disable value is 1, then the frag always get a page when do refill. Which could see reliable crashes or scp failure (scp a file 100M in size to VM): The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning of a new frag. When the frag size is larger than PAGE_SIZE, everything is fine. However, if the frag is only one page and the total size of the buffer and virtnet_rq_dma is larger than one page, an overflow may occur. Here, when the frag size is not enough, we reduce the buffer len to fix this problem. Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api") Reported-by: "Si-Wei Liu" Signed-off-by: Xuan Zhuo This looks like a fix that should target the net tree, but the following patches looks like net-next material. Any special reason to bundle them together? Also, please explicitly include the the target tree in the subj on next submissions, thanks! Paolo
Re: [PATCH 0/5] virtio_net: enable premapped mode by default
Hi Michael, On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote: > On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote: >> In the last linux version, we disabled this feature to fix the >> regress[1]. >> >> The patch set is try to fix the problem and re-enable it. >> >> More info: >> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com >> >> Thanks. >> >> [1]: >> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > Darren, you previously reported crashes with a patch very similar to 1/5. > Can you please test this patchset and report whether they > are still observed? > If yes, any data on how to reproduce would be very benefitial for Xuan > Zhuo. > I aim to get to this in the next week, but I don't currently have access to a system to test it, it will take a few days at least before I can get one. Thanks, Darren. > >> Xuan Zhuo (5): >> virtio-net: fix overflow inside virtnet_rq_alloc >> virtio_net: introduce vi->mode >> virtio_net: big mode skip the unmap check >> virtio_net: enable premapped mode for merge and small by default >> virtio_net: rx remove premapped failover code >> >> drivers/net/virtio_net.c | 168 --- >> 1 file changed, 105 insertions(+), 63 deletions(-) >> >> -- >> 2.32.0.3.g01195cf9f
Re: [PATCH 0/5] virtio_net: enable premapped mode by default
On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote: > In the last linux version, we disabled this feature to fix the > regress[1]. > > The patch set is try to fix the problem and re-enable it. > > More info: > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > Thanks. > > [1]: > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com Darren, you previously reported crashes with a patch very similar to 1/5. Can you please test this patchset and report whether they are still observed? If yes, any data on how to reproduce would be very benefitial for Xuan Zhuo. > Xuan Zhuo (5): > virtio-net: fix overflow inside virtnet_rq_alloc > virtio_net: introduce vi->mode > virtio_net: big mode skip the unmap check > virtio_net: enable premapped mode for merge and small by default > virtio_net: rx remove premapped failover code > > drivers/net/virtio_net.c | 168 --- > 1 file changed, 105 insertions(+), 63 deletions(-) > > -- > 2.32.0.3.g01195cf9f
Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"
On Wed, Oct 09, 2024 at 06:00:47PM +0800, Xuan Zhuo wrote: > On Wed, 9 Oct 2024 05:29:35 -0400, "Michael S. Tsirkin" > wrote: > > On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote: > > > Regression: > > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > > > > > I still think that the patch can fix the problem, I hope Darren can > > > re-test it > > > or give me more info. > > > > > > > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > > > > > If that can not work or Darren can not reply in time, Michael you can try > > > this > > > patch set. > > > > > > Thanks. > > > > It's been a month - were you going to post patches fixing bugs in premap > > and then re-enabling this? > > [1] was tried to fix the bug in premapped mode. > > We all wait the re-test from Darren. > > But Jason and Takero have tested it. > > If you do not want to wait Darren, we can discuss [1] or add it to your next > branch for more test. > > Thanks. > > [1]: > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > > > > > Thanks! > > > > > > > Xuan Zhuo (3): > > > Revert "virtio_net: rx remove premapped failover code" > > > Revert "virtio_net: big mode skip the unmap check" > > > virtio_net: disable premapped mode by default > > > > > > drivers/net/virtio_net.c | 95 +++- > > > 1 file changed, 46 insertions(+), 49 deletions(-) > > > > > > -- > > > 2.32.0.3.g01195cf9f > > > > > I suggest you post a patchset you want me to apply on top of master. -- MST
Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"
On Wed, 9 Oct 2024 05:29:35 -0400, "Michael S. Tsirkin" wrote: > On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote: > > Regression: > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > > > I still think that the patch can fix the problem, I hope Darren can re-test > > it > > or give me more info. > > > > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > > > If that can not work or Darren can not reply in time, Michael you can try > > this > > patch set. > > > > Thanks. > > It's been a month - were you going to post patches fixing bugs in premap > and then re-enabling this? [1] was tried to fix the bug in premapped mode. We all wait the re-test from Darren. But Jason and Takero have tested it. If you do not want to wait Darren, we can discuss [1] or add it to your next branch for more test. Thanks. [1]: http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > Thanks! > > > > Xuan Zhuo (3): > > Revert "virtio_net: rx remove premapped failover code" > > Revert "virtio_net: big mode skip the unmap check" > > virtio_net: disable premapped mode by default > > > > drivers/net/virtio_net.c | 95 +++- > > 1 file changed, 46 insertions(+), 49 deletions(-) > > > > -- > > 2.32.0.3.g01195cf9f > > >
Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"
On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote: > Regression: > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com > > I still think that the patch can fix the problem, I hope Darren can re-test it > or give me more info. > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com > > If that can not work or Darren can not reply in time, Michael you can try this > patch set. > > Thanks. It's been a month - were you going to post patches fixing bugs in premap and then re-enabling this? Thanks! > Xuan Zhuo (3): > Revert "virtio_net: rx remove premapped failover code" > Revert "virtio_net: big mode skip the unmap check" > virtio_net: disable premapped mode by default > > drivers/net/virtio_net.c | 95 +++- > 1 file changed, 46 insertions(+), 49 deletions(-) > > -- > 2.32.0.3.g01195cf9f >
Re: [RFC net-next v1 10/12] virtio_net: xsk: tx: support xmit xsk buffer
On Tue, 24 Sep 2024 15:35:08 +0800, Jason Wang wrote: > On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > > > The driver's tx napi is very important for XSK. It is responsible for > > obtaining data from the XSK queue and sending it out. > > > > At the beginning, we need to trigger tx napi. > > > > virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk > > buffer) by the last bits of the pointer. > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 176 --- > > 1 file changed, 166 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 3ad4c6e3ef18..1a870f1df910 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -83,6 +83,7 @@ struct virtnet_sq_free_stats { > > u64 bytes; > > u64 napi_packets; > > u64 napi_bytes; > > + u64 xsk; > > }; > > > > struct virtnet_sq_stats { > > @@ -514,16 +515,20 @@ static struct sk_buff *virtnet_skb_append_frag(struct > > sk_buff *head_skb, > >struct sk_buff *curr_skb, > >struct page *page, void *buf, > >int len, int truesize); > > +static void virtnet_xsk_completed(struct send_queue *sq, int num); > > > > enum virtnet_xmit_type { > > VIRTNET_XMIT_TYPE_SKB, > > VIRTNET_XMIT_TYPE_SKB_ORPHAN, > > VIRTNET_XMIT_TYPE_XDP, > > + VIRTNET_XMIT_TYPE_XSK, > > }; > > > > /* We use the last two bits of the pointer to distinguish the xmit type. */ > > #define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1)) > > > > +#define VIRTIO_XSK_FLAG_OFFSET 4 > > Any reason this is not 2? There's no particular reason for this, any value greater than 2 will work. > > > + > > static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr) > > { > > unsigned long p = (unsigned long)*ptr; > > @@ -546,6 +551,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, > > int num, void *data, > > GFP_ATOMIC); > > } > > > > +static u32 virtnet_ptr_to_xsk_buff_len(void *ptr) > > +{ > > + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET; > > +} > > + > > static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > { > > sg_assign_page(sg, NULL); > > @@ -587,11 +597,27 @@ static void __free_old_xmit(struct send_queue *sq, > > struct netdev_queue *txq, > > stats->bytes += xdp_get_frame_len(frame); > > xdp_return_frame(frame); > > break; > > + > > + case VIRTNET_XMIT_TYPE_XSK: > > + stats->bytes += virtnet_ptr_to_xsk_buff_len(ptr); > > + stats->xsk++; > > + break; > > } > > } > > netdev_tx_completed_queue(txq, stats->napi_packets, > > stats->napi_bytes); > > Not related to this patch, but this seems unnecessary to AF_XDP. YES. netdev_tx_completed_queue will check napi_bytes firstly. So I do not think we need to do anything for this. > > > } > > > > +static void virtnet_free_old_xmit(struct send_queue *sq, > > + struct netdev_queue *txq, > > + bool in_napi, > > + struct virtnet_sq_free_stats *stats) > > +{ > > + __free_old_xmit(sq, txq, in_napi, stats); > > + > > + if (stats->xsk) > > + virtnet_xsk_completed(sq, stats->xsk); > > +} > > + > > /* Converting between virtqueue no. and kernel tx/rx queue no. > > * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq > > */ > > @@ -1019,7 +1045,7 @@ static void free_old_xmit(struct send_queue *sq, > > struct netdev_queue *txq, > > { > > struct virtnet_sq_free_stats stats = {0}; > > > > - __free_old_xmit(sq, txq, in_napi, &stats); > > + virtnet_free_old_xmit(sq, txq, in_napi, &stats); > > > > /* Avoid overhead when no packets have been processed > > * happens when called speculatively from start_xmit. > > @@ -1380,6 +1406,111 @@ static int virtnet_add_recvbuf_xsk(struct > > virtnet_info *vi, struct receive_queue > > return err; > > } > > > > +static void *virtnet_xsk_to_ptr(u32 len) > > +{ > > + unsigned long p; > > + > > + p = len << VIRTIO_XSK_FLAG_OFFSET; > > + > > + return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK); > > +} > > + > > +static int virtnet_xsk_xmit_one(struct send_queue *sq, > > + struct xsk_buff_pool *pool, > > + struct xdp_desc *desc) > > +{ > > + struct virtnet_info *vi; > > + dma_addr_t addr; > > + > > + vi = sq->vq->vdev->priv; > > + > > + addr = xsk_buff_raw_get_dma(pool, desc->addr); > > + xsk_buff_raw_dma_sync_for_dev
Re: [RFC net-next v1 07/12] virtio_net: refactor the xmit type
On Tue, 24 Sep 2024 15:35:03 +0800, Jason Wang wrote: > On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > > > Because the af-xdp will introduce a new xmit type, so I refactor the > > xmit type mechanism first. > > > > In general, pointers are aligned to 4 or 8 bytes. > > I think this needs some clarification, the alignment seems to depend > on the lowest common multiple between the alignments of all struct > members. So we know both xdp_frame and sk_buff are at least 4 bytes > aligned. > > If we want to reuse the lowest bit of pointers in AF_XDP, the > alignment of the data structure should be at least 4 bytes. YES, for AF_XDP. See more in #10. > > > If it is aligned to 4 > > bytes, then only two bits are free for a pointer. But there are 4 types > > here, so we can't use bits to distinguish them. And 2 bits is enough for > > 4 types: > > > > 00 for skb > > 01 for SKB_ORPHAN > > 10 for XDP > > 11 for af-xdp tx > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 90 +++- > > 1 file changed, 51 insertions(+), 39 deletions(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 630e5b21ad69..41a5ea9b788d 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -45,9 +45,6 @@ module_param(napi_tx, bool, 0644); > > #define VIRTIO_XDP_TX BIT(0) > > #define VIRTIO_XDP_REDIR BIT(1) > > > > -#define VIRTIO_XDP_FLAGBIT(0) > > -#define VIRTIO_ORPHAN_FLAG BIT(1) > > - > > /* RX packet size EWMA. The average packet size is used to determine the > > packet > > * buffer size when refilling RX rings. As the entire RX ring may be > > refilled > > * at once, the weight is chosen so that the EWMA will be insensitive to > > short- > > @@ -512,34 +509,35 @@ static struct sk_buff *virtnet_skb_append_frag(struct > > sk_buff *head_skb, > >struct page *page, void *buf, > >int len, int truesize); > > > > -static bool is_xdp_frame(void *ptr) > > -{ > > - return (unsigned long)ptr & VIRTIO_XDP_FLAG; > > -} > > +enum virtnet_xmit_type { > > + VIRTNET_XMIT_TYPE_SKB, > > + VIRTNET_XMIT_TYPE_SKB_ORPHAN, > > + VIRTNET_XMIT_TYPE_XDP, > > +}; > > > > -static void *xdp_to_ptr(struct xdp_frame *ptr) > > -{ > > - return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); > > -} > > +/* We use the last two bits of the pointer to distinguish the xmit type. */ > > +#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1)) > > > > -static struct xdp_frame *ptr_to_xdp(void *ptr) > > +static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr) > > Nit: not a native speaker but I think something like pack/unpack might > be better. Will fix. Thanks > > With those changes. > > Acked-by: Jason Wang > > Thanks >
Re: [RFC net-next v1 08/12] virtio_net: xsk: bind/unbind xsk for tx
On Tue, 24 Sep 2024 15:35:05 +0800, Jason Wang wrote: > On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > > > This patch implement the logic of bind/unbind xsk pool to sq and rq. > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/net/virtio_net.c | 53 > > 1 file changed, 53 insertions(+) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index 41a5ea9b788d..7c379614fd22 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -295,6 +295,10 @@ struct send_queue { > > > > /* Record whether sq is in reset state. */ > > bool reset; > > + > > + struct xsk_buff_pool *xsk_pool; > > + > > + dma_addr_t xsk_hdr_dma_addr; > > }; > > > > /* Internal representation of a receive virtqueue */ > > @@ -497,6 +501,8 @@ struct virtio_net_common_hdr { > > }; > > }; > > > > +static struct virtio_net_common_hdr xsk_hdr; > > + > > static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); > > static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff > > *xdp, > >struct net_device *dev, > > @@ -5488,6 +5494,29 @@ static int virtnet_rq_bind_xsk_pool(struct > > virtnet_info *vi, struct receive_queu > > return err; > > } > > > > +static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi, > > + struct send_queue *sq, > > + struct xsk_buff_pool *pool) > > +{ > > + int err, qindex; > > + > > + qindex = sq - vi->sq; > > + > > + virtnet_tx_pause(vi, sq); > > + > > + err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf); > > + if (err) { > > + netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: > > %d\n", qindex, err); > > + pool = NULL; > > + } > > + > > + sq->xsk_pool = pool; > > + > > + virtnet_tx_resume(vi, sq); > > + > > + return err; > > +} > > + > > static int virtnet_xsk_pool_enable(struct net_device *dev, > >struct xsk_buff_pool *pool, > >u16 qid) > > @@ -5496,6 +5525,7 @@ static int virtnet_xsk_pool_enable(struct net_device > > *dev, > > struct receive_queue *rq; > > struct device *dma_dev; > > struct send_queue *sq; > > + dma_addr_t hdr_dma; > > int err, size; > > > > if (vi->hdr_len > xsk_pool_get_headroom(pool)) > > @@ -5533,6 +5563,11 @@ static int virtnet_xsk_pool_enable(struct net_device > > *dev, > > if (!rq->xsk_buffs) > > return -ENOMEM; > > > > + hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, > > vi->hdr_len, > > +DMA_TO_DEVICE, 0); > > + if (virtqueue_dma_mapping_error(sq->vq, hdr_dma)) > > + return -ENOMEM; > > + > > err = xsk_pool_dma_map(pool, dma_dev, 0); > > if (err) > > goto err_xsk_map; > > @@ -5541,11 +5576,24 @@ static int virtnet_xsk_pool_enable(struct > > net_device *dev, > > if (err) > > goto err_rq; > > > > + err = virtnet_sq_bind_xsk_pool(vi, sq, pool); > > + if (err) > > + goto err_sq; > > + > > + /* Now, we do not support tx offset, so all the tx virtnet hdr is > > zero. > > What did you mean by "tx offset" here? (Or I don't see the connection > with vnet hdr). Sorry, should be tx offload(such as tx csum). Will fix. Thanks. > > Anyhow the patch looks good. > > Acked-by: Jason Wang > > Thanks >
Re: [RFC net-next v1 02/12] virtio_ring: split: record extras for indirect buffers
Will fix for all commets in next version. Thanks. On Tue, 24 Sep 2024 15:34:57 +0800, Jason Wang wrote: > On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > > > The subsequent commit needs to know whether every indirect buffer is > > premapped or not. So we need to introduce an extra struct for every > > indirect buffer to record this info. > > > > Signed-off-by: Xuan Zhuo > > --- > > drivers/virtio/virtio_ring.c | 126 +-- > > 1 file changed, 61 insertions(+), 65 deletions(-) > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > index 228e9fbcba3f..62901bee97c0 100644 > > --- a/drivers/virtio/virtio_ring.c > > +++ b/drivers/virtio/virtio_ring.c > > @@ -67,9 +67,20 @@ > > #define LAST_ADD_TIME_INVALID(vq) > > #endif > > > > +struct vring_desc_extra { > > + dma_addr_t addr;/* Descriptor DMA addr. */ > > + u32 len;/* Descriptor length. */ > > + u16 flags; /* Descriptor flags. */ > > + u16 next; /* The next desc state in a list. */ > > +}; > > + > > struct vring_desc_state_split { > > void *data; /* Data for callback. */ > > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > > + > > + /* Indirect extra table and desc table, if any. These two will be > > +* allocated together. So we won't stress more to the memory > > allocator. > > +*/ > > + struct vring_desc_extra *indir; > > }; > > > > struct vring_desc_state_packed { > > @@ -79,13 +90,6 @@ struct vring_desc_state_packed { > > u16 last; /* The last desc state in a list. */ > > }; > > > > -struct vring_desc_extra { > > - dma_addr_t addr;/* Descriptor DMA addr. */ > > - u32 len;/* Descriptor length. */ > > - u16 flags; /* Descriptor flags. */ > > - u16 next; /* The next desc state in a list. */ > > -}; > > - > > struct vring_virtqueue_split { > > /* Actual memory layout for this queue. */ > > struct vring vring; > > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue > > *vq, u32 num) > > * Split ring specific functions - *_split(). > > */ > > > > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue > > *vq, > > - const struct vring_desc *desc) > > -{ > > - u16 flags; > > - > > - if (!vring_need_unmap_buffer(vq)) > > - return; > > - > > - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > > - > > - dma_unmap_page(vring_dma_dev(vq), > > - virtio64_to_cpu(vq->vq.vdev, desc->addr), > > - virtio32_to_cpu(vq->vq.vdev, desc->len), > > - (flags & VRING_DESC_F_WRITE) ? > > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > > -} > > - > > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > > - unsigned int i) > > + struct vring_desc_extra *extra) > > { > > - struct vring_desc_extra *extra = vq->split.desc_extra; > > u16 flags; > > > > - flags = extra[i].flags; > > + flags = extra->flags; > > > > if (flags & VRING_DESC_F_INDIRECT) { > > if (!vq->use_dma_api) > > goto out; > > > > dma_unmap_single(vring_dma_dev(vq), > > -extra[i].addr, > > -extra[i].len, > > +extra->addr, > > +extra->len, > > (flags & VRING_DESC_F_WRITE) ? > > DMA_FROM_DEVICE : DMA_TO_DEVICE); > > } else { > > @@ -479,20 +465,22 @@ static unsigned int vring_unmap_one_split(const > > struct vring_virtqueue *vq, > > goto out; > > > > dma_unmap_page(vring_dma_dev(vq), > > - extra[i].addr, > > - extra[i].len, > > + extra->addr, > > + extra->len, > >(flags & VRING_DESC_F_WRITE) ? > >DMA_FROM_DEVICE : DMA_TO_DEVICE); > > } > > > > out: > > - return extra[i].next; > > + return extra->next; > > } > > > > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, > >unsigned int total_sg, > > + struct vring_desc_extra > > **pextra, > > This kind of API seems a little bit strange. In the caller we had: > > if (desc) { > ... > } else { > extra = vq->split.desc_extra; > } >
Re: [RFC net-next v1 04/12] virtio_ring: perform premapped operations based on per-buffer
On Tue, 24 Sep 2024 15:35:01 +0800, Jason Wang wrote: > On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > > > The current configuration sets the virtqueue (vq) to premapped mode, > > implying that all buffers submitted to this queue must be mapped ahead > > of time. This presents a challenge for the virtnet send queue (sq): the > > virtnet driver would be required to keep track of dma information for vq > > size * 17, which can be substantial. However, if the premapped mode were > > applied on a per-buffer basis, the complexity would be greatly reduced. > > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel > > skb buffers could remain unmapped. > > > > We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this > > indicates that the driver has performed DMA mapping in advance, allowing > > the Virtio core to directly utilize sg_dma_address(sg) without > > conducting any internal DMA mapping. Additionally, DMA unmap operations > > for this buffer will be bypassed. > > So I think we still need some explanation here. I think this works for > virtio-net as the sgs are initialized by the virtio-net device itself. > > But it seems not the case for all the others where the sgs were passed > from the uppyer subsystem. For example in __virtscsi_add_cmd(), we > had: > > if (sc && sc->sc_data_direction != DMA_NONE) { > if (sc->sc_data_direction != DMA_FROM_DEVICE) > out = &sc->sdb.table; > if (sc->sc_data_direction != DMA_TO_DEVICE) > in = &sc->sdb.table; > } > > /* Request header. */ > sg_init_one(&req, &cmd->req, req_size); > sgs[out_num++] = &req; > > /* Data-out buffer. */ > if (out) { > /* Place WRITE protection SGLs before Data OUT payload */ > if (scsi_prot_sg_count(sc)) > sgs[out_num++] = scsi_prot_sglist(sc); > sgs[out_num++] = out->sgl; > } With this in mind, I think the new api is a suitable approach to avoid changing sg. Thanks. > > Thanks >
Re: [RFC net-next v1 10/12] virtio_net: xsk: tx: support xmit xsk buffer
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > The driver's tx napi is very important for XSK. It is responsible for > obtaining data from the XSK queue and sending it out. > > At the beginning, we need to trigger tx napi. > > virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk > buffer) by the last bits of the pointer. > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 176 --- > 1 file changed, 166 insertions(+), 10 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 3ad4c6e3ef18..1a870f1df910 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -83,6 +83,7 @@ struct virtnet_sq_free_stats { > u64 bytes; > u64 napi_packets; > u64 napi_bytes; > + u64 xsk; > }; > > struct virtnet_sq_stats { > @@ -514,16 +515,20 @@ static struct sk_buff *virtnet_skb_append_frag(struct > sk_buff *head_skb, >struct sk_buff *curr_skb, >struct page *page, void *buf, >int len, int truesize); > +static void virtnet_xsk_completed(struct send_queue *sq, int num); > > enum virtnet_xmit_type { > VIRTNET_XMIT_TYPE_SKB, > VIRTNET_XMIT_TYPE_SKB_ORPHAN, > VIRTNET_XMIT_TYPE_XDP, > + VIRTNET_XMIT_TYPE_XSK, > }; > > /* We use the last two bits of the pointer to distinguish the xmit type. */ > #define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1)) > > +#define VIRTIO_XSK_FLAG_OFFSET 4 Any reason this is not 2? > + > static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr) > { > unsigned long p = (unsigned long)*ptr; > @@ -546,6 +551,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, int > num, void *data, > GFP_ATOMIC); > } > > +static u32 virtnet_ptr_to_xsk_buff_len(void *ptr) > +{ > + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET; > +} > + > static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > { > sg_assign_page(sg, NULL); > @@ -587,11 +597,27 @@ static void __free_old_xmit(struct send_queue *sq, > struct netdev_queue *txq, > stats->bytes += xdp_get_frame_len(frame); > xdp_return_frame(frame); > break; > + > + case VIRTNET_XMIT_TYPE_XSK: > + stats->bytes += virtnet_ptr_to_xsk_buff_len(ptr); > + stats->xsk++; > + break; > } > } > netdev_tx_completed_queue(txq, stats->napi_packets, > stats->napi_bytes); Not related to this patch, but this seems unnecessary to AF_XDP. > } > > +static void virtnet_free_old_xmit(struct send_queue *sq, > + struct netdev_queue *txq, > + bool in_napi, > + struct virtnet_sq_free_stats *stats) > +{ > + __free_old_xmit(sq, txq, in_napi, stats); > + > + if (stats->xsk) > + virtnet_xsk_completed(sq, stats->xsk); > +} > + > /* Converting between virtqueue no. and kernel tx/rx queue no. > * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq > */ > @@ -1019,7 +1045,7 @@ static void free_old_xmit(struct send_queue *sq, struct > netdev_queue *txq, > { > struct virtnet_sq_free_stats stats = {0}; > > - __free_old_xmit(sq, txq, in_napi, &stats); > + virtnet_free_old_xmit(sq, txq, in_napi, &stats); > > /* Avoid overhead when no packets have been processed > * happens when called speculatively from start_xmit. > @@ -1380,6 +1406,111 @@ static int virtnet_add_recvbuf_xsk(struct > virtnet_info *vi, struct receive_queue > return err; > } > > +static void *virtnet_xsk_to_ptr(u32 len) > +{ > + unsigned long p; > + > + p = len << VIRTIO_XSK_FLAG_OFFSET; > + > + return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK); > +} > + > +static int virtnet_xsk_xmit_one(struct send_queue *sq, > + struct xsk_buff_pool *pool, > + struct xdp_desc *desc) > +{ > + struct virtnet_info *vi; > + dma_addr_t addr; > + > + vi = sq->vq->vdev->priv; > + > + addr = xsk_buff_raw_get_dma(pool, desc->addr); > + xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len); > + > + sg_init_table(sq->sg, 2); > + > + sg_fill_dma(sq->sg, sq->xsk_hdr_dma_addr, vi->hdr_len); > + sg_fill_dma(sq->sg + 1, addr, desc->len); > + > + return virtqueue_add_outbuf(sq->vq, sq->sg, 2, > + virtnet_xsk_to_ptr(desc->len), > GFP_ATOMIC); > +} > + > +static int virtnet_xsk_xmit_batch(struct send_queue *sq, > + struct xsk_buff_pool *pool, > + unsign
Re: [RFC net-next v1 04/12] virtio_ring: perform premapped operations based on per-buffer
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > The current configuration sets the virtqueue (vq) to premapped mode, > implying that all buffers submitted to this queue must be mapped ahead > of time. This presents a challenge for the virtnet send queue (sq): the > virtnet driver would be required to keep track of dma information for vq > size * 17, which can be substantial. However, if the premapped mode were > applied on a per-buffer basis, the complexity would be greatly reduced. > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel > skb buffers could remain unmapped. > > We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this > indicates that the driver has performed DMA mapping in advance, allowing > the Virtio core to directly utilize sg_dma_address(sg) without > conducting any internal DMA mapping. Additionally, DMA unmap operations > for this buffer will be bypassed. So I think we still need some explanation here. I think this works for virtio-net as the sgs are initialized by the virtio-net device itself. But it seems not the case for all the others where the sgs were passed from the uppyer subsystem. For example in __virtscsi_add_cmd(), we had: if (sc && sc->sc_data_direction != DMA_NONE) { if (sc->sc_data_direction != DMA_FROM_DEVICE) out = &sc->sdb.table; if (sc->sc_data_direction != DMA_TO_DEVICE) in = &sc->sdb.table; } /* Request header. */ sg_init_one(&req, &cmd->req, req_size); sgs[out_num++] = &req; /* Data-out buffer. */ if (out) { /* Place WRITE protection SGLs before Data OUT payload */ if (scsi_prot_sg_count(sc)) sgs[out_num++] = scsi_prot_sglist(sc); sgs[out_num++] = out->sgl; } Thanks
Re: [RFC net-next v1 08/12] virtio_net: xsk: bind/unbind xsk for tx
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > This patch implement the logic of bind/unbind xsk pool to sq and rq. > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 53 > 1 file changed, 53 insertions(+) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 41a5ea9b788d..7c379614fd22 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -295,6 +295,10 @@ struct send_queue { > > /* Record whether sq is in reset state. */ > bool reset; > + > + struct xsk_buff_pool *xsk_pool; > + > + dma_addr_t xsk_hdr_dma_addr; > }; > > /* Internal representation of a receive virtqueue */ > @@ -497,6 +501,8 @@ struct virtio_net_common_hdr { > }; > }; > > +static struct virtio_net_common_hdr xsk_hdr; > + > static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf); > static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff > *xdp, >struct net_device *dev, > @@ -5488,6 +5494,29 @@ static int virtnet_rq_bind_xsk_pool(struct > virtnet_info *vi, struct receive_queu > return err; > } > > +static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi, > + struct send_queue *sq, > + struct xsk_buff_pool *pool) > +{ > + int err, qindex; > + > + qindex = sq - vi->sq; > + > + virtnet_tx_pause(vi, sq); > + > + err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf); > + if (err) { > + netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: > %d\n", qindex, err); > + pool = NULL; > + } > + > + sq->xsk_pool = pool; > + > + virtnet_tx_resume(vi, sq); > + > + return err; > +} > + > static int virtnet_xsk_pool_enable(struct net_device *dev, >struct xsk_buff_pool *pool, >u16 qid) > @@ -5496,6 +5525,7 @@ static int virtnet_xsk_pool_enable(struct net_device > *dev, > struct receive_queue *rq; > struct device *dma_dev; > struct send_queue *sq; > + dma_addr_t hdr_dma; > int err, size; > > if (vi->hdr_len > xsk_pool_get_headroom(pool)) > @@ -5533,6 +5563,11 @@ static int virtnet_xsk_pool_enable(struct net_device > *dev, > if (!rq->xsk_buffs) > return -ENOMEM; > > + hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, > vi->hdr_len, > +DMA_TO_DEVICE, 0); > + if (virtqueue_dma_mapping_error(sq->vq, hdr_dma)) > + return -ENOMEM; > + > err = xsk_pool_dma_map(pool, dma_dev, 0); > if (err) > goto err_xsk_map; > @@ -5541,11 +5576,24 @@ static int virtnet_xsk_pool_enable(struct net_device > *dev, > if (err) > goto err_rq; > > + err = virtnet_sq_bind_xsk_pool(vi, sq, pool); > + if (err) > + goto err_sq; > + > + /* Now, we do not support tx offset, so all the tx virtnet hdr is > zero. What did you mean by "tx offset" here? (Or I don't see the connection with vnet hdr). Anyhow the patch looks good. Acked-by: Jason Wang Thanks
Re: [RFC net-next v1 07/12] virtio_net: refactor the xmit type
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > Because the af-xdp will introduce a new xmit type, so I refactor the > xmit type mechanism first. > > In general, pointers are aligned to 4 or 8 bytes. I think this needs some clarification, the alignment seems to depend on the lowest common multiple between the alignments of all struct members. So we know both xdp_frame and sk_buff are at least 4 bytes aligned. If we want to reuse the lowest bit of pointers in AF_XDP, the alignment of the data structure should be at least 4 bytes. > If it is aligned to 4 > bytes, then only two bits are free for a pointer. But there are 4 types > here, so we can't use bits to distinguish them. And 2 bits is enough for > 4 types: > > 00 for skb > 01 for SKB_ORPHAN > 10 for XDP > 11 for af-xdp tx > > Signed-off-by: Xuan Zhuo > --- > drivers/net/virtio_net.c | 90 +++- > 1 file changed, 51 insertions(+), 39 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 630e5b21ad69..41a5ea9b788d 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -45,9 +45,6 @@ module_param(napi_tx, bool, 0644); > #define VIRTIO_XDP_TX BIT(0) > #define VIRTIO_XDP_REDIR BIT(1) > > -#define VIRTIO_XDP_FLAGBIT(0) > -#define VIRTIO_ORPHAN_FLAG BIT(1) > - > /* RX packet size EWMA. The average packet size is used to determine the > packet > * buffer size when refilling RX rings. As the entire RX ring may be refilled > * at once, the weight is chosen so that the EWMA will be insensitive to > short- > @@ -512,34 +509,35 @@ static struct sk_buff *virtnet_skb_append_frag(struct > sk_buff *head_skb, >struct page *page, void *buf, >int len, int truesize); > > -static bool is_xdp_frame(void *ptr) > -{ > - return (unsigned long)ptr & VIRTIO_XDP_FLAG; > -} > +enum virtnet_xmit_type { > + VIRTNET_XMIT_TYPE_SKB, > + VIRTNET_XMIT_TYPE_SKB_ORPHAN, > + VIRTNET_XMIT_TYPE_XDP, > +}; > > -static void *xdp_to_ptr(struct xdp_frame *ptr) > -{ > - return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); > -} > +/* We use the last two bits of the pointer to distinguish the xmit type. */ > +#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1)) > > -static struct xdp_frame *ptr_to_xdp(void *ptr) > +static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr) Nit: not a native speaker but I think something like pack/unpack might be better. With those changes. Acked-by: Jason Wang Thanks
Re: [RFC net-next v1 02/12] virtio_ring: split: record extras for indirect buffers
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo wrote: > > The subsequent commit needs to know whether every indirect buffer is > premapped or not. So we need to introduce an extra struct for every > indirect buffer to record this info. > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 126 +-- > 1 file changed, 61 insertions(+), 65 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 228e9fbcba3f..62901bee97c0 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -67,9 +67,20 @@ > #define LAST_ADD_TIME_INVALID(vq) > #endif > > +struct vring_desc_extra { > + dma_addr_t addr;/* Descriptor DMA addr. */ > + u32 len;/* Descriptor length. */ > + u16 flags; /* Descriptor flags. */ > + u16 next; /* The next desc state in a list. */ > +}; > + > struct vring_desc_state_split { > void *data; /* Data for callback. */ > - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > + > + /* Indirect extra table and desc table, if any. These two will be > +* allocated together. So we won't stress more to the memory > allocator. > +*/ > + struct vring_desc_extra *indir; > }; > > struct vring_desc_state_packed { > @@ -79,13 +90,6 @@ struct vring_desc_state_packed { > u16 last; /* The last desc state in a list. */ > }; > > -struct vring_desc_extra { > - dma_addr_t addr;/* Descriptor DMA addr. */ > - u32 len;/* Descriptor length. */ > - u16 flags; /* Descriptor flags. */ > - u16 next; /* The next desc state in a list. */ > -}; > - > struct vring_virtqueue_split { > /* Actual memory layout for this queue. */ > struct vring vring; > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, > u32 num) > * Split ring specific functions - *_split(). > */ > > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, > - const struct vring_desc *desc) > -{ > - u16 flags; > - > - if (!vring_need_unmap_buffer(vq)) > - return; > - > - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > - > - dma_unmap_page(vring_dma_dev(vq), > - virtio64_to_cpu(vq->vq.vdev, desc->addr), > - virtio32_to_cpu(vq->vq.vdev, desc->len), > - (flags & VRING_DESC_F_WRITE) ? > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > -} > - > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > - unsigned int i) > + struct vring_desc_extra *extra) > { > - struct vring_desc_extra *extra = vq->split.desc_extra; > u16 flags; > > - flags = extra[i].flags; > + flags = extra->flags; > > if (flags & VRING_DESC_F_INDIRECT) { > if (!vq->use_dma_api) > goto out; > > dma_unmap_single(vring_dma_dev(vq), > -extra[i].addr, > -extra[i].len, > +extra->addr, > +extra->len, > (flags & VRING_DESC_F_WRITE) ? > DMA_FROM_DEVICE : DMA_TO_DEVICE); > } else { > @@ -479,20 +465,22 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > goto out; > > dma_unmap_page(vring_dma_dev(vq), > - extra[i].addr, > - extra[i].len, > + extra->addr, > + extra->len, >(flags & VRING_DESC_F_WRITE) ? >DMA_FROM_DEVICE : DMA_TO_DEVICE); > } > > out: > - return extra[i].next; > + return extra->next; > } > > static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq, >unsigned int total_sg, > + struct vring_desc_extra > **pextra, This kind of API seems a little bit strange. In the caller we had: if (desc) { ... } else { extra = vq->split.desc_extra; } I'd move the assignment of the extra there to make the code more readable. >gfp_t gfp) > { > + struct vring_desc_extra *extra; > struct vring_desc *desc; > unsigned int i; > > @@ -503,40 +491,45 @@ static struct vring_desc *alloc_indirect_split(struct > virtqueue *_vq, > */ >
Re: [PATCH net-next 04/13] virtio_ring: perform premapped operations based on per-buffer
On Thu, Sep 12, 2024 at 3:43 PM Xuan Zhuo wrote: > > On Wed, 11 Sep 2024 11:54:25 +0800, Jason Wang wrote: > > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo > > wrote: > > > > > > The current configuration sets the virtqueue (vq) to premapped mode, > > > implying that all buffers submitted to this queue must be mapped ahead > > > of time. This presents a challenge for the virtnet send queue (sq): the > > > virtnet driver would be required to keep track of dma information for vq > > > size * 17, which can be substantial. However, if the premapped mode were > > > applied on a per-buffer basis, the complexity would be greatly reduced. > > > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel > > > skb buffers could remain unmapped. > > > > Is this only applied to TX or both TX and RX. > > > For rx, if you mean per-buffer dma buffer, I think it is yes, > rx can reuse this. If you mean should we do premapped for the > normal rx buffers, I think we should, that can reduce the > dma map operations. > > > > > > > > > > We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this > > > indicates that the driver has performed DMA mapping in advance, allowing > > > the Virtio core to directly utilize sg_dma_address(sg) without > > > conducting any internal DMA mapping. > > > > This seems conflict with the code below? > > > > #define sg_is_premapped(sg) (!sg_page(sg)) > > Sorry, I do not get for you. > > The key point is that the sg->page is setted by driver. Ok, I forget that but let's document this assumption in the changelog. > > So I mean if the driver sets sg->page = NULL, then for this sg, > the virtio core can skip dma mapping. If the driver sets > sg->page to the page of the buffer, then the virtio core should > do dma mapping for this sg. > Ok, let's describe this in the changelog. Thanks
Re: [PATCH net-next 07/13] virtio_net: refactor the xmit type
On Thu, Sep 12, 2024 at 3:54 PM Xuan Zhuo wrote: > > On Wed, 11 Sep 2024 12:04:16 +0800, Jason Wang wrote: > > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo > > wrote: > > > > > > Because the af-xdp will introduce a new xmit type, so I refactor the > > > xmit type mechanism first. > > > > > > We use the last two bits of the pointer to distinguish the xmit type, > > > so we can distinguish four xmit types. Now we have three types: skb, > > > orphan and xdp. > > > > And if I was not wrong, we do not anymore use bitmasks. If yes, let's > > explain the reason here. > > In general, pointers are aligned to 4 or 8 bytes. If it is aligned to 4 bytes, > then only two bits are free for a pointer. So we can only use two bits. > > But there are 4 types here, so we can't use bits to distinguish them. > > b00 for skb > b01 for SKB_ORPHAN > b10 for XDP > b11 for af-xdp tx Let's add these in the changelog. Thanks
Re: [PATCH net-next 10/13] virtio_net: xsk: tx: support xmit xsk buffer
On Thu, Sep 12, 2024 at 4:50 PM Xuan Zhuo wrote: > > On Wed, 11 Sep 2024 12:31:32 +0800, Jason Wang wrote: > > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo > > wrote: > > > > > > The driver's tx napi is very important for XSK. It is responsible for > > > obtaining data from the XSK queue and sending it out. > > > > > > At the beginning, we need to trigger tx napi. > > > > > > Signed-off-by: Xuan Zhuo > > > --- > > > drivers/net/virtio_net.c | 127 ++- > > > 1 file changed, 125 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > > index 221681926d23..3743694d3c3b 100644 > > > --- a/drivers/net/virtio_net.c > > > +++ b/drivers/net/virtio_net.c > > > @@ -516,10 +516,13 @@ enum virtnet_xmit_type { > > > VIRTNET_XMIT_TYPE_SKB, > > > VIRTNET_XMIT_TYPE_ORPHAN, > > > VIRTNET_XMIT_TYPE_XDP, > > > + VIRTNET_XMIT_TYPE_XSK, > > > }; > > > > > > #define VIRTNET_XMIT_TYPE_MASK (VIRTNET_XMIT_TYPE_SKB | > > > VIRTNET_XMIT_TYPE_ORPHAN \ > > > - | VIRTNET_XMIT_TYPE_XDP) > > > + | VIRTNET_XMIT_TYPE_XDP | > > > VIRTNET_XMIT_TYPE_XSK) > > > + > > > +#define VIRTIO_XSK_FLAG_OFFSET 4 > > > > > > static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr) > > > { > > > @@ -543,6 +546,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, > > > int num, void *data, > > > GFP_ATOMIC); > > > } > > > > > > +static u32 virtnet_ptr_to_xsk(void *ptr) > > > +{ > > > + return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET; > > > +} > > > + > > > > This needs a better name, otherwise readers might be confused. > > > > E.g something like virtnet_ptr_to_xsk_buff_len()? > > > > > static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len) > > > { > > > sg_assign_page(sg, NULL); > > > @@ -584,6 +592,10 @@ static void __free_old_xmit(struct send_queue *sq, > > > struct netdev_queue *txq, > > > stats->bytes += xdp_get_frame_len(frame); > > > xdp_return_frame(frame); > > > break; > > > + > > > + case VIRTNET_XMIT_TYPE_XSK: > > > + stats->bytes += virtnet_ptr_to_xsk(ptr); > > > + break; > > > > Do we miss xsk_tx_completed() here? > > > > > } > > > } > > > netdev_tx_completed_queue(txq, stats->napi_packets, > > > stats->napi_bytes); > > > @@ -1393,6 +1405,97 @@ static int virtnet_xsk_wakeup(struct net_device > > > *dev, u32 qid, u32 flag) > > > return 0; > > > } > > > > > > +static void *virtnet_xsk_to_ptr(u32 len) > > > +{ > > > + unsigned long p; > > > + > > > + p = len << VIRTIO_XSK_FLAG_OFFSET; > > > + > > > + return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK); > > > +} > > > + > > > +static int virtnet_xsk_xmit_one(struct send_queue *sq, > > > + struct xsk_buff_pool *pool, > > > + struct xdp_desc *desc) > > > +{ > > > + struct virtnet_info *vi; > > > + dma_addr_t addr; > > > + > > > + vi = sq->vq->vdev->priv; > > > + > > > + addr = xsk_buff_raw_get_dma(pool, desc->addr); > > > + xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len); > > > + > > > + sg_init_table(sq->sg, 2); > > > + > > > + sg_fill_dma(sq->sg, sq->xsk_hdr_dma_addr, vi->hdr_len); > > > + sg_fill_dma(sq->sg + 1, addr, desc->len); > > > + > > > + return virtqueue_add_outbuf(sq->vq, sq->sg, 2, > > > + virtnet_xsk_to_ptr(desc->len), > > > GFP_ATOMIC); > > > +} > > > + > > > +static int virtnet_xsk_xmit_batch(struct send_queue *sq, > > > + struct xsk_buff_pool *pool, > > > + unsigned int budget, > > > + u64 *kicks) > > > +{ > > > + struct xdp_desc *descs = pool->tx_descs; > > > + bool kick = false; > > > + u32 nb_pkts, i; > > > + int err; > > > + > > > + budget = min_t(u32, budget, sq->vq->num_free); > > > + > > > + nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget); > > > + if (!nb_pkts) > > > + return 0; > > > + > > > + for (i = 0; i < nb_pkts; i++) { > > > + err = virtnet_xsk_xmit_one(sq, pool, &descs[i]); > > > + if (unlikely(err)) { > > > + xsk_tx_completed(sq->xsk_pool, nb_pkts - i); > > > > Should we kick in this condition? > > > > > + break; > > > + } > > > + > > > + kick = true; > > > + } > > > + > > > + if (kick && virtqueue_kick_prepare(sq->vq) && > > > virtqueue_notify(sq->vq)) > > > > Can we simply use virtqueue_kick() here? > > > > > + (*kicks)++; > > > + > > > + retu