Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr

2025-01-09 Thread Jason Wang
On Mon, Dec 30, 2024 at 5:51 PM Chia-Yu Chang (Nokia)
 wrote:
>
> >From: Jason Wang 
> >Sent: Monday, December 30, 2024 8:52 AM
> >To: Chia-Yu Chang (Nokia) 
> >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
> >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; 
> >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; 
> >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; 
> >netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
> >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; 
> >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; 
> >xuanz...@linux.alibaba.com; epere...@redhat.com; 
> >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen 
> >De Schepper (Nokia) ; 
> >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; 
> >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; 
> >jason_living...@comcast.com; vidhi_g...@apple.com
> >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in 
> >virtio_net_hdr
> >
> >[You don't often get email from jasow...@redhat.com. Learn why this is 
> >important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> >CAUTION: This is an external email. Please be very careful when clicking 
> >links or opening attachments. See the URL nok.it/ext for additional 
> >information.
> >
> >
> >
> >On Sat, Dec 28, 2024 at 3:13 AM  wrote:
> >>
> >> From: Chia-Yu Chang 
> >>
> >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE
> >> field to count new packets with CE mark; however, it will be corrupted
> >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by
> >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be
> >> changed within a super-skb.
> >>
> >> To apply the aforementieond new AccECN GSO for virtio, new featue bits
> >> for host and guest are added for feature negotiation between driver
> >> and device. And the translation of Accurate ECN GSO flag between
> >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to
> >> avoid CWR flag corruption due to RFC3168 ECN TSO.
> >>
> >> Signed-off-by: Chia-Yu Chang 
> >> ---
> >>  drivers/net/virtio_net.c| 14 +++---
> >>  drivers/vdpa/pds/debugfs.c  |  6 ++
> >>  include/linux/virtio_net.h  | 16 ++--
> >>  include/uapi/linux/virtio_net.h |  5 +
> >>  4 files changed, 32 insertions(+), 9 deletions(-)
> >
> >Is there a link to the spec patch? It needs to be accepted first.
> >
> >Thanks
>
> Hi Jason,
>
> Thanks for the feedback, I found the virtio-spec in github: 
> https://github.com/oasis-tcs/virtio-spec but not able to find the procedure 
> to propose.
> Could you help to share the procedure to propose spec patch? Thanks.

Sorry for the late reply.

You can start by subscribing to virtio-dev and send a spec patch there.

Thanks

>
> --
> Chia-Yu




Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr

2025-01-09 Thread Lei Yang
I tested this series of patches v6 with virtio-net regression tests,
everything works fine.

Tested-by: Lei Yang 

On Wed, Jan 8, 2025 at 8:33 PM Michael S. Tsirkin  wrote:
>
> On Mon, Dec 30, 2024 at 09:50:59AM +, Chia-Yu Chang (Nokia) wrote:
> > >From: Jason Wang 
> > >Sent: Monday, December 30, 2024 8:52 AM
> > >To: Chia-Yu Chang (Nokia) 
> > >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
> > >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; 
> > >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; 
> > >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; 
> > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
> > >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; 
> > >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; 
> > >xuanz...@linux.alibaba.com; epere...@redhat.com; 
> > >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; 
> > >Koen De Schepper (Nokia) ; 
> > >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; 
> > >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; 
> > >jason_living...@comcast.com; vidhi_g...@apple.com
> > >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in 
> > >virtio_net_hdr
> > >
> > >[You don't often get email from jasow...@redhat.com. Learn why this is 
> > >important at https://aka.ms/LearnAboutSenderIdentification ]
> > >
> > >CAUTION: This is an external email. Please be very careful when clicking 
> > >links or opening attachments. See the URL nok.it/ext for additional 
> > >information.
> > >
> > >
> > >
> > >On Sat, Dec 28, 2024 at 3:13 AM  wrote:
> > >>
> > >> From: Chia-Yu Chang 
> > >>
> > >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE
> > >> field to count new packets with CE mark; however, it will be corrupted
> > >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by
> > >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be
> > >> changed within a super-skb.
> > >>
> > >> To apply the aforementieond new AccECN GSO for virtio, new featue bits
> > >> for host and guest are added for feature negotiation between driver
> > >> and device. And the translation of Accurate ECN GSO flag between
> > >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to
> > >> avoid CWR flag corruption due to RFC3168 ECN TSO.
> > >>
> > >> Signed-off-by: Chia-Yu Chang 
> > >> ---
> > >>  drivers/net/virtio_net.c| 14 +++---
> > >>  drivers/vdpa/pds/debugfs.c  |  6 ++
> > >>  include/linux/virtio_net.h  | 16 ++--
> > >>  include/uapi/linux/virtio_net.h |  5 +
> > >>  4 files changed, 32 insertions(+), 9 deletions(-)
> > >
> > >Is there a link to the spec patch? It needs to be accepted first.
> > >
> > >Thanks
> >
> > Hi Jason,
> >
> > Thanks for the feedback, I found the virtio-spec in github: 
> > https://github.com/oasis-tcs/virtio-spec but not able to find the procedure 
> > to propose.
> > Could you help to share the procedure to propose spec patch? Thanks.
>
>
> You post it on virtio-comment for discussion. Github issues are then used
> for voting and to track acceptance.
> https://github.com/oasis-tcs/virtio-spec/blob/master/README.md#use-of-github-issues
>
>
> > --
> > Chia-Yu
>
>




Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr

2025-01-08 Thread Michael S. Tsirkin
On Mon, Dec 30, 2024 at 09:50:59AM +, Chia-Yu Chang (Nokia) wrote:
> >From: Jason Wang  
> >Sent: Monday, December 30, 2024 8:52 AM
> >To: Chia-Yu Chang (Nokia) 
> >Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
> >eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; 
> >joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; 
> >ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; 
> >netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
> >shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; 
> >sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; 
> >xuanz...@linux.alibaba.com; epere...@redhat.com; 
> >virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen 
> >De Schepper (Nokia) ; 
> >g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; 
> >mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; 
> >jason_living...@comcast.com; vidhi_g...@apple.com
> >Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in 
> >virtio_net_hdr
> >
> >[You don't often get email from jasow...@redhat.com. Learn why this is 
> >important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> >CAUTION: This is an external email. Please be very careful when clicking 
> >links or opening attachments. See the URL nok.it/ext for additional 
> >information.
> >
> >
> >
> >On Sat, Dec 28, 2024 at 3:13 AM  wrote:
> >>
> >> From: Chia-Yu Chang 
> >>
> >> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE 
> >> field to count new packets with CE mark; however, it will be corrupted 
> >> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by 
> >> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be 
> >> changed within a super-skb.
> >>
> >> To apply the aforementieond new AccECN GSO for virtio, new featue bits 
> >> for host and guest are added for feature negotiation between driver 
> >> and device. And the translation of Accurate ECN GSO flag between 
> >> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to 
> >> avoid CWR flag corruption due to RFC3168 ECN TSO.
> >>
> >> Signed-off-by: Chia-Yu Chang 
> >> ---
> >>  drivers/net/virtio_net.c| 14 +++---
> >>  drivers/vdpa/pds/debugfs.c  |  6 ++
> >>  include/linux/virtio_net.h  | 16 ++--
> >>  include/uapi/linux/virtio_net.h |  5 +
> >>  4 files changed, 32 insertions(+), 9 deletions(-)
> >
> >Is there a link to the spec patch? It needs to be accepted first.
> >
> >Thanks
> 
> Hi Jason,
> 
> Thanks for the feedback, I found the virtio-spec in github: 
> https://github.com/oasis-tcs/virtio-spec but not able to find the procedure 
> to propose.
> Could you help to share the procedure to propose spec patch? Thanks.


You post it on virtio-comment for discussion. Github issues are then used
for voting and to track acceptance.
https://github.com/oasis-tcs/virtio-spec/blob/master/README.md#use-of-github-issues


> --
> Chia-Yu




RE: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr

2024-12-30 Thread Chia-Yu Chang (Nokia)
>From: Jason Wang  
>Sent: Monday, December 30, 2024 8:52 AM
>To: Chia-Yu Chang (Nokia) 
>Cc: netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
>eduma...@google.com; dsah...@kernel.org; pab...@redhat.com; 
>joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; 
>ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; 
>netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
>shenjia...@huawei.com; salil.me...@huawei.com; shaoji...@huawei.com; 
>sae...@nvidia.com; tar...@nvidia.com; m...@redhat.com; 
>xuanz...@linux.alibaba.com; epere...@redhat.com; 
>virtualizat...@lists.linux.dev; i...@kernel.org; ncardw...@google.com; Koen De 
>Schepper (Nokia) ; 
>g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; 
>mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; 
>jason_living...@comcast.com; vidhi_g...@apple.com
>Subject: Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in 
>virtio_net_hdr
>
>[You don't often get email from jasow...@redhat.com. Learn why this is 
>important at https://aka.ms/LearnAboutSenderIdentification ]
>
>CAUTION: This is an external email. Please be very careful when clicking links 
>or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Sat, Dec 28, 2024 at 3:13 AM  wrote:
>>
>> From: Chia-Yu Chang 
>>
>> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE 
>> field to count new packets with CE mark; however, it will be corrupted 
>> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by 
>> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be 
>> changed within a super-skb.
>>
>> To apply the aforementieond new AccECN GSO for virtio, new featue bits 
>> for host and guest are added for feature negotiation between driver 
>> and device. And the translation of Accurate ECN GSO flag between 
>> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to 
>> avoid CWR flag corruption due to RFC3168 ECN TSO.
>>
>> Signed-off-by: Chia-Yu Chang 
>> ---
>>  drivers/net/virtio_net.c| 14 +++---
>>  drivers/vdpa/pds/debugfs.c  |  6 ++
>>  include/linux/virtio_net.h  | 16 ++--
>>  include/uapi/linux/virtio_net.h |  5 +
>>  4 files changed, 32 insertions(+), 9 deletions(-)
>
>Is there a link to the spec patch? It needs to be accepted first.
>
>Thanks

Hi Jason,

Thanks for the feedback, I found the virtio-spec in github: 
https://github.com/oasis-tcs/virtio-spec but not able to find the procedure to 
propose.
Could you help to share the procedure to propose spec patch? Thanks.

--
Chia-Yu


Re: [PATCH v6 net-next 11/14] virtio_net: Accurate ECN flag in virtio_net_hdr

2024-12-29 Thread Jason Wang
On Sat, Dec 28, 2024 at 3:13 AM  wrote:
>
> From: Chia-Yu Chang 
>
> Unlike RFC 3168 ECN, accurate ECN uses the CWR flag as part of the ACE
> field to count new packets with CE mark; however, it will be corrupted
> by the RFC 3168 ECN-aware TSO. Therefore, fallback shall be applied by
> seting NETIF_F_GSO_ACCECN to ensure that the CWR flag should not be
> changed within a super-skb.
>
> To apply the aforementieond new AccECN GSO for virtio, new featue bits
> for host and guest are added for feature negotiation between driver and
> device. And the translation of Accurate ECN GSO flag between
> virtio_net_hdr and skb header for NETIF_F_GSO_ACCECN is also added to
> avoid CWR flag corruption due to RFC3168 ECN TSO.
>
> Signed-off-by: Chia-Yu Chang 
> ---
>  drivers/net/virtio_net.c| 14 +++---
>  drivers/vdpa/pds/debugfs.c  |  6 ++
>  include/linux/virtio_net.h  | 16 ++--
>  include/uapi/linux/virtio_net.h |  5 +
>  4 files changed, 32 insertions(+), 9 deletions(-)

Is there a link to the spec patch? It needs to be accepted first.

Thanks




Re: [PATCH v6 net-next 00/14] AccECN protocol preparation patch series

2024-12-27 Thread Jakub Kicinski
On Fri, 27 Dec 2024 20:11:57 +0100 chia-yu.ch...@nokia-bell-labs.com
wrote:
> Hello

Hello.

Someone may still review, but:

## Form letter - winter-break

Networking development is suspended for winter holidays, until Jan 2nd.
We are currently accepting bug fixes only, see the announcements at:

https://lore.kernel.org/20241211164022.6a075...@kernel.org
https://lore.kernel.org/20241220182851.7acb6...@kernel.org

RFC patches sent for review only are welcome at any time.
-- 
pw-bot: defer
pv-bot: closed



Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-20 Thread Michal Luczaj
On 12/20/24 11:49, Stefano Garzarella wrote:
> ...
> Note that non-NULL -> NULL should only occur before a connection is 
> established, so before any data is passed. Is this a problem for BPF?

Please take a look at vsock_bpf_update_proto(). The condition is to have a
transport assigned. BPF assumes transport will stay valid.

And currently that's a wrong assumption: transport can transition from
non-NULL to NULL (due to a failed reconnect). That's why we hit null ptr
deref via vsock_bpf_recvmsg().

That said, I sure hope someone BPF-competent is reading this :)




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-20 Thread Stefano Garzarella

On Thu, Dec 19, 2024 at 05:09:42PM +0100, Michal Luczaj wrote:

On 12/19/24 16:12, Stefano Garzarella wrote:

On Thu, 19 Dec 2024 at 16:05, Michal Luczaj  wrote:


On 12/19/24 15:48, Stefano Garzarella wrote:

On Thu, 19 Dec 2024 at 15:36, Michal Luczaj  wrote:


On 12/19/24 09:19, Stefano Garzarella wrote:

...
I think the best thing though is to better understand how to handle
deassign, rather than checking everywhere that it's not null, also
because in some cases (like the one in virtio-vsock), it's also
important that the transport is the same.


My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
it impossible-by-design to switch ->transport from non-NULL to NULL in
vsock_assign_transport().


I don't know if that's enough, in this case the problem is that some
response packets are intended for a socket, where the transport has
changed. So whether it's null or assigned but different, it's still a
problem we have to handle.

So making it impossible for the transport to be null, but allowing it
to be different (we can't prevent it from changing), doesn't solve the
problem for us, it only shifts it.


Got it. I assumed this issue would be solved by `vsk->transport !=
&t->transport` in the critical place(s).

(Note that BPF doesn't care if transport has changed; BPF just expects to
have _a_ transport.)


If I'm not mistaken, that would require rewriting vsock_assign_transport()
so that a new transport is assigned only once fully initialized, otherwise
keep the old one (still unhurt and functional) and return error. Because
failing connect() should not change anything under the hood, right?



Nope, connect should be able to change the transport.

Because a user can do an initial connect() that requires a specific
transport, this one fails maybe because there's no peer with that cid.
Then the user can redo the connect() to a different cid that requires
a different transport.


But the initial connect() failing does not change anything under the hood
(transport should/could stay NULL).


Nope, isn't null, it's assigned to a transport, because for example it
has to send a packet to connect to the remote CID and wait back for a
response that for example says the CID doesn't exist.


Ahh, I think I get it. So the initial connect() passed
vsock_assign_transport() successfully and then failed deeper in
vsock_connect(), right? That's fine. Let the socket have a useless
transport (a valid pointer nevertheless). 


Just to be clear, it's not useless, since it's used to make the 
connection. We know that it's useless just when we report that the 
connection failed, so maybe we should de-assign it when we set 
`sock->state = SS_UNCONNECTED`.



Sure, upcoming connect() can
assign a new (possibly useless just as well) transport, but there's no
reason to allow ->transport becoming NULL.


I'm not sure about this, in the end in the connection failure case, when 
we set `sock->state = SS_UNCONNECTED`, we're returning the socket to a 
pre-connect state, so it might make sense to also reset the transport to 
NULL, so that we have exactly the same conditions.




And a pre-connect socket (where ->transport==NULL) is not an issue, because
BPF won't let it in any sockmap, so vsock_bpf_recvmsg() won't be reachable.

Anywa, thanks for explaining,
Michal

PS. Or ignore the above and remove the socket from the sockmap at every
reconnect? Possible unhash abuse:


I should take a closer look at unhash, but it might make sense!



diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5cf8109f672a..8a65153ee186 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -483,6 +483,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
if (vsk->transport == new_transport)
return 0;

+   const struct proto *prot = READ_ONCE(sk->sk_prot);
+   if (prot->unhash)
+   prot->unhash(sk);
+
/* transport->release() must be called with sock lock acquired.
 * This path can only be taken during vsock_connect(), where we
 * have already held the sock lock. In the other cases, this
diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c
index 4aa6e74ec295..80deb4d70aea 100644
--- a/net/vmw_vsock/vsock_bpf.c
+++ b/net/vmw_vsock/vsock_bpf.c
@@ -119,6 +119,7 @@ static void vsock_bpf_rebuild_protos(struct proto *prot, 
const struct proto *bas
*prot= *base;
prot->close  = sock_map_close;
    prot->recvmsg = vsock_bpf_recvmsg;
+   prot->unhash = sock_map_unhash;
prot->sock_is_readable = sk_msg_is_readable;
}


Then a successful re-connect assigns
the transport (NULL -> non-NULL). And it's all good because all I wan

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Michal Luczaj
On 12/19/24 16:12, Stefano Garzarella wrote:
> On Thu, 19 Dec 2024 at 16:05, Michal Luczaj  wrote:
>>
>> On 12/19/24 15:48, Stefano Garzarella wrote:
>>> On Thu, 19 Dec 2024 at 15:36, Michal Luczaj  wrote:
>>>>
>>>> On 12/19/24 09:19, Stefano Garzarella wrote:
>>>>> ...
>>>>> I think the best thing though is to better understand how to handle
>>>>> deassign, rather than checking everywhere that it's not null, also
>>>>> because in some cases (like the one in virtio-vsock), it's also
>>>>> important that the transport is the same.
>>>>
>>>> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
>>>> it impossible-by-design to switch ->transport from non-NULL to NULL in
>>>> vsock_assign_transport().
>>>
>>> I don't know if that's enough, in this case the problem is that some
>>> response packets are intended for a socket, where the transport has
>>> changed. So whether it's null or assigned but different, it's still a
>>> problem we have to handle.
>>>
>>> So making it impossible for the transport to be null, but allowing it
>>> to be different (we can't prevent it from changing), doesn't solve the
>>> problem for us, it only shifts it.
>>
>> Got it. I assumed this issue would be solved by `vsk->transport !=
>> &t->transport` in the critical place(s).
>>
>> (Note that BPF doesn't care if transport has changed; BPF just expects to
>> have _a_ transport.)
>>
>>>> If I'm not mistaken, that would require rewriting vsock_assign_transport()
>>>> so that a new transport is assigned only once fully initialized, otherwise
>>>> keep the old one (still unhurt and functional) and return error. Because
>>>> failing connect() should not change anything under the hood, right?
>>>>
>>>
>>> Nope, connect should be able to change the transport.
>>>
>>> Because a user can do an initial connect() that requires a specific
>>> transport, this one fails maybe because there's no peer with that cid.
>>> Then the user can redo the connect() to a different cid that requires
>>> a different transport.
>>
>> But the initial connect() failing does not change anything under the hood
>> (transport should/could stay NULL).
> 
> Nope, isn't null, it's assigned to a transport, because for example it
> has to send a packet to connect to the remote CID and wait back for a
> response that for example says the CID doesn't exist.

Ahh, I think I get it. So the initial connect() passed
vsock_assign_transport() successfully and then failed deeper in
vsock_connect(), right? That's fine. Let the socket have a useless
transport (a valid pointer nevertheless). Sure, upcoming connect() can
assign a new (possibly useless just as well) transport, but there's no
reason to allow ->transport becoming NULL.

And a pre-connect socket (where ->transport==NULL) is not an issue, because
BPF won't let it in any sockmap, so vsock_bpf_recvmsg() won't be reachable.

Anywa, thanks for explaining,
Michal

PS. Or ignore the above and remove the socket from the sockmap at every
reconnect? Possible unhash abuse:

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5cf8109f672a..8a65153ee186 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -483,6 +483,10 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
if (vsk->transport == new_transport)
return 0;
 
+   const struct proto *prot = READ_ONCE(sk->sk_prot);
+   if (prot->unhash)
+   prot->unhash(sk);
+
/* transport->release() must be called with sock lock acquired.
 * This path can only be taken during vsock_connect(), where we
     * have already held the sock lock. In the other cases, this
diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c
index 4aa6e74ec295..80deb4d70aea 100644
--- a/net/vmw_vsock/vsock_bpf.c
+++ b/net/vmw_vsock/vsock_bpf.c
@@ -119,6 +119,7 @@ static void vsock_bpf_rebuild_protos(struct proto *prot, 
const struct proto *bas
*prot= *base;
prot->close  = sock_map_close;
prot->recvmsg = vsock_bpf_recvmsg;
+   prot->unhash = sock_map_unhash;
prot->sock_is_readable = sk_msg_is_readable;
 }

>> Then a successful re-connect assigns
>> the transport (NULL -> non-NULL). And it's all good because all I wanted to
>> avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly
>> shallow understanding :)




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Stefano Garzarella
On Thu, 19 Dec 2024 at 16:05, Michal Luczaj  wrote:
>
> On 12/19/24 15:48, Stefano Garzarella wrote:
> > On Thu, 19 Dec 2024 at 15:36, Michal Luczaj  wrote:
> >>
> >> On 12/19/24 09:19, Stefano Garzarella wrote:
> >>> ...
> >>> I think the best thing though is to better understand how to handle
> >>> deassign, rather than checking everywhere that it's not null, also
> >>> because in some cases (like the one in virtio-vsock), it's also
> >>> important that the transport is the same.
> >>
> >> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
> >> it impossible-by-design to switch ->transport from non-NULL to NULL in
> >> vsock_assign_transport().
> >
> > I don't know if that's enough, in this case the problem is that some
> > response packets are intended for a socket, where the transport has
> > changed. So whether it's null or assigned but different, it's still a
> > problem we have to handle.
> >
> > So making it impossible for the transport to be null, but allowing it
> > to be different (we can't prevent it from changing), doesn't solve the
> > problem for us, it only shifts it.
>
> Got it. I assumed this issue would be solved by `vsk->transport !=
> &t->transport` in the critical place(s).
>
> (Note that BPF doesn't care if transport has changed; BPF just expects to
> have _a_ transport.)
>
> >> If I'm not mistaken, that would require rewriting vsock_assign_transport()
> >> so that a new transport is assigned only once fully initialized, otherwise
> >> keep the old one (still unhurt and functional) and return error. Because
> >> failing connect() should not change anything under the hood, right?
> >>
> >
> > Nope, connect should be able to change the transport.
> >
> > Because a user can do an initial connect() that requires a specific
> > transport, this one fails maybe because there's no peer with that cid.
> > Then the user can redo the connect() to a different cid that requires
> > a different transport.
>
> But the initial connect() failing does not change anything under the hood
> (transport should/could stay NULL).

Nope, isn't null, it's assigned to a transport, because for example it
has to send a packet to connect to the remote CID and wait back for a
response that for example says the CID doesn't exist.

> Then a successful re-connect assigns
> the transport (NULL -> non-NULL). And it's all good because all I wanted to
> avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly
> shallow understanding :)
>




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Michal Luczaj
On 12/19/24 15:48, Stefano Garzarella wrote:
> On Thu, 19 Dec 2024 at 15:36, Michal Luczaj  wrote:
>>
>> On 12/19/24 09:19, Stefano Garzarella wrote:
>>> ...
>>> I think the best thing though is to better understand how to handle
>>> deassign, rather than checking everywhere that it's not null, also
>>> because in some cases (like the one in virtio-vsock), it's also
>>> important that the transport is the same.
>>
>> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
>> it impossible-by-design to switch ->transport from non-NULL to NULL in
>> vsock_assign_transport().
> 
> I don't know if that's enough, in this case the problem is that some
> response packets are intended for a socket, where the transport has
> changed. So whether it's null or assigned but different, it's still a
> problem we have to handle.
> 
> So making it impossible for the transport to be null, but allowing it
> to be different (we can't prevent it from changing), doesn't solve the
> problem for us, it only shifts it.

Got it. I assumed this issue would be solved by `vsk->transport !=
&t->transport` in the critical place(s).

(Note that BPF doesn't care if transport has changed; BPF just expects to
have _a_ transport.)

>> If I'm not mistaken, that would require rewriting vsock_assign_transport()
>> so that a new transport is assigned only once fully initialized, otherwise
>> keep the old one (still unhurt and functional) and return error. Because
>> failing connect() should not change anything under the hood, right?
>>
> 
> Nope, connect should be able to change the transport.
> 
> Because a user can do an initial connect() that requires a specific
> transport, this one fails maybe because there's no peer with that cid.
> Then the user can redo the connect() to a different cid that requires
> a different transport.

But the initial connect() failing does not change anything under the hood
(transport should/could stay NULL). Then a successful re-connect assigns
the transport (NULL -> non-NULL). And it's all good because all I wanted to
avoid (because of BPF) was non-NULL -> NULL. Anyway, that's my possibly
shallow understanding :)




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Stefano Garzarella
On Thu, 19 Dec 2024 at 15:36, Michal Luczaj  wrote:
>
> On 12/19/24 09:19, Stefano Garzarella wrote:
> > ...
> > I think the best thing though is to better understand how to handle
> > deassign, rather than checking everywhere that it's not null, also
> > because in some cases (like the one in virtio-vsock), it's also
> > important that the transport is the same.
>
> My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
> it impossible-by-design to switch ->transport from non-NULL to NULL in
> vsock_assign_transport().

I don't know if that's enough, in this case the problem is that some
response packets are intended for a socket, where the transport has
changed. So whether it's null or assigned but different, it's still a
problem we have to handle.

So making it impossible for the transport to be null, but allowing it
to be different (we can't prevent it from changing), doesn't solve the
problem for us, it only shifts it.

>
> If I'm not mistaken, that would require rewriting vsock_assign_transport()
> so that a new transport is assigned only once fully initialized, otherwise
> keep the old one (still unhurt and functional) and return error. Because
> failing connect() should not change anything under the hood, right?
>

Nope, connect should be able to change the transport.

Because a user can do an initial connect() that requires a specific
transport, this one fails maybe because there's no peer with that cid.
Then the user can redo the connect() to a different cid that requires
a different transport.

Stefano




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Michal Luczaj
On 12/19/24 09:19, Stefano Garzarella wrote:
> ...
> I think the best thing though is to better understand how to handle 
> deassign, rather than checking everywhere that it's not null, also 
> because in some cases (like the one in virtio-vsock), it's also 
> important that the transport is the same.

My vote would be to apply your virtio_transport_recv_pkt() patch *and* make
it impossible-by-design to switch ->transport from non-NULL to NULL in
vsock_assign_transport().

If I'm not mistaken, that would require rewriting vsock_assign_transport()
so that a new transport is assigned only once fully initialized, otherwise
keep the old one (still unhurt and functional) and return error. Because
failing connect() should not change anything under the hood, right?




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-19 Thread Stefano Garzarella

On Wed, Dec 18, 2024 at 08:37:38PM -0500, Hyunwoo Kim wrote:

On Thu, Dec 19, 2024 at 01:25:34AM +0100, Michal Luczaj wrote:

On 12/18/24 16:51, Hyunwoo Kim wrote:
> On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote:
>> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote:
>>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:
 At least for vsock_loopback.c, this change doesn’t seem to introduce any
 particular issues.
>>>
>>> But was it working for you? because the check was wrong, this one should
>>> work, but still, I didn't have time to test it properly, I'll do later.
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
>>> index 9acc13ab3f82..ddecf6e430d6 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct 
virtio_transport *t,
>>>lock_sock(sk);
>>> -   /* Check if sk has been closed before lock_sock */
>>> -   if (sock_flag(sk, SOCK_DONE)) {
>>> +   /* Check if sk has been closed or assigned to another transport 
before
>>> +* lock_sock
>>> +*/
>>> +   if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) {
>>>(void)virtio_transport_reset_no_sock(t, skb);
>>>release_sock(sk);
>>>sock_put(sk);

Hi, I got curious about this race, my 2 cents:

Your patch seems to fix the reported issue, but there's also a variant (as
in: transport going null unexpectedly) involving BPF:


I think it is a different problem, to be solved in vsock_bpf.c

With something like this (untested):

diff --git a/net/vmw_vsock/vsock_bpf.c b/net/vmw_vsock/vsock_bpf.c
index 4aa6e74ec295..8c2322dc2af7 100644
--- a/net/vmw_vsock/vsock_bpf.c
+++ b/net/vmw_vsock/vsock_bpf.c
@@ -25,10 +25,8 @@ static struct proto vsock_bpf_prot;
 static bool vsock_has_data(struct sock *sk, struct sk_psock *psock)
 {
struct vsock_sock *vsk = vsock_sk(sk);
-   s64 ret;

-   ret = vsock_connectible_has_data(vsk);
-   if (ret > 0)
+   if (vsk->transport && vsock_connectible_has_data(vsk) > 0)
return true;

return vsock_sk_has_data(sk, psock);

Note: we should check better this, because sometime we call it without
lock_sock, also thi



Yes. It seems that calling connect() twice causes the transport to become
NULL, leading to null-ptr-deref in any flow that tries to access that
transport.


We already expect vsk->transport to be null in several parts, but in 
some we assume it is called when this is valid. So we should check 
better what we do when we deassign a transport.




And that null-ptr-deref occurs because, unlike __vsock_stream_recvmsg,
vsock_bpf_recvmsg does not check vsock->transport:


Right.

So, thanks for the report, I'll try to see if I can make a patch with 
everything before tomorrow, because then I'm gone for 2 weeks.


Otherwise we'll see as soon as I get back or if you have time in the 
meantime, any solution is welcome.


I think the best thing though is to better understand how to handle 
deassign, rather than checking everywhere that it's not null, also 
because in some cases (like the one in virtio-vsock), it's also 
important that the transport is the same.


Thanks,
Stefano


```
int
__vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
   int flags)
{
...

   lock_sock(sk);

   transport = vsk->transport;

   if (!transport || sk->sk_state != TCP_ESTABLISHED) {
   /* Recvmsg is supposed to return 0 if a peer performs an
* orderly shutdown. Differentiate between that case and when a
* peer has not connected or a local shutdown occurred with the
* SOCK_DONE flag.
*/
   if (sock_flag(sk, SOCK_DONE))
   err = 0;
   else
   err = -ENOTCONN;

   goto out;
   }
```



/*
$ gcc vsock-transport.c && sudo ./a.out

BUG: kernel NULL pointer dereference, address: 00a0
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0
Oops: Oops:  [#1] PREEMPT SMP NOPTI
CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+
RIP: 0010:vsock_connectible_has_data+0x1f/0x40
Call Trace:
 vsock_bpf_recvmsg+0xca/0x5e0
 sock_recvmsg+0xb9/0xc0
 __sys_recvfrom+0xb3/0x130
 __x64_sys_recvfrom+0x20/0x30
 do_syscall_64+0x93/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
*/

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *msg)
{
perror(msg);
exit(-1);
}

static int create_sockmap(void)
{
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_SOCKMAP,
.key_size = sizeof(int),
   

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Hyunwoo Kim
On Thu, Dec 19, 2024 at 01:25:34AM +0100, Michal Luczaj wrote:
> On 12/18/24 16:51, Hyunwoo Kim wrote:
> > On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote:
> >> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote:
> >>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:
>  At least for vsock_loopback.c, this change doesn’t seem to introduce any
>  particular issues.
> >>>
> >>> But was it working for you? because the check was wrong, this one should
> >>> work, but still, I didn't have time to test it properly, I'll do later.
> >>>
> >>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> >>> b/net/vmw_vsock/virtio_transport_common.c
> >>> index 9acc13ab3f82..ddecf6e430d6 100644
> >>> --- a/net/vmw_vsock/virtio_transport_common.c
> >>> +++ b/net/vmw_vsock/virtio_transport_common.c
> >>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct 
> >>> virtio_transport *t,
> >>>lock_sock(sk);
> >>> -   /* Check if sk has been closed before lock_sock */
> >>> -   if (sock_flag(sk, SOCK_DONE)) {
> >>> +   /* Check if sk has been closed or assigned to another transport 
> >>> before
> >>> +* lock_sock
> >>> +*/
> >>> +   if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) {
> >>>(void)virtio_transport_reset_no_sock(t, skb);
> >>>release_sock(sk);
> >>>sock_put(sk);
> 
> Hi, I got curious about this race, my 2 cents:
> 
> Your patch seems to fix the reported issue, but there's also a variant (as
> in: transport going null unexpectedly) involving BPF:

Yes. It seems that calling connect() twice causes the transport to become 
NULL, leading to null-ptr-deref in any flow that tries to access that 
transport.

And that null-ptr-deref occurs because, unlike __vsock_stream_recvmsg, 
vsock_bpf_recvmsg does not check vsock->transport:
```
int
__vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int flags)
{
...

lock_sock(sk);

transport = vsk->transport;

if (!transport || sk->sk_state != TCP_ESTABLISHED) {
/* Recvmsg is supposed to return 0 if a peer performs an
 * orderly shutdown. Differentiate between that case and when a
 * peer has not connected or a local shutdown occurred with the
 * SOCK_DONE flag.
 */
if (sock_flag(sk, SOCK_DONE))
err = 0;
else
err = -ENOTCONN;

goto out;
}
```

> 
> /*
> $ gcc vsock-transport.c && sudo ./a.out
> 
> BUG: kernel NULL pointer dereference, address: 00a0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0
> Oops: Oops:  [#1] PREEMPT SMP NOPTI
> CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+
> RIP: 0010:vsock_connectible_has_data+0x1f/0x40
> Call Trace:
>  vsock_bpf_recvmsg+0xca/0x5e0
>  sock_recvmsg+0xb9/0xc0
>  __sys_recvfrom+0xb3/0x130
>  __x64_sys_recvfrom+0x20/0x30
>  do_syscall_64+0x93/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> */
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static void die(const char *msg)
> {
>   perror(msg);
>   exit(-1);
> }
> 
> static int create_sockmap(void)
> {
>   union bpf_attr attr = {
>   .map_type = BPF_MAP_TYPE_SOCKMAP,
>   .key_size = sizeof(int),
>   .value_size = sizeof(int),
>   .max_entries = 1
>   };
>   int map;
> 
>   map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
>   if (map < 0)
>   die("create_sockmap");
> 
>   return map;
> }
> 
> static void map_update_elem(int fd, int key, int value)
> {
>   union bpf_attr attr = {
>   .map_fd = fd,
>   .key = (uint64_t)&key,
>   .value = (uint64_t)&value,
>   .flags = BPF_ANY
>   };
> 
>   if (syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)))
>   die("map_update_elem");
> }
> 
> int main(void)
> {
>   struct sockaddr_vm addr = {
>   .svm_family = AF_VSOCK,
>   .svm_port = VMADDR_PORT_ANY,
>   .svm_cid = VMADDR_CID_LOCAL
>   };
>   socklen_t alen = sizeof(addr);
>   int map, s;
> 
>   map = create_sockmap();
> 
>   s = socket(AF_VSOCK, SOCK_SEQPACKET, 0);
>   if (s < 0)
>   die("socket");
> 
>   if (!connect(s, (struct sockaddr *)&addr, alen))
>   die("connect #1");
>   perror("ok, connect #1 failed; transport set");
> 
>   map_update_elem(map, 0, s);
> 
>   addr.svm_cid = 42;
>   if (!connect(s, (struct sockaddr *)&addr, alen))
>   die("connect #2");
>   perror("ok, connect #2 failed;

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Michal Luczaj
On 12/18/24 16:51, Hyunwoo Kim wrote:
> On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote:
>> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote:
>>> On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:
 At least for vsock_loopback.c, this change doesn’t seem to introduce any
 particular issues.
>>>
>>> But was it working for you? because the check was wrong, this one should
>>> work, but still, I didn't have time to test it properly, I'll do later.
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
>>> b/net/vmw_vsock/virtio_transport_common.c
>>> index 9acc13ab3f82..ddecf6e430d6 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct 
>>> virtio_transport *t,
>>>lock_sock(sk);
>>> -   /* Check if sk has been closed before lock_sock */
>>> -   if (sock_flag(sk, SOCK_DONE)) {
>>> +   /* Check if sk has been closed or assigned to another transport 
>>> before
>>> +* lock_sock
>>> +*/
>>> +   if (sock_flag(sk, SOCK_DONE) || vsk->transport != &t->transport) {
>>>(void)virtio_transport_reset_no_sock(t, skb);
>>>release_sock(sk);
>>>sock_put(sk);

Hi, I got curious about this race, my 2 cents:

Your patch seems to fix the reported issue, but there's also a variant (as
in: transport going null unexpectedly) involving BPF:

/*
$ gcc vsock-transport.c && sudo ./a.out

BUG: kernel NULL pointer dereference, address: 00a0
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 12faf8067 P4D 12faf8067 PUD 113670067 PMD 0
Oops: Oops:  [#1] PREEMPT SMP NOPTI
CPU: 15 UID: 0 PID: 1198 Comm: a.out Not tainted 6.13.0-rc2+
RIP: 0010:vsock_connectible_has_data+0x1f/0x40
Call Trace:
 vsock_bpf_recvmsg+0xca/0x5e0
 sock_recvmsg+0xb9/0xc0
 __sys_recvfrom+0xb3/0x130
 __x64_sys_recvfrom+0x20/0x30
 do_syscall_64+0x93/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
*/

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *msg)
{
perror(msg);
exit(-1);
}

static int create_sockmap(void)
{
union bpf_attr attr = {
.map_type = BPF_MAP_TYPE_SOCKMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1
};
int map;

map = syscall(SYS_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
if (map < 0)
die("create_sockmap");

return map;
}

static void map_update_elem(int fd, int key, int value)
{
union bpf_attr attr = {
.map_fd = fd,
.key = (uint64_t)&key,
.value = (uint64_t)&value,
.flags = BPF_ANY
};

if (syscall(SYS_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)))
die("map_update_elem");
}

int main(void)
{
struct sockaddr_vm addr = {
.svm_family = AF_VSOCK,
.svm_port = VMADDR_PORT_ANY,
.svm_cid = VMADDR_CID_LOCAL
};
socklen_t alen = sizeof(addr);
int map, s;

map = create_sockmap();

s = socket(AF_VSOCK, SOCK_SEQPACKET, 0);
if (s < 0)
die("socket");

if (!connect(s, (struct sockaddr *)&addr, alen))
die("connect #1");
perror("ok, connect #1 failed; transport set");

map_update_elem(map, 0, s);

addr.svm_cid = 42;
if (!connect(s, (struct sockaddr *)&addr, alen))
die("connect #2");
perror("ok, connect #2 failed; transport unset");

recv(s, NULL, 0, 0);
return 0;
}




Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Hyunwoo Kim
On Wed, Dec 18, 2024 at 04:31:03PM +0100, Stefano Garzarella wrote:
> On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote:
> > On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:
> > > On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote:
> > > > On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote:
> > > > > When calling connect to change the CID of a vsock, the loopback
> > > > > worker for the VIRTIO_VSOCK_OP_RST command is invoked.
> > > > > During this process, vsock_stream_has_data() calls
> > > > > vsk->transport->stream_has_data().
> > > > > However, a null-ptr-deref occurs because vsk->transport was set
> > > > > to NULL in vsock_deassign_transport().
> > > > > 
> > > > > cpu0  
> > > > > cpu1
> > > > > 
> > > > >   
> > > > > socket(A)
> > > > > 
> > > > >   bind(A, 
> > > > > VMADDR_CID_LOCAL)
> > > > > 
> > > > > vsock_bind()
> > > > > 
> > > > >   
> > > > > listen(A)
> > > > > 
> > > > > vsock_listen()
> > > > >  socket(B)
> > > > > 
> > > > >  connect(B, VMADDR_CID_LOCAL)
> > > > > 
> > > > >  connect(B, VMADDR_CID_HYPERVISOR)
> > > > >vsock_connect(B)
> > > > >  lock_sock(sk);
> 
> It shouldn't go on here anyway, because there's this check in
> vsock_connect():
> 
>   switch (sock->state) {
>   case SS_CONNECTED:
>   err = -EISCONN;
>   goto out;

By using a kill signal, set sock->state to SS_UNCONNECTED and sk->sk_state to 
TCP_CLOSING before the second connect() is called, causing the worker to 
execute without the SOCK_DONE flag being set.

> 
> 
> Indeed if I try, I have this behaviour:
> 
> shell1# python3
> import socket
> s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM)
> s.bind((1,1234))
> s.listen()
> 
> shell2# python3
> import socket
> s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM)
> s.connect((1, 1234))
> s.connect((2, 1234))
> Traceback (most recent call last):
>   File "", line 1, in 
> OSError: [Errno 106] Transport endpoint is already connected
> 
> 
> Where 106 is exactly EISCONN.
> So, do you have a better reproducer for that?

The full scenario is as follows:
```
 cpu0  
cpu1

   socket(A)

   bind(A, {cid: 
VMADDR_CID_LOCAL, port: 1024})
 vsock_bind()

   listen(A)
 vsock_listen()
  socket(B)

  connect(B, {cid: VMADDR_CID_LOCAL, port: 1024})
vsock_connect()
  lock_sock(sk);
  virtio_transport_connect()
virtio_transport_connect()
  virtio_transport_send_pkt_info()
vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_REQUEST)
  queue_work(vsock_loopback_work)
  sk->sk_state = TCP_SYN_SENT;
  release_sock(sk);
   
vsock_loopback_work()
 
virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_REQUEST)
   sk = 
vsock_find_bound_socket(&dst);
   
virtio_transport_recv_listen(sk, skb)
 child = 
vsock_create_connected(sk);
 
vsock_assign_transport()
   vvs = 
kzalloc(sizeof(*vvs), GFP_KERNEL);
 
vsock_insert_connected(vchild);
   
list_add(&vsk->connected_table, list);
 
virtio_transport_send_response(vchild, skb);
   
virtio_transport_send_pkt_info()
 
vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RESPONSE)
   
queue_work(vsock_loopback_work)

   
vsock_loopback_work()
 
virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RESPONSE)

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Stefano Garzarella

On Wed, Dec 18, 2024 at 03:40:40PM +0100, Stefano Garzarella wrote:

On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:

On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote:

On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote:

When calling connect to change the CID of a vsock, the loopback
worker for the VIRTIO_VSOCK_OP_RST command is invoked.
During this process, vsock_stream_has_data() calls
vsk->transport->stream_has_data().
However, a null-ptr-deref occurs because vsk->transport was set
to NULL in vsock_deassign_transport().

cpu0  
cpu1

  socket(A)

  bind(A, 
VMADDR_CID_LOCAL)
vsock_bind()

  listen(A)
vsock_listen()
 socket(B)

 connect(B, VMADDR_CID_LOCAL)

 connect(B, VMADDR_CID_HYPERVISOR)
   vsock_connect(B)
 lock_sock(sk);


It shouldn't go on here anyway, because there's this check in 
vsock_connect():


switch (sock->state) {
case SS_CONNECTED:
err = -EISCONN;
goto out;


Indeed if I try, I have this behaviour:

shell1# python3
import socket
s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM)
s.bind((1,1234))
s.listen()

shell2# python3
import socket
s = socket.socket(socket.AF_VSOCK, socket.SOCK_STREAM)
s.connect((1, 1234))
s.connect((2, 1234))
Traceback (most recent call last):
  File "", line 1, in 
OSError: [Errno 106] Transport endpoint is already connected


Where 106 is exactly EISCONN.
So, do you have a better reproducer for that?

Would be nice to add a test in tools/testing/vsock/vsock_test.c

Thanks,
Stefano


 vsock_assign_transport()
   virtio_transport_release()
 virtio_transport_close()
   virtio_transport_shutdown()
 virtio_transport_send_pkt_info()
   vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
 queue_work(vsock_loopback_work)
   vsock_deassign_transport()
 vsk->transport = NULL;
  
vsock_loopback_work()

virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
  
virtio_transport_recv_connected()

virtio_transport_reset()
  
virtio_transport_send_pkt_info()

vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST)
  
queue_work(vsock_loopback_work)

  
vsock_loopback_work()

virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST)
   
virtio_transport_recv_disconnecting()
 
virtio_transport_do_close()
   
vsock_stream_has_data()
 
vsk->transport->stream_has_data(vsk);// null-ptr-deref

To resolve this issue, add a check for vsk->transport, similar to
functions like vsock_send_shutdown().

Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock")
Signed-off-by: Hyunwoo Kim 
Signed-off-by: Wongi Lee 
---
net/vmw_vsock/af_vsock.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5cf8109f672a..a0c008626798 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected);

s64 vsock_stream_has_data(struct vsock_sock *vsk)
{
+   if (!vsk->transport)
+   return 0;
+


I understand that this alleviates the problem, but IMO it is not the right
solution. We should understand why we're still processing the packet in the
context of this socket if it's no longer assigned to the right transport.


Got it. I agree with you.



Maybe we can try to improve virtio_transport_recv_pkt() and check if the
vsk->transport is what we expect, I mean something like this (untested):

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 9acc13ab3f82..18b91149a62e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virti

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Stefano Garzarella

On Wed, Dec 18, 2024 at 09:19:08AM -0500, Hyunwoo Kim wrote:

On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote:

On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote:
> When calling connect to change the CID of a vsock, the loopback
> worker for the VIRTIO_VSOCK_OP_RST command is invoked.
> During this process, vsock_stream_has_data() calls
> vsk->transport->stream_has_data().
> However, a null-ptr-deref occurs because vsk->transport was set
> to NULL in vsock_deassign_transport().
>
> cpu0  
cpu1
>
>   socket(A)
>
>   bind(A, 
VMADDR_CID_LOCAL)
> vsock_bind()
>
>   listen(A)
> vsock_listen()
>  socket(B)
>
>  connect(B, VMADDR_CID_LOCAL)
>
>  connect(B, VMADDR_CID_HYPERVISOR)
>vsock_connect(B)
>  lock_sock(sk);
>  vsock_assign_transport()
>virtio_transport_release()
>  virtio_transport_close()
>virtio_transport_shutdown()
>  virtio_transport_send_pkt_info()
>vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
>  queue_work(vsock_loopback_work)
>vsock_deassign_transport()
>  vsk->transport = NULL;
>   
vsock_loopback_work()
> 
virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
>   
virtio_transport_recv_connected()
> 
virtio_transport_reset()
>   
virtio_transport_send_pkt_info()
> 
vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST)
>   
queue_work(vsock_loopback_work)
>
>   
vsock_loopback_work()
> 
virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST)
>   
virtio_transport_recv_disconnecting()
> 
virtio_transport_do_close()
>   
vsock_stream_has_data()
> 
vsk->transport->stream_has_data(vsk);// null-ptr-deref
>
> To resolve this issue, add a check for vsk->transport, similar to
> functions like vsock_send_shutdown().
>
> Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock")
> Signed-off-by: Hyunwoo Kim 
> Signed-off-by: Wongi Lee 
> ---
> net/vmw_vsock/af_vsock.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 5cf8109f672a..a0c008626798 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected);
>
> s64 vsock_stream_has_data(struct vsock_sock *vsk)
> {
> +  if (!vsk->transport)
> +  return 0;
> +

I understand that this alleviates the problem, but IMO it is not the right
solution. We should understand why we're still processing the packet in the
context of this socket if it's no longer assigned to the right transport.


Got it. I agree with you.



Maybe we can try to improve virtio_transport_recv_pkt() and check if the
vsk->transport is what we expect, I mean something like this (untested):

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 9acc13ab3f82..18b91149a62e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,

lock_sock(sk);

-   /* Check if sk has been closed before lock_sock */
-   if (sock_flag(sk, SOCK_DONE)) {
+   /* Check if sk has been closed or assigned to another transport before
+* lock_sock
+*/
+   if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) {
(void)virtio_transport_reset_no_sock(t, skb);
release_sock(sk);
sock_put(sk);

BTW I'm not sure it is the best solution, we have to check that we do not
introduce strange cases, but IMHO we have to solve the problem earlier in
virtio_transport_recv_pkt().


At least for vsock_loopback.c, this change doesn’t seem to introduce any
particular issues.


But was it working for yo

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Hyunwoo Kim
On Wed, Dec 18, 2024 at 02:40:49PM +0100, Stefano Garzarella wrote:
> On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote:
> > When calling connect to change the CID of a vsock, the loopback
> > worker for the VIRTIO_VSOCK_OP_RST command is invoked.
> > During this process, vsock_stream_has_data() calls
> > vsk->transport->stream_has_data().
> > However, a null-ptr-deref occurs because vsk->transport was set
> > to NULL in vsock_deassign_transport().
> > 
> > cpu0
> >   cpu1
> > 
> >   socket(A)
> > 
> >   bind(A, 
> > VMADDR_CID_LOCAL)
> > vsock_bind()
> > 
> >   listen(A)
> > 
> > vsock_listen()
> >  socket(B)
> > 
> >  connect(B, VMADDR_CID_LOCAL)
> > 
> >  connect(B, VMADDR_CID_HYPERVISOR)
> >vsock_connect(B)
> >  lock_sock(sk);
> >  vsock_assign_transport()
> >virtio_transport_release()
> >  virtio_transport_close()
> >virtio_transport_shutdown()
> >  virtio_transport_send_pkt_info()
> >vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
> >  queue_work(vsock_loopback_work)
> >vsock_deassign_transport()
> >  vsk->transport = NULL;
> >   
> > vsock_loopback_work()
> > 
> > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
> >   
> > virtio_transport_recv_connected()
> > 
> > virtio_transport_reset()
> >   
> > virtio_transport_send_pkt_info()
> > 
> > vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST)
> >   
> > queue_work(vsock_loopback_work)
> > 
> >   
> > vsock_loopback_work()
> > 
> > virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST)
> >
> > virtio_transport_recv_disconnecting()
> >  
> > virtio_transport_do_close()
> >
> > vsock_stream_has_data()
> >  
> > vsk->transport->stream_has_data(vsk);// null-ptr-deref
> > 
> > To resolve this issue, add a check for vsk->transport, similar to
> > functions like vsock_send_shutdown().
> > 
> > Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct 
> > vsock_sock")
> > Signed-off-by: Hyunwoo Kim 
> > Signed-off-by: Wongi Lee 
> > ---
> > net/vmw_vsock/af_vsock.c | 3 +++
> > 1 file changed, 3 insertions(+)
> > 
> > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> > index 5cf8109f672a..a0c008626798 100644
> > --- a/net/vmw_vsock/af_vsock.c
> > +++ b/net/vmw_vsock/af_vsock.c
> > @@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected);
> > 
> > s64 vsock_stream_has_data(struct vsock_sock *vsk)
> > {
> > +   if (!vsk->transport)
> > +   return 0;
> > +
> 
> I understand that this alleviates the problem, but IMO it is not the right
> solution. We should understand why we're still processing the packet in the
> context of this socket if it's no longer assigned to the right transport.

Got it. I agree with you.

> 
> Maybe we can try to improve virtio_transport_recv_pkt() and check if the
> vsk->transport is what we expect, I mean something like this (untested):
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index 9acc13ab3f82..18b91149a62e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport 
> *t,
> 
> lock_sock(sk);
> 
> -   /* Check if sk has been closed before lock_sock */
> -   if (sock_flag(sk, SOCK_DONE)) {
> +   /* Check if sk has been closed or assigned to another transport before
> +* lock_sock
> +*/
> +   if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) {
> (void)virtio_transport_reset_no_sock(t, skb);
> release_sock(sk);
> sock_put(sk);
> 
> BTW I'm not sure it is the best solution, we have to check that we do not
> intro

Re: [PATCH] vsock/virtio: Fix null-ptr-deref in vsock_stream_has_data

2024-12-18 Thread Stefano Garzarella

On Wed, Dec 18, 2024 at 07:25:07AM -0500, Hyunwoo Kim wrote:

When calling connect to change the CID of a vsock, the loopback
worker for the VIRTIO_VSOCK_OP_RST command is invoked.
During this process, vsock_stream_has_data() calls
vsk->transport->stream_has_data().
However, a null-ptr-deref occurs because vsk->transport was set
to NULL in vsock_deassign_transport().

cpu0  
cpu1

  socket(A)

  bind(A, 
VMADDR_CID_LOCAL)
vsock_bind()

  listen(A)
vsock_listen()
 socket(B)

 connect(B, VMADDR_CID_LOCAL)

 connect(B, VMADDR_CID_HYPERVISOR)
   vsock_connect(B)
 lock_sock(sk);
 vsock_assign_transport()
   virtio_transport_release()
 virtio_transport_close()
   virtio_transport_shutdown()
 virtio_transport_send_pkt_info()
   vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
 queue_work(vsock_loopback_work)
   vsock_deassign_transport()
 vsk->transport = NULL;
  
vsock_loopback_work()

virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_SHUTDOWN)
  
virtio_transport_recv_connected()

virtio_transport_reset()
  
virtio_transport_send_pkt_info()

vsock_loopback_send_pkt(VIRTIO_VSOCK_OP_RST)
  
queue_work(vsock_loopback_work)

  
vsock_loopback_work()

virtio_transport_recv_pkt(VIRTIO_VSOCK_OP_RST)
   
virtio_transport_recv_disconnecting()
 
virtio_transport_do_close()
   
vsock_stream_has_data()
 
vsk->transport->stream_has_data(vsk);// null-ptr-deref

To resolve this issue, add a check for vsk->transport, similar to
functions like vsock_send_shutdown().

Fixes: fe502c4a38d9 ("vsock: add 'transport' member in the struct vsock_sock")
Signed-off-by: Hyunwoo Kim 
Signed-off-by: Wongi Lee 
---
net/vmw_vsock/af_vsock.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 5cf8109f672a..a0c008626798 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -870,6 +870,9 @@ EXPORT_SYMBOL_GPL(vsock_create_connected);

s64 vsock_stream_has_data(struct vsock_sock *vsk)
{
+   if (!vsk->transport)
+   return 0;
+


I understand that this alleviates the problem, but IMO it is not the 
right solution. We should understand why we're still processing the 
packet in the context of this socket if it's no longer assigned to the 
right transport.


Maybe we can try to improve virtio_transport_recv_pkt() and check if the 
vsk->transport is what we expect, I mean something like this (untested):


diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 9acc13ab3f82..18b91149a62e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1628,8 +1628,10 @@ void virtio_transport_recv_pkt(struct virtio_transport 
*t,

lock_sock(sk);

-   /* Check if sk has been closed before lock_sock */
-   if (sock_flag(sk, SOCK_DONE)) {
+   /* Check if sk has been closed or assigned to another transport before
+* lock_sock
+*/
+   if (sock_flag(sk, SOCK_DONE) || vsk->transport != t) {
(void)virtio_transport_reset_no_sock(t, skb);
release_sock(sk);
sock_put(sk);

BTW I'm not sure it is the best solution, we have to check that we do 
not introduce strange cases, but IMHO we have to solve the problem 
earlier in virtio_transport_recv_pkt().


Thanks,
Stefano


return vsk->transport->stream_has_data(vsk);
}
EXPORT_SYMBOL_GPL(vsock_stream_has_data);
--
2.34.1






Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks

2024-12-06 Thread Michal Luczaj
On 11/19/24 11:31, Stefano Garzarella wrote:
> Hi Michal,
> 
> On Thu, Nov 07, 2024 at 09:46:11PM +0100, Michal Luczaj wrote:
>> Short series fixing some memory leaks that I've stumbled upon while 
>> toying
>> with the selftests.
> 
> Are these tests already upstream?
> I would like to add them to my suite, can you tell me how to run them?

CC got accidentally stripped, sorry.
Long story short: 
https://lore.kernel.org/netdev/20241206-test-vsock-leaks-v1-0-c31e8c875...@rbox.co/

Thanks,
Michal




Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks

2024-11-19 Thread Stefano Garzarella

Hi Michal,

On Thu, Nov 07, 2024 at 09:46:11PM +0100, Michal Luczaj wrote:
Short series fixing some memory leaks that I've stumbled upon while 
toying

with the selftests.


Are these tests already upstream?
I would like to add them to my suite, can you tell me how to run them?

Thanks,
Stefano



Signed-off-by: Michal Luczaj 
---
Changes in v2:
- Remove the refactoring patch from the series [Stefano]
- PATCH 2: Drop "virtio" from the commit title [Stefano]
- Collect Reviewed-by [Stefano]
- Link to v1: 
https://lore.kernel.org/r/20241106-vsock-mem-leaks-v1-0-8f4ffc309...@rbox.co

---
Michal Luczaj (3):
 virtio/vsock: Fix accept_queue memory leak
 vsock: Fix sk_error_queue memory leak
 virtio/vsock: Improve MSG_ZEROCOPY error handling

net/vmw_vsock/af_vsock.c| 3 +++
net/vmw_vsock/virtio_transport_common.c | 9 +
2 files changed, 12 insertions(+)
---
base-commit: 71712cf519faeed529549a79559c06c7fc250a15
change-id: 20241106-vsock-mem-leaks-9b63e912560a

Best regards,
--
Michal Luczaj 






Re: [PATCH net-next v4 00/13] virtio-net: support AF_XDP zero copy (tx)

2024-11-15 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski :

On Tue, 12 Nov 2024 09:29:15 +0800 you wrote:
> v4:
> 1. rebase net-next
> 2. update the kdoc for the new APIs
> 
> v3:
> 1. use sg_dma_address/length api to set the premapped sg
> 2. remove 'premapped' parameter from the new APIs
> 3. tweak the comment of commit #2,#3
> 
> [...]

Here is the summary with links:
  - [net-next,v4,01/13] virtio_ring: introduce vring_need_unmap_buffer
https://git.kernel.org/netdev/net-next/c/9f19c084057a
  - [net-next,v4,02/13] virtio_ring: split: record extras for indirect buffers
https://git.kernel.org/netdev/net-next/c/bc2b4c3401c6
  - [net-next,v4,03/13] virtio_ring: packed: record extras for indirect buffers
https://git.kernel.org/netdev/net-next/c/aaa789843a93
  - [net-next,v4,04/13] virtio_ring: perform premapped operations based on 
per-buffer
https://git.kernel.org/netdev/net-next/c/c7e1b422afac
  - [net-next,v4,05/13] virtio_ring: introduce add api for premapped
https://git.kernel.org/netdev/net-next/c/3ef66af31fea
  - [net-next,v4,06/13] virtio-net: rq submits premapped per-buffer
https://git.kernel.org/netdev/net-next/c/31f3cd4e5756
  - [net-next,v4,07/13] virtio_ring: remove API virtqueue_set_dma_premapped
https://git.kernel.org/netdev/net-next/c/880ebcbe0663
  - [net-next,v4,08/13] virtio_net: refactor the xmit type
https://git.kernel.org/netdev/net-next/c/7db956707f5f
  - [net-next,v4,09/13] virtio_net: xsk: bind/unbind xsk for tx
https://git.kernel.org/netdev/net-next/c/21a4e3ce6dc7
  - [net-next,v4,10/13] virtio_net: xsk: prevent disable tx napi
https://git.kernel.org/netdev/net-next/c/1df5116a41a8
  - [net-next,v4,11/13] virtio_net: xsk: tx: support xmit xsk buffer
https://git.kernel.org/netdev/net-next/c/89f86675cb03
  - [net-next,v4,12/13] virtio_net: update tx timeout record
https://git.kernel.org/netdev/net-next/c/e2c5c57f1af8
  - [net-next,v4,13/13] virtio_net: xdp_features add NETDEV_XDP_ACT_XSK_ZEROCOPY
https://git.kernel.org/netdev/net-next/c/37e0ca657a3d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling

2024-11-13 Thread Chia-Yu Chang (Nokia)
>From: Jason Wang  
>Sent: Wednesday, November 13, 2024 2:43 AM
>To: Ilpo Järvinen 
>Cc: Chia-Yu Chang (Nokia) ; Michael S. 
>Tsirkin ; virtualizat...@lists.linux.dev; Eric Dumazet 
>; netdev@vger.kernel.org; dsah...@gmail.com; 
>da...@davemloft.net; dsah...@kernel.org; pab...@redhat.com; 
>joel.grana...@kernel.org; k...@kernel.org; andrew+net...@lunn.ch; 
>ho...@kernel.org; pa...@netfilter.org; kad...@netfilter.org; 
>netfilter-de...@vger.kernel.org; coret...@netfilter.org; ncardw...@google.com; 
>Koen De Schepper (Nokia) ; 
>g.wh...@cablelabs.com; ingemar.s.johans...@ericsson.com; 
>mirja.kuehlew...@ericsson.com; chesh...@apple.com; rs.i...@gmx.at; 
>jason_living...@comcast.com; vidhi_g...@apple.com
>Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & 
>better AccECN handling
>
>[You don't often get email from jasow...@redhat.com. Learn why this is 
>important at https://aka.ms/LearnAboutSenderIdentification ]
>
>CAUTION: This is an external email. Please be very careful when clicking links 
>or opening attachments. See the URL nok.it/ext for additional information.
>
>
>
>On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen  wrote:
>>
>> Adding a few virtio people. Please see the virtio spec/flag question 
>> below.
>>
>> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>>
>> > >From: Ilpo Järvinen 
>> > >Sent: Thursday, November 7, 2024 8:28 PM
>> > >To: Eric Dumazet 
>> > >Cc: Chia-Yu Chang (Nokia) ; 
>> > >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
>> > >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; 
>> > >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; 
>> > >pa...@netfilter.org; kad...@netfilter.org; 
>> > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
>> > >ncardw...@google.com; Koen De Schepper (Nokia) 
>> > >; g.wh...@cablelabs.com; 
>> > >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; 
>> > >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; 
>> > >vidhi_g...@apple.com
>> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field 
>> > >corruption & better AccECN handling
>> > >
>> > >
>> > >CAUTION: This is an external email. Please be very careful when clicking 
>> > >links or opening attachments. See the URL nok.it/ext for additional 
>> > >information.
>> > >
>> > >
>> > >
>> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
>> > >
>> > >>On Tue, Nov 5, 2024 at 11:07 AM  
>> > >>wrote:
>> > >>>
>> > >>> From: Ilpo Järvinen 
>> > >>>
>> > >>> There are important differences in how the CWR field behaves in
>> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE 
>> > >>> counter and its changes are important so adjust the flags 
>> > >>> changed mask accordingly.
>> > >>>
>> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid 
>> > >>> corrupting CWR flag somewhere.
>> > >>>
>> > >>> Signed-off-by: Ilpo Järvinen 
>> > >>> Signed-off-by: Chia-Yu Chang 
>> > >>> ---
>> > >>>  net/ipv4/tcp_offload.c | 4 ++--
>> > >>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> > >>>
>> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c 
>> > >>> index
>> > >>> 0b05f30e9e5f..f59762d88c38 100644
>> > >>> --- a/net/ipv4/tcp_offload.c
>> > >>> +++ b/net/ipv4/tcp_offload.c
>> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head 
>> > >>> *head, struct sk_buff *skb,
>> > >>> th2 = tcp_hdr(p);
>> > >>> flush = (__force int)(flags & TCP_FLAG_CWR);
>> > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>> > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
>> > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
>> > >>> for (i = sizeof(*th); i < thlen; i += 4)
>> > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 
>> > >>> +405,7

Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling

2024-11-12 Thread Jason Wang
On Wed, Nov 13, 2024 at 5:19 AM Ilpo Järvinen  wrote:
>
> Adding a few virtio people. Please see the virtio spec/flag question
> below.
>
> On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:
>
> > >From: Ilpo Järvinen 
> > >Sent: Thursday, November 7, 2024 8:28 PM
> > >To: Eric Dumazet 
> > >Cc: Chia-Yu Chang (Nokia) ; 
> > >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
> > >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; 
> > >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; 
> > >pa...@netfilter.org; kad...@netfilter.org; 
> > >netfilter-de...@vger.kernel.org; coret...@netfilter.org; 
> > >ncardw...@google.com; Koen De Schepper (Nokia) 
> > >; g.wh...@cablelabs.com; 
> > >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; 
> > >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; 
> > >vidhi_g...@apple.com
> > >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & 
> > >better AccECN handling
> > >
> > >
> > >CAUTION: This is an external email. Please be very careful when clicking 
> > >links or opening attachments. See the URL nok.it/ext for additional 
> > >information.
> > >
> > >
> > >
> > >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> > >
> > >>On Tue, Nov 5, 2024 at 11:07 AM  wrote:
> > >>>
> > >>> From: Ilpo Järvinen 
> > >>>
> > >>> There are important differences in how the CWR field behaves in
> > >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter
> > >>> and its changes are important so adjust the flags changed mask
> > >>> accordingly.
> > >>>
> > >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid
> > >>> corrupting CWR flag somewhere.
> > >>>
> > >>> Signed-off-by: Ilpo Järvinen 
> > >>> Signed-off-by: Chia-Yu Chang 
> > >>> ---
> > >>>  net/ipv4/tcp_offload.c | 4 ++--
> > >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index
> > >>> 0b05f30e9e5f..f59762d88c38 100644
> > >>> --- a/net/ipv4/tcp_offload.c
> > >>> +++ b/net/ipv4/tcp_offload.c
> > >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head 
> > >>> *head, struct sk_buff *skb,
> > >>> th2 = tcp_hdr(p);
> > >>> flush = (__force int)(flags & TCP_FLAG_CWR);
> > >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> > >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> > >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> > >>> for (i = sizeof(*th); i < thlen; i += 4)
> > >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7
> > >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> > >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> > >>>
> > >>> if (th->cwr)
> > >>> -   shinfo->gso_type |= SKB_GSO_TCP_ECN;
> > >>> +   shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> > >>>  }
> > >>>  EXPORT_SYMBOL(tcp_gro_complete);
> > >>>
> > >>
> > >> I do not really understand this patch. How a GRO engine can know which
> > >> ECN variant the peers are using ?
> > >
> > >Hi Eric,
> > >
> > >Thanks for taking a look.
> > >
> > >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can 
> > >result in header change that corrupts ACE field. Thus, GRO has to assume 
> > >the RFC3168 CWR offloading trick cannot be used anymore (unless it tracks 
> > >the connection and knows the bits are used for RFC3168 which is something 
> > >nobody is going to do).
> > >
> > >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or 
> > >NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE 
> > >field value.
> >
> > Hi Eric and Ilpo,
> >
> > From my understanding of another email thread (patch 08/13), it seems the 
> > conclusions that SKB_GSO_TCP_ACC

RE: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & better AccECN handling

2024-11-12 Thread Ilpo Järvinen
Adding a few virtio people. Please see the virtio spec/flag question 
below.

On Tue, 12 Nov 2024, Chia-Yu Chang (Nokia) wrote:

> >From: Ilpo Järvinen  
> >Sent: Thursday, November 7, 2024 8:28 PM
> >To: Eric Dumazet 
> >Cc: Chia-Yu Chang (Nokia) ; 
> >netdev@vger.kernel.org; dsah...@gmail.com; da...@davemloft.net; 
> >dsah...@kernel.org; pab...@redhat.com; joel.grana...@kernel.org; 
> >k...@kernel.org; andrew+net...@lunn.ch; ho...@kernel.org; 
> >pa...@netfilter.org; kad...@netfilter.org; netfilter-de...@vger.kernel.org; 
> >coret...@netfilter.org; ncardw...@google.com; Koen De Schepper (Nokia) 
> >; g.wh...@cablelabs.com; 
> >ingemar.s.johans...@ericsson.com; mirja.kuehlew...@ericsson.com; 
> >chesh...@apple.com; rs.i...@gmx.at; jason_living...@comcast.com; 
> >vidhi_g...@apple.com
> >Subject: Re: [PATCH v5 net-next 09/13] gro: prevent ACE field corruption & 
> >better AccECN handling
> >
> >
> >CAUTION: This is an external email. Please be very careful when clicking 
> >links or opening attachments. See the URL nok.it/ext for additional 
> >information.
> >
> >
> >
> >On Thu, 7 Nov 2024, Eric Dumazet wrote:
> >
> >>On Tue, Nov 5, 2024 at 11:07 AM  wrote:
> >>>
> >>> From: Ilpo Järvinen 
> >>>
> >>> There are important differences in how the CWR field behaves in 
> >>> RFC3168 and AccECN. With AccECN, CWR flag is part of the ACE counter 
> >>> and its changes are important so adjust the flags changed mask 
> >>> accordingly.
> >>>
> >>> Also, if CWR is there, set the Accurate ECN GSO flag to avoid 
> >>> corrupting CWR flag somewhere.
> >>>
> >>> Signed-off-by: Ilpo Järvinen 
> >>> Signed-off-by: Chia-Yu Chang 
> >>> ---
> >>>  net/ipv4/tcp_offload.c | 4 ++--
> >>>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index 
> >>> 0b05f30e9e5f..f59762d88c38 100644
> >>> --- a/net/ipv4/tcp_offload.c
> >>> +++ b/net/ipv4/tcp_offload.c
> >>> @@ -329,7 +329,7 @@ struct sk_buff *tcp_gro_receive(struct list_head 
> >>> *head, struct sk_buff *skb,
> >>> th2 = tcp_hdr(p);
> >>> flush = (__force int)(flags & TCP_FLAG_CWR);
> >>> flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
> >>> - ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>> + ~(TCP_FLAG_FIN | TCP_FLAG_PSH));
> >>> flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> >>> for (i = sizeof(*th); i < thlen; i += 4)
> >>> flush |= *(u32 *)((u8 *)th + i) ^ @@ -405,7 +405,7 
> >>> @@ void tcp_gro_complete(struct sk_buff *skb)
> >>> shinfo->gso_segs = NAPI_GRO_CB(skb)->count;
> >>>
> >>> if (th->cwr)
> >>> -   shinfo->gso_type |= SKB_GSO_TCP_ECN;
> >>> +   shinfo->gso_type |= SKB_GSO_TCP_ACCECN;
> >>>  }
> >>>  EXPORT_SYMBOL(tcp_gro_complete);
> >>>
> >>
> >> I do not really understand this patch. How a GRO engine can know which 
> >> ECN variant the peers are using ?
> >
> >Hi Eric,
> >
> >Thanks for taking a look.
> >
> >GRO doesn't know. Setting SKB_GSO_TCP_ECN in case of not knowing can result 
> >in header change that corrupts ACE field. Thus, GRO has to assume the 
> >RFC3168 CWR offloading trick cannot be used anymore (unless it tracks the 
> >connection and knows the bits are used for RFC3168 which is something nobody 
> >is going to do).
> >
> >The main point of SKB_GSO_TCP_ACCECN is to prevent SKB_GSO_TCP_ECN or 
> >NETIF_F_TSO_ECN offloading to be used for the skb as it would corrupt ACE 
> >field value.
> 
> Hi Eric and Ilpo,
> 
> From my understanding of another email thread (patch 08/13), it seems the 
> conclusions that SKB_GSO_TCP_ACCECN will still be needed.
> 
> >
> >SKB_GSO_TCP_ACCECN doesn't allow CWR bits change within a super-skb but the 
> >same CWR flag should be repeated for all segments. In a sense, it's simpler 
> >than RFC3168 offloading.
> >
> >> SKB_GSO_TCP_ECN is also used from other points, what is your plan ?
> >>
> >> git grep -n SKB_GSO_TCP_ECN
> >> drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:3888:
> >> skb_shinfo(skb)->gso_type |= 

Re: [PATCH net v2 0/3] virtio/vsock: Fix memory leaks

2024-11-12 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net.git (main)
by Paolo Abeni :

On Thu, 07 Nov 2024 21:46:11 +0100 you wrote:
> Short series fixing some memory leaks that I've stumbled upon while toying
> with the selftests.
> 
> Signed-off-by: Michal Luczaj 
> ---
> Changes in v2:
> - Remove the refactoring patch from the series [Stefano]
> - PATCH 2: Drop "virtio" from the commit title [Stefano]
> - Collect Reviewed-by [Stefano]
> - Link to v1: 
> https://lore.kernel.org/r/20241106-vsock-mem-leaks-v1-0-8f4ffc309...@rbox.co
> 
> [...]

Here is the summary with links:
  - [net,v2,1/3] virtio/vsock: Fix accept_queue memory leak
https://git.kernel.org/netdev/net/c/d7b0ff5a8667
  - [net,v2,2/3] vsock: Fix sk_error_queue memory leak
https://git.kernel.org/netdev/net/c/fbf7085b3ad1
  - [net,v2,3/3] virtio/vsock: Improve MSG_ZEROCOPY error handling
https://git.kernel.org/netdev/net/c/60cf6206a1f5

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





Re: [PATCH net-next v3 05/13] virtio_ring: introduce add api for premapped

2024-11-11 Thread Jakub Kicinski
On Thu,  7 Nov 2024 16:54:56 +0800 Xuan Zhuo wrote:
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).

> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).

Looks like we need a rebase, doesn't apply any more.

When you rebase please correct the kdoc, looks like the official format
wants a ":" after the "Returns"



Re: [PATCH net v2 2/3] vsock: Fix sk_error_queue memory leak

2024-11-11 Thread Arseniy Krasnov



On 07.11.2024 23:46, Michal Luczaj wrote:
> Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
> Where they remain, until explicitly recv()ed. To prevent memory leaks,
> clean up the queue when the socket is destroyed.
> 
> unreferenced object 0x8881028beb00 (size 224):
>   comm "vsock_test", pid 1218, jiffies 4294694897
>   hex dump (first 32 bytes):
> 90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!...!.
> 00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..!.
>   backtrace (crc 6c7031ca):
> [] kmem_cache_alloc_node_noprof+0x2f7/0x370
> [] __alloc_skb+0x132/0x180
> [] sock_omalloc+0x4b/0x80
> [] msg_zerocopy_realloc+0x9e/0x240
> [] virtio_transport_send_pkt_info+0x412/0x4c0
> [] virtio_transport_stream_enqueue+0x43/0x50
> [] vsock_connectible_sendmsg+0x373/0x450
> [] sys_sendmsg+0x365/0x3a0
> [] ___sys_sendmsg+0x84/0xd0
> [] __sys_sendmsg+0x47/0x80
> [] do_syscall_64+0x93/0x180
> [] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Signed-off-by: Michal Luczaj 
> ---
>  net/vmw_vsock/af_vsock.c | 3 +++
>  1 file changed, 3 insertions(+)

Acked-by: Arseniy Krasnov 

> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 
> 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6
>  100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk)
>  {
>   struct vsock_sock *vsk = vsock_sk(sk);
>  
> + /* Flush MSG_ZEROCOPY leftovers. */
> + __skb_queue_purge(&sk->sk_error_queue);
> +
>   vsock_deassign_transport(vsk);
>  
>   /* When clearing these addresses, there's no need to set the family and
> 



Re: [PATCH net v2 3/3] virtio/vsock: Improve MSG_ZEROCOPY error handling

2024-11-11 Thread Arseniy Krasnov



On 07.11.2024 23:46, Michal Luczaj wrote:
> Add a missing kfree_skb() to prevent memory leaks.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Reviewed-by: Stefano Garzarella 
> Signed-off-by: Michal Luczaj 
> ---
>  net/vmw_vsock/virtio_transport_common.c | 1 +
>  1 file changed, 1 insertion(+)

Acked-by: Arseniy Krasnov 

> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index 
> cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2
>  100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct 
> vsock_sock *vsk,
>   if (virtio_transport_init_zcopy_skb(vsk, skb,
>   info->msg,
>   can_zcopy)) {
> + kfree_skb(skb);
>   ret = -ENOMEM;
>   break;
>   }
> 



Re: [PATCH net 3/4] virtio/vsock: Improve MSG_ZEROCOPY error handling

2024-11-11 Thread Arseniy Krasnov



On 06.11.2024 20:51, Michal Luczaj wrote:
> Add a missing kfree_skb() to prevent memory leaks.
> 
> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
> Signed-off-by: Michal Luczaj 
> ---
>  net/vmw_vsock/virtio_transport_common.c | 1 +
>  1 file changed, 1 insertion(+)

Acked-by: Arseniy Krasnov 

> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index 
> cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2
>  100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct 
> vsock_sock *vsk,
>   if (virtio_transport_init_zcopy_skb(vsk, skb,
>   info->msg,
>   can_zcopy)) {
> + kfree_skb(skb);
>   ret = -ENOMEM;
>   break;
>   }
> 



Re: [PATCH net-next v3 03/13] virtio_ring: packed: record extras for indirect buffers

2024-11-10 Thread Jason Wang
On Mon, Nov 11, 2024 at 11:53 AM Xuan Zhuo  wrote:
>
> On Thu,  7 Nov 2024 16:54:54 +0800, Xuan Zhuo  
> wrote:
> > The subsequent commit needs to know whether every indirect buffer is
> > premapped or not. So we need to introduce an extra struct for every
> > indirect buffer to record this info.
> >
> > Signed-off-by: Xuan Zhuo 
>
>
> Hi, Jason
>
> This also needs a review.

Sorry, I miss this.

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v3 03/13] virtio_ring: packed: record extras for indirect buffers

2024-11-10 Thread Xuan Zhuo
On Thu,  7 Nov 2024 16:54:54 +0800, Xuan Zhuo  
wrote:
> The subsequent commit needs to know whether every indirect buffer is
> premapped or not. So we need to introduce an extra struct for every
> indirect buffer to record this info.
>
> Signed-off-by: Xuan Zhuo 


Hi, Jason

This also needs a review.

Thanks.


> ---
>  drivers/virtio/virtio_ring.c | 60 +---
>  1 file changed, 36 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 405d5a348795..cfe70c40f630 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -78,7 +78,11 @@ struct vring_desc_state_split {
>
>  struct vring_desc_state_packed {
>   void *data; /* Data for callback. */
> - struct vring_packed_desc *indir_desc; /* Indirect descriptor, if any. */
> +
> + /* Indirect desc table and extra table, if any. These two will be
> +  * allocated together. So we won't stress more to the memory allocator.
> +  */
> + struct vring_packed_desc *indir_desc;
>   u16 num;/* Descriptor list length. */
>   u16 last;   /* The last desc state in a list. */
>  };
> @@ -1238,27 +1242,12 @@ static void vring_unmap_extra_packed(const struct 
> vring_virtqueue *vq,
>   }
>  }
>
> -static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
> - const struct vring_packed_desc *desc)
> -{
> - u16 flags;
> -
> - if (!vring_need_unmap_buffer(vq))
> - return;
> -
> - flags = le16_to_cpu(desc->flags);
> -
> - dma_unmap_page(vring_dma_dev(vq),
> -le64_to_cpu(desc->addr),
> -le32_to_cpu(desc->len),
> -(flags & VRING_DESC_F_WRITE) ?
> -DMA_FROM_DEVICE : DMA_TO_DEVICE);
> -}
> -
>  static struct vring_packed_desc *alloc_indirect_packed(unsigned int total_sg,
>  gfp_t gfp)
>  {
> + struct vring_desc_extra *extra;
>   struct vring_packed_desc *desc;
> + int i, size;
>
>   /*
>* We require lowmem mappings for the descriptors because
> @@ -1267,7 +1256,16 @@ static struct vring_packed_desc 
> *alloc_indirect_packed(unsigned int total_sg,
>*/
>   gfp &= ~__GFP_HIGHMEM;
>
> - desc = kmalloc_array(total_sg, sizeof(struct vring_packed_desc), gfp);
> + size = (sizeof(*desc) + sizeof(*extra)) * total_sg;
> +
> + desc = kmalloc(size, gfp);
> + if (!desc)
> + return NULL;
> +
> + extra = (struct vring_desc_extra *)&desc[total_sg];
> +
> + for (i = 0; i < total_sg; i++)
> + extra[i].next = i + 1;
>
>   return desc;
>  }
> @@ -1280,6 +1278,7 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>void *data,
>gfp_t gfp)
>  {
> + struct vring_desc_extra *extra;
>   struct vring_packed_desc *desc;
>   struct scatterlist *sg;
>   unsigned int i, n, err_idx;
> @@ -1291,6 +1290,8 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>   if (!desc)
>   return -ENOMEM;
>
> + extra = (struct vring_desc_extra *)&desc[total_sg];
> +
>   if (unlikely(vq->vq.num_free < 1)) {
>   pr_debug("Can't add buf len 1 - avail = 0\n");
>   kfree(desc);
> @@ -1312,6 +1313,13 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>   0 : VRING_DESC_F_WRITE);
>   desc[i].addr = cpu_to_le64(addr);
>   desc[i].len = cpu_to_le32(sg->length);
> +
> + if (unlikely(vq->use_dma_api)) {
> + extra[i].addr = addr;
> + extra[i].len = sg->length;
> + extra[i].flags = n < out_sgs ?  0 : 
> VRING_DESC_F_WRITE;
> + }
> +
>   i++;
>   }
>   }
> @@ -1381,7 +1389,7 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>   err_idx = i;
>
>   for (i = 0; i < err_idx; i++)
> - vring_unmap_desc_packed(vq, &desc[i]);
> + vring_unmap_extra_packed(vq, &extra[i]);
>
>  free_desc:
>   kfree(desc);
> @@ -1617,7 +1625,8 @@ static void detach_buf_packed(struct vring_virtqueue 
> *vq,
>   }
>
>   if (vq->indirect) {
> - u32 len;
> + struct vring_desc_extra *extra;
> + u32 len, num;
>
>   /* Free the indirect table, if any, now that it's unmapped. */
>   desc = state->indir_desc;
> @@ -1626,9 +1635,12 @@ static void detach_buf_packed(struct vring_virtqueue 
> *vq,
>
>   if (vring_need_unmap_buffer(vq)) {
>   len = vq->packed.desc_extra[id].len;

Re: [PATCH net-next v3 06/13] virtio-net: rq submits premapped per-buffer

2024-11-10 Thread Jason Wang
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo  wrote:
>
> virtio-net rq submits premapped per-buffer by setting sg page to NULL;
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4b27ded8fc16..862beacef5d7 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -539,6 +539,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
>  }
>
> +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> +{
> +   sg_dma_address(sg) = addr;
> +   sg_dma_len(sg) = len;
> +}
> +

In the future, we need to consider hiding those in the core.

Anyhow

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v3 05/13] virtio_ring: introduce add api for premapped

2024-11-10 Thread Jason Wang
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo  wrote:
>
> Two APIs are introduced to submit premapped per-buffers.
>
> int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
>  struct scatterlist *sg, unsigned int num,
>  void *data,
>  void *ctx,
>  gfp_t gfp);
>
> int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
>   struct scatterlist *sg, unsigned int num,
>   void *data,
>   gfp_t gfp);
>
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v3 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-10 Thread Jason Wang
On Thu, Nov 7, 2024 at 4:55 PM Xuan Zhuo  wrote:
>
> The subsequent commit needs to know whether every indirect buffer is
> premapped or not. So we need to introduce an extra struct for every
> indirect buffer to record this info.
>
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-10 Thread Jason Wang
On Wed, Nov 6, 2024 at 3:42 PM Michael S. Tsirkin  wrote:
>
> On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote:
> > > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > > -   vring_unmap_one_split(vq, i);
> > > > > +   vring_unmap_one_split(vq, &extra[i]);
> > > >
> > > > Not sure if I've asked this before. But this part seems to deserve an
> > > > independent fix for -stable.
> > >
> > > What fix?
> >
> > I meant for hardening we need to check the flags stored in the extra
> > instead of the descriptor itself as it could be mangled by the device.
> >
> > Thanks
>
> Good point. Jason, want to cook up a patch?

Will do.

Thanks

>
> --
> MST
>




Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use

2024-11-08 Thread Stefano Garzarella

On Thu, Nov 07, 2024 at 10:04:03PM +0100, Michal Luczaj wrote:

On 11/7/24 11:22, Stefano Garzarella wrote:

On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote:

Macro vsock_connected_sockets_vsk() has been unused since its introduction.
Instead of removing it, utilise it in vsock_insert_connected() where it's
been open-coded.

No functional change intended.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")


This is not a fix, so please remove the Fixes tag, we don't need to
backport this patch in stable branches.

Also in this case this is not related at all with virtio transport, so
please remove `virtio` from the commit title.

In addition maybe you can remove this patch from this series, and send
it to net-next.
...


Right, I get it. Just to be clear: are such small (and non-functional)
cleanups welcomed coming by themselves?


Good question, in this case I think it's a good cleanup, since we have 
the function already there, why not use it. So I don't see a problem 
with that.


If you find others cleanups, it's better to pack in a single series, 
otherwise let's send just it.


Thanks for the fixes and cleanups!

Stefano




Re: [PATCH net v2 2/3] vsock: Fix sk_error_queue memory leak

2024-11-08 Thread Stefano Garzarella

On Thu, Nov 07, 2024 at 09:46:13PM +0100, Michal Luczaj wrote:

Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
Where they remain, until explicitly recv()ed. To prevent memory leaks,
clean up the queue when the socket is destroyed.

unreferenced object 0x8881028beb00 (size 224):
 comm "vsock_test", pid 1218, jiffies 4294694897
 hex dump (first 32 bytes):
   90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!...!.
   00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..!.
 backtrace (crc 6c7031ca):
   [] kmem_cache_alloc_node_noprof+0x2f7/0x370
   [] __alloc_skb+0x132/0x180
   [] sock_omalloc+0x4b/0x80
   [] msg_zerocopy_realloc+0x9e/0x240
   [] virtio_transport_send_pkt_info+0x412/0x4c0
   [] virtio_transport_stream_enqueue+0x43/0x50
   [] vsock_connectible_sendmsg+0x373/0x450
   [] sys_sendmsg+0x365/0x3a0
   [] ___sys_sendmsg+0x84/0xd0
   [] __sys_sendmsg+0x47/0x80
   [] do_syscall_64+0x93/0x180
   [] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Michal Luczaj 
---
net/vmw_vsock/af_vsock.c | 3 +++
1 file changed, 3 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 
35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6
 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk)
{
struct vsock_sock *vsk = vsock_sk(sk);

+   /* Flush MSG_ZEROCOPY leftovers. */
+   __skb_queue_purge(&sk->sk_error_queue);
+
vsock_deassign_transport(vsk);

/* When clearing these addresses, there's no need to set the family and

--
2.46.2






Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use

2024-11-07 Thread Michal Luczaj
On 11/7/24 11:22, Stefano Garzarella wrote:
> On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote:
>> Macro vsock_connected_sockets_vsk() has been unused since its introduction.
>> Instead of removing it, utilise it in vsock_insert_connected() where it's
>> been open-coded.
>>
>> No functional change intended.
>>
>> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
> 
> This is not a fix, so please remove the Fixes tag, we don't need to 
> backport this patch in stable branches.
> 
> Also in this case this is not related at all with virtio transport, so 
> please remove `virtio` from the commit title.
> 
> In addition maybe you can remove this patch from this series, and send 
> it to net-next.
> ...

Right, I get it. Just to be clear: are such small (and non-functional)
cleanups welcomed coming by themselves?

Thanks,
Michal




Re: [PATCH net 2/4] virtio/vsock: Fix sk_error_queue memory leak

2024-11-07 Thread Michal Luczaj
On 11/7/24 11:17, Stefano Garzarella wrote:
> On Wed, Nov 06, 2024 at 06:51:19PM +0100, Michal Luczaj wrote:
>> Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
>> Where they remain, until explicitly recv()ed. To prevent memory leaks,
>> clean up the queue when the socket is destroyed.
>>
>> unreferenced object 0x8881028beb00 (size 224):
>>  comm "vsock_test", pid 1218, jiffies 4294694897
>>  hex dump (first 32 bytes):
>>90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!...!.
>>00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..!.
>>  backtrace (crc 6c7031ca):
>>[] kmem_cache_alloc_node_noprof+0x2f7/0x370
>>[] __alloc_skb+0x132/0x180
>>[] sock_omalloc+0x4b/0x80
>>[] msg_zerocopy_realloc+0x9e/0x240
>>[] virtio_transport_send_pkt_info+0x412/0x4c0
>>[] virtio_transport_stream_enqueue+0x43/0x50
>>[] vsock_connectible_sendmsg+0x373/0x450
>>[] sys_sendmsg+0x365/0x3a0
>>[] ___sys_sendmsg+0x84/0xd0
>>[] __sys_sendmsg+0x47/0x80
>>[] do_syscall_64+0x93/0x180
>>[] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>
>> Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
>> Signed-off-by: Michal Luczaj 
>> ---
>> net/vmw_vsock/af_vsock.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index 
>> 35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6
>>  100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk)
>> {
>>  struct vsock_sock *vsk = vsock_sk(sk);
>>
>> +/* Flush MSG_ZEROCOPY leftovers. */
>> +__skb_queue_purge(&sk->sk_error_queue);
>> +
> 
> It is true that for now this is supported only in the virtio transport, 
> but it's more related to the core, so please remove `virtio` from the 
> commit title.
> 
> The rest LGTM.
> ...

OK, done. Here's v2 of the series:
https://lore.kernel.org/netdev/20241107-vsock-mem-leaks-v2-0-4e21bfcfc...@rbox.co/

Thanks for the reviews,
Michal




Re: [PATCH net 4/4] virtio/vsock: Put vsock_connected_sockets_vsk() to use

2024-11-07 Thread Stefano Garzarella

On Wed, Nov 06, 2024 at 06:51:21PM +0100, Michal Luczaj wrote:

Macro vsock_connected_sockets_vsk() has been unused since its introduction.
Instead of removing it, utilise it in vsock_insert_connected() where it's
been open-coded.

No functional change intended.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")


This is not a fix, so please remove the Fixes tag, we don't need to 
backport this patch in stable branches.


Also in this case this is not related at all with virtio transport, so 
please remove `virtio` from the commit title.


In addition maybe you can remove this patch from this series, and send 
it to net-next.


Thanks,
Stefano


Signed-off-by: Michal Luczaj 
---
net/vmw_vsock/af_vsock.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 
dfd29160fe11c4675f872c1ee123d65b2da0dae6..c3a37c3d4bf3c8117fbc8bd020da8dc1c9212732
 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -275,8 +275,7 @@ static void vsock_insert_unbound(struct vsock_sock *vsk)

void vsock_insert_connected(struct vsock_sock *vsk)
{
-   struct list_head *list = vsock_connected_sockets(
-   &vsk->remote_addr, &vsk->local_addr);
+   struct list_head *list = vsock_connected_sockets_vsk(vsk);

spin_lock_bh(&vsock_table_lock);
__vsock_insert_connected(list, vsk);

--
2.46.2






Re: [PATCH net 3/4] virtio/vsock: Improve MSG_ZEROCOPY error handling

2024-11-07 Thread Stefano Garzarella

On Wed, Nov 06, 2024 at 06:51:20PM +0100, Michal Luczaj wrote:

Add a missing kfree_skb() to prevent memory leaks.

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Michal Luczaj 
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 
cd075f608d4f6f48f894543e5e9c966d3e5f22df..e2e6a30b759bdc6371bb0d63ee2e77c0ba148fd2
 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -400,6 +400,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock 
*vsk,
if (virtio_transport_init_zcopy_skb(vsk, skb,
info->msg,
can_zcopy)) {
+   kfree_skb(skb);
ret = -ENOMEM;
break;
}

--
2.46.2






Re: [PATCH net 2/4] virtio/vsock: Fix sk_error_queue memory leak

2024-11-07 Thread Stefano Garzarella

On Wed, Nov 06, 2024 at 06:51:19PM +0100, Michal Luczaj wrote:

Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
Where they remain, until explicitly recv()ed. To prevent memory leaks,
clean up the queue when the socket is destroyed.

unreferenced object 0x8881028beb00 (size 224):
 comm "vsock_test", pid 1218, jiffies 4294694897
 hex dump (first 32 bytes):
   90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!...!.
   00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..!.
 backtrace (crc 6c7031ca):
   [] kmem_cache_alloc_node_noprof+0x2f7/0x370
   [] __alloc_skb+0x132/0x180
   [] sock_omalloc+0x4b/0x80
   [] msg_zerocopy_realloc+0x9e/0x240
   [] virtio_transport_send_pkt_info+0x412/0x4c0
   [] virtio_transport_stream_enqueue+0x43/0x50
   [] vsock_connectible_sendmsg+0x373/0x450
   [] sys_sendmsg+0x365/0x3a0
   [] ___sys_sendmsg+0x84/0xd0
   [] __sys_sendmsg+0x47/0x80
   [] do_syscall_64+0x93/0x180
   [] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Michal Luczaj 
---
net/vmw_vsock/af_vsock.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 
35681adedd9aaec3565495158f5342b8aa76c9bc..dfd29160fe11c4675f872c1ee123d65b2da0dae6
 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -836,6 +836,9 @@ static void vsock_sk_destruct(struct sock *sk)
{
struct vsock_sock *vsk = vsock_sk(sk);

+   /* Flush MSG_ZEROCOPY leftovers. */
+   __skb_queue_purge(&sk->sk_error_queue);
+


It is true that for now this is supported only in the virtio transport, 
but it's more related to the core, so please remove `virtio` from the 
commit title.


The rest LGTM.

Thanks,
Stefano


vsock_deassign_transport(vsk);

/* When clearing these addresses, there's no need to set the family and

--
2.46.2






Re: [PATCH net 1/4] virtio/vsock: Fix accept_queue memory leak

2024-11-07 Thread Stefano Garzarella

On Wed, Nov 06, 2024 at 06:51:18PM +0100, Michal Luczaj wrote:

As the final stages of socket destruction may be delayed, it is possible
that virtio_transport_recv_listen() will be called after the accept_queue
has been flushed, but before the SOCK_DONE flag has been set. As a result,
sockets enqueued after the flush would remain unremoved, leading to a
memory leak.

vsock_release
 __vsock_release
   lock
   virtio_transport_release
 virtio_transport_close
   schedule_delayed_work(close_work)
   sk_shutdown = SHUTDOWN_MASK
(!) flush accept_queue
   release
   virtio_transport_recv_pkt
 vsock_find_bound_socket
 lock
 if flag(SOCK_DONE) return
 virtio_transport_recv_listen
   child = vsock_create_connected
 (!)   vsock_enqueue_accept(child)
 release
close_work
 lock
 virtio_transport_do_close
   set_flag(SOCK_DONE)
   virtio_transport_remove_sock
 vsock_remove_sock
   vsock_remove_bound
 release

Introduce a sk_shutdown check to disallow vsock_enqueue_accept() during
socket destruction.

unreferenced object 0x888109e3f800 (size 2040):
 comm "kworker/5:2", pid 371, jiffies 4294940105
 hex dump (first 32 bytes):
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
   28 00 0b 40 00 00 00 00 00 00 00 00 00 00 00 00  (..@
 backtrace (crc 9e5f4e84):
   [] kmem_cache_alloc_noprof+0x2c1/0x360
   [] sk_prot_alloc+0x30/0x120
   [] sk_alloc+0x2c/0x4b0
   [] __vsock_create.constprop.0+0x2a/0x310
   [] virtio_transport_recv_pkt+0x4dc/0x9a0
   [] vsock_loopback_work+0xfd/0x140
   [] process_one_work+0x20c/0x570
   [] worker_thread+0x1bf/0x3a0
   [] kthread+0xdd/0x110
   [] ret_from_fork+0x2d/0x50
   [] ret_from_fork_asm+0x1a/0x30

Fixes: 3fe356d58efa ("vsock/virtio: discard packets only when socket is really 
closed")
Signed-off-by: Michal Luczaj 
---
net/vmw_vsock/virtio_transport_common.c | 8 
1 file changed, 8 insertions(+)


Reviewed-by: Stefano Garzarella 



diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 
ccbd2bc0d2109aea4f19e79a0438f85893e1d89c..cd075f608d4f6f48f894543e5e9c966d3e5f22df
 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1512,6 +1512,14 @@ virtio_transport_recv_listen(struct sock *sk, struct 
sk_buff *skb,
return -ENOMEM;
}

+   /* __vsock_release() might have already flushed accept_queue.
+* Subsequent enqueues would lead to a memory leak.
+*/
+   if (sk->sk_shutdown == SHUTDOWN_MASK) {
+   virtio_transport_reset_no_sock(t, skb);
+   return -ESHUTDOWN;
+   }
+
child = vsock_create_connected(sk);
if (!child) {
virtio_transport_reset_no_sock(t, skb);

--
2.46.2






Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-06 Thread Jakub Kicinski
On Wed, 6 Nov 2024 04:16:07 -0500 Michael S. Tsirkin wrote:
> I thought it can but can't remember why now. OK, nm then, thanks!

FWIW (I think there was confusion in earlier discussions) we do merge
net into net-next once a week. So we can always apply stuff to net,
and then depending patches to net-next within a week. Just for future
reference, this patch IIUC we just leave be.



Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-06 Thread Michael S. Tsirkin
On Wed, Nov 06, 2024 at 04:46:23PM +0800, Xuan Zhuo wrote:
> On Wed, 6 Nov 2024 02:38:40 -0500, "Michael S. Tsirkin"  
> wrote:
> > On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote:
> > > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote:
> > > > In the last linux version, we disabled this feature to fix the
> > > > regress[1].
> > > >
> > > > The patch set is try to fix the problem and re-enable it.
> > > >
> > > > More info: 
> > > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> > >
> > > Sorry to ping, Michael, Jason we're waiting to hear from you on
> > > this one.
> >
> > Can patch 1 be applied on net as well? Or I can take it through
> > my tree. It's a bugfix, just for an uncommon configuration.
> 
> 
> Why? That can not be triggered in net branch.
> 
> Thanks

I thought it can but can't remember why now. OK, nm then, thanks!




Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-06 Thread Xuan Zhuo
On Wed, 6 Nov 2024 02:38:40 -0500, "Michael S. Tsirkin"  wrote:
> On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote:
> > On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote:
> > > In the last linux version, we disabled this feature to fix the
> > > regress[1].
> > >
> > > The patch set is try to fix the problem and re-enable it.
> > >
> > > More info: 
> > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> >
> > Sorry to ping, Michael, Jason we're waiting to hear from you on
> > this one.
>
> Can patch 1 be applied on net as well? Or I can take it through
> my tree. It's a bugfix, just for an uncommon configuration.


Why? That can not be triggered in net branch.

Thanks



Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-05 Thread Michael S. Tsirkin
On Wed, Nov 06, 2024 at 09:44:39AM +0800, Jason Wang wrote:
> > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > -   vring_unmap_one_split(vq, i);
> > > > +   vring_unmap_one_split(vq, &extra[i]);
> > >
> > > Not sure if I've asked this before. But this part seems to deserve an
> > > independent fix for -stable.
> >
> > What fix?
> 
> I meant for hardening we need to check the flags stored in the extra
> instead of the descriptor itself as it could be mangled by the device.
> 
> Thanks

Good point. Jason, want to cook up a patch?

-- 
MST




Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-05 Thread Michael S. Tsirkin
On Mon, Nov 04, 2024 at 06:46:41PM -0800, Jakub Kicinski wrote:
> On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote:
> > In the last linux version, we disabled this feature to fix the
> > regress[1].
> > 
> > The patch set is try to fix the problem and re-enable it.
> > 
> > More info: 
> > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> Sorry to ping, Michael, Jason we're waiting to hear from you on 
> this one.

Can patch 1 be applied on net as well? Or I can take it through
my tree. It's a bugfix, just for an uncommon configuration.




Re: [PATCH net-next v1 1/4] virtio-net: fix overflow inside virtnet_rq_alloc

2024-11-05 Thread Michael S. Tsirkin
On Tue, Oct 29, 2024 at 04:46:12PM +0800, Xuan Zhuo wrote:
> When the frag just got a page, then may lead to regression on VM.
> Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> then the frag always get a page when do refill.
> 
> Which could see reliable crashes or scp failure (scp a file 100M in size
> to VM).
> 
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur.
> 
> The commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever
> use_dma_api") introduced this problem. And we reverted some commits to
> fix this in last linux version. Now we try to enable it and fix this
> bug directly.
> 
> Here, when the frag size is not enough, we reduce the buffer len to fix
> this problem.
> 
> Reported-by: "Si-Wei Liu" 
> Tested-by: Darren Kenny 
> Signed-off-by: Xuan Zhuo 


This one belongs in net I feel. However, patches 2 and on
depend on it not to cause regressions. So I suggest merging it
on both branches, git will figure out the conflict I think.


> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 792e9eadbfc3..d50c1940eb23 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
>   void *buf, *head;
>   dma_addr_t addr;
>  
> - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> - return NULL;
> -
>   head = page_address(alloc_frag->page);
>  
>   if (rq->do_dma) {
> @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   len = SKB_DATA_ALIGN(len) +
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> + return -ENOMEM;
> +
>   buf = virtnet_rq_alloc(rq, len, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>*/
>   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>  
> + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> + return -ENOMEM;
> +
> + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> alloc_frag->size)
> + len -= sizeof(struct virtnet_rq_dma);
> +
>   buf = virtnet_rq_alloc(rq, len + room, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> -- 
> 2.32.0.3.g01195cf9f




Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-05 Thread Xuan Zhuo
On Wed, 6 Nov 2024 09:44:39 +0800, Jason Wang  wrote:
> On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo  wrote:
> >
> > On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang  wrote:
> > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  
> > > wrote:
> > > >
> > > > The subsequent commit needs to know whether every indirect buffer is
> > > > premapped or not. So we need to introduce an extra struct for every
> > > > indirect buffer to record this info.
> > > >
> > > > Signed-off-by: Xuan Zhuo 
> > > > ---
> > > >  drivers/virtio/virtio_ring.c | 112 ---
> > > >  1 file changed, 52 insertions(+), 60 deletions(-)
> > >
> > > Do we have a performance impact for this patch?
> > >
> > > >
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 97590c201aa2..dca093744fe1 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -69,7 +69,11 @@
> > > >
> > > >  struct vring_desc_state_split {
> > > > void *data; /* Data for callback. */
> > > > -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. 
> > > > */
> > > > +
> > > > +   /* Indirect extra table and desc table, if any. These two will 
> > > > be
> > > > +* allocated together. So we won't stress more to the memory 
> > > > allocator.
> > > > +*/
> > > > +   struct vring_desc *indir_desc;
> > >
> > > So it looks like we put a descriptor table after the extra table. Can
> > > this lead to more crossing page mappings for the indirect descriptors?
> > >
> > > If yes, it seems expensive so we probably need to make the descriptor
> > > table come first.
> >
> > No, the descriptors are before extra table.
>
> Well, you need then tweak the above comment, it said
>
> "Indirect extra table and desc table".
>
> > So, there is not performance impact.
> >
> >
> > >
> > > >  };
> > > >
>
> [...]
>
> > > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > > -   vring_unmap_one_split(vq, i);
> > > > +   vring_unmap_one_split(vq, &extra[i]);
> > >
> > > Not sure if I've asked this before. But this part seems to deserve an
> > > independent fix for -stable.
> >
> > What fix?
>
> I meant for hardening we need to check the flags stored in the extra
> instead of the descriptor itself as it could be mangled by the device.

I see, we can do it in future.

Thanks.


>
> Thanks
>



Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer

2024-11-05 Thread Xuan Zhuo
On Wed, 6 Nov 2024 09:56:55 +0800, Jason Wang  wrote:
> On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo  wrote:
> >
> > On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang  wrote:
> > > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  
> > > wrote:
> > > >
> > > > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> > > >
> > > > Signed-off-by: Xuan Zhuo 
> > > > ---
> > > >  drivers/net/virtio_net.c | 24 +---
> > > >  1 file changed, 13 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 792e9eadbfc3..09757fa408bd 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > > > return (struct sk_buff *)((unsigned long)ptr & 
> > > > ~VIRTIO_ORPHAN_FLAG);
> > > >  }
> > > >
> > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 
> > > > len)
> > > > +{
> > > > +   sg->dma_address = addr;
> > > > +   sg->length = len;
> > >
> > > This may work but I think it's better to reuse existing dma sg helpers 
> > > like:
> > >
> > > sg_dma_address(sg) = addr;
> > > sg_dma_length(sg) = len;
> > >
> > > And we probably need to fix the virtio core which only uses
> > > sg_dma_address() but not sg_dma_length().
> > >
> > > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is 
> > > set.
> >
> >
> > I don't think so.
> >
> > For no-premapped mode, we pass the sg as no-dma sg to virtio core,
> > so the virtio core uses the sg->length directly.
>
> This is fine.
>
> > If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
> > so we must use sg->length directly.
>
> I meant it's a hack. It may work now but will be a bug in the future.
>
> For example, I'm playing a prototype to do pre mapping for virtio-blk,
> the idea is to move the expensive DMA mappings in the case of swiotlb
> etc to be done outside the pre virtqueue lock. In that case, the
> driver may want to use dma_map_sg() instead of dma_map_page().
>
> I'd suppose we will finally go with the way where DMA mappings needs
> to be handled by the driver, and dma_map_sg() is faster than per sg
> dma_map_page() anyhow.
>
> >
> > In this case, for the driver, we can not use sg_dma_length(),
> > if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set 
> > sg->dma_length,
> > but virtio core use sg->length.
>
> Well, we just need a minor tweak to get the length from
> vring_map_one_sg(), then everything should be fine?
>
> if (sg_is_premapped) {
>   *addr = sg_dma_address(sg);
>   *len = sg_dma_len(sg);
> }

For now, let us start from:

 if (sg_is_premapped) {
   *addr = sg_dma_address(sg);
   sg->length = sg_dma_len(sg);
 }

Then virtio core needs to be refactor to use dma_map_sg() in future.

Thanks.


>
> >
> > For sg->dma_address, it is ok for me to use sg_dma_address or not.
> > But for consistency to sg->length, I use the sg->dma_address directly.
> >
> > I noticed this is special, so I put them into an independent function.
> >
> > Thanks.
>
> Actually, the code like sg_fill_dma() calls for a virtqueue dma
> mapping helper, I think we've agreed that core needs to hide DMA
> details from the driver.  That is something like
> virtqueue_dma_map_sg() etc.
>
> Thanks
>
> >
> > >
> > > Others look good.
> > >
> > > Thanks
> > >
> > > > +}
> > > > +
> > > >  static void __free_old_xmit(struct send_queue *sq, struct netdev_queue 
> > > > *txq,
> > > > bool in_napi, struct virtnet_sq_free_stats 
> > > > *stats)
> > > >  {
> > > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct 
> > > > receive_queue *rq, void *buf, u32 len)
> > > > addr = dma->addr - sizeof(*dma) + offset;
> > > >
> > > > sg_init_table(rq->sg, 1);
> > > > -   rq->sg[0].dma_address = addr;
> > > > -   rq->sg[0].length = len;
> > > > +   sg_fill_dma(rq->sg, addr, len);
> > > >  }
> > > >
> > > >  static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, 
> > > > gfp_t gfp)
> > > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct 
> > > > virtnet_info *vi,
> > > > }
> > > >  }
> > > >
> > > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 
> > > > len)
> > > > -{
> > > > -   sg->dma_address = addr;
> > > > -   sg->length = len;
> > > > -}
> > > > -
> > > >  static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> > > >struct receive_queue *rq, void *buf, 
> > > > u32 len)
> > > >  {
> > > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct 
> > > > virtnet_info *vi, struct receive_queue
> > > > sg_init_table(rq->sg, 1);
> > > > sg_fill_dma(rq->sg, addr, len);
> > > >
> > > > -   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, 
> > > > xsk_buffs[i], gfp);
> > > > +   err = virtqueue_ad

Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer

2024-11-05 Thread Jason Wang
On Tue, Nov 5, 2024 at 3:23 PM Xuan Zhuo  wrote:
>
> On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang  wrote:
> > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  
> > wrote:
> > >
> > > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/net/virtio_net.c | 24 +---
> > >  1 file changed, 13 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 792e9eadbfc3..09757fa408bd 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > > return (struct sk_buff *)((unsigned long)ptr & 
> > > ~VIRTIO_ORPHAN_FLAG);
> > >  }
> > >
> > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > +{
> > > +   sg->dma_address = addr;
> > > +   sg->length = len;
> >
> > This may work but I think it's better to reuse existing dma sg helpers like:
> >
> > sg_dma_address(sg) = addr;
> > sg_dma_length(sg) = len;
> >
> > And we probably need to fix the virtio core which only uses
> > sg_dma_address() but not sg_dma_length().
> >
> > This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.
>
>
> I don't think so.
>
> For no-premapped mode, we pass the sg as no-dma sg to virtio core,
> so the virtio core uses the sg->length directly.

This is fine.

> If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
> so we must use sg->length directly.

I meant it's a hack. It may work now but will be a bug in the future.

For example, I'm playing a prototype to do pre mapping for virtio-blk,
the idea is to move the expensive DMA mappings in the case of swiotlb
etc to be done outside the pre virtqueue lock. In that case, the
driver may want to use dma_map_sg() instead of dma_map_page().

I'd suppose we will finally go with the way where DMA mappings needs
to be handled by the driver, and dma_map_sg() is faster than per sg
dma_map_page() anyhow.

>
> In this case, for the driver, we can not use sg_dma_length(),
> if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length,
> but virtio core use sg->length.

Well, we just need a minor tweak to get the length from
vring_map_one_sg(), then everything should be fine?

if (sg_is_premapped) {
  *addr = sg_dma_address(sg);
  *len = sg_dma_len(sg);
}

>
> For sg->dma_address, it is ok for me to use sg_dma_address or not.
> But for consistency to sg->length, I use the sg->dma_address directly.
>
> I noticed this is special, so I put them into an independent function.
>
> Thanks.

Actually, the code like sg_fill_dma() calls for a virtqueue dma
mapping helper, I think we've agreed that core needs to hide DMA
details from the driver.  That is something like
virtqueue_dma_map_sg() etc.

Thanks

>
> >
> > Others look good.
> >
> > Thanks
> >
> > > +}
> > > +
> > >  static void __free_old_xmit(struct send_queue *sq, struct netdev_queue 
> > > *txq,
> > > bool in_napi, struct virtnet_sq_free_stats 
> > > *stats)
> > >  {
> > > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct 
> > > receive_queue *rq, void *buf, u32 len)
> > > addr = dma->addr - sizeof(*dma) + offset;
> > >
> > > sg_init_table(rq->sg, 1);
> > > -   rq->sg[0].dma_address = addr;
> > > -   rq->sg[0].length = len;
> > > +   sg_fill_dma(rq->sg, addr, len);
> > >  }
> > >
> > >  static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t 
> > > gfp)
> > > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct 
> > > virtnet_info *vi,
> > > }
> > >  }
> > >
> > > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > -{
> > > -   sg->dma_address = addr;
> > > -   sg->length = len;
> > > -}
> > > -
> > >  static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> > >struct receive_queue *rq, void *buf, 
> > > u32 len)
> > >  {
> > > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct 
> > > virtnet_info *vi, struct receive_queue
> > > sg_init_table(rq->sg, 1);
> > > sg_fill_dma(rq->sg, addr, len);
> > >
> > > -   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, 
> > > xsk_buffs[i], gfp);
> > > +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, 
> > > xsk_buffs[i],
> > > +   NULL, true, gfp);
> > > if (err)
> > > goto err;
> > > }
> > > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info 
> > > *vi, struct receive_queue *rq,
> > >
> > > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> > >
> > > -   err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > > +   err = virtqueue_add_inbuf_premapped(rq-

Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-05 Thread Jason Wang
On Tue, Nov 5, 2024 at 2:53 PM Xuan Zhuo  wrote:
>
> On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang  wrote:
> > On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  
> > wrote:
> > >
> > > The subsequent commit needs to know whether every indirect buffer is
> > > premapped or not. So we need to introduce an extra struct for every
> > > indirect buffer to record this info.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/virtio/virtio_ring.c | 112 ---
> > >  1 file changed, 52 insertions(+), 60 deletions(-)
> >
> > Do we have a performance impact for this patch?
> >
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 97590c201aa2..dca093744fe1 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -69,7 +69,11 @@
> > >
> > >  struct vring_desc_state_split {
> > > void *data; /* Data for callback. */
> > > -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> > > +
> > > +   /* Indirect extra table and desc table, if any. These two will be
> > > +* allocated together. So we won't stress more to the memory 
> > > allocator.
> > > +*/
> > > +   struct vring_desc *indir_desc;
> >
> > So it looks like we put a descriptor table after the extra table. Can
> > this lead to more crossing page mappings for the indirect descriptors?
> >
> > If yes, it seems expensive so we probably need to make the descriptor
> > table come first.
>
> No, the descriptors are before extra table.

Well, you need then tweak the above comment, it said

"Indirect extra table and desc table".

> So, there is not performance impact.
>
>
> >
> > >  };
> > >

[...]

> > > while (vq->split.vring.desc[i].flags & nextflag) {
> > > -   vring_unmap_one_split(vq, i);
> > > +   vring_unmap_one_split(vq, &extra[i]);
> >
> > Not sure if I've asked this before. But this part seems to deserve an
> > independent fix for -stable.
>
> What fix?

I meant for hardening we need to check the flags stored in the extra
instead of the descriptor itself as it could be mangled by the device.

Thanks




Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-05 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni :

On Tue, 29 Oct 2024 16:46:11 +0800 you wrote:
> v1:
> 1. fix some small problems
> 2. remove commit "virtio_net: introduce vi->mode"
> 
> In the last linux version, we disabled this feature to fix the
> regress[1].
> 
> [...]

Here is the summary with links:
  - [net-next,v1,1/4] virtio-net: fix overflow inside virtnet_rq_alloc
https://git.kernel.org/netdev/net-next/c/6aacd1484468
  - [net-next,v1,2/4] virtio_net: big mode skip the unmap check
https://git.kernel.org/netdev/net-next/c/a33f3df85075
  - [net-next,v1,3/4] virtio_net: enable premapped mode for merge and small by 
default
https://git.kernel.org/netdev/net-next/c/47008bb51c3e
  - [net-next,v1,4/4] virtio_net: rx remove premapped failover code
https://git.kernel.org/netdev/net-next/c/fb22437c1ba3

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





Re: [PATCH net-next v1 4/4] virtio_net: rx remove premapped failover code

2024-11-05 Thread Jason Wang
On Tue, Oct 29, 2024 at 4:47 PM Xuan Zhuo  wrote:
>
> Now, the premapped mode can be enabled unconditionally.
>
> So we can remove the failover code for merge and small mode.
>
> The virtnet_rq_xxx() helper would be only used if the mode is using pre
> mapping. A check is added to prevent misusing of these API.
>
> Tested-by: Darren Kenny 
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v1 1/4] virtio-net: fix overflow inside virtnet_rq_alloc

2024-11-05 Thread Jason Wang
On Tue, Oct 29, 2024 at 4:47 PM Xuan Zhuo  wrote:
>
> When the frag just got a page, then may lead to regression on VM.
> Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> then the frag always get a page when do refill.
>
> Which could see reliable crashes or scp failure (scp a file 100M in size
> to VM).
>
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur.
>
> The commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever
> use_dma_api") introduced this problem. And we reverted some commits to
> fix this in last linux version. Now we try to enable it and fix this
> bug directly.
>
> Here, when the frag size is not enough, we reduce the buffer len to fix
> this problem.
>
> Reported-by: "Si-Wei Liu" 
> Tested-by: Darren Kenny 
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-05 Thread Xuan Zhuo
On Tue, 5 Nov 2024 11:42:09 +0800, Jason Wang  wrote:
> On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
> >
> > The subsequent commit needs to know whether every indirect buffer is
> > premapped or not. So we need to introduce an extra struct for every
> > indirect buffer to record this info.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/virtio/virtio_ring.c | 112 ---
> >  1 file changed, 52 insertions(+), 60 deletions(-)
>
> Do we have a performance impact for this patch?
>
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 97590c201aa2..dca093744fe1 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -69,7 +69,11 @@
> >
> >  struct vring_desc_state_split {
> > void *data; /* Data for callback. */
> > -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> > +
> > +   /* Indirect extra table and desc table, if any. These two will be
> > +* allocated together. So we won't stress more to the memory 
> > allocator.
> > +*/
> > +   struct vring_desc *indir_desc;
>
> So it looks like we put a descriptor table after the extra table. Can
> this lead to more crossing page mappings for the indirect descriptors?
>
> If yes, it seems expensive so we probably need to make the descriptor
> table come first.

No, the descriptors are before extra table.
So, there is not performance impact.


>
> >  };
> >
> >  struct vring_desc_state_packed {
> > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue 
> > *vq, u32 num)
> >   * Split ring specific functions - *_split().
> >   */
> >
> > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue 
> > *vq,
> > -  const struct vring_desc *desc)
> > -{
> > -   u16 flags;
> > -
> > -   if (!vring_need_unmap_buffer(vq))
> > -   return;
> > -
> > -   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > -
> > -   dma_unmap_page(vring_dma_dev(vq),
> > -  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > -  virtio32_to_cpu(vq->vq.vdev, desc->len),
> > -  (flags & VRING_DESC_F_WRITE) ?
> > -  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > -}
> > -
> >  static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> > - unsigned int i)
> > + struct vring_desc_extra *extra)
> >  {
> > -   struct vring_desc_extra *extra = vq->split.desc_extra;
> > u16 flags;
> >
> > -   flags = extra[i].flags;
> > +   flags = extra->flags;
> >
> > if (flags & VRING_DESC_F_INDIRECT) {
> > if (!vq->use_dma_api)
> > goto out;
> >
> > dma_unmap_single(vring_dma_dev(vq),
> > -extra[i].addr,
> > -extra[i].len,
> > +extra->addr,
> > +extra->len,
> >  (flags & VRING_DESC_F_WRITE) ?
> >  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > } else {
> > @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const 
> > struct vring_virtqueue *vq,
> > goto out;
> >
> > dma_unmap_page(vring_dma_dev(vq),
> > -  extra[i].addr,
> > -  extra[i].len,
> > +  extra->addr,
> > +  extra->len,
> >(flags & VRING_DESC_F_WRITE) ?
> >DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > }
> >
> >  out:
> > -   return extra[i].next;
> > +   return extra->next;
> >  }
> >
> >  static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> >unsigned int total_sg,
> >gfp_t gfp)
> >  {
> > +   struct vring_desc_extra *extra;
> > struct vring_desc *desc;
> > -   unsigned int i;
> > +   unsigned int i, size;
> >
> > /*
> >  * We require lowmem mappings for the descriptors because
> > @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct 
> > virtqueue *_vq,
> >  */
> > gfp &= ~__GFP_HIGHMEM;
> >
> > -   desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> > +   size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg;
> > +
> > +   desc = kmalloc(size, gfp);
> > if (!desc)
> > return NULL;
> >
> > +   extra = (struct vring_desc_extra *)&desc[total_sg];
> > +
> > for (i = 0; i < total_sg; i++)
> > -   desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
>

Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer

2024-11-04 Thread Xuan Zhuo
On Tue, 5 Nov 2024 11:23:50 +0800, Jason Wang  wrote:
> On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
> >
> > virtio-net rq submits premapped per-buffer by setting sg page to NULL;
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 24 +---
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 792e9eadbfc3..09757fa408bd 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> > return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> >  }
> >
> > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > +{
> > +   sg->dma_address = addr;
> > +   sg->length = len;
>
> This may work but I think it's better to reuse existing dma sg helpers like:
>
> sg_dma_address(sg) = addr;
> sg_dma_length(sg) = len;
>
> And we probably need to fix the virtio core which only uses
> sg_dma_address() but not sg_dma_length().
>
> This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.


I don't think so.

For no-premapped mode, we pass the sg as no-dma sg to virtio core,
so the virtio core uses the sg->length directly.
If virtio core do dma map for sg, we do not use the dma_mag_sg_attrs(),
so we must use sg->length directly.

In this case, for the driver, we can not use sg_dma_length(),
if CONFIG_NEED_SG_DMA_LENGTH is set, sg_dma_length() will set sg->dma_length,
but virtio core use sg->length.

For sg->dma_address, it is ok for me to use sg_dma_address or not.
But for consistency to sg->length, I use the sg->dma_address directly.

I noticed this is special, so I put them into an independent function.

Thanks.

>
> Others look good.
>
> Thanks
>
> > +}
> > +
> >  static void __free_old_xmit(struct send_queue *sq, struct netdev_queue 
> > *txq,
> > bool in_napi, struct virtnet_sq_free_stats 
> > *stats)
> >  {
> > @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue 
> > *rq, void *buf, u32 len)
> > addr = dma->addr - sizeof(*dma) + offset;
> >
> > sg_init_table(rq->sg, 1);
> > -   rq->sg[0].dma_address = addr;
> > -   rq->sg[0].length = len;
> > +   sg_fill_dma(rq->sg, addr, len);
> >  }
> >
> >  static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t 
> > gfp)
> > @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct 
> > virtnet_info *vi,
> > }
> >  }
> >
> > -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > -{
> > -   sg->dma_address = addr;
> > -   sg->length = len;
> > -}
> > -
> >  static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
> >struct receive_queue *rq, void *buf, u32 
> > len)
> >  {
> > @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct 
> > virtnet_info *vi, struct receive_queue
> > sg_init_table(rq->sg, 1);
> > sg_fill_dma(rq->sg, addr, len);
> >
> > -   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], 
> > gfp);
> > +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, 
> > xsk_buffs[i],
> > +   NULL, true, gfp);
> > if (err)
> > goto err;
> > }
> > @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> > struct receive_queue *rq,
> >
> > virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
> >
> > -   err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > +   rq->do_dma, gfp);
> > if (err < 0) {
> > if (rq->do_dma)
> > virtnet_rq_unmap(rq, buf, 0);
> > @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> > *vi,
> > virtnet_rq_init_one_sg(rq, buf, len);
> >
> > ctx = mergeable_len_to_ctx(len + room, headroom);
> > -   err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> > +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> > +   rq->do_dma, gfp);
> > if (err < 0) {
> > if (rq->do_dma)
> > virtnet_rq_unmap(rq, buf, 0);
> > --
> > 2.32.0.3.g01195cf9f
> >
>



Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-04 Thread Jason Wang
Hi Jakub:

On Tue, Nov 5, 2024 at 10:46 AM Jakub Kicinski  wrote:
>
> On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote:
> > In the last linux version, we disabled this feature to fix the
> > regress[1].
> >
> > The patch set is try to fix the problem and re-enable it.
> >
> > More info: 
> > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
>
> Sorry to ping, Michael, Jason we're waiting to hear from you on
> this one.
>

Will review this today.

Thanks




Re: [PATCH net-next v2 07/13] virtio_ring: remove API virtqueue_set_dma_premapped

2024-11-04 Thread Jason Wang
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
>
> Now, this API is useless. remove it.
>
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks




Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped

2024-11-04 Thread Jason Wang
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
>
> Two APIs are introduced to submit premapped per-buffers.
>
> int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
>  struct scatterlist *sg, unsigned int num,
>  void *data,
>  void *ctx,
>  bool premapped,
>  gfp_t gfp);
>
> int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
>   struct scatterlist *sg, unsigned int num,
>   void *data,
>   bool premapped,
>   gfp_t gfp);
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 48 
>  include/linux/virtio.h   | 13 ++
>  2 files changed, 61 insertions(+)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index a89295b79e66..525308d82728 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -2272,6 +2272,29 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>
> +/**
> + * virtqueue_add_outbuf_premapped - expose output buffers to other end
> + * @vq: the struct virtqueue we're talking about.
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg readable by other side
> + * @data: the token identifying the buffer.
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
> +  struct scatterlist *sg, unsigned int num,
> +  void *data,
> +  bool premapped,

We don't need this parameter consider:

1) we've already had virtqueue_add_outbuf() which implies the buf has
been mapped
2) no explanation for "premapped" in the function doc

Thanks

> +  gfp_t gfp)
> +{
> +   return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped);
> +
>  /**
>   * virtqueue_add_inbuf - expose input buffers to other end
>   * @vq: the struct virtqueue we're talking about.
> @@ -2318,6 +2341,31 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
>
> +/**
> + * virtqueue_add_inbuf_premapped - expose input buffers to other end
> + * @vq: the struct virtqueue we're talking about.
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg writable by other side
> + * @data: the token identifying the buffer.
> + * @ctx: extra context for the token
> + * @gfp: how to do memory allocations (if necessary).
> + *
> + * Caller must ensure we don't call this with other virtqueue operations
> + * at the same time (except where noted).
> + *
> + * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
> + */
> +int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + void *ctx,
> + bool premapped,
> + gfp_t gfp)
> +{
> +   return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, premapped, gfp);
> +}
> +EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_premapped);
> +
>  /**
>   * virtqueue_dma_dev - get the dma dev
>   * @_vq: the struct virtqueue we're talking about.
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 306137a15d07..19afa49b92d0 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -56,6 +56,19 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
> void *ctx,
> gfp_t gfp);
>
> +int virtqueue_add_inbuf_premapped(struct virtqueue *vq,
> + struct scatterlist *sg, unsigned int num,
> + void *data,
> + void *ctx,
> + bool premapped,
> + gfp_t gfp);
> +
> +int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
> +  struct scatterlist *sg, unsigned int num,
> +  void *data,
> +  bool premapped,
> +  gfp_t gfp);
> +
>  int virtqueue_add_sgs(struct virtqueue *vq,
>   struct scatterlist *sgs[],
>   unsigned int out_sgs,
> --
> 2.32.0.3.g01195cf9f
>




Re: [PATCH net-next v2 04/13] virtio_ring: perform premapped operations based on per-buffer

2024-11-04 Thread Jason Wang
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
>
> The current configuration sets the virtqueue (vq) to premapped mode,
> implying that all buffers submitted to this queue must be mapped ahead
> of time. This presents a challenge for the virtnet send queue (sq): the
> virtnet driver would be required to keep track of dma information for vq
> size * 17,

Even worse as MAX_SKB_FRAGS could be even larger.

> which can be substantial. However, if the premapped mode were
> applied on a per-buffer basis, the complexity would be greatly reduced.
> With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
> skb buffers could remain unmapped.
>
> And consider that some sgs are not generated by the virtio driver,
> that may be passed from the block stack. So we can not change the
> sgs, new APIs are the better way.

I had some new thoughts on this.

Pre-mapping makes a lot of sense as it allows the us to move the
expensive DMA mapping operations (swiotlb, IOMMU or VDUSE) out of the
per-vq lock. I wonder what would we do if someday we want to convert
the skb TX to be premapped (or even all virtio drivers).

Considering we've already used skb_to_sgvec() in start_xmit() it looks
like we need to allocate sg[queue_num][MAX_SKB_FRAGS + 2] and store
the dma mapping there. Then we probably don't even need to duplicate
the dma mapping information in the core.

It means it's the driver's charge to store the dma information via sg,
so we can simply use dma_map_sg() in add_sgs() which allows various
optimizations in IOMMU layers.

>
> So we pass the new argument 'premapped' to indicate the buffers
> submitted to virtio are premapped in advance. Additionally,
> DMA unmap operations for these buffers will be bypassed.
>
> Suggested-by: Jason Wang 
> Signed-off-by: Xuan Zhuo 

Anyhow

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/virtio/virtio_ring.c | 79 ++--
>  1 file changed, 40 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 628e01af1c9a..a89295b79e66 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -300,9 +300,10 @@ static bool vring_use_dma_api(const struct virtio_device 
> *vdev)
> return false;
>  }
>
> -static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring)
> +static bool vring_need_unmap_buffer(const struct vring_virtqueue *vring,
> +   const struct vring_desc_extra *extra)
>  {
> -   return vring->use_dma_api && !vring->premapped;
> +   return vring->use_dma_api && (extra->addr != DMA_MAPPING_ERROR);
>  }
>
>  size_t virtio_max_dma_size(const struct virtio_device *vdev)
> @@ -372,9 +373,10 @@ static struct device *vring_dma_dev(const struct 
> vring_virtqueue *vq)
>
>  /* Map one sg entry. */
>  static int vring_map_one_sg(const struct vring_virtqueue *vq, struct 
> scatterlist *sg,
> -   enum dma_data_direction direction, dma_addr_t 
> *addr)
> +   enum dma_data_direction direction, dma_addr_t 
> *addr,
> +   bool premapped)
>  {
> -   if (vq->premapped) {
> +   if (premapped) {
> *addr = sg_dma_address(sg);
> return 0;
> }
> @@ -465,7 +467,7 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
>  (flags & VRING_DESC_F_WRITE) ?
>  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> -   if (!vring_need_unmap_buffer(vq))
> +   if (!vring_need_unmap_buffer(vq, extra))
> goto out;
>
> dma_unmap_page(vring_dma_dev(vq),
> @@ -514,7 +516,7 @@ static inline unsigned int 
> virtqueue_add_desc_split(struct virtqueue *vq,
> unsigned int i,
> dma_addr_t addr,
> unsigned int len,
> -   u16 flags)
> +   u16 flags, bool premapped)
>  {
> u16 next;
>
> @@ -522,7 +524,7 @@ static inline unsigned int 
> virtqueue_add_desc_split(struct virtqueue *vq,
> desc[i].addr = cpu_to_virtio64(vq->vdev, addr);
> desc[i].len = cpu_to_virtio32(vq->vdev, len);
>
> -   extra[i].addr = addr;
> +   extra[i].addr = premapped ? DMA_MAPPING_ERROR : addr;
> extra[i].len = len;
> extra[i].flags = flags;
>
> @@ -540,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   unsigned int in_sgs,
>   void *data,
>   void *ctx,
> + bool premapped,
>   gfp_t gfp)
>  {
> s

Re: [PATCH net-next v2 02/13] virtio_ring: split: record extras for indirect buffers

2024-11-04 Thread Jason Wang
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
>
> The subsequent commit needs to know whether every indirect buffer is
> premapped or not. So we need to introduce an extra struct for every
> indirect buffer to record this info.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 112 ---
>  1 file changed, 52 insertions(+), 60 deletions(-)

Do we have a performance impact for this patch?

>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 97590c201aa2..dca093744fe1 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -69,7 +69,11 @@
>
>  struct vring_desc_state_split {
> void *data; /* Data for callback. */
> -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> +
> +   /* Indirect extra table and desc table, if any. These two will be
> +* allocated together. So we won't stress more to the memory 
> allocator.
> +*/
> +   struct vring_desc *indir_desc;

So it looks like we put a descriptor table after the extra table. Can
this lead to more crossing page mappings for the indirect descriptors?

If yes, it seems expensive so we probably need to make the descriptor
table come first.

>  };
>
>  struct vring_desc_state_packed {
> @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, 
> u32 num)
>   * Split ring specific functions - *_split().
>   */
>
> -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
> -  const struct vring_desc *desc)
> -{
> -   u16 flags;
> -
> -   if (!vring_need_unmap_buffer(vq))
> -   return;
> -
> -   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> -
> -   dma_unmap_page(vring_dma_dev(vq),
> -  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> -  virtio32_to_cpu(vq->vq.vdev, desc->len),
> -  (flags & VRING_DESC_F_WRITE) ?
> -  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> -}
> -
>  static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> - unsigned int i)
> + struct vring_desc_extra *extra)
>  {
> -   struct vring_desc_extra *extra = vq->split.desc_extra;
> u16 flags;
>
> -   flags = extra[i].flags;
> +   flags = extra->flags;
>
> if (flags & VRING_DESC_F_INDIRECT) {
> if (!vq->use_dma_api)
> goto out;
>
> dma_unmap_single(vring_dma_dev(vq),
> -extra[i].addr,
> -extra[i].len,
> +extra->addr,
> +extra->len,
>  (flags & VRING_DESC_F_WRITE) ?
>  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> @@ -479,22 +465,23 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
> goto out;
>
> dma_unmap_page(vring_dma_dev(vq),
> -  extra[i].addr,
> -  extra[i].len,
> +  extra->addr,
> +  extra->len,
>(flags & VRING_DESC_F_WRITE) ?
>DMA_FROM_DEVICE : DMA_TO_DEVICE);
> }
>
>  out:
> -   return extra[i].next;
> +   return extra->next;
>  }
>
>  static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
>unsigned int total_sg,
>gfp_t gfp)
>  {
> +   struct vring_desc_extra *extra;
> struct vring_desc *desc;
> -   unsigned int i;
> +   unsigned int i, size;
>
> /*
>  * We require lowmem mappings for the descriptors because
> @@ -503,40 +490,41 @@ static struct vring_desc *alloc_indirect_split(struct 
> virtqueue *_vq,
>  */
> gfp &= ~__GFP_HIGHMEM;
>
> -   desc = kmalloc_array(total_sg, sizeof(struct vring_desc), gfp);
> +   size = sizeof(*desc) * total_sg + sizeof(*extra) * total_sg;
> +
> +   desc = kmalloc(size, gfp);
> if (!desc)
> return NULL;
>
> +   extra = (struct vring_desc_extra *)&desc[total_sg];
> +
> for (i = 0; i < total_sg; i++)
> -   desc[i].next = cpu_to_virtio16(_vq->vdev, i + 1);
> +   extra[i].next = i + 1;
> +
> return desc;
>  }
>
>  static inline unsigned int virtqueue_add_desc_split(struct virtqueue *vq,
> struct vring_desc *desc,
> +   struct vring_desc_extra 
> *extra,
> unsigned int i,
>

Re: [PATCH net-next v2 06/13] virtio-net: rq submits premapped per-buffer

2024-11-04 Thread Jason Wang
On Wed, Oct 30, 2024 at 4:25 PM Xuan Zhuo  wrote:
>
> virtio-net rq submits premapped per-buffer by setting sg page to NULL;
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 792e9eadbfc3..09757fa408bd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -542,6 +542,12 @@ static struct sk_buff *ptr_to_skb(void *ptr)
> return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
>  }
>
> +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> +{
> +   sg->dma_address = addr;
> +   sg->length = len;

This may work but I think it's better to reuse existing dma sg helpers like:

sg_dma_address(sg) = addr;
sg_dma_length(sg) = len;

And we probably need to fix the virtio core which only uses
sg_dma_address() but not sg_dma_length().

This helps us to avoid future issues when CONFIG_NEED_SG_DMA_LENGTH is set.

Others look good.

Thanks

> +}
> +
>  static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> bool in_napi, struct virtnet_sq_free_stats *stats)
>  {
> @@ -915,8 +921,7 @@ static void virtnet_rq_init_one_sg(struct receive_queue 
> *rq, void *buf, u32 len)
> addr = dma->addr - sizeof(*dma) + offset;
>
> sg_init_table(rq->sg, 1);
> -   rq->sg[0].dma_address = addr;
> -   rq->sg[0].length = len;
> +   sg_fill_dma(rq->sg, addr, len);
>  }
>
>  static void *virtnet_rq_alloc(struct receive_queue *rq, u32 size, gfp_t gfp)
> @@ -1068,12 +1073,6 @@ static void check_sq_full_and_disable(struct 
> virtnet_info *vi,
> }
>  }
>
> -static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> -{
> -   sg->dma_address = addr;
> -   sg->length = len;
> -}
> -
>  static struct xdp_buff *buf_to_xdp(struct virtnet_info *vi,
>struct receive_queue *rq, void *buf, u32 
> len)
>  {
> @@ -1354,7 +1353,8 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info 
> *vi, struct receive_queue
> sg_init_table(rq->sg, 1);
> sg_fill_dma(rq->sg, addr, len);
>
> -   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, xsk_buffs[i], 
> gfp);
> +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, 
> xsk_buffs[i],
> +   NULL, true, gfp);
> if (err)
> goto err;
> }
> @@ -2431,7 +2431,8 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>
> virtnet_rq_init_one_sg(rq, buf, vi->hdr_len + GOOD_PACKET_LEN);
>
> -   err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> +   rq->do_dma, gfp);
> if (err < 0) {
> if (rq->do_dma)
> virtnet_rq_unmap(rq, buf, 0);
> @@ -2546,7 +2547,8 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
> virtnet_rq_init_one_sg(rq, buf, len);
>
> ctx = mergeable_len_to_ctx(len + room, headroom);
> -   err = virtqueue_add_inbuf_ctx(rq->vq, rq->sg, 1, buf, ctx, gfp);
> +   err = virtqueue_add_inbuf_premapped(rq->vq, rq->sg, 1, buf, ctx,
> +   rq->do_dma, gfp);
> if (err < 0) {
> if (rq->do_dma)
> virtnet_rq_unmap(rq, buf, 0);
> --
> 2.32.0.3.g01195cf9f
>




Re: [PATCH net-next v1 0/4] virtio_net: enable premapped mode by default

2024-11-04 Thread Jakub Kicinski
On Tue, 29 Oct 2024 16:46:11 +0800 Xuan Zhuo wrote:
> In the last linux version, we disabled this feature to fix the
> regress[1].
> 
> The patch set is try to fix the problem and re-enable it.
> 
> More info: 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com

Sorry to ping, Michael, Jason we're waiting to hear from you on 
this one.



Re: [PATCH net-next v2 05/13] virtio_ring: introduce add api for premapped

2024-10-30 Thread kernel test robot
Hi Xuan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:
https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_ring-introduce-vring_need_unmap_buffer/20241030-162739
base:   net-next/main
patch link:
https://lore.kernel.org/r/20241030082453.97310-6-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next v2 05/13] virtio_ring: introduce add api for 
premapped
config: x86_64-buildonly-randconfig-003-20241031 
(https://download.01.org/0day-ci/archive/20241031/202410310925.lucycrtj-...@intel.com/config)
compiler: clang version 19.1.2 (https://github.com/llvm/llvm-project 
7ba7d8e2f7b6445b60679da826210cdde29eaf8b)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20241031/202410310925.lucycrtj-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202410310925.lucycrtj-...@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/virtio/virtio_ring.c:2293: warning: Function parameter or struct 
>> member 'premapped' not described in 'virtqueue_add_outbuf_premapped'
>> drivers/virtio/virtio_ring.c:2364: warning: Function parameter or struct 
>> member 'premapped' not described in 'virtqueue_add_inbuf_premapped'


vim +2293 drivers/virtio/virtio_ring.c

  2274  
  2275  /**
  2276   * virtqueue_add_outbuf_premapped - expose output buffers to other end
  2277   * @vq: the struct virtqueue we're talking about.
  2278   * @sg: scatterlist (must be well-formed and terminated!)
  2279   * @num: the number of entries in @sg readable by other side
  2280   * @data: the token identifying the buffer.
  2281   * @gfp: how to do memory allocations (if necessary).
  2282   *
  2283   * Caller must ensure we don't call this with other virtqueue operations
  2284   * at the same time (except where noted).
  2285   *
  2286   * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
  2287   */
  2288  int virtqueue_add_outbuf_premapped(struct virtqueue *vq,
  2289 struct scatterlist *sg, unsigned int 
num,
  2290 void *data,
  2291 bool premapped,
  2292 gfp_t gfp)
> 2293  {
  2294  return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, premapped, 
gfp);
  2295  }
  2296  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf_premapped);
  2297  
  2298  /**
  2299   * virtqueue_add_inbuf - expose input buffers to other end
  2300   * @vq: the struct virtqueue we're talking about.
  2301   * @sg: scatterlist (must be well-formed and terminated!)
  2302   * @num: the number of entries in @sg writable by other side
  2303   * @data: the token identifying the buffer.
  2304   * @gfp: how to do memory allocations (if necessary).
  2305   *
  2306   * Caller must ensure we don't call this with other virtqueue operations
  2307   * at the same time (except where noted).
  2308   *
  2309   * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
  2310   */
  2311  int virtqueue_add_inbuf(struct virtqueue *vq,
  2312  struct scatterlist *sg, unsigned int num,
  2313  void *data,
  2314  gfp_t gfp)
  2315  {
  2316  return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, 
gfp);
  2317  }
  2318  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
  2319  
  2320  /**
  2321   * virtqueue_add_inbuf_ctx - expose input buffers to other end
  2322   * @vq: the struct virtqueue we're talking about.
  2323   * @sg: scatterlist (must be well-formed and terminated!)
  2324   * @num: the number of entries in @sg writable by other side
  2325   * @data: the token identifying the buffer.
  2326   * @ctx: extra context for the token
  2327   * @gfp: how to do memory allocations (if necessary).
  2328   *
  2329   * Caller must ensure we don't call this with other virtqueue operations
  2330   * at the same time (except where noted).
  2331   *
  2332   * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
  2333   */
  2334  int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
  2335  struct scatterlist *sg, unsigned int num,
  2336  void *data,
  2337  void *ctx,
  2338  gfp_t gfp)
  2339  {
  2340  return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp);
  2341  }
  2342  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
  2343  
  2344  /**
  2345   * virtqueue_add_inbuf_premapped - expose input buffers to other end
  2346   * @vq: the struct virtqueue we're talking about.
  2347   * @sg: scatterlist (must be well-formed and terminated!)
  2348   * @num: the number of entries in @sg writable by other side
  2349   * @data: the token

Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-25 Thread Simon Horman
On Fri, Oct 25, 2024 at 10:35:53AM +0800, Xuan Zhuo wrote:
> On Thu, 17 Oct 2024 15:42:59 +0200, Paolo Abeni  wrote:
> >
> >
> > On 10/14/24 05:12, Xuan Zhuo wrote:
> > > When the frag just got a page, then may lead to regression on VM.
> > > Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> > > then the frag always get a page when do refill.
> > >
> > > Which could see reliable crashes or scp failure (scp a file 100M in size
> > > to VM):
> > >
> > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > everything is fine. However, if the frag is only one page and the
> > > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > > overflow may occur.
> > >
> > > Here, when the frag size is not enough, we reduce the buffer len to fix
> > > this problem.
> > >
> > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > > use_dma_api")
> > > Reported-by: "Si-Wei Liu" 
> > > Signed-off-by: Xuan Zhuo 
> >
> > This looks like a fix that should target the net tree, but the following
> > patches looks like net-next material. Any special reason to bundle them
> > together?
> 
> Sorry, I forgot to add net-next as a target tree.
> 
> This may look like a fix. But the feature was disabled in the last Linux
> version. So the bug cannot be triggered, so we don't need to push to the net
> tree.

I think it would be useful to be clear in the commit message, use of tags,
and target tree regarding fixes and non-fixes.

Please describe in the commit message why this is not fixing a bug, as you
have described above. And please do not include Fixes tags in patches that
are not bug fixes, which seems to be the case here.

If you want to refer to the patch that introduced the problem, you can use
the following syntax, in the commit message, before the tags. Unlike Fixes
tags, this may be line wrapped.

  This problem is not a bug fix for net because... It was was introduced by
  commit f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
use_dma_api").

  Reported-by: ...
  ...



Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-24 Thread Xuan Zhuo
On Fri, 18 Oct 2024 15:41:41 +0800, Jason Wang  wrote:
> On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
> >
> > When the frag just got a page, then may lead to regression on VM.
> > Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> > then the frag always get a page when do refill.
> >
> > Which could see reliable crashes or scp failure (scp a file 100M in size
> > to VM):
> >
> > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > of a new frag. When the frag size is larger than PAGE_SIZE,
> > everything is fine. However, if the frag is only one page and the
> > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > overflow may occur.
> >
> > Here, when the frag size is not enough, we reduce the buffer len to fix
> > this problem.
> >
> > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > use_dma_api")
> > Reported-by: "Si-Wei Liu" 
> > Signed-off-by: Xuan Zhuo 
>
> Though this may fix the problem and we need it now, I would like to
> have some tweaks in the future. Basically because of the security
> implication for sharing driver metadata with the device (most IOMMU
> can only do protection at the granule at the page). So we may end up
> with device-triggerable behaviour. For safety, we should use driver
> private memory for DMA metadata.
>
> > ---
> >  drivers/net/virtio_net.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f8131f92a392..59a99bbaf852 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> > u32 size, gfp_t gfp)
> > void *buf, *head;
> > dma_addr_t addr;
> >
> > -   if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> > -   return NULL;
> > -
> > head = page_address(alloc_frag->page);
> >
> > if (rq->do_dma) {
> > @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> > struct receive_queue *rq,
> > len = SKB_DATA_ALIGN(len) +
> >   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >
> > +   if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> > +   return -ENOMEM;
> > +
> > buf = virtnet_rq_alloc(rq, len, gfp);
> > if (unlikely(!buf))
> > return -ENOMEM;
> > @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> > *vi,
> >  */
> > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> >
> > +   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> > +   return -ENOMEM;
>
> This makes me think of another question, how could we guarantee len +
> room is less or equal to PAGE_SIZE. Especially considering we need
> extra head and tailroom for XDP? If we can't it is a bug:

get_mergeable_buf_len() do this.

>
> """
> /**
>  * skb_page_frag_refill - check that a page_frag contains enough room
>  * @sz: minimum size of the fragment we want to get
>  * @pfrag: pointer to page_frag
>  * @gfp: priority for memory allocation
>  *
>  * Note: While this allocator tries to use high order pages, there is
>  * no guarantee that allocations succeed. Therefore, @sz MUST be
>  * less or equal than PAGE_SIZE.
>  */
> """
>
> > +
> > +   if (!alloc_frag->offset && len + room + sizeof(struct 
> > virtnet_rq_dma) > alloc_frag->size)
> > +   len -= sizeof(struct virtnet_rq_dma);
>
> Any reason we need to check alloc_frag->offset?

We just need to check when the page of the alloc frag is new.
In this case, we need to allocate space to store dma meta.
If the offset > 0, then we do not need to allocate extra space,
so it is safe.


>
> > +
> > buf = virtnet_rq_alloc(rq, len + room, gfp);
>
> Btw, as pointed out in previous review, we should have a consistent API:
>
> 1) hide the alloc_frag like virtnet_rq_alloc()
>
> or
>
> 2) pass the alloc_frag to virtnet_rq_alloc()


Now we need to check len+room before calling skb_page_frag_refill()
so we must move  virtnet_rq_alloc() outside virtnet_rq_alloc().

Thanks

>
> > if (unlikely(!buf))
> > return -ENOMEM;
> > --
> > 2.32.0.3.g01195cf9f
> >
>
> Thanks
>



Re: [PATCH 4/5] virtio_net: enable premapped mode for merge and small by default

2024-10-24 Thread Xuan Zhuo
On Fri, 18 Oct 2024 16:00:07 +0800, Jason Wang  wrote:
> On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
> >
> > Currently, the virtio core will perform a dma operation for each
> > buffer. Although, the same page may be operated multiple times.
> >
> > In premapped mod, we can perform only one dma operation for the pages of
> > the alloc frag. This is beneficial for the iommu device.
> >
> > kernel command line: intel_iommu=on iommu.passthrough=0
> >
> >|  strict=0  | strict=1
> > Before |  775496pps | 428614pps
> > After  | 1109316pps | 742853pps
> >
> > In the 6.11, we disabled this feature because a regress [1].
> >
> > Now, we fix the problem and re-enable it.
> >
> > [1]: 
> > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 17 +
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index cd90e77881df..8cf24b7b58bd 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -6133,6 +6133,21 @@ static int virtnet_alloc_queues(struct virtnet_info 
> > *vi)
> > return -ENOMEM;
> >  }
> >
> > +static void virtnet_rq_set_premapped(struct virtnet_info *vi)
> > +{
> > +   int i;
> > +
> > +   /* disable for big mode */
> > +   if (vi->mode == VIRTNET_MODE_BIG)
> > +   return;
>
> Nitpick: I would like such a check to be done at the caller.

I am ok, if you like.

Thanks.


>
> But anyhow the patch looks good
>
> Acked-by: Jason Wang 
>
> Thanks
>
> > +
> > +   for (i = 0; i < vi->max_queue_pairs; i++) {
> > +   /* error should never happen */
> > +   BUG_ON(virtqueue_set_dma_premapped(vi->rq[i].vq));
> > +   vi->rq[i].do_dma = true;
> > +   }
> > +}
> > +
> >  static int init_vqs(struct virtnet_info *vi)
> >  {
> > int ret;
> > @@ -6146,6 +6161,8 @@ static int init_vqs(struct virtnet_info *vi)
> > if (ret)
> > goto err_free;
> >
> > +   virtnet_rq_set_premapped(vi);
> > +
> > cpus_read_lock();
> > virtnet_set_affinity(vi);
> > cpus_read_unlock();
> > --
> > 2.32.0.3.g01195cf9f
> >
>



Re: [PATCH 2/5] virtio_net: introduce vi->mode

2024-10-24 Thread Xuan Zhuo
On Fri, 18 Oct 2024 15:48:38 +0800, Jason Wang  wrote:
> On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
> >
> > Now, if we want to judge the rx work mode, we have to use such codes:
> >
> > 1. merge mode: vi->mergeable_rx_bufs
> > 2. big mode:   vi->big_packets && !vi->mergeable_rx_bufs
> > 3. small: !vi->big_packets && !vi->mergeable_rx_bufs
> >
> > This is inconvenient and abstract, and we also have this use case:
> >
> > if (vi->mergeable_rx_bufs)
> > 
> > else if (vi->big_packets)
> > 
> > else
> >
> > For this case, I think switch-case is the better choice.
> >
> > So here I introduce vi->mode to record the virtio-net work mode.
> > That is helpful to judge the work mode and choose the branches.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 61 +++-
> >  1 file changed, 47 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 59a99bbaf852..14809b614d62 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -385,6 +385,12 @@ struct control_buf {
> > virtio_net_ctrl_ack status;
> >  };
> >
> > +enum virtnet_mode {
> > +   VIRTNET_MODE_SMALL,
> > +   VIRTNET_MODE_MERGE,
> > +   VIRTNET_MODE_BIG
> > +};
>
> I'm not sure if this can ease or not.
>
> [...]
>
> > +   if (vi->mergeable_rx_bufs)
> > +   vi->mode = VIRTNET_MODE_MERGE;
> > +   else if (vi->big_packets)
> > +   vi->mode = VIRTNET_MODE_BIG;
>
> Maybe we can just say big_packets doesn't mean big mode.
>
> > +   else
> > +   vi->mode = VIRTNET_MODE_SMALL;
> > +
> > if (vi->any_header_sg)
> > dev->needed_headroom = vi->hdr_len;
>
> Anyhow this seems not a fix so it should be a separate series than patch 1?



I think this series is not about fixing the problem, the feature was disabled in
the last Linux version. This series is about turning the pre-mapped mode of rx
back on.

This commit tries to make things easier. The current code is very unintuitive
when we try to switch or check the virtio-net rx mode

Thanks.


>
> Thanks
>
> >
> > --
> > 2.32.0.3.g01195cf9f
> >
>



Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-24 Thread Xuan Zhuo
On Thu, 17 Oct 2024 15:42:59 +0200, Paolo Abeni  wrote:
>
>
> On 10/14/24 05:12, Xuan Zhuo wrote:
> > When the frag just got a page, then may lead to regression on VM.
> > Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> > then the frag always get a page when do refill.
> >
> > Which could see reliable crashes or scp failure (scp a file 100M in size
> > to VM):
> >
> > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > of a new frag. When the frag size is larger than PAGE_SIZE,
> > everything is fine. However, if the frag is only one page and the
> > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > overflow may occur.
> >
> > Here, when the frag size is not enough, we reduce the buffer len to fix
> > this problem.
> >
> > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > use_dma_api")
> > Reported-by: "Si-Wei Liu" 
> > Signed-off-by: Xuan Zhuo 
>
> This looks like a fix that should target the net tree, but the following
> patches looks like net-next material. Any special reason to bundle them
> together?

Sorry, I forgot to add net-next as a target tree.

This may look like a fix. But the feature was disabled in the last Linux
version. So the bug cannot be triggered, so we don't need to push to the net
tree.

Thanks.

>
> Also, please explicitly include the the target tree in the subj on next
> submissions, thanks!
>
> Paolo
>
>



Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-19 Thread Michael S. Tsirkin
On Thu, Oct 17, 2024 at 03:42:59PM +0200, Paolo Abeni wrote:
> 
> 
> On 10/14/24 05:12, Xuan Zhuo wrote:
> > When the frag just got a page, then may lead to regression on VM.
> > Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> > then the frag always get a page when do refill.
> > 
> > Which could see reliable crashes or scp failure (scp a file 100M in size
> > to VM):
> > 
> > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > of a new frag. When the frag size is larger than PAGE_SIZE,
> > everything is fine. However, if the frag is only one page and the
> > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > overflow may occur.
> > 
> > Here, when the frag size is not enough, we reduce the buffer len to fix
> > this problem.
> > 
> > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > use_dma_api")
> > Reported-by: "Si-Wei Liu" 
> > Signed-off-by: Xuan Zhuo 
> 
> This looks like a fix that should target the net tree, but the following
> patches looks like net-next material. Any special reason to bundle them
> together?
> 
> Also, please explicitly include the the target tree in the subj on next
> submissions, thanks!
> 
> Paolo

The issue is that net-next is not rebased on net, so if patches 2-5 land
in next and 1 only in net, next will be broken. I think the simplest fix
is just to have 1 on net-next, backport it to net too, and git will
figure it out. Right?

-- 
MST




Re: [PATCH 0/5] virtio_net: enable premapped mode by default

2024-10-19 Thread Michael S. Tsirkin
On Fri, Oct 18, 2024 at 03:59:14PM +0100, Darren Kenny wrote:
> Hi Michael / Xuan Zhuo,
> 
> On Wednesday, 2024-10-16 at 08:55:21 +01, Darren Kenny wrote:
> > Hi Michael,
> >
> > On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote:
> >> On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote:
> >>> In the last linux version, we disabled this feature to fix the
> >>> regress[1].
> >>> 
> >>> The patch set is try to fix the problem and re-enable it.
> >>> 
> >>> More info: 
> >>> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> >>> 
> >>> Thanks.
> >>> 
> >>> [1]: 
> >>> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> >>
> >> Darren, you previously reported crashes with a patch very similar to 1/5.
> >> Can you please test this patchset and report whether they
> >> are still observed?
> >> If yes, any data on how to reproduce would be very benefitial for Xuan
> >> Zhuo.
> >>
> >
> > I aim to get to this in the next week, but I don't currently have
> > access to a system to test it, it will take a few days at least before I
> > can get one.
> 
> I finally a managed to get access to a system to test this on, and it
> looks like things are working with this patch-set. So...
> 
> Tested-by: Darren Kenny 
> 
> Thanks,
> 
> Darren.

Thanks a lot Darren!

-- 
MST




Re: [PATCH 0/5] virtio_net: enable premapped mode by default

2024-10-18 Thread Darren Kenny
Hi Michael / Xuan Zhuo,

On Wednesday, 2024-10-16 at 08:55:21 +01, Darren Kenny wrote:
> Hi Michael,
>
> On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote:
>> On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote:
>>> In the last linux version, we disabled this feature to fix the
>>> regress[1].
>>> 
>>> The patch set is try to fix the problem and re-enable it.
>>> 
>>> More info: 
>>> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
>>> 
>>> Thanks.
>>> 
>>> [1]: 
>>> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
>>
>> Darren, you previously reported crashes with a patch very similar to 1/5.
>> Can you please test this patchset and report whether they
>> are still observed?
>> If yes, any data on how to reproduce would be very benefitial for Xuan
>> Zhuo.
>>
>
> I aim to get to this in the next week, but I don't currently have
> access to a system to test it, it will take a few days at least before I
> can get one.

I finally a managed to get access to a system to test this on, and it
looks like things are working with this patch-set. So...

Tested-by: Darren Kenny 

Thanks,

Darren.



Re: [PATCH 4/5] virtio_net: enable premapped mode for merge and small by default

2024-10-18 Thread Jason Wang
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
>
> Currently, the virtio core will perform a dma operation for each
> buffer. Although, the same page may be operated multiple times.
>
> In premapped mod, we can perform only one dma operation for the pages of
> the alloc frag. This is beneficial for the iommu device.
>
> kernel command line: intel_iommu=on iommu.passthrough=0
>
>|  strict=0  | strict=1
> Before |  775496pps | 428614pps
> After  | 1109316pps | 742853pps
>
> In the 6.11, we disabled this feature because a regress [1].
>
> Now, we fix the problem and re-enable it.
>
> [1]: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 17 +
>  1 file changed, 17 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index cd90e77881df..8cf24b7b58bd 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -6133,6 +6133,21 @@ static int virtnet_alloc_queues(struct virtnet_info 
> *vi)
> return -ENOMEM;
>  }
>
> +static void virtnet_rq_set_premapped(struct virtnet_info *vi)
> +{
> +   int i;
> +
> +   /* disable for big mode */
> +   if (vi->mode == VIRTNET_MODE_BIG)
> +   return;

Nitpick: I would like such a check to be done at the caller.

But anyhow the patch looks good

Acked-by: Jason Wang 

Thanks

> +
> +   for (i = 0; i < vi->max_queue_pairs; i++) {
> +   /* error should never happen */
> +   BUG_ON(virtqueue_set_dma_premapped(vi->rq[i].vq));
> +   vi->rq[i].do_dma = true;
> +   }
> +}
> +
>  static int init_vqs(struct virtnet_info *vi)
>  {
> int ret;
> @@ -6146,6 +6161,8 @@ static int init_vqs(struct virtnet_info *vi)
> if (ret)
> goto err_free;
>
> +   virtnet_rq_set_premapped(vi);
> +
> cpus_read_lock();
> virtnet_set_affinity(vi);
> cpus_read_unlock();
> --
> 2.32.0.3.g01195cf9f
>




Re: [PATCH 2/5] virtio_net: introduce vi->mode

2024-10-18 Thread Jason Wang
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
>
> Now, if we want to judge the rx work mode, we have to use such codes:
>
> 1. merge mode: vi->mergeable_rx_bufs
> 2. big mode:   vi->big_packets && !vi->mergeable_rx_bufs
> 3. small: !vi->big_packets && !vi->mergeable_rx_bufs
>
> This is inconvenient and abstract, and we also have this use case:
>
> if (vi->mergeable_rx_bufs)
> 
> else if (vi->big_packets)
> 
> else
>
> For this case, I think switch-case is the better choice.
>
> So here I introduce vi->mode to record the virtio-net work mode.
> That is helpful to judge the work mode and choose the branches.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 61 +++-
>  1 file changed, 47 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 59a99bbaf852..14809b614d62 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -385,6 +385,12 @@ struct control_buf {
> virtio_net_ctrl_ack status;
>  };
>
> +enum virtnet_mode {
> +   VIRTNET_MODE_SMALL,
> +   VIRTNET_MODE_MERGE,
> +   VIRTNET_MODE_BIG
> +};

I'm not sure if this can ease or not.

[...]

> +   if (vi->mergeable_rx_bufs)
> +   vi->mode = VIRTNET_MODE_MERGE;
> +   else if (vi->big_packets)
> +   vi->mode = VIRTNET_MODE_BIG;

Maybe we can just say big_packets doesn't mean big mode.

> +   else
> +   vi->mode = VIRTNET_MODE_SMALL;
> +
> if (vi->any_header_sg)
> dev->needed_headroom = vi->hdr_len;

Anyhow this seems not a fix so it should be a separate series than patch 1?

Thanks

>
> --
> 2.32.0.3.g01195cf9f
>




Re: [PATCH 5/5] virtio_net: rx remove premapped failover code

2024-10-18 Thread Jason Wang
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
>
> Now, the premapped mode can be enabled unconditionally.
>
> So we can remove the failover code for merge and small mode.
>
> Signed-off-by: Xuan Zhuo 

Let's be more verbose here. For example, the virtnet_rq_xxx() helper
would be only used if the mode is using pre mapping.

And it might be worth adding a check to prevent misusing of the API.

Others look good to me.

Thanks




Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-18 Thread Jason Wang
On Mon, Oct 14, 2024 at 11:12 AM Xuan Zhuo  wrote:
>
> When the frag just got a page, then may lead to regression on VM.
> Specially if the sysctl net.core.high_order_alloc_disable value is 1,
> then the frag always get a page when do refill.
>
> Which could see reliable crashes or scp failure (scp a file 100M in size
> to VM):
>
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur.
>
> Here, when the frag size is not enough, we reduce the buffer len to fix
> this problem.
>
> Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> use_dma_api")
> Reported-by: "Si-Wei Liu" 
> Signed-off-by: Xuan Zhuo 

Though this may fix the problem and we need it now, I would like to
have some tweaks in the future. Basically because of the security
implication for sharing driver metadata with the device (most IOMMU
can only do protection at the granule at the page). So we may end up
with device-triggerable behaviour. For safety, we should use driver
private memory for DMA metadata.

> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f8131f92a392..59a99bbaf852 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -926,9 +926,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
> void *buf, *head;
> dma_addr_t addr;
>
> -   if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> -   return NULL;
> -
> head = page_address(alloc_frag->page);
>
> if (rq->do_dma) {
> @@ -2423,6 +2420,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
> len = SKB_DATA_ALIGN(len) +
>   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>
> +   if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> +   return -ENOMEM;
> +
> buf = virtnet_rq_alloc(rq, len, gfp);
> if (unlikely(!buf))
> return -ENOMEM;
> @@ -2525,6 +2525,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>  */
> len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>
> +   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> +   return -ENOMEM;

This makes me think of another question, how could we guarantee len +
room is less or equal to PAGE_SIZE. Especially considering we need
extra head and tailroom for XDP? If we can't it is a bug:

"""
/**
 * skb_page_frag_refill - check that a page_frag contains enough room
 * @sz: minimum size of the fragment we want to get
 * @pfrag: pointer to page_frag
 * @gfp: priority for memory allocation
 *
 * Note: While this allocator tries to use high order pages, there is
 * no guarantee that allocations succeed. Therefore, @sz MUST be
 * less or equal than PAGE_SIZE.
 */
"""

> +
> +   if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) 
> > alloc_frag->size)
> +   len -= sizeof(struct virtnet_rq_dma);

Any reason we need to check alloc_frag->offset?

> +
> buf = virtnet_rq_alloc(rq, len + room, gfp);

Btw, as pointed out in previous review, we should have a consistent API:

1) hide the alloc_frag like virtnet_rq_alloc()

or

2) pass the alloc_frag to virtnet_rq_alloc()

> if (unlikely(!buf))
> return -ENOMEM;
> --
> 2.32.0.3.g01195cf9f
>

Thanks




Re: [PATCH 1/5] virtio-net: fix overflow inside virtnet_rq_alloc

2024-10-17 Thread Paolo Abeni




On 10/14/24 05:12, Xuan Zhuo wrote:

When the frag just got a page, then may lead to regression on VM.
Specially if the sysctl net.core.high_order_alloc_disable value is 1,
then the frag always get a page when do refill.

Which could see reliable crashes or scp failure (scp a file 100M in size
to VM):

The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
of a new frag. When the frag size is larger than PAGE_SIZE,
everything is fine. However, if the frag is only one page and the
total size of the buffer and virtnet_rq_dma is larger than one page, an
overflow may occur.

Here, when the frag size is not enough, we reduce the buffer len to fix
this problem.

Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api")
Reported-by: "Si-Wei Liu" 
Signed-off-by: Xuan Zhuo 


This looks like a fix that should target the net tree, but the following 
patches looks like net-next material. Any special reason to bundle them 
together?


Also, please explicitly include the the target tree in the subj on next 
submissions, thanks!


Paolo




Re: [PATCH 0/5] virtio_net: enable premapped mode by default

2024-10-16 Thread Darren Kenny
Hi Michael,

On Monday, 2024-10-14 at 00:57:41 -04, Michael S. Tsirkin wrote:
> On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote:
>> In the last linux version, we disabled this feature to fix the
>> regress[1].
>> 
>> The patch set is try to fix the problem and re-enable it.
>> 
>> More info: 
>> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
>> 
>> Thanks.
>> 
>> [1]: 
>> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
>
> Darren, you previously reported crashes with a patch very similar to 1/5.
> Can you please test this patchset and report whether they
> are still observed?
> If yes, any data on how to reproduce would be very benefitial for Xuan
> Zhuo.
>

I aim to get to this in the next week, but I don't currently have
access to a system to test it, it will take a few days at least before I
can get one.

Thanks,

Darren.


>
>> Xuan Zhuo (5):
>>   virtio-net: fix overflow inside virtnet_rq_alloc
>>   virtio_net: introduce vi->mode
>>   virtio_net: big mode skip the unmap check
>>   virtio_net: enable premapped mode for merge and small by default
>>   virtio_net: rx remove premapped failover code
>> 
>>  drivers/net/virtio_net.c | 168 ---
>>  1 file changed, 105 insertions(+), 63 deletions(-)
>> 
>> --
>> 2.32.0.3.g01195cf9f



Re: [PATCH 0/5] virtio_net: enable premapped mode by default

2024-10-13 Thread Michael S. Tsirkin
On Mon, Oct 14, 2024 at 11:12:29AM +0800, Xuan Zhuo wrote:
> In the last linux version, we disabled this feature to fix the
> regress[1].
> 
> The patch set is try to fix the problem and re-enable it.
> 
> More info: 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> Thanks.
> 
> [1]: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com

Darren, you previously reported crashes with a patch very similar to 1/5.
Can you please test this patchset and report whether they
are still observed?
If yes, any data on how to reproduce would be very benefitial for Xuan
Zhuo.


> Xuan Zhuo (5):
>   virtio-net: fix overflow inside virtnet_rq_alloc
>   virtio_net: introduce vi->mode
>   virtio_net: big mode skip the unmap check
>   virtio_net: enable premapped mode for merge and small by default
>   virtio_net: rx remove premapped failover code
> 
>  drivers/net/virtio_net.c | 168 ---
>  1 file changed, 105 insertions(+), 63 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f




Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-10-09 Thread Michael S. Tsirkin
On Wed, Oct 09, 2024 at 06:00:47PM +0800, Xuan Zhuo wrote:
> On Wed, 9 Oct 2024 05:29:35 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> > > Regression: 
> > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > >
> > > I still think that the patch can fix the problem, I hope Darren can 
> > > re-test it
> > > or give me more info.
> > >
> > > 
> > > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> > >
> > > If that can not work or Darren can not reply in time, Michael you can try 
> > > this
> > > patch set.
> > >
> > > Thanks.
> >
> > It's been a month - were you going to post patches fixing bugs in premap
> > and then re-enabling this?
> 
> [1] was tried to fix the bug in premapped mode.
> 
> We all wait the re-test from Darren.
> 
> But Jason and Takero have tested it.
> 
> If you do not want to wait Darren, we can discuss [1] or add it to your next
> branch for more test.
> 
> Thanks.
> 
> [1]: 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> 
> >
> > Thanks!
> >
> >
> > > Xuan Zhuo (3):
> > >   Revert "virtio_net: rx remove premapped failover code"
> > >   Revert "virtio_net: big mode skip the unmap check"
> > >   virtio_net: disable premapped mode by default
> > >
> > >  drivers/net/virtio_net.c | 95 +++-
> > >  1 file changed, 46 insertions(+), 49 deletions(-)
> > >
> > > --
> > > 2.32.0.3.g01195cf9f
> > >
> >

I suggest you post a patchset you want me to apply on top of master.

-- 
MST




Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-10-09 Thread Xuan Zhuo
On Wed, 9 Oct 2024 05:29:35 -0400, "Michael S. Tsirkin"  wrote:
> On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> > Regression: 
> > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> >
> > I still think that the patch can fix the problem, I hope Darren can re-test 
> > it
> > or give me more info.
> >
> > 
> > http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> >
> > If that can not work or Darren can not reply in time, Michael you can try 
> > this
> > patch set.
> >
> > Thanks.
>
> It's been a month - were you going to post patches fixing bugs in premap
> and then re-enabling this?

[1] was tried to fix the bug in premapped mode.

We all wait the re-test from Darren.

But Jason and Takero have tested it.

If you do not want to wait Darren, we can discuss [1] or add it to your next
branch for more test.

Thanks.

[1]: 
http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com


>
> Thanks!
>
>
> > Xuan Zhuo (3):
> >   Revert "virtio_net: rx remove premapped failover code"
> >   Revert "virtio_net: big mode skip the unmap check"
> >   virtio_net: disable premapped mode by default
> >
> >  drivers/net/virtio_net.c | 95 +++-
> >  1 file changed, 46 insertions(+), 49 deletions(-)
> >
> > --
> > 2.32.0.3.g01195cf9f
> >
>



Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-10-09 Thread Michael S. Tsirkin
On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> Regression: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> 
> I still think that the patch can fix the problem, I hope Darren can re-test it
> or give me more info.
> 
> 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> If that can not work or Darren can not reply in time, Michael you can try this
> patch set.
> 
> Thanks.

It's been a month - were you going to post patches fixing bugs in premap
and then re-enabling this?

Thanks!


> Xuan Zhuo (3):
>   Revert "virtio_net: rx remove premapped failover code"
>   Revert "virtio_net: big mode skip the unmap check"
>   virtio_net: disable premapped mode by default
> 
>  drivers/net/virtio_net.c | 95 +++-
>  1 file changed, 46 insertions(+), 49 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f
> 




Re: [RFC net-next v1 10/12] virtio_net: xsk: tx: support xmit xsk buffer

2024-09-24 Thread Xuan Zhuo
On Tue, 24 Sep 2024 15:35:08 +0800, Jason Wang  wrote:
> On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
> >
> > The driver's tx napi is very important for XSK. It is responsible for
> > obtaining data from the XSK queue and sending it out.
> >
> > At the beginning, we need to trigger tx napi.
> >
> > virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
> > buffer) by the last bits of the pointer.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 176 ---
> >  1 file changed, 166 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 3ad4c6e3ef18..1a870f1df910 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -83,6 +83,7 @@ struct virtnet_sq_free_stats {
> > u64 bytes;
> > u64 napi_packets;
> > u64 napi_bytes;
> > +   u64 xsk;
> >  };
> >
> >  struct virtnet_sq_stats {
> > @@ -514,16 +515,20 @@ static struct sk_buff *virtnet_skb_append_frag(struct 
> > sk_buff *head_skb,
> >struct sk_buff *curr_skb,
> >struct page *page, void *buf,
> >int len, int truesize);
> > +static void virtnet_xsk_completed(struct send_queue *sq, int num);
> >
> >  enum virtnet_xmit_type {
> > VIRTNET_XMIT_TYPE_SKB,
> > VIRTNET_XMIT_TYPE_SKB_ORPHAN,
> > VIRTNET_XMIT_TYPE_XDP,
> > +   VIRTNET_XMIT_TYPE_XSK,
> >  };
> >
> >  /* We use the last two bits of the pointer to distinguish the xmit type. */
> >  #define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
> >
> > +#define VIRTIO_XSK_FLAG_OFFSET 4
>
> Any reason this is not 2?

There's no particular reason for this, any value greater than 2 will work.


>
> > +
> >  static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)
> >  {
> > unsigned long p = (unsigned long)*ptr;
> > @@ -546,6 +551,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, 
> > int num, void *data,
> > GFP_ATOMIC);
> >  }
> >
> > +static u32 virtnet_ptr_to_xsk_buff_len(void *ptr)
> > +{
> > +   return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> > +}
> > +
> >  static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> >  {
> > sg_assign_page(sg, NULL);
> > @@ -587,11 +597,27 @@ static void __free_old_xmit(struct send_queue *sq, 
> > struct netdev_queue *txq,
> > stats->bytes += xdp_get_frame_len(frame);
> > xdp_return_frame(frame);
> > break;
> > +
> > +   case VIRTNET_XMIT_TYPE_XSK:
> > +   stats->bytes += virtnet_ptr_to_xsk_buff_len(ptr);
> > +   stats->xsk++;
> > +   break;
> > }
> > }
> > netdev_tx_completed_queue(txq, stats->napi_packets, 
> > stats->napi_bytes);
>
> Not related to this patch, but this seems unnecessary to AF_XDP.

YES.

netdev_tx_completed_queue will check napi_bytes firstly.
So I do not think we need to do anything for this.

>
> >  }
> >
> > +static void virtnet_free_old_xmit(struct send_queue *sq,
> > + struct netdev_queue *txq,
> > + bool in_napi,
> > + struct virtnet_sq_free_stats *stats)
> > +{
> > +   __free_old_xmit(sq, txq, in_napi, stats);
> > +
> > +   if (stats->xsk)
> > +   virtnet_xsk_completed(sq, stats->xsk);
> > +}
> > +
> >  /* Converting between virtqueue no. and kernel tx/rx queue no.
> >   * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
> >   */
> > @@ -1019,7 +1045,7 @@ static void free_old_xmit(struct send_queue *sq, 
> > struct netdev_queue *txq,
> >  {
> > struct virtnet_sq_free_stats stats = {0};
> >
> > -   __free_old_xmit(sq, txq, in_napi, &stats);
> > +   virtnet_free_old_xmit(sq, txq, in_napi, &stats);
> >
> > /* Avoid overhead when no packets have been processed
> >  * happens when called speculatively from start_xmit.
> > @@ -1380,6 +1406,111 @@ static int virtnet_add_recvbuf_xsk(struct 
> > virtnet_info *vi, struct receive_queue
> > return err;
> >  }
> >
> > +static void *virtnet_xsk_to_ptr(u32 len)
> > +{
> > +   unsigned long p;
> > +
> > +   p = len << VIRTIO_XSK_FLAG_OFFSET;
> > +
> > +   return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK);
> > +}
> > +
> > +static int virtnet_xsk_xmit_one(struct send_queue *sq,
> > +   struct xsk_buff_pool *pool,
> > +   struct xdp_desc *desc)
> > +{
> > +   struct virtnet_info *vi;
> > +   dma_addr_t addr;
> > +
> > +   vi = sq->vq->vdev->priv;
> > +
> > +   addr = xsk_buff_raw_get_dma(pool, desc->addr);
> > +   xsk_buff_raw_dma_sync_for_dev

Re: [RFC net-next v1 07/12] virtio_net: refactor the xmit type

2024-09-24 Thread Xuan Zhuo
On Tue, 24 Sep 2024 15:35:03 +0800, Jason Wang  wrote:
> On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
> >
> > Because the af-xdp will introduce a new xmit type, so I refactor the
> > xmit type mechanism first.
> >
> > In general, pointers are aligned to 4 or 8 bytes.
>
> I think this needs some clarification, the alignment seems to depend
> on the lowest common multiple between the alignments of all struct
> members. So we know both xdp_frame and sk_buff are at least 4 bytes
> aligned.
>
> If we want to reuse the lowest bit of pointers in AF_XDP, the
> alignment of the data structure should be at least 4 bytes.

YES, for AF_XDP. See more in #10.


>
> > If it is aligned to 4
> > bytes, then only two bits are free for a pointer. But there are 4 types
> > here, so we can't use bits to distinguish them. And 2 bits is enough for
> > 4 types:
> >
> > 00 for skb
> > 01 for SKB_ORPHAN
> > 10 for XDP
> > 11 for af-xdp tx
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 90 +++-
> >  1 file changed, 51 insertions(+), 39 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 630e5b21ad69..41a5ea9b788d 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -45,9 +45,6 @@ module_param(napi_tx, bool, 0644);
> >  #define VIRTIO_XDP_TX  BIT(0)
> >  #define VIRTIO_XDP_REDIR   BIT(1)
> >
> > -#define VIRTIO_XDP_FLAGBIT(0)
> > -#define VIRTIO_ORPHAN_FLAG BIT(1)
> > -
> >  /* RX packet size EWMA. The average packet size is used to determine the 
> > packet
> >   * buffer size when refilling RX rings. As the entire RX ring may be 
> > refilled
> >   * at once, the weight is chosen so that the EWMA will be insensitive to 
> > short-
> > @@ -512,34 +509,35 @@ static struct sk_buff *virtnet_skb_append_frag(struct 
> > sk_buff *head_skb,
> >struct page *page, void *buf,
> >int len, int truesize);
> >
> > -static bool is_xdp_frame(void *ptr)
> > -{
> > -   return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> > -}
> > +enum virtnet_xmit_type {
> > +   VIRTNET_XMIT_TYPE_SKB,
> > +   VIRTNET_XMIT_TYPE_SKB_ORPHAN,
> > +   VIRTNET_XMIT_TYPE_XDP,
> > +};
> >
> > -static void *xdp_to_ptr(struct xdp_frame *ptr)
> > -{
> > -   return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> > -}
> > +/* We use the last two bits of the pointer to distinguish the xmit type. */
> > +#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
> >
> > -static struct xdp_frame *ptr_to_xdp(void *ptr)
> > +static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)
>
> Nit: not a native speaker but I think something like pack/unpack might
> be better.


Will fix.

Thanks

>
> With those changes.
>
> Acked-by: Jason Wang 
>
> Thanks
>



Re: [RFC net-next v1 08/12] virtio_net: xsk: bind/unbind xsk for tx

2024-09-24 Thread Xuan Zhuo
On Tue, 24 Sep 2024 15:35:05 +0800, Jason Wang  wrote:
> On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
> >
> > This patch implement the logic of bind/unbind xsk pool to sq and rq.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 53 
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 41a5ea9b788d..7c379614fd22 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -295,6 +295,10 @@ struct send_queue {
> >
> > /* Record whether sq is in reset state. */
> > bool reset;
> > +
> > +   struct xsk_buff_pool *xsk_pool;
> > +
> > +   dma_addr_t xsk_hdr_dma_addr;
> >  };
> >
> >  /* Internal representation of a receive virtqueue */
> > @@ -497,6 +501,8 @@ struct virtio_net_common_hdr {
> > };
> >  };
> >
> > +static struct virtio_net_common_hdr xsk_hdr;
> > +
> >  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
> >  static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff 
> > *xdp,
> >struct net_device *dev,
> > @@ -5488,6 +5494,29 @@ static int virtnet_rq_bind_xsk_pool(struct 
> > virtnet_info *vi, struct receive_queu
> > return err;
> >  }
> >
> > +static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi,
> > +   struct send_queue *sq,
> > +   struct xsk_buff_pool *pool)
> > +{
> > +   int err, qindex;
> > +
> > +   qindex = sq - vi->sq;
> > +
> > +   virtnet_tx_pause(vi, sq);
> > +
> > +   err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf);
> > +   if (err) {
> > +   netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: 
> > %d\n", qindex, err);
> > +   pool = NULL;
> > +   }
> > +
> > +   sq->xsk_pool = pool;
> > +
> > +   virtnet_tx_resume(vi, sq);
> > +
> > +   return err;
> > +}
> > +
> >  static int virtnet_xsk_pool_enable(struct net_device *dev,
> >struct xsk_buff_pool *pool,
> >u16 qid)
> > @@ -5496,6 +5525,7 @@ static int virtnet_xsk_pool_enable(struct net_device 
> > *dev,
> > struct receive_queue *rq;
> > struct device *dma_dev;
> > struct send_queue *sq;
> > +   dma_addr_t hdr_dma;
> > int err, size;
> >
> > if (vi->hdr_len > xsk_pool_get_headroom(pool))
> > @@ -5533,6 +5563,11 @@ static int virtnet_xsk_pool_enable(struct net_device 
> > *dev,
> > if (!rq->xsk_buffs)
> > return -ENOMEM;
> >
> > +   hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, 
> > vi->hdr_len,
> > +DMA_TO_DEVICE, 0);
> > +   if (virtqueue_dma_mapping_error(sq->vq, hdr_dma))
> > +   return -ENOMEM;
> > +
> > err = xsk_pool_dma_map(pool, dma_dev, 0);
> > if (err)
> > goto err_xsk_map;
> > @@ -5541,11 +5576,24 @@ static int virtnet_xsk_pool_enable(struct 
> > net_device *dev,
> > if (err)
> > goto err_rq;
> >
> > +   err = virtnet_sq_bind_xsk_pool(vi, sq, pool);
> > +   if (err)
> > +   goto err_sq;
> > +
> > +   /* Now, we do not support tx offset, so all the tx virtnet hdr is 
> > zero.
>
> What did you mean by "tx offset" here? (Or I don't see the connection
> with vnet hdr).

Sorry, should be tx offload(such as tx csum).

Will fix.

Thanks.


>
> Anyhow the patch looks good.
>
> Acked-by: Jason Wang 
>
> Thanks
>



Re: [RFC net-next v1 02/12] virtio_ring: split: record extras for indirect buffers

2024-09-24 Thread Xuan Zhuo
Will fix for all commets in next version.

Thanks.

On Tue, 24 Sep 2024 15:34:57 +0800, Jason Wang  wrote:
> On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
> >
> > The subsequent commit needs to know whether every indirect buffer is
> > premapped or not. So we need to introduce an extra struct for every
> > indirect buffer to record this info.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/virtio/virtio_ring.c | 126 +--
> >  1 file changed, 61 insertions(+), 65 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 228e9fbcba3f..62901bee97c0 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -67,9 +67,20 @@
> >  #define LAST_ADD_TIME_INVALID(vq)
> >  #endif
> >
> > +struct vring_desc_extra {
> > +   dma_addr_t addr;/* Descriptor DMA addr. */
> > +   u32 len;/* Descriptor length. */
> > +   u16 flags;  /* Descriptor flags. */
> > +   u16 next;   /* The next desc state in a list. */
> > +};
> > +
> >  struct vring_desc_state_split {
> > void *data; /* Data for callback. */
> > -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> > +
> > +   /* Indirect extra table and desc table, if any. These two will be
> > +* allocated together. So we won't stress more to the memory 
> > allocator.
> > +*/
> > +   struct vring_desc_extra *indir;
> >  };
> >
> >  struct vring_desc_state_packed {
> > @@ -79,13 +90,6 @@ struct vring_desc_state_packed {
> > u16 last;   /* The last desc state in a list. */
> >  };
> >
> > -struct vring_desc_extra {
> > -   dma_addr_t addr;/* Descriptor DMA addr. */
> > -   u32 len;/* Descriptor length. */
> > -   u16 flags;  /* Descriptor flags. */
> > -   u16 next;   /* The next desc state in a list. */
> > -};
> > -
> >  struct vring_virtqueue_split {
> > /* Actual memory layout for this queue. */
> > struct vring vring;
> > @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue 
> > *vq, u32 num)
> >   * Split ring specific functions - *_split().
> >   */
> >
> > -static void vring_unmap_one_split_indirect(const struct vring_virtqueue 
> > *vq,
> > -  const struct vring_desc *desc)
> > -{
> > -   u16 flags;
> > -
> > -   if (!vring_need_unmap_buffer(vq))
> > -   return;
> > -
> > -   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> > -
> > -   dma_unmap_page(vring_dma_dev(vq),
> > -  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> > -  virtio32_to_cpu(vq->vq.vdev, desc->len),
> > -  (flags & VRING_DESC_F_WRITE) ?
> > -  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > -}
> > -
> >  static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> > - unsigned int i)
> > + struct vring_desc_extra *extra)
> >  {
> > -   struct vring_desc_extra *extra = vq->split.desc_extra;
> > u16 flags;
> >
> > -   flags = extra[i].flags;
> > +   flags = extra->flags;
> >
> > if (flags & VRING_DESC_F_INDIRECT) {
> > if (!vq->use_dma_api)
> > goto out;
> >
> > dma_unmap_single(vring_dma_dev(vq),
> > -extra[i].addr,
> > -extra[i].len,
> > +extra->addr,
> > +extra->len,
> >  (flags & VRING_DESC_F_WRITE) ?
> >  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > } else {
> > @@ -479,20 +465,22 @@ static unsigned int vring_unmap_one_split(const 
> > struct vring_virtqueue *vq,
> > goto out;
> >
> > dma_unmap_page(vring_dma_dev(vq),
> > -  extra[i].addr,
> > -  extra[i].len,
> > +  extra->addr,
> > +  extra->len,
> >(flags & VRING_DESC_F_WRITE) ?
> >DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > }
> >
> >  out:
> > -   return extra[i].next;
> > +   return extra->next;
> >  }
> >
> >  static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
> >unsigned int total_sg,
> > +  struct vring_desc_extra 
> > **pextra,
>
> This kind of API seems a little bit strange. In the caller we had:
>
> if (desc) {
> ...
> } else {
> extra = vq->split.desc_extra;
> }
>

Re: [RFC net-next v1 04/12] virtio_ring: perform premapped operations based on per-buffer

2024-09-24 Thread Xuan Zhuo
On Tue, 24 Sep 2024 15:35:01 +0800, Jason Wang  wrote:
> On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
> >
> > The current configuration sets the virtqueue (vq) to premapped mode,
> > implying that all buffers submitted to this queue must be mapped ahead
> > of time. This presents a challenge for the virtnet send queue (sq): the
> > virtnet driver would be required to keep track of dma information for vq
> > size * 17, which can be substantial. However, if the premapped mode were
> > applied on a per-buffer basis, the complexity would be greatly reduced.
> > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
> > skb buffers could remain unmapped.
> >
> > We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this
> > indicates that the driver has performed DMA mapping in advance, allowing
> > the Virtio core to directly utilize sg_dma_address(sg) without
> > conducting any internal DMA mapping. Additionally, DMA unmap operations
> > for this buffer will be bypassed.
>
> So I think we still need some explanation here. I think this works for
> virtio-net as the sgs are initialized by the virtio-net device itself.
>
> But it seems not the case for all the others where the sgs were passed
> from the uppyer subsystem. For example in __virtscsi_add_cmd(), we
> had:
>
> if (sc && sc->sc_data_direction != DMA_NONE) {
> if (sc->sc_data_direction != DMA_FROM_DEVICE)
> out = &sc->sdb.table;
> if (sc->sc_data_direction != DMA_TO_DEVICE)
> in = &sc->sdb.table;
> }
>
> /* Request header.  */
> sg_init_one(&req, &cmd->req, req_size);
> sgs[out_num++] = &req;
>
> /* Data-out buffer.  */
> if (out) {
> /* Place WRITE protection SGLs before Data OUT payload */
> if (scsi_prot_sg_count(sc))
> sgs[out_num++] = scsi_prot_sglist(sc);
> sgs[out_num++] = out->sgl;
> }


With this in mind, I think the new api is a suitable approach to avoid changing
sg.

Thanks.

>
> Thanks
>



Re: [RFC net-next v1 10/12] virtio_net: xsk: tx: support xmit xsk buffer

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
>
> The driver's tx napi is very important for XSK. It is responsible for
> obtaining data from the XSK queue and sending it out.
>
> At the beginning, we need to trigger tx napi.
>
> virtnet_free_old_xmit distinguishes three type ptr(skb, xdp frame, xsk
> buffer) by the last bits of the pointer.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 176 ---
>  1 file changed, 166 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3ad4c6e3ef18..1a870f1df910 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -83,6 +83,7 @@ struct virtnet_sq_free_stats {
> u64 bytes;
> u64 napi_packets;
> u64 napi_bytes;
> +   u64 xsk;
>  };
>
>  struct virtnet_sq_stats {
> @@ -514,16 +515,20 @@ static struct sk_buff *virtnet_skb_append_frag(struct 
> sk_buff *head_skb,
>struct sk_buff *curr_skb,
>struct page *page, void *buf,
>int len, int truesize);
> +static void virtnet_xsk_completed(struct send_queue *sq, int num);
>
>  enum virtnet_xmit_type {
> VIRTNET_XMIT_TYPE_SKB,
> VIRTNET_XMIT_TYPE_SKB_ORPHAN,
> VIRTNET_XMIT_TYPE_XDP,
> +   VIRTNET_XMIT_TYPE_XSK,
>  };
>
>  /* We use the last two bits of the pointer to distinguish the xmit type. */
>  #define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
>
> +#define VIRTIO_XSK_FLAG_OFFSET 4

Any reason this is not 2?

> +
>  static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)
>  {
> unsigned long p = (unsigned long)*ptr;
> @@ -546,6 +551,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, int 
> num, void *data,
> GFP_ATOMIC);
>  }
>
> +static u32 virtnet_ptr_to_xsk_buff_len(void *ptr)
> +{
> +   return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> +}
> +
>  static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
>  {
> sg_assign_page(sg, NULL);
> @@ -587,11 +597,27 @@ static void __free_old_xmit(struct send_queue *sq, 
> struct netdev_queue *txq,
> stats->bytes += xdp_get_frame_len(frame);
> xdp_return_frame(frame);
> break;
> +
> +   case VIRTNET_XMIT_TYPE_XSK:
> +   stats->bytes += virtnet_ptr_to_xsk_buff_len(ptr);
> +   stats->xsk++;
> +   break;
> }
> }
> netdev_tx_completed_queue(txq, stats->napi_packets, 
> stats->napi_bytes);

Not related to this patch, but this seems unnecessary to AF_XDP.

>  }
>
> +static void virtnet_free_old_xmit(struct send_queue *sq,
> + struct netdev_queue *txq,
> + bool in_napi,
> + struct virtnet_sq_free_stats *stats)
> +{
> +   __free_old_xmit(sq, txq, in_napi, stats);
> +
> +   if (stats->xsk)
> +   virtnet_xsk_completed(sq, stats->xsk);
> +}
> +
>  /* Converting between virtqueue no. and kernel tx/rx queue no.
>   * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>   */
> @@ -1019,7 +1045,7 @@ static void free_old_xmit(struct send_queue *sq, struct 
> netdev_queue *txq,
>  {
> struct virtnet_sq_free_stats stats = {0};
>
> -   __free_old_xmit(sq, txq, in_napi, &stats);
> +   virtnet_free_old_xmit(sq, txq, in_napi, &stats);
>
> /* Avoid overhead when no packets have been processed
>  * happens when called speculatively from start_xmit.
> @@ -1380,6 +1406,111 @@ static int virtnet_add_recvbuf_xsk(struct 
> virtnet_info *vi, struct receive_queue
> return err;
>  }
>
> +static void *virtnet_xsk_to_ptr(u32 len)
> +{
> +   unsigned long p;
> +
> +   p = len << VIRTIO_XSK_FLAG_OFFSET;
> +
> +   return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK);
> +}
> +
> +static int virtnet_xsk_xmit_one(struct send_queue *sq,
> +   struct xsk_buff_pool *pool,
> +   struct xdp_desc *desc)
> +{
> +   struct virtnet_info *vi;
> +   dma_addr_t addr;
> +
> +   vi = sq->vq->vdev->priv;
> +
> +   addr = xsk_buff_raw_get_dma(pool, desc->addr);
> +   xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
> +
> +   sg_init_table(sq->sg, 2);
> +
> +   sg_fill_dma(sq->sg, sq->xsk_hdr_dma_addr, vi->hdr_len);
> +   sg_fill_dma(sq->sg + 1, addr, desc->len);
> +
> +   return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
> +   virtnet_xsk_to_ptr(desc->len), 
> GFP_ATOMIC);
> +}
> +
> +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> + struct xsk_buff_pool *pool,
> + unsign

Re: [RFC net-next v1 04/12] virtio_ring: perform premapped operations based on per-buffer

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
>
> The current configuration sets the virtqueue (vq) to premapped mode,
> implying that all buffers submitted to this queue must be mapped ahead
> of time. This presents a challenge for the virtnet send queue (sq): the
> virtnet driver would be required to keep track of dma information for vq
> size * 17, which can be substantial. However, if the premapped mode were
> applied on a per-buffer basis, the complexity would be greatly reduced.
> With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
> skb buffers could remain unmapped.
>
> We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this
> indicates that the driver has performed DMA mapping in advance, allowing
> the Virtio core to directly utilize sg_dma_address(sg) without
> conducting any internal DMA mapping. Additionally, DMA unmap operations
> for this buffer will be bypassed.

So I think we still need some explanation here. I think this works for
virtio-net as the sgs are initialized by the virtio-net device itself.

But it seems not the case for all the others where the sgs were passed
from the uppyer subsystem. For example in __virtscsi_add_cmd(), we
had:

if (sc && sc->sc_data_direction != DMA_NONE) {
if (sc->sc_data_direction != DMA_FROM_DEVICE)
out = &sc->sdb.table;
if (sc->sc_data_direction != DMA_TO_DEVICE)
in = &sc->sdb.table;
}

/* Request header.  */
sg_init_one(&req, &cmd->req, req_size);
sgs[out_num++] = &req;

/* Data-out buffer.  */
if (out) {
/* Place WRITE protection SGLs before Data OUT payload */
if (scsi_prot_sg_count(sc))
sgs[out_num++] = scsi_prot_sglist(sc);
sgs[out_num++] = out->sgl;
}

Thanks




Re: [RFC net-next v1 08/12] virtio_net: xsk: bind/unbind xsk for tx

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
>
> This patch implement the logic of bind/unbind xsk pool to sq and rq.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 53 
>  1 file changed, 53 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 41a5ea9b788d..7c379614fd22 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -295,6 +295,10 @@ struct send_queue {
>
> /* Record whether sq is in reset state. */
> bool reset;
> +
> +   struct xsk_buff_pool *xsk_pool;
> +
> +   dma_addr_t xsk_hdr_dma_addr;
>  };
>
>  /* Internal representation of a receive virtqueue */
> @@ -497,6 +501,8 @@ struct virtio_net_common_hdr {
> };
>  };
>
> +static struct virtio_net_common_hdr xsk_hdr;
> +
>  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
>  static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff 
> *xdp,
>struct net_device *dev,
> @@ -5488,6 +5494,29 @@ static int virtnet_rq_bind_xsk_pool(struct 
> virtnet_info *vi, struct receive_queu
> return err;
>  }
>
> +static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi,
> +   struct send_queue *sq,
> +   struct xsk_buff_pool *pool)
> +{
> +   int err, qindex;
> +
> +   qindex = sq - vi->sq;
> +
> +   virtnet_tx_pause(vi, sq);
> +
> +   err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf);
> +   if (err) {
> +   netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: 
> %d\n", qindex, err);
> +   pool = NULL;
> +   }
> +
> +   sq->xsk_pool = pool;
> +
> +   virtnet_tx_resume(vi, sq);
> +
> +   return err;
> +}
> +
>  static int virtnet_xsk_pool_enable(struct net_device *dev,
>struct xsk_buff_pool *pool,
>u16 qid)
> @@ -5496,6 +5525,7 @@ static int virtnet_xsk_pool_enable(struct net_device 
> *dev,
> struct receive_queue *rq;
> struct device *dma_dev;
> struct send_queue *sq;
> +   dma_addr_t hdr_dma;
> int err, size;
>
> if (vi->hdr_len > xsk_pool_get_headroom(pool))
> @@ -5533,6 +5563,11 @@ static int virtnet_xsk_pool_enable(struct net_device 
> *dev,
> if (!rq->xsk_buffs)
> return -ENOMEM;
>
> +   hdr_dma = virtqueue_dma_map_single_attrs(sq->vq, &xsk_hdr, 
> vi->hdr_len,
> +DMA_TO_DEVICE, 0);
> +   if (virtqueue_dma_mapping_error(sq->vq, hdr_dma))
> +   return -ENOMEM;
> +
> err = xsk_pool_dma_map(pool, dma_dev, 0);
> if (err)
> goto err_xsk_map;
> @@ -5541,11 +5576,24 @@ static int virtnet_xsk_pool_enable(struct net_device 
> *dev,
> if (err)
> goto err_rq;
>
> +   err = virtnet_sq_bind_xsk_pool(vi, sq, pool);
> +   if (err)
> +   goto err_sq;
> +
> +   /* Now, we do not support tx offset, so all the tx virtnet hdr is 
> zero.

What did you mean by "tx offset" here? (Or I don't see the connection
with vnet hdr).

Anyhow the patch looks good.

Acked-by: Jason Wang 

Thanks




Re: [RFC net-next v1 07/12] virtio_net: refactor the xmit type

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
>
> Because the af-xdp will introduce a new xmit type, so I refactor the
> xmit type mechanism first.
>
> In general, pointers are aligned to 4 or 8 bytes.

I think this needs some clarification, the alignment seems to depend
on the lowest common multiple between the alignments of all struct
members. So we know both xdp_frame and sk_buff are at least 4 bytes
aligned.

If we want to reuse the lowest bit of pointers in AF_XDP, the
alignment of the data structure should be at least 4 bytes.

> If it is aligned to 4
> bytes, then only two bits are free for a pointer. But there are 4 types
> here, so we can't use bits to distinguish them. And 2 bits is enough for
> 4 types:
>
> 00 for skb
> 01 for SKB_ORPHAN
> 10 for XDP
> 11 for af-xdp tx
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 90 +++-
>  1 file changed, 51 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 630e5b21ad69..41a5ea9b788d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -45,9 +45,6 @@ module_param(napi_tx, bool, 0644);
>  #define VIRTIO_XDP_TX  BIT(0)
>  #define VIRTIO_XDP_REDIR   BIT(1)
>
> -#define VIRTIO_XDP_FLAGBIT(0)
> -#define VIRTIO_ORPHAN_FLAG BIT(1)
> -
>  /* RX packet size EWMA. The average packet size is used to determine the 
> packet
>   * buffer size when refilling RX rings. As the entire RX ring may be refilled
>   * at once, the weight is chosen so that the EWMA will be insensitive to 
> short-
> @@ -512,34 +509,35 @@ static struct sk_buff *virtnet_skb_append_frag(struct 
> sk_buff *head_skb,
>struct page *page, void *buf,
>int len, int truesize);
>
> -static bool is_xdp_frame(void *ptr)
> -{
> -   return (unsigned long)ptr & VIRTIO_XDP_FLAG;
> -}
> +enum virtnet_xmit_type {
> +   VIRTNET_XMIT_TYPE_SKB,
> +   VIRTNET_XMIT_TYPE_SKB_ORPHAN,
> +   VIRTNET_XMIT_TYPE_XDP,
> +};
>
> -static void *xdp_to_ptr(struct xdp_frame *ptr)
> -{
> -   return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
> -}
> +/* We use the last two bits of the pointer to distinguish the xmit type. */
> +#define VIRTNET_XMIT_TYPE_MASK (BIT(0) | BIT(1))
>
> -static struct xdp_frame *ptr_to_xdp(void *ptr)
> +static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)

Nit: not a native speaker but I think something like pack/unpack might
be better.

With those changes.

Acked-by: Jason Wang 

Thanks




Re: [RFC net-next v1 02/12] virtio_ring: split: record extras for indirect buffers

2024-09-24 Thread Jason Wang
On Tue, Sep 24, 2024 at 9:32 AM Xuan Zhuo  wrote:
>
> The subsequent commit needs to know whether every indirect buffer is
> premapped or not. So we need to introduce an extra struct for every
> indirect buffer to record this info.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 126 +--
>  1 file changed, 61 insertions(+), 65 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 228e9fbcba3f..62901bee97c0 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -67,9 +67,20 @@
>  #define LAST_ADD_TIME_INVALID(vq)
>  #endif
>
> +struct vring_desc_extra {
> +   dma_addr_t addr;/* Descriptor DMA addr. */
> +   u32 len;/* Descriptor length. */
> +   u16 flags;  /* Descriptor flags. */
> +   u16 next;   /* The next desc state in a list. */
> +};
> +
>  struct vring_desc_state_split {
> void *data; /* Data for callback. */
> -   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> +
> +   /* Indirect extra table and desc table, if any. These two will be
> +* allocated together. So we won't stress more to the memory 
> allocator.
> +*/
> +   struct vring_desc_extra *indir;
>  };
>
>  struct vring_desc_state_packed {
> @@ -79,13 +90,6 @@ struct vring_desc_state_packed {
> u16 last;   /* The last desc state in a list. */
>  };
>
> -struct vring_desc_extra {
> -   dma_addr_t addr;/* Descriptor DMA addr. */
> -   u32 len;/* Descriptor length. */
> -   u16 flags;  /* Descriptor flags. */
> -   u16 next;   /* The next desc state in a list. */
> -};
> -
>  struct vring_virtqueue_split {
> /* Actual memory layout for this queue. */
> struct vring vring;
> @@ -440,38 +444,20 @@ static void virtqueue_init(struct vring_virtqueue *vq, 
> u32 num)
>   * Split ring specific functions - *_split().
>   */
>
> -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
> -  const struct vring_desc *desc)
> -{
> -   u16 flags;
> -
> -   if (!vring_need_unmap_buffer(vq))
> -   return;
> -
> -   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> -
> -   dma_unmap_page(vring_dma_dev(vq),
> -  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> -  virtio32_to_cpu(vq->vq.vdev, desc->len),
> -  (flags & VRING_DESC_F_WRITE) ?
> -  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> -}
> -
>  static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> - unsigned int i)
> + struct vring_desc_extra *extra)
>  {
> -   struct vring_desc_extra *extra = vq->split.desc_extra;
> u16 flags;
>
> -   flags = extra[i].flags;
> +   flags = extra->flags;
>
> if (flags & VRING_DESC_F_INDIRECT) {
> if (!vq->use_dma_api)
> goto out;
>
> dma_unmap_single(vring_dma_dev(vq),
> -extra[i].addr,
> -extra[i].len,
> +extra->addr,
> +extra->len,
>  (flags & VRING_DESC_F_WRITE) ?
>  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> @@ -479,20 +465,22 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
> goto out;
>
> dma_unmap_page(vring_dma_dev(vq),
> -  extra[i].addr,
> -  extra[i].len,
> +  extra->addr,
> +  extra->len,
>(flags & VRING_DESC_F_WRITE) ?
>DMA_FROM_DEVICE : DMA_TO_DEVICE);
> }
>
>  out:
> -   return extra[i].next;
> +   return extra->next;
>  }
>
>  static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
>unsigned int total_sg,
> +  struct vring_desc_extra 
> **pextra,

This kind of API seems a little bit strange. In the caller we had:

if (desc) {
...
} else {
extra = vq->split.desc_extra;
}

I'd move the assignment of the extra there to make the code more readable.

>gfp_t gfp)
>  {
> +   struct vring_desc_extra *extra;
> struct vring_desc *desc;
> unsigned int i;
>
> @@ -503,40 +491,45 @@ static struct vring_desc *alloc_indirect_split(struct 
> virtqueue *_vq,
>  */
>

Re: [PATCH net-next 04/13] virtio_ring: perform premapped operations based on per-buffer

2024-09-12 Thread Jason Wang
On Thu, Sep 12, 2024 at 3:43 PM Xuan Zhuo  wrote:
>
> On Wed, 11 Sep 2024 11:54:25 +0800, Jason Wang  wrote:
> > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo  
> > wrote:
> > >
> > > The current configuration sets the virtqueue (vq) to premapped mode,
> > > implying that all buffers submitted to this queue must be mapped ahead
> > > of time. This presents a challenge for the virtnet send queue (sq): the
> > > virtnet driver would be required to keep track of dma information for vq
> > > size * 17, which can be substantial. However, if the premapped mode were
> > > applied on a per-buffer basis, the complexity would be greatly reduced.
> > > With AF_XDP enabled, AF_XDP buffers would become premapped, while kernel
> > > skb buffers could remain unmapped.
> >
> > Is this only applied to TX or both TX and RX.
>
>
> For rx, if you mean per-buffer dma buffer, I think it is yes,
> rx can reuse this. If you mean should we do premapped for the
> normal rx buffers, I think we should, that can reduce the
> dma map operations.
>
>
> >
> > >
> > > We can distinguish them by sg_page(sg), When sg_page(sg) is NULL, this
> > > indicates that the driver has performed DMA mapping in advance, allowing
> > > the Virtio core to directly utilize sg_dma_address(sg) without
> > > conducting any internal DMA mapping.
> >
> > This seems conflict with the code below?
> >
> > #define sg_is_premapped(sg) (!sg_page(sg))
>
> Sorry, I do not get for you.
>
> The key point is that the sg->page is setted by driver.

Ok, I forget that but let's document this assumption in the changelog.

>
> So I mean if the driver sets sg->page = NULL, then for this sg,
> the virtio core can skip dma mapping. If the driver sets
> sg->page to the page of the buffer, then the virtio core should
> do dma mapping for this sg.
>

Ok, let's describe this in the changelog.

Thanks




Re: [PATCH net-next 07/13] virtio_net: refactor the xmit type

2024-09-12 Thread Jason Wang
On Thu, Sep 12, 2024 at 3:54 PM Xuan Zhuo  wrote:
>
> On Wed, 11 Sep 2024 12:04:16 +0800, Jason Wang  wrote:
> > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo  
> > wrote:
> > >
> > > Because the af-xdp will introduce a new xmit type, so I refactor the
> > > xmit type mechanism first.
> > >
> > > We use the last two bits of the pointer to distinguish the xmit type,
> > > so we can distinguish four xmit types. Now we have three types: skb,
> > > orphan and xdp.
> >
> > And if I was not wrong, we do not anymore use bitmasks. If yes, let's
> > explain the reason here.
>
> In general, pointers are aligned to 4 or 8 bytes. If it is aligned to 4 bytes,
> then only two bits are free for a pointer. So we can only use two bits.
>
> But there are 4 types here, so we can't use bits to distinguish them.
>
> b00 for skb
> b01 for SKB_ORPHAN
> b10 for XDP
> b11 for af-xdp tx

Let's add these in the changelog.

Thanks




Re: [PATCH net-next 10/13] virtio_net: xsk: tx: support xmit xsk buffer

2024-09-12 Thread Jason Wang
On Thu, Sep 12, 2024 at 4:50 PM Xuan Zhuo  wrote:
>
> On Wed, 11 Sep 2024 12:31:32 +0800, Jason Wang  wrote:
> > On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo  
> > wrote:
> > >
> > > The driver's tx napi is very important for XSK. It is responsible for
> > > obtaining data from the XSK queue and sending it out.
> > >
> > > At the beginning, we need to trigger tx napi.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/net/virtio_net.c | 127 ++-
> > >  1 file changed, 125 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 221681926d23..3743694d3c3b 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -516,10 +516,13 @@ enum virtnet_xmit_type {
> > > VIRTNET_XMIT_TYPE_SKB,
> > > VIRTNET_XMIT_TYPE_ORPHAN,
> > > VIRTNET_XMIT_TYPE_XDP,
> > > +   VIRTNET_XMIT_TYPE_XSK,
> > >  };
> > >
> > >  #define VIRTNET_XMIT_TYPE_MASK (VIRTNET_XMIT_TYPE_SKB | 
> > > VIRTNET_XMIT_TYPE_ORPHAN \
> > > -   | VIRTNET_XMIT_TYPE_XDP)
> > > +   | VIRTNET_XMIT_TYPE_XDP | 
> > > VIRTNET_XMIT_TYPE_XSK)
> > > +
> > > +#define VIRTIO_XSK_FLAG_OFFSET 4
> > >
> > >  static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)
> > >  {
> > > @@ -543,6 +546,11 @@ static int virtnet_add_outbuf(struct send_queue *sq, 
> > > int num, void *data,
> > > GFP_ATOMIC);
> > >  }
> > >
> > > +static u32 virtnet_ptr_to_xsk(void *ptr)
> > > +{
> > > +   return ((unsigned long)ptr) >> VIRTIO_XSK_FLAG_OFFSET;
> > > +}
> > > +
> >
> > This needs a better name, otherwise readers might be confused.
> >
> > E.g something like virtnet_ptr_to_xsk_buff_len()?
> >
> > >  static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > >  {
> > > sg_assign_page(sg, NULL);
> > > @@ -584,6 +592,10 @@ static void __free_old_xmit(struct send_queue *sq, 
> > > struct netdev_queue *txq,
> > > stats->bytes += xdp_get_frame_len(frame);
> > > xdp_return_frame(frame);
> > > break;
> > > +
> > > +   case VIRTNET_XMIT_TYPE_XSK:
> > > +   stats->bytes += virtnet_ptr_to_xsk(ptr);
> > > +   break;
> >
> > Do we miss xsk_tx_completed() here?
> >
> > > }
> > > }
> > > netdev_tx_completed_queue(txq, stats->napi_packets, 
> > > stats->napi_bytes);
> > > @@ -1393,6 +1405,97 @@ static int virtnet_xsk_wakeup(struct net_device 
> > > *dev, u32 qid, u32 flag)
> > > return 0;
> > >  }
> > >
> > > +static void *virtnet_xsk_to_ptr(u32 len)
> > > +{
> > > +   unsigned long p;
> > > +
> > > +   p = len << VIRTIO_XSK_FLAG_OFFSET;
> > > +
> > > +   return virtnet_xmit_ptr_mix((void *)p, VIRTNET_XMIT_TYPE_XSK);
> > > +}
> > > +
> > > +static int virtnet_xsk_xmit_one(struct send_queue *sq,
> > > +   struct xsk_buff_pool *pool,
> > > +   struct xdp_desc *desc)
> > > +{
> > > +   struct virtnet_info *vi;
> > > +   dma_addr_t addr;
> > > +
> > > +   vi = sq->vq->vdev->priv;
> > > +
> > > +   addr = xsk_buff_raw_get_dma(pool, desc->addr);
> > > +   xsk_buff_raw_dma_sync_for_device(pool, addr, desc->len);
> > > +
> > > +   sg_init_table(sq->sg, 2);
> > > +
> > > +   sg_fill_dma(sq->sg, sq->xsk_hdr_dma_addr, vi->hdr_len);
> > > +   sg_fill_dma(sq->sg + 1, addr, desc->len);
> > > +
> > > +   return virtqueue_add_outbuf(sq->vq, sq->sg, 2,
> > > +   virtnet_xsk_to_ptr(desc->len), 
> > > GFP_ATOMIC);
> > > +}
> > > +
> > > +static int virtnet_xsk_xmit_batch(struct send_queue *sq,
> > > + struct xsk_buff_pool *pool,
> > > + unsigned int budget,
> > > + u64 *kicks)
> > > +{
> > > +   struct xdp_desc *descs = pool->tx_descs;
> > > +   bool kick = false;
> > > +   u32 nb_pkts, i;
> > > +   int err;
> > > +
> > > +   budget = min_t(u32, budget, sq->vq->num_free);
> > > +
> > > +   nb_pkts = xsk_tx_peek_release_desc_batch(pool, budget);
> > > +   if (!nb_pkts)
> > > +   return 0;
> > > +
> > > +   for (i = 0; i < nb_pkts; i++) {
> > > +   err = virtnet_xsk_xmit_one(sq, pool, &descs[i]);
> > > +   if (unlikely(err)) {
> > > +   xsk_tx_completed(sq->xsk_pool, nb_pkts - i);
> >
> > Should we kick in this condition?
> >
> > > +   break;
> > > +   }
> > > +
> > > +   kick = true;
> > > +   }
> > > +
> > > +   if (kick && virtqueue_kick_prepare(sq->vq) && 
> > > virtqueue_notify(sq->vq))
> >
> > Can we simply use virtqueue_kick() here?
> >
> > > +   (*kicks)++;
> > > +
> > > +   retu

  1   2   3   4   5   6   7   8   9   10   >