Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Tue, 19 Aug 2025 10:48:37 +0800 Jason Wang wrote: > On Mon, Aug 18, 2025 at 11:39 PM Jakub Kicinski wrote: > > > > On Mon, 18 Aug 2025 09:15:22 +0800 Junnan Wu wrote: > > > > > Yes, you are right. The commit of this fix tag is the first commit I > > > > > found which add function `virtnet_poll_cleantx`. Actually, we are not > > > > > sure whether this issue appears after this commit. > > > > > > > > > > In our side, this issue is found by chance in version 5.15. > > > > > > > > > > It's hard to find the key commit which cause this issue > > > > > for reason that the reproduction of this scenario is too complex. > > > > > > > > I think the problem needs to be more clearly understood, and then it > > > > will be easier to find the fixes tag. At the face of it the patch > > > > makes it look like close() doesn't reliably stop the device, which > > > > is highly odd. > > > > > > Yes, you are right. It is really strange that `close()` acts like > > > that, because current order has worked for long time. But panic call > > > stack in our env shows that the function `virtnet_close` and > > > `netif_device_detach` should have a correct execution order. And it > > > needs more time to find the fixes tag. I wonder that is it must have > > > fixes tag to merge? > > > > > > By the way, you mentioned that "the problem need to be more clearly > > > understood", did you mean the descriptions and sequences in commit > > > message are not easy to understand? Do you have some suggestions > > > about this? > > > > Perhaps Jason gets your explanation and will correct me, but to me it > > seems like the fix is based on trial and error rather than clear > > understanding of the problem. If you understood the problem clearly > > you should be able to find the Fixes tag without a problem.. > > > > +1 > > The code looks fine but the fixes tag needs to be correct. > > Thanks Well, I will do some works to find out the fixes tag. Once there's progress, I will let you know.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Mon, Aug 18, 2025 at 11:39 PM Jakub Kicinski wrote: > > On Mon, 18 Aug 2025 09:15:22 +0800 Junnan Wu wrote: > > > > Yes, you are right. The commit of this fix tag is the first commit I > > > > found which add function `virtnet_poll_cleantx`. Actually, we are not > > > > sure whether this issue appears after this commit. > > > > > > > > In our side, this issue is found by chance in version 5.15. > > > > > > > > It's hard to find the key commit which cause this issue > > > > for reason that the reproduction of this scenario is too complex. > > > > > > I think the problem needs to be more clearly understood, and then it > > > will be easier to find the fixes tag. At the face of it the patch > > > makes it look like close() doesn't reliably stop the device, which > > > is highly odd. > > > > Yes, you are right. It is really strange that `close()` acts like > > that, because current order has worked for long time. But panic call > > stack in our env shows that the function `virtnet_close` and > > `netif_device_detach` should have a correct execution order. And it > > needs more time to find the fixes tag. I wonder that is it must have > > fixes tag to merge? > > > > By the way, you mentioned that "the problem need to be more clearly > > understood", did you mean the descriptions and sequences in commit > > message are not easy to understand? Do you have some suggestions > > about this? > > Perhaps Jason gets your explanation and will correct me, but to me it > seems like the fix is based on trial and error rather than clear > understanding of the problem. If you understood the problem clearly > you should be able to find the Fixes tag without a problem.. > +1 The code looks fine but the fixes tag needs to be correct. Thanks
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Mon, 18 Aug 2025 09:15:22 +0800 Junnan Wu wrote: > > > Yes, you are right. The commit of this fix tag is the first commit I > > > found which add function `virtnet_poll_cleantx`. Actually, we are not > > > sure whether this issue appears after this commit. > > > > > > In our side, this issue is found by chance in version 5.15. > > > > > > It's hard to find the key commit which cause this issue > > > for reason that the reproduction of this scenario is too complex. > > > > I think the problem needs to be more clearly understood, and then it > > will be easier to find the fixes tag. At the face of it the patch > > makes it look like close() doesn't reliably stop the device, which > > is highly odd. > > Yes, you are right. It is really strange that `close()` acts like > that, because current order has worked for long time. But panic call > stack in our env shows that the function `virtnet_close` and > `netif_device_detach` should have a correct execution order. And it > needs more time to find the fixes tag. I wonder that is it must have > fixes tag to merge? > > By the way, you mentioned that "the problem need to be more clearly > understood", did you mean the descriptions and sequences in commit > message are not easy to understand? Do you have some suggestions > about this? Perhaps Jason gets your explanation and will correct me, but to me it seems like the fix is based on trial and error rather than clear understanding of the problem. If you understood the problem clearly you should be able to find the Fixes tag without a problem..
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Fri, 15 Aug 2025 08:55:03 -0700 Jakub Kicinski wrote > On Fri, 15 Aug 2025 14:06:15 +0800 Junnan Wu wrote: > > On Fri, 15 Aug 2025 13:38:21 +0800 Jason Wang wrote > > > On Fri, Aug 15, 2025 at 10:24 AM Junnan Wu > > > wrote: > > > > Sorry, I basically mean that the tx napi which caused by > > > > userspace will not be scheduled during suspend, others can not be > > > > guaranteed, such as unfinished packets already in tx vq etc. > > > > > > > > But after this patch, once `virtnet_close` completes, > > > > both tx and rq napi will be disabled which guarantee their napi > > > > will not be scheduled in future. And the tx state will be set to > > > > "__QUEUE_STATE_DRV_XOFF" correctly in `netif_device_detach`. > > > > > > Ok, so the commit mentioned by fix tag is incorrect. > > > > Yes, you are right. The commit of this fix tag is the first commit I > > found which add function `virtnet_poll_cleantx`. Actually, we are not > > sure whether this issue appears after this commit. > > > > In our side, this issue is found by chance in version 5.15. > > > > It's hard to find the key commit which cause this issue > > for reason that the reproduction of this scenario is too complex. > > I think the problem needs to be more clearly understood, and then it > will be easier to find the fixes tag. At the face of it the patch > makes it look like close() doesn't reliably stop the device, which > is highly odd. Yes, you are right. It is really strange that `close()` acts like that, because current order has worked for long time. But panic call stack in our env shows that the function `virtnet_close` and `netif_device_detach` should have a correct execution order. And it needs more time to find the fixes tag. I wonder that is it must have fixes tag to merge? By the way, you mentioned that "the problem need to be more clearly understood", did you mean the descriptions and sequences in commit message are not easy to understand? Do you have some suggestions about this? Thanks!
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Fri, 15 Aug 2025 14:06:15 +0800 Junnan Wu wrote: > On Fri, 15 Aug 2025 13:38:21 +0800 Jason Wang wrote > > On Fri, Aug 15, 2025 at 10:24 AM Junnan Wu wrote: > > > > > Sorry, I basically mean that the tx napi which caused by > > > userspace will not be scheduled during suspend, others can not be > > > guaranteed, such as unfinished packets already in tx vq etc. > > > > > > But after this patch, once `virtnet_close` completes, > > > both tx and rq napi will be disabled which guarantee their napi > > > will not be scheduled in future. And the tx state will be set to > > > "__QUEUE_STATE_DRV_XOFF" correctly in `netif_device_detach`. > > > > Ok, so the commit mentioned by fix tag is incorrect. > > Yes, you are right. The commit of this fix tag is the first commit I > found which add function `virtnet_poll_cleantx`. Actually, we are not > sure whether this issue appears after this commit. > > In our side, this issue is found by chance in version 5.15. > > It's hard to find the key commit which cause this issue > for reason that the reproduction of this scenario is too complex. I think the problem needs to be more clearly understood, and then it will be easier to find the fixes tag. At the face of it the patch makes it look like close() doesn't reliably stop the device, which is highly odd.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Fri, 15 Aug 2025 13:38:21 +0800 Jason Wang wrote > On Fri, Aug 15, 2025 at 10:24 AM Junnan Wu wrote: > > > > Sorry, I basically mean that the tx napi which caused by userspace will not > > be scheduled during suspend, > > others can not be guaranteed, such as unfinished packets already in tx vq > > etc. > > > > But after this patch, once `virtnet_close` completes, > > both tx and rq napi will be disabled which guarantee their napi will not be > > scheduled in future. > > And the tx state will be set to "__QUEUE_STATE_DRV_XOFF" correctly in > > `netif_device_detach`. > > Ok, so the commit mentioned by fix tag is incorrect. Yes, you are right. The commit of this fix tag is the first commit I found which add function `virtnet_poll_cleantx`. Actually, we are not sure whether this issue appears after this commit. In our side, this issue is found by chance in version 5.15. It's hard to find the key commit which cause this issue for reason that the reproduction of this scenario is too complex.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Fri, Aug 15, 2025 at 10:24 AM Junnan Wu wrote: > > Sorry, I basically mean that the tx napi which caused by userspace will not > be scheduled during suspend, > others can not be guaranteed, such as unfinished packets already in tx vq etc. > > But after this patch, once `virtnet_close` completes, > both tx and rq napi will be disabled which guarantee their napi will not be > scheduled in future. > And the tx state will be set to "__QUEUE_STATE_DRV_XOFF" correctly in > `netif_device_detach`. Ok, so the commit mentioned by fix tag is incorrect. Thanks > > Thanks. >
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
Sorry, I basically mean that the tx napi which caused by userspace will not be scheduled during suspend, others can not be guaranteed, such as unfinished packets already in tx vq etc. But after this patch, once `virtnet_close` completes, both tx and rq napi will be disabled which guarantee their napi will not be scheduled in future. And the tx state will be set to "__QUEUE_STATE_DRV_XOFF" correctly in `netif_device_detach`. Thanks.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Thu, Aug 14, 2025 at 2:44 PM Junnan Wu wrote:
>
> On Thu, 14 Aug 2025 12:01:18 +0800 Jason Wang wrote:
> > On Thu, Aug 14, 2025 at 10:36 AM Junnan Wu wrote:
> > >
> > > On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> > > > Sounds like a fix people may want to backport. Could you repost with
> > > > an appropriate Fixes tag added, pointing to the earliest commit where
> > > > the problem can be observed?
> > >
> > > This issue is caused by commit "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
> > > After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
> > > will be invoked, which will wakeup tx queue and clear queue state.
> > > If you agree with it, I will repost with this Fixes tag later.
> > >
> > > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
> >
> > Could you please explain why it is specific to RX NAPI but not TX?
> >
> > Thanks
>
> This issue appears in suspend flow, if a TCP connection in host VM is still
> sending packet before driver suspend is completed, it will tigger RX napi
> schedule,
> Finally "use after free" happens when tcp ack timer is up.
>
> And in suspend flow, the action to send packet is already stopped in guest VM,
> therefore TX napi will not be scheduled.
I basically mean who guarantees the TX NAPI is not scheduled?
Thanks
>
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Thu, 14 Aug 2025 14:49:06 +0800 Jason Wang wrote:
> On Thu, Aug 14, 2025 at 2:44 PM Junnan Wu wrote:
> >
> > On Thu, 14 Aug 2025 12:01:18 +0800 Jason Wang wrote:
> > > On Thu, Aug 14, 2025 at 10:36 AM Junnan Wu
> > > wrote:
> > > >
> > > > On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> > > > > Sounds like a fix people may want to backport. Could you repost with
> > > > > an appropriate Fixes tag added, pointing to the earliest commit where
> > > > > the problem can be observed?
> > > >
> > > > This issue is caused by commit
> > > > "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
> > > > After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
> > > > will be invoked, which will wakeup tx queue and clear queue state.
> > > > If you agree with it, I will repost with this Fixes tag later.
> > > >
> > > > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
> > >
> > > Could you please explain why it is specific to RX NAPI but not TX?
> > >
> > > Thanks
> >
> > This issue appears in suspend flow, if a TCP connection in host VM is still
> > sending packet before driver suspend is completed, it will tigger RX napi
> > schedule,
> > Finally "use after free" happens when tcp ack timer is up.
> >
> > And in suspend flow, the action to send packet is already stopped in guest
> > VM,
>
> The TX interrupt and NAPI is not disabled yet. Or anything I miss here?
When system suspends, the userspace progress which based on virtio_net
will be freezed firstly, and then driver suspend callback executes.
so though TX interrupt and NAPI is not disabled at that time, it will also not
be scheduled.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Thu, Aug 14, 2025 at 2:44 PM Junnan Wu wrote:
>
> On Thu, 14 Aug 2025 12:01:18 +0800 Jason Wang wrote:
> > On Thu, Aug 14, 2025 at 10:36 AM Junnan Wu wrote:
> > >
> > > On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> > > > Sounds like a fix people may want to backport. Could you repost with
> > > > an appropriate Fixes tag added, pointing to the earliest commit where
> > > > the problem can be observed?
> > >
> > > This issue is caused by commit "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
> > > After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
> > > will be invoked, which will wakeup tx queue and clear queue state.
> > > If you agree with it, I will repost with this Fixes tag later.
> > >
> > > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
> >
> > Could you please explain why it is specific to RX NAPI but not TX?
> >
> > Thanks
>
> This issue appears in suspend flow, if a TCP connection in host VM is still
> sending packet before driver suspend is completed, it will tigger RX napi
> schedule,
> Finally "use after free" happens when tcp ack timer is up.
>
> And in suspend flow, the action to send packet is already stopped in guest VM,
The TX interrupt and NAPI is not disabled yet. Or anything I miss here?
> therefore TX napi will not be scheduled.
>
Thanks
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Thu, 14 Aug 2025 12:01:18 +0800 Jason Wang wrote:
> On Thu, Aug 14, 2025 at 10:36 AM Junnan Wu wrote:
> >
> > On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> > > Sounds like a fix people may want to backport. Could you repost with
> > > an appropriate Fixes tag added, pointing to the earliest commit where
> > > the problem can be observed?
> >
> > This issue is caused by commit "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
> > After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
> > will be invoked, which will wakeup tx queue and clear queue state.
> > If you agree with it, I will repost with this Fixes tag later.
> >
> > Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
>
> Could you please explain why it is specific to RX NAPI but not TX?
>
> Thanks
This issue appears in suspend flow, if a TCP connection in host VM is still
sending packet before driver suspend is completed, it will tigger RX napi
schedule,
Finally "use after free" happens when tcp ack timer is up.
And in suspend flow, the action to send packet is already stopped in guest VM,
therefore TX napi will not be scheduled.
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Thu, Aug 14, 2025 at 10:36 AM Junnan Wu wrote:
>
> On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> > Sounds like a fix people may want to backport. Could you repost with
> > an appropriate Fixes tag added, pointing to the earliest commit where
> > the problem can be observed?
>
> This issue is caused by commit "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
> After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
> will be invoked, which will wakeup tx queue and clear queue state.
> If you agree with it, I will repost with this Fixes tag later.
>
> Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
Could you please explain why it is specific to RX NAPI but not TX?
Thanks
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Wed, 13 Aug 2025 17:23:07 -0700 Jakub Kicinski wrote:
> Sounds like a fix people may want to backport. Could you repost with
> an appropriate Fixes tag added, pointing to the earliest commit where
> the problem can be observed?
This issue is caused by commit "7b0411ef4aa69c9256d6a2c289d0a2b320414633"
After this patch, during `virtnet_poll`, function `virtnet_poll_cleantx`
will be invoked, which will wakeup tx queue and clear queue state.
If you agree with it, I will repost with this Fixes tag later.
Fixes: 7b0411ef4aa6 ("virtio-net: clean tx descriptors from rx napi")
Re: [PATCH net] virtio_net: adjust the execution order of function `virtnet_close` during freeze
On Tue, 12 Aug 2025 17:08:17 +0800 Junnan Wu wrote: > "Use after free" issue appears in suspend once race occurs when > napi poll scheduls after `netif_device_detach` and before napi disables. Sounds like a fix people may want to backport. Could you repost with an appropriate Fixes tag added, pointing to the earliest commit where the problem can be observed? -- pw-bot: cr

