Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On 8/13/17 2:56 PM, Wei Wang wrote: >> Looking at my patch to move host routes from loopback to device with the >> address, I have this: >> >> @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) >> const struct arg_dev_net *adn = arg; >> const struct net_device *dev = adn->dev; >> >> - if ((rt->dst.dev == dev || !dev) && >> + if ((rt->dst.dev == dev || !dev || >> +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) && >> rt != adn->net->ipv6.ip6_null_entry && >> (rt->rt6i_nsiblings == 0 || >> (dev && netdev_unregistering(dev)) || > > As you explained earlier, after your patch, all entries in the fib6 > tree will have rt->dst.dev be the same as rt->rt6i_idev->dev except > those ones created by p6_rt_cache_alloc() and ip6_rt_pcpu_alloc(). > Then the above newly added check is mainly to catch those cached dst > entries (created by ip6_rt_cached_alloc()). right? > And it is required because __ipv6_ifa_notify() -> ip6_del_rt() won't > take care of those cached dst entries. > > Then I think I should wait for your patches to get merged before > submitting my patch? no. your patch will need to go back to 4.12; my changes will not be appropriate for that.
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
> Looking at my patch to move host routes from loopback to device with the > address, I have this: > > @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) > const struct arg_dev_net *adn = arg; > const struct net_device *dev = adn->dev; > > - if ((rt->dst.dev == dev || !dev) && > + if ((rt->dst.dev == dev || !dev || > +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) && > rt != adn->net->ipv6.ip6_null_entry && > (rt->rt6i_nsiblings == 0 || > (dev && netdev_unregistering(dev)) || As you explained earlier, after your patch, all entries in the fib6 tree will have rt->dst.dev be the same as rt->rt6i_idev->dev except those ones created by p6_rt_cache_alloc() and ip6_rt_pcpu_alloc(). Then the above newly added check is mainly to catch those cached dst entries (created by ip6_rt_cached_alloc()). right? And it is required because __ipv6_ifa_notify() -> ip6_del_rt() won't take care of those cached dst entries. Then I think I should wait for your patches to get merged before submitting my patch? Thanks. Wei On Sun, Aug 13, 2017 at 9:24 AM, David Ahern wrote: > On 8/12/17 1:42 PM, Wei Wang wrote: >> Hi Ido, >> - if ((rt->dst.dev == dev || !dev) && + if ((rt->dst.dev == dev || !dev || + rt->rt6i_idev->dev == dev) && >>> >>> Can you please explain why this line is needed? While host routes aren't >>> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are >>> removed later on in addrconf_ifdown(). >>> >> >> Yes.. Agree. But one difference is that if the route is removed from >> addrconf_ifdown(), dst_dev_put() won't be called to release the >> devices before doing dst_release(). It is OK if dst_release() sees the >> refcnt on dst already drops to 0 and directly destroys the dst. But I >> think it will cause problem if at the time, the dst is still held by >> some other users because then the refcnt on the device going down will >> not get released. >> That's why I think we should remove the dst with either dst->dev == >> going down dev or rt6->rt6i_idev->dev == going down dev from the fib6 >> tree always because there, we always call dst_dev_put() to release the >> device. >> >>> With your patch, if I check the return value of ip6_del_rt() in >>> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host >>> route was already removed by rt6_ifdown(). When the line in question is >>> removed from the patch I don't get the error anymore. >>> >> >> Right. That is expected as the route is already removed from the tree. >> >>> Is it possible that in John's case the host route was correctly removed >>> from the FIB and that the unreleased reference was due to a wrong check >>> in ip6_dst_ifdown() (which you patched correctly AFAICT)? >>> >> >> Yes. possible. But as I explained earlier, I still think we should >> also remove routes with rt6->rt6i_idev->dev == going down dev from the >> tree. > > Looking at my patch to move host routes from loopback to device with the > address, I have this: > > @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) > const struct arg_dev_net *adn = arg; > const struct net_device *dev = adn->dev; > > - if ((rt->dst.dev == dev || !dev) && > + if ((rt->dst.dev == dev || !dev || > +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) && > rt != adn->net->ipv6.ip6_null_entry && > (rt->rt6i_nsiblings == 0 || > (dev && netdev_unregistering(dev)) || > >
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On 8/12/17 1:42 PM, Wei Wang wrote: > Hi Ido, > >>> - if ((rt->dst.dev == dev || !dev) && >>> + if ((rt->dst.dev == dev || !dev || >>> + rt->rt6i_idev->dev == dev) && >> >> Can you please explain why this line is needed? While host routes aren't >> removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are >> removed later on in addrconf_ifdown(). >> > > Yes.. Agree. But one difference is that if the route is removed from > addrconf_ifdown(), dst_dev_put() won't be called to release the > devices before doing dst_release(). It is OK if dst_release() sees the > refcnt on dst already drops to 0 and directly destroys the dst. But I > think it will cause problem if at the time, the dst is still held by > some other users because then the refcnt on the device going down will > not get released. > That's why I think we should remove the dst with either dst->dev == > going down dev or rt6->rt6i_idev->dev == going down dev from the fib6 > tree always because there, we always call dst_dev_put() to release the > device. > >> With your patch, if I check the return value of ip6_del_rt() in >> __ipv6_ifa_notify() I see that -ENONET is returned. Because the host >> route was already removed by rt6_ifdown(). When the line in question is >> removed from the patch I don't get the error anymore. >> > > Right. That is expected as the route is already removed from the tree. > >> Is it possible that in John's case the host route was correctly removed >> from the FIB and that the unreleased reference was due to a wrong check >> in ip6_dst_ifdown() (which you patched correctly AFAICT)? >> > > Yes. possible. But as I explained earlier, I still think we should > also remove routes with rt6->rt6i_idev->dev == going down dev from the > tree. Looking at my patch to move host routes from loopback to device with the address, I have this: @@ -2789,7 +2808,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) const struct arg_dev_net *adn = arg; const struct net_device *dev = adn->dev; - if ((rt->dst.dev == dev || !dev) && + if ((rt->dst.dev == dev || !dev || +(netdev_unregistering(dev) && rt->rt6i_idev->dev == dev)) && rt != adn->net->ipv6.ip6_null_entry && (rt->rt6i_nsiblings == 0 || (dev && netdev_unregistering(dev)) ||
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Hi Ido, >> - if ((rt->dst.dev == dev || !dev) && >> + if ((rt->dst.dev == dev || !dev || >> + rt->rt6i_idev->dev == dev) && > > Can you please explain why this line is needed? While host routes aren't > removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are > removed later on in addrconf_ifdown(). > Yes.. Agree. But one difference is that if the route is removed from addrconf_ifdown(), dst_dev_put() won't be called to release the devices before doing dst_release(). It is OK if dst_release() sees the refcnt on dst already drops to 0 and directly destroys the dst. But I think it will cause problem if at the time, the dst is still held by some other users because then the refcnt on the device going down will not get released. That's why I think we should remove the dst with either dst->dev == going down dev or rt6->rt6i_idev->dev == going down dev from the fib6 tree always because there, we always call dst_dev_put() to release the device. > With your patch, if I check the return value of ip6_del_rt() in > __ipv6_ifa_notify() I see that -ENONET is returned. Because the host > route was already removed by rt6_ifdown(). When the line in question is > removed from the patch I don't get the error anymore. > Right. That is expected as the route is already removed from the tree. > Is it possible that in John's case the host route was correctly removed > from the FIB and that the unreleased reference was due to a wrong check > in ip6_dst_ifdown() (which you patched correctly AFAICT)? > Yes. possible. But as I explained earlier, I still think we should also remove routes with rt6->rt6i_idev->dev == going down dev from the tree. Thanks. Wei On Sat, Aug 12, 2017 at 11:01 AM, Ido Schimmel wrote: > Hi Wei, > > On Fri, Aug 11, 2017 at 05:10:02PM -0700, Wei Wang wrote: >> I think we have a potential fix for this issue. >> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >> possible that rt6->dst.dev points to loopback device while >> rt6->rt6i_idev->dev points to a real device. >> When the real device goes down, the current fib6 clean up code only >> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >> That leaves unreleased refcnt on the real device if rt6->dst.dev >> points to loopback dev. > > [...] > >> From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001 >> From: Wei Wang >> Date: Fri, 11 Aug 2017 16:36:04 -0700 >> Subject: [PATCH 1/2] potential fix for unregister_netdevice() >> >> Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea >> --- >> net/ipv6/route.c | 17 - >> 1 file changed, 8 insertions(+), 9 deletions(-) >> >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >> index 4d30c96a819d..105922903932 100644 >> --- a/net/ipv6/route.c >> +++ b/net/ipv6/route.c >> @@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, >> struct net_device *dev, >> struct net_device *loopback_dev = >> dev_net(dev)->loopback_dev; >> >> - if (dev != loopback_dev) { >> - if (idev && idev->dev == dev) { >> - struct inet6_dev *loopback_idev = >> - in6_dev_get(loopback_dev); >> - if (loopback_idev) { >> - rt->rt6i_idev = loopback_idev; >> - in6_dev_put(idev); >> - } >> + if (idev && idev->dev != loopback_dev) { >> + struct inet6_dev *loopback_idev = >> + in6_dev_get(loopback_dev); >> + if (loopback_idev) { >> + rt->rt6i_idev = loopback_idev; >> + in6_dev_put(idev); >> } >> } >> } >> @@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) >> const struct arg_dev_net *adn = arg; >> const struct net_device *dev = adn->dev; >> >> - if ((rt->dst.dev == dev || !dev) && >> + if ((rt->dst.dev == dev || !dev || >> + rt->rt6i_idev->dev == dev) && > > Can you please explain why this line is needed? While host routes aren't > removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are > removed later on in addrconf_ifdown(). > > With your patch, if I check the return value of ip6_del_rt() in > __ipv6_ifa_notify() I see that -ENONET is returned. Because the host > route was already removed by rt6_ifdown(). When the line in question is > removed from the patch I don't get the error anymore. > > Is it possible that in John's case the host route was correctly removed > from the FIB and that the unreleased reference was due to a wrong check > in ip6_dst_ifdown() (which you patched correctly AFAICT)? > > Thanks > >> rt != adn->net->ipv6.ip6_null_entry && >> (rt->rt6i_nsiblings == 0 || >>(dev && netdev_unregistering(dev)) || >> -- >> 2.14.0.434.g98096fd7a8-goog >>
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 8:37 PM, David Ahern wrote: > On 8/11/17 6:25 PM, Wei Wang wrote: >> By "a patch to fix that" do you mean after your patch, for every rt6, >> rt6->rt6i_idev will be the same as rt6->dst.dev? > > FIB entries should have them the same device with my patch. > > The copies done (ip6_rt_cache_alloc and ip6_rt_pcpu_alloc) will have to > set dst.dev to loopback or VRF device for RTF_LOCAL routes; it's the > only way to get local traffic to work and this is similar to what IPv4 does. Got it. Thanks David. Wei
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
> Looks good so far! I've not hit the issue yet. > Great. I will prepare an official patch then. > Thanks so much for sorting out a fix! > > If its useful: > Tested-by: John Stultz Sure will do. Thanks. Wei On Fri, Aug 11, 2017 at 8:07 PM, John Stultz wrote: > On Fri, Aug 11, 2017 at 5:31 PM, John Stultz wrote: >> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang wrote: If after Cong's fix, the issue still happens, could you help try the patch attached and collect all logs when you try the reproduce the issue? It would be great to have logs for both success case and the failure case. Thanks so much for your help. >>> >>> I think we have a potential fix for this issue. >>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >>> possible that rt6->dst.dev points to loopback device while >>> rt6->rt6i_idev->dev points to a real device. >>> When the real device goes down, the current fib6 clean up code only >>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >>> That leaves unreleased refcnt on the real device if rt6->dst.dev >>> points to loopback dev. >>> >>> The attached potential fix is tested by Martin and made sure it fixes his >>> issue. >>> >>> John, >>> It will be great if you can also give it a try and see if it fixes the >>> issue on your side before I submit an official patch. >> >> So yes, sorry I haven't been able to get back quicker on the other >> patches sent, was mucking about in other work. >> >> So yea, this patch (potential fix for unregister_netdevice()) seems >> to avoid the issue. >> >> I'm going to do some further testing, but its looking good so far. > > Looks good so far! I've not hit the issue yet. > > Thanks so much for sorting out a fix! > > If its useful: > Tested-by: John Stultz > > thanks again > -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 8:07 PM, John Stultz wrote: > On Fri, Aug 11, 2017 at 5:31 PM, John Stultz wrote: >> On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang wrote: If after Cong's fix, the issue still happens, could you help try the patch attached and collect all logs when you try the reproduce the issue? It would be great to have logs for both success case and the failure case. Thanks so much for your help. >>> >>> I think we have a potential fix for this issue. >>> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >>> possible that rt6->dst.dev points to loopback device while >>> rt6->rt6i_idev->dev points to a real device. >>> When the real device goes down, the current fib6 clean up code only >>> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >>> That leaves unreleased refcnt on the real device if rt6->dst.dev >>> points to loopback dev. >>> >>> The attached potential fix is tested by Martin and made sure it fixes his >>> issue. >>> >>> John, >>> It will be great if you can also give it a try and see if it fixes the >>> issue on your side before I submit an official patch. >> >> So yes, sorry I haven't been able to get back quicker on the other >> patches sent, was mucking about in other work. >> >> So yea, this patch (potential fix for unregister_netdevice()) seems >> to avoid the issue. >> >> I'm going to do some further testing, but its looking good so far. > > Looks good so far! I've not hit the issue yet. > > Thanks so much for sorting out a fix! > > If its useful: > Tested-by: John Stultz > > thanks again > -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Hi Wei, On Fri, Aug 11, 2017 at 05:10:02PM -0700, Wei Wang wrote: > I think we have a potential fix for this issue. > Martin and I found that when addrconf_dst_alloc() creates a rt6, it is > possible that rt6->dst.dev points to loopback device while > rt6->rt6i_idev->dev points to a real device. > When the real device goes down, the current fib6 clean up code only > checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. > That leaves unreleased refcnt on the real device if rt6->dst.dev > points to loopback dev. [...] > From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001 > From: Wei Wang > Date: Fri, 11 Aug 2017 16:36:04 -0700 > Subject: [PATCH 1/2] potential fix for unregister_netdevice() > > Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea > --- > net/ipv6/route.c | 17 - > 1 file changed, 8 insertions(+), 9 deletions(-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 4d30c96a819d..105922903932 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, > struct net_device *dev, > struct net_device *loopback_dev = > dev_net(dev)->loopback_dev; > > - if (dev != loopback_dev) { > - if (idev && idev->dev == dev) { > - struct inet6_dev *loopback_idev = > - in6_dev_get(loopback_dev); > - if (loopback_idev) { > - rt->rt6i_idev = loopback_idev; > - in6_dev_put(idev); > - } > + if (idev && idev->dev != loopback_dev) { > + struct inet6_dev *loopback_idev = > + in6_dev_get(loopback_dev); > + if (loopback_idev) { > + rt->rt6i_idev = loopback_idev; > + in6_dev_put(idev); > } > } > } > @@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) > const struct arg_dev_net *adn = arg; > const struct net_device *dev = adn->dev; > > - if ((rt->dst.dev == dev || !dev) && > + if ((rt->dst.dev == dev || !dev || > + rt->rt6i_idev->dev == dev) && Can you please explain why this line is needed? While host routes aren't removed from the FIB by rt6_ifdown() (when dst.dev goes down), they are removed later on in addrconf_ifdown(). With your patch, if I check the return value of ip6_del_rt() in __ipv6_ifa_notify() I see that -ENONET is returned. Because the host route was already removed by rt6_ifdown(). When the line in question is removed from the patch I don't get the error anymore. Is it possible that in John's case the host route was correctly removed from the FIB and that the unreleased reference was due to a wrong check in ip6_dst_ifdown() (which you patched correctly AFAICT)? Thanks > rt != adn->net->ipv6.ip6_null_entry && > (rt->rt6i_nsiblings == 0 || >(dev && netdev_unregistering(dev)) || > -- > 2.14.0.434.g98096fd7a8-goog >
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On 8/11/17 6:25 PM, Wei Wang wrote: > By "a patch to fix that" do you mean after your patch, for every rt6, > rt6->rt6i_idev will be the same as rt6->dst.dev? FIB entries should have them the same device with my patch. The copies done (ip6_rt_cache_alloc and ip6_rt_pcpu_alloc) will have to set dst.dev to loopback or VRF device for RTF_LOCAL routes; it's the only way to get local traffic to work and this is similar to what IPv4 does.
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 5:31 PM, John Stultz wrote: > On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang wrote: >>> If after Cong's fix, the issue still happens, could you help try the >>> patch attached and collect all logs when you try the reproduce the >>> issue? It would be great to have logs for both success case and the >>> failure case. >>> >>> Thanks so much for your help. >>> >> >> I think we have a potential fix for this issue. >> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >> possible that rt6->dst.dev points to loopback device while >> rt6->rt6i_idev->dev points to a real device. >> When the real device goes down, the current fib6 clean up code only >> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >> That leaves unreleased refcnt on the real device if rt6->dst.dev >> points to loopback dev. >> >> The attached potential fix is tested by Martin and made sure it fixes his >> issue. >> >> John, >> It will be great if you can also give it a try and see if it fixes the >> issue on your side before I submit an official patch. > > So yes, sorry I haven't been able to get back quicker on the other > patches sent, was mucking about in other work. > > So yea, this patch (potential fix for unregister_netdevice()) seems > to avoid the issue. > > I'm going to do some further testing, but its looking good so far. Looks good so far! I've not hit the issue yet. Thanks so much for sorting out a fix! If its useful: Tested-by: John Stultz thanks again -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
> So yes, sorry I haven't been able to get back quicker on the other > patches sent, was mucking about in other work. > > So yea, this patch (potential fix for unregister_netdevice()) seems > to avoid the issue. > > I'm going to do some further testing, but its looking good so far. > Great. Thanks. > Do you still want feedback on the previous changes? If this patch is good, then I don't really need the previous feedback. Thanks. Wei On Fri, Aug 11, 2017 at 5:31 PM, John Stultz wrote: > On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang wrote: >>> If after Cong's fix, the issue still happens, could you help try the >>> patch attached and collect all logs when you try the reproduce the >>> issue? It would be great to have logs for both success case and the >>> failure case. >>> >>> Thanks so much for your help. >>> >> >> I think we have a potential fix for this issue. >> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >> possible that rt6->dst.dev points to loopback device while >> rt6->rt6i_idev->dev points to a real device. >> When the real device goes down, the current fib6 clean up code only >> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >> That leaves unreleased refcnt on the real device if rt6->dst.dev >> points to loopback dev. >> >> The attached potential fix is tested by Martin and made sure it fixes his >> issue. >> >> John, >> It will be great if you can also give it a try and see if it fixes the >> issue on your side before I submit an official patch. > > So yes, sorry I haven't been able to get back quicker on the other > patches sent, was mucking about in other work. > > So yea, this patch (potential fix for unregister_netdevice()) seems > to avoid the issue. > > I'm going to do some further testing, but its looking good so far. > > Do you still want feedback on the previous changes? > > thanks > -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 5:10 PM, Wei Wang wrote: >> If after Cong's fix, the issue still happens, could you help try the >> patch attached and collect all logs when you try the reproduce the >> issue? It would be great to have logs for both success case and the >> failure case. >> >> Thanks so much for your help. >> > > I think we have a potential fix for this issue. > Martin and I found that when addrconf_dst_alloc() creates a rt6, it is > possible that rt6->dst.dev points to loopback device while > rt6->rt6i_idev->dev points to a real device. > When the real device goes down, the current fib6 clean up code only > checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. > That leaves unreleased refcnt on the real device if rt6->dst.dev > points to loopback dev. > > The attached potential fix is tested by Martin and made sure it fixes his > issue. > > John, > It will be great if you can also give it a try and see if it fixes the > issue on your side before I submit an official patch. So yes, sorry I haven't been able to get back quicker on the other patches sent, was mucking about in other work. So yea, this patch (potential fix for unregister_netdevice()) seems to avoid the issue. I'm going to do some further testing, but its looking good so far. Do you still want feedback on the previous changes? thanks -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 5:19 PM, David Ahern wrote: > On 8/11/17 6:10 PM, Wei Wang wrote: >> I think we have a potential fix for this issue. >> Martin and I found that when addrconf_dst_alloc() creates a rt6, it is >> possible that rt6->dst.dev points to loopback device while >> rt6->rt6i_idev->dev points to a real device. >> When the real device goes down, the current fib6 clean up code only >> checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. >> That leaves unreleased refcnt on the real device if rt6->dst.dev >> points to loopback dev. > > Yes, host routes and anycast routes. > > I have a patch to fix that but it is held up on a few VRF test cases > failing. Hopefully I can get that figured out next week. These unrelated > routes against the loopback device have been a source of a number of > problems (e.g. take down 'lo' and all of IPv6 networking stops for that > namespace). Thanks David. By "a patch to fix that" do you mean after your patch, for every rt6, rt6->rt6i_idev will be the same as rt6->dst.dev?
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On 8/11/17 6:10 PM, Wei Wang wrote: > I think we have a potential fix for this issue. > Martin and I found that when addrconf_dst_alloc() creates a rt6, it is > possible that rt6->dst.dev points to loopback device while > rt6->rt6i_idev->dev points to a real device. > When the real device goes down, the current fib6 clean up code only > checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. > That leaves unreleased refcnt on the real device if rt6->dst.dev > points to loopback dev. Yes, host routes and anycast routes. I have a patch to fix that but it is held up on a few VRF test cases failing. Hopefully I can get that figured out next week. These unrelated routes against the loopback device have been a source of a number of problems (e.g. take down 'lo' and all of IPv6 networking stops for that namespace).
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
> If after Cong's fix, the issue still happens, could you help try the > patch attached and collect all logs when you try the reproduce the > issue? It would be great to have logs for both success case and the > failure case. > > Thanks so much for your help. > I think we have a potential fix for this issue. Martin and I found that when addrconf_dst_alloc() creates a rt6, it is possible that rt6->dst.dev points to loopback device while rt6->rt6i_idev->dev points to a real device. When the real device goes down, the current fib6 clean up code only checks for rt6->dst.dev and assumes rt6->rt6i_idev->dev is the same. That leaves unreleased refcnt on the real device if rt6->dst.dev points to loopback dev. The attached potential fix is tested by Martin and made sure it fixes his issue. John, It will be great if you can also give it a try and see if it fixes the issue on your side before I submit an official patch. Thanks very much for the help from everyone. Wei On Fri, Aug 11, 2017 at 10:25 AM, Wei Wang wrote: > On Fri, Aug 11, 2017 at 9:48 AM, Cong Wang wrote: >> Hi, >> >> On Thu, Aug 10, 2017 at 11:12 AM, John Stultz wrote: >>> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang wrote: Hi John, Is it possible to try the attached patch? >>> >>> Thanks so much for the quick turn around! >>> >>> So I dropped all the reverts you suggested, and applied this one >>> against 4.13-rc4, but I'm still seeing the problematic behavior. >> >> Does the following one-line fix make a difference? >> >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c >> index a640fbcba15d..c145a35763a0 100644 >> --- a/net/ipv6/route.c >> +++ b/net/ipv6/route.c >> @@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt) >> struct uncached_list *ul = rt->rt6i_uncached_list; >> >> spin_lock_bh(&ul->lock); >> - list_del(&rt->rt6i_uncached); >> + list_del_init(&rt->rt6i_uncached); >> spin_unlock_bh(&ul->lock); >> } >> } > > > Thanks a lot Cong for proposing this fix. > > For the last few days, John has been helping me running debug image > and we found out that the leaked dst is probably in addrconf.c. > Martin and I are looking through the code and trying to put more debugs. > > John, > > If after Cong's fix, the issue still happens, could you help try the > patch attached and collect all logs when you try the reproduce the > issue? It would be great to have logs for both success case and the > failure case. > > Thanks so much for your help. > > Wei From 2d8861808c2029013f6b6e86120ba6902329145b Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Fri, 11 Aug 2017 16:36:04 -0700 Subject: [PATCH 1/2] potential fix for unregister_netdevice() Change-Id: I5d5f6f7a7ad0f5dd769f33487db17ff2570d52ea --- net/ipv6/route.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 4d30c96a819d..105922903932 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -417,14 +417,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, struct net_device *dev, struct net_device *loopback_dev = dev_net(dev)->loopback_dev; - if (dev != loopback_dev) { - if (idev && idev->dev == dev) { - struct inet6_dev *loopback_idev = -in6_dev_get(loopback_dev); - if (loopback_idev) { -rt->rt6i_idev = loopback_idev; -in6_dev_put(idev); - } + if (idev && idev->dev != loopback_dev) { + struct inet6_dev *loopback_idev = + in6_dev_get(loopback_dev); + if (loopback_idev) { + rt->rt6i_idev = loopback_idev; + in6_dev_put(idev); } } } @@ -2789,7 +2787,8 @@ static int fib6_ifdown(struct rt6_info *rt, void *arg) const struct arg_dev_net *adn = arg; const struct net_device *dev = adn->dev; - if ((rt->dst.dev == dev || !dev) && + if ((rt->dst.dev == dev || !dev || + rt->rt6i_idev->dev == dev) && rt != adn->net->ipv6.ip6_null_entry && (rt->rt6i_nsiblings == 0 || (dev && netdev_unregistering(dev)) || -- 2.14.0.434.g98096fd7a8-goog
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Fri, Aug 11, 2017 at 9:48 AM, Cong Wang wrote: > Hi, > > On Thu, Aug 10, 2017 at 11:12 AM, John Stultz wrote: >> On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang wrote: >>> Hi John, >>> >>> Is it possible to try the attached patch? >> >> Thanks so much for the quick turn around! >> >> So I dropped all the reverts you suggested, and applied this one >> against 4.13-rc4, but I'm still seeing the problematic behavior. > > Does the following one-line fix make a difference? > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index a640fbcba15d..c145a35763a0 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt) > struct uncached_list *ul = rt->rt6i_uncached_list; > > spin_lock_bh(&ul->lock); > - list_del(&rt->rt6i_uncached); > + list_del_init(&rt->rt6i_uncached); > spin_unlock_bh(&ul->lock); > } > } Thanks a lot Cong for proposing this fix. For the last few days, John has been helping me running debug image and we found out that the leaked dst is probably in addrconf.c. Martin and I are looking through the code and trying to put more debugs. John, If after Cong's fix, the issue still happens, could you help try the patch attached and collect all logs when you try the reproduce the issue? It would be great to have logs for both success case and the failure case. Thanks so much for your help. Wei From 0ff591e00eac13888c5ee9c84ee51b286b27389d Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Thu, 10 Aug 2017 21:12:32 -0700 Subject: [PATCH] ipv6: unregister_netdevice debug Change-Id: I97b044b957a146de480e5427212c1ce80bc3dd3c --- include/net/dst.h | 7 +++ include/net/dst_ops.h | 1 + include/net/ip6_fib.h | 10 ++ net/core/dev.c| 16 +++ net/core/dst.c| 10 -- net/ipv6/addrconf.c | 24 +++--- net/ipv6/route.c | 55 --- 7 files changed, 115 insertions(+), 8 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index f73611ec4017..48d9f0322492 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -56,6 +56,13 @@ struct dst_entry { #define DST_XFRM_TUNNEL 0x0020 #define DST_XFRM_QUEUE 0x0040 #define DST_METADATA 0x0080 +#define RTF_ICMPV6_DST 0x0100 +#define RTF_ADDRCONF_DST 0x0200 +#define RTF_UNCACHED_DST 0x0400 +#define RTF_CACHE_DST 0x0800 +#define RTF_FIB6_DST 0x1000 +#define RTF_PCPU_DST 0x2000 + short error; diff --git a/include/net/dst_ops.h b/include/net/dst_ops.h index c84b3287e38b..eef7d6b6f3cd 100644 --- a/include/net/dst_ops.h +++ b/include/net/dst_ops.h @@ -39,6 +39,7 @@ struct dst_ops { struct kmem_cache *kmem_cachep; struct percpu_counter pcpuc_entries cacheline_aligned_in_smp; + void (*do_account)(struct dst_entry *dst); }; static inline int dst_entries_get_fast(struct dst_ops *dst) diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index 1a88008cc6f5..af382d123f63 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -214,6 +214,16 @@ struct rt6_statistics { __u32 fib_rt_entries; /* rt entries in table */ __u32 fib_rt_cache; /* cache routes */ __u32 fib_discarded_routes; + __u32 icmpv6_dst; + __u32 addrconf_dst; + __u32 fib6_dst; + __u32 cache_dst; + __u32 uncached_dst; + __u32 pcpu_dst; + __u32 dev_free_fib6; + __u32 dev_free_uncached; + __u32 dev_free_icmpv6; + __u32 dev_free_addrconf; }; #define RTN_TL_ROOT 0x0001 diff --git a/net/core/dev.c b/net/core/dev.c index 8ea6b4b42611..5c6fea78003f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1,3 +1,4 @@ + /* * NET3Protocol independent device support routines. * @@ -7738,8 +7739,23 @@ static void netdev_wait_allrefs(struct net_device *dev) { unsigned long rebroadcast_time, warning_time; int refcnt; + struct net *net = dev_net(dev); linkwatch_forget_dev(dev); + printk("%s: dev_name %s, icmpv6_dst %u, addrconf_dst %u, fib6_dst %u, pcpu_dst %u, cache_dst %u, uncached_list %u, dev_free_fib6 %u, dev_free_uncached %u, dev_free_icmpv6 %u, dev_free_addrconf %u, fib_rt_entries %u\n", __func__, + dev->name, + net->ipv6.rt6_stats->icmpv6_dst, + net->ipv6.rt6_stats->addrconf_dst, + net->ipv6.rt6_stats->fib6_dst, + net->ipv6.rt6_stats->pcpu_dst, + net->ipv6.rt6_stats->cache_dst, + net->ipv6.rt6_stats->uncached_dst, + net->ipv6.rt6_stats->dev_free_fib6, + net->ipv6.rt6_stats->dev_free_uncached, + net->ipv6.rt6_stats->dev_free_icmpv6, + net->ipv6.rt6_stats->dev_free_addrconf, + net->ipv6.rt6_stats->fib_rt_entries); + rebroadcast_time = warning_time = jiffies; refcnt = netdev_refcnt_read(dev); diff --git a/net/core/dst.c b/net/core/dst.c index 00aa972ad1a1..8c20deec398b 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -184,8 +184,11 @@ void dst_release(struct dst_e
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Hi, On Thu, Aug 10, 2017 at 11:12 AM, John Stultz wrote: > On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang wrote: >> Hi John, >> >> Is it possible to try the attached patch? > > Thanks so much for the quick turn around! > > So I dropped all the reverts you suggested, and applied this one > against 4.13-rc4, but I'm still seeing the problematic behavior. Does the following one-line fix make a difference? diff --git a/net/ipv6/route.c b/net/ipv6/route.c index a640fbcba15d..c145a35763a0 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -141,7 +141,7 @@ static void rt6_uncached_list_del(struct rt6_info *rt) struct uncached_list *ul = rt->rt6i_uncached_list; spin_lock_bh(&ul->lock); - list_del(&rt->rt6i_uncached); + list_del_init(&rt->rt6i_uncached); spin_unlock_bh(&ul->lock); } }
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Thu, Aug 10, 2017 at 11:12 AM, John Stultz wrote: > On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang wrote: >> Hi John, >> >> Is it possible to try the attached patch? > > Thanks so much for the quick turn around! > > So I dropped all the reverts you suggested, and applied this one > against 4.13-rc4, but I'm still seeing the problematic behavior. > Thanks for confirming. I have been going through the code and not yet found any leaks. I am also trying to reproduce the issue myself. Martin seems to also see this issue. I will continue investigating. >> I am not sure if it actually fixes the issue. But I think it is worth a try. >> Also, could you get me all the ipv6 routes when you plug in the usb >> using "ip -6 route show"? (If you have multiple routing tables >> configured, could you dump them all?) > > # ip -6 route show > 2601:1c2:1002:83f0::/64 dev eth0 proto kernel metric 256 expires > 345599sec pref medium > fe80::/64 dev eth0 proto kernel metric 256 pref medium > default via fe80::200:caff:fe11:2233 dev eth0 proto ra metric 1024 > expires 1799sec hoplimit 64 pref medium > > > After unplugging the device (and getting the unregister_netdevice errors): > # ip -6 route show > # > Thanks for the logs. > > thanks > -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 10:41 PM, Wei Wang wrote: > Hi John, > > Is it possible to try the attached patch? Thanks so much for the quick turn around! So I dropped all the reverts you suggested, and applied this one against 4.13-rc4, but I'm still seeing the problematic behavior. > I am not sure if it actually fixes the issue. But I think it is worth a try. > Also, could you get me all the ipv6 routes when you plug in the usb > using "ip -6 route show"? (If you have multiple routing tables > configured, could you dump them all?) # ip -6 route show 2601:1c2:1002:83f0::/64 dev eth0 proto kernel metric 256 expires 345599sec pref medium fe80::/64 dev eth0 proto kernel metric 256 pref medium default via fe80::200:caff:fe11:2233 dev eth0 proto ra metric 1024 expires 1799sec hoplimit 64 pref medium After unplugging the device (and getting the unregister_netdevice errors): # ip -6 route show # thanks -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Hi John, Is it possible to try the attached patch? I am not sure if it actually fixes the issue. But I think it is worth a try. Also, could you get me all the ipv6 routes when you plug in the usb using "ip -6 route show"? (If you have multiple routing tables configured, could you dump them all?) Thanks a lot. Wei On Wed, Aug 9, 2017 at 6:36 PM, Wei Wang wrote: > On Wed, Aug 9, 2017 at 6:26 PM, John Stultz wrote: >> On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang wrote: >>> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz wrote: >>>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang wrote: >>>>> (Cc'ing Wei whose commit was blamed) >>>>> >>>>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz >>>>> wrote: >>>>>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz >>>>>> wrote: >>>>>>> So, with recent testing with my HiKey board, I've been noticing some >>>>>>> quirky behavior with my USB eth adapter. >>>>>>> >>>>>>> Basically, pluging the usb eth adapter in and then removing it, when >>>>>>> plugging it back in I often find that its not detected, and the system >>>>>>> slowly spits out the following message over and over: >>>>>>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >>>>>> >>>>>> The other bit is that after this starts printing, the board will no >>>>>> longer reboot (it hangs continuing to occasionally print the above >>>>>> message), and I have to manually reset the device. >>>>>> >>>>> >>>>> So this warning is not temporarily shown but lasts until a reboot, >>>>> right? If so it is a dst refcnt leak. >>>> >>>> Correct, once I get into the state it lasts until a reboot. >>>> >>>>> How reproducible is it for you? From my reading, it seems always >>>>> reproduced when you unplug and plug your usb eth interface? >>>>> Is there anything else involved? For example, network namespace. >>>> >>>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first >>>> unplug of the USB eth adapter. >>>> >>>> But as I get back closer to 4.12, it seemingly becomes harder to >>>> trigger, but sometimes still happens. >>>> >>>> So far, I've not been able to trigger it with 4.12. >>>> >>>> I don't think network namespaces are involved? Though its out of my >>>> area, so AOSP may be using them these days. Is there a simple way to >>>> check? >>>> >>>> I'll also do another bisection to see if the bad point moves back any >>>> further. >> >> So I went through another bisection around and got 9514528d92d4 ipv6: >> call dst_dev_put() properly as the first bad commit again. >> >>> If you see the problem starts to happen on commit >>> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all >>> the following commits: >>> (from new to old) >>> 1eb04e7c9e63 net: reorder all the dst flags >>> a4c2fd7f7891 net: remove DST_NOCACHE flag >>> b2a9c0ed75a3 net: remove DST_NOGC flag >>> 5b7c9a8ff828 net: remove dst gc related code >>> db916649b5dd ipv6: get rid of icmp6 dst garbage collector >>> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free() >>> ad65a2f05695 ipv6: call dst_hold_safe() properly >>> 9514528d92d4 ipv6: call dst_dev_put() properly >> >> >> And reverting this set off of 4.13-rc4 seems to make the issue go away. >> >> Is there anything I can test to help narrow down the specific problem >> with that patchset? >> > > Thanks John for confirming. > Let me spend some time on the commits and I will let you know if I > have some debug image for you to try. > > Wei > > >> thanks >> -john From 93f2836679c81915b110ff56617f9f5dae2e6927 Mon Sep 17 00:00:00 2001 From: Wei Wang Date: Wed, 9 Aug 2017 22:27:36 -0700 Subject: [PATCH] ipv6: unregister netdev bug fix Change-Id: I30fa739989ac50fbc7f4cbc6a04130005589cc25 --- include/net/ip6_route.h | 1 + net/ipv6/addrconf.c | 10 +++--- net/ipv6/anycast.c | 3 ++- net/ipv6/route.c| 2 +- 4 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 907d39a42f6b..dec1424ce619 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -94,6 +94,7
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 6:26 PM, John Stultz wrote: > On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang wrote: >> On Wed, Aug 9, 2017 at 4:44 PM, John Stultz wrote: >>> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang wrote: >>>> (Cc'ing Wei whose commit was blamed) >>>> >>>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz wrote: >>>>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz >>>>> wrote: >>>>>> So, with recent testing with my HiKey board, I've been noticing some >>>>>> quirky behavior with my USB eth adapter. >>>>>> >>>>>> Basically, pluging the usb eth adapter in and then removing it, when >>>>>> plugging it back in I often find that its not detected, and the system >>>>>> slowly spits out the following message over and over: >>>>>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >>>>> >>>>> The other bit is that after this starts printing, the board will no >>>>> longer reboot (it hangs continuing to occasionally print the above >>>>> message), and I have to manually reset the device. >>>>> >>>> >>>> So this warning is not temporarily shown but lasts until a reboot, >>>> right? If so it is a dst refcnt leak. >>> >>> Correct, once I get into the state it lasts until a reboot. >>> >>>> How reproducible is it for you? From my reading, it seems always >>>> reproduced when you unplug and plug your usb eth interface? >>>> Is there anything else involved? For example, network namespace. >>> >>> So with 4.13-rc3/4 I seem to trigger it easily, often with the first >>> unplug of the USB eth adapter. >>> >>> But as I get back closer to 4.12, it seemingly becomes harder to >>> trigger, but sometimes still happens. >>> >>> So far, I've not been able to trigger it with 4.12. >>> >>> I don't think network namespaces are involved? Though its out of my >>> area, so AOSP may be using them these days. Is there a simple way to >>> check? >>> >>> I'll also do another bisection to see if the bad point moves back any >>> further. > > So I went through another bisection around and got 9514528d92d4 ipv6: > call dst_dev_put() properly as the first bad commit again. > >> If you see the problem starts to happen on commit >> 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all >> the following commits: >> (from new to old) >> 1eb04e7c9e63 net: reorder all the dst flags >> a4c2fd7f7891 net: remove DST_NOCACHE flag >> b2a9c0ed75a3 net: remove DST_NOGC flag >> 5b7c9a8ff828 net: remove dst gc related code >> db916649b5dd ipv6: get rid of icmp6 dst garbage collector >> 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free() >> ad65a2f05695 ipv6: call dst_hold_safe() properly >> 9514528d92d4 ipv6: call dst_dev_put() properly > > > And reverting this set off of 4.13-rc4 seems to make the issue go away. > > Is there anything I can test to help narrow down the specific problem > with that patchset? > Thanks John for confirming. Let me spend some time on the commits and I will let you know if I have some debug image for you to try. Wei > thanks > -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang wrote: > On Wed, Aug 9, 2017 at 4:44 PM, John Stultz wrote: >> On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang wrote: >>> (Cc'ing Wei whose commit was blamed) >>> >>> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz wrote: >>>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz wrote: >>>>> So, with recent testing with my HiKey board, I've been noticing some >>>>> quirky behavior with my USB eth adapter. >>>>> >>>>> Basically, pluging the usb eth adapter in and then removing it, when >>>>> plugging it back in I often find that its not detected, and the system >>>>> slowly spits out the following message over and over: >>>>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >>>> >>>> The other bit is that after this starts printing, the board will no >>>> longer reboot (it hangs continuing to occasionally print the above >>>> message), and I have to manually reset the device. >>>> >>> >>> So this warning is not temporarily shown but lasts until a reboot, >>> right? If so it is a dst refcnt leak. >> >> Correct, once I get into the state it lasts until a reboot. >> >>> How reproducible is it for you? From my reading, it seems always >>> reproduced when you unplug and plug your usb eth interface? >>> Is there anything else involved? For example, network namespace. >> >> So with 4.13-rc3/4 I seem to trigger it easily, often with the first >> unplug of the USB eth adapter. >> >> But as I get back closer to 4.12, it seemingly becomes harder to >> trigger, but sometimes still happens. >> >> So far, I've not been able to trigger it with 4.12. >> >> I don't think network namespaces are involved? Though its out of my >> area, so AOSP may be using them these days. Is there a simple way to >> check? >> >> I'll also do another bisection to see if the bad point moves back any >> further. So I went through another bisection around and got 9514528d92d4 ipv6: call dst_dev_put() properly as the first bad commit again. > If you see the problem starts to happen on commit > 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all > the following commits: > (from new to old) > 1eb04e7c9e63 net: reorder all the dst flags > a4c2fd7f7891 net: remove DST_NOCACHE flag > b2a9c0ed75a3 net: remove DST_NOGC flag > 5b7c9a8ff828 net: remove dst gc related code > db916649b5dd ipv6: get rid of icmp6 dst garbage collector > 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free() > ad65a2f05695 ipv6: call dst_hold_safe() properly > 9514528d92d4 ipv6: call dst_dev_put() properly And reverting this set off of 4.13-rc4 seems to make the issue go away. Is there anything I can test to help narrow down the specific problem with that patchset? thanks -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 5:36 PM, Wei Wang wrote: > > Does your USB adapter get an IPv6 address? Yes, it does. > If you see the problem starts to happen on commit > 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all > the following commits: > (from new to old) > 1eb04e7c9e63 net: reorder all the dst flags > a4c2fd7f7891 net: remove DST_NOCACHE flag > b2a9c0ed75a3 net: remove DST_NOGC flag > 5b7c9a8ff828 net: remove dst gc related code > db916649b5dd ipv6: get rid of icmp6 dst garbage collector > 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free() > ad65a2f05695 ipv6: call dst_hold_safe() properly > 9514528d92d4 ipv6: call dst_dev_put() properly > > and try if it starts to work? I'll give that a shot! Thanks so much for the help! I really appreciate it! -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 4:44 PM, John Stultz wrote: > On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang wrote: >> (Cc'ing Wei whose commit was blamed) >> >> On Mon, Aug 7, 2017 at 2:15 PM, John Stultz wrote: >>> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz wrote: >>>> So, with recent testing with my HiKey board, I've been noticing some >>>> quirky behavior with my USB eth adapter. >>>> >>>> Basically, pluging the usb eth adapter in and then removing it, when >>>> plugging it back in I often find that its not detected, and the system >>>> slowly spits out the following message over and over: >>>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >>> >>> The other bit is that after this starts printing, the board will no >>> longer reboot (it hangs continuing to occasionally print the above >>> message), and I have to manually reset the device. >>> >> >> So this warning is not temporarily shown but lasts until a reboot, >> right? If so it is a dst refcnt leak. > > Correct, once I get into the state it lasts until a reboot. > >> How reproducible is it for you? From my reading, it seems always >> reproduced when you unplug and plug your usb eth interface? >> Is there anything else involved? For example, network namespace. > > So with 4.13-rc3/4 I seem to trigger it easily, often with the first > unplug of the USB eth adapter. > > But as I get back closer to 4.12, it seemingly becomes harder to > trigger, but sometimes still happens. > > So far, I've not been able to trigger it with 4.12. > > I don't think network namespaces are involved? Though its out of my > area, so AOSP may be using them these days. Is there a simple way to > check? > > I'll also do another bisection to see if the bad point moves back any further. > > thanks > -john Hi John, Does your USB adapter get an IPv6 address? If you see the problem starts to happen on commit 9514528d92d4cbe086499322370155ed69f5d06c, could you try reverting all the following commits: (from new to old) 1eb04e7c9e63 net: reorder all the dst flags a4c2fd7f7891 net: remove DST_NOCACHE flag b2a9c0ed75a3 net: remove DST_NOGC flag 5b7c9a8ff828 net: remove dst gc related code db916649b5dd ipv6: get rid of icmp6 dst garbage collector 587fea741134 ipv6: mark DST_NOGC and remove the operation of dst_free() ad65a2f05695 ipv6: call dst_hold_safe() properly 9514528d92d4 ipv6: call dst_dev_put() properly and try if it starts to work? By only reverting 9514528d92d4 definitely won't work as all the later commits depend on this one. Thanks a lot. Wei
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Wed, Aug 9, 2017 at 4:34 PM, Cong Wang wrote: > (Cc'ing Wei whose commit was blamed) > > On Mon, Aug 7, 2017 at 2:15 PM, John Stultz wrote: >> On Mon, Aug 7, 2017 at 2:05 PM, John Stultz wrote: >>> So, with recent testing with my HiKey board, I've been noticing some >>> quirky behavior with my USB eth adapter. >>> >>> Basically, pluging the usb eth adapter in and then removing it, when >>> plugging it back in I often find that its not detected, and the system >>> slowly spits out the following message over and over: >>> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 >> >> The other bit is that after this starts printing, the board will no >> longer reboot (it hangs continuing to occasionally print the above >> message), and I have to manually reset the device. >> > > So this warning is not temporarily shown but lasts until a reboot, > right? If so it is a dst refcnt leak. Correct, once I get into the state it lasts until a reboot. > How reproducible is it for you? From my reading, it seems always > reproduced when you unplug and plug your usb eth interface? > Is there anything else involved? For example, network namespace. So with 4.13-rc3/4 I seem to trigger it easily, often with the first unplug of the USB eth adapter. But as I get back closer to 4.12, it seemingly becomes harder to trigger, but sometimes still happens. So far, I've not been able to trigger it with 4.12. I don't think network namespaces are involved? Though its out of my area, so AOSP may be using them these days. Is there a simple way to check? I'll also do another bisection to see if the bad point moves back any further. thanks -john
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
(Cc'ing Wei whose commit was blamed) On Mon, Aug 7, 2017 at 2:15 PM, John Stultz wrote: > On Mon, Aug 7, 2017 at 2:05 PM, John Stultz wrote: >> So, with recent testing with my HiKey board, I've been noticing some >> quirky behavior with my USB eth adapter. >> >> Basically, pluging the usb eth adapter in and then removing it, when >> plugging it back in I often find that its not detected, and the system >> slowly spits out the following message over and over: >> unregister_netdevice: waiting for eth0 to become free. Usage count = 1 > > The other bit is that after this starts printing, the board will no > longer reboot (it hangs continuing to occasionally print the above > message), and I have to manually reset the device. > So this warning is not temporarily shown but lasts until a reboot, right? If so it is a dst refcnt leak. How reproducible is it for you? From my reading, it seems always reproduced when you unplug and plug your usb eth interface? Is there anything else involved? For example, network namespace. Thanks.
Re: unregister_netdevice: waiting for eth0 to become free. Usage count = 1
On Mon, Aug 7, 2017 at 2:05 PM, John Stultz wrote: > So, with recent testing with my HiKey board, I've been noticing some > quirky behavior with my USB eth adapter. > > Basically, pluging the usb eth adapter in and then removing it, when > plugging it back in I often find that its not detected, and the system > slowly spits out the following message over and over: > unregister_netdevice: waiting for eth0 to become free. Usage count = 1 The other bit is that after this starts printing, the board will no longer reboot (it hangs continuing to occasionally print the above message), and I have to manually reset the device. thanks -john
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
So, with recent testing with my HiKey board, I've been noticing some quirky behavior with my USB eth adapter. Basically, pluging the usb eth adapter in and then removing it, when plugging it back in I often find that its not detected, and the system slowly spits out the following message over and over: unregister_netdevice: waiting for eth0 to become free. Usage count = 1 I've tried to go through and bisect it, but apparently the issue isn't always reproducible, as I'm apparently getting lots of false negatives (where I can't always reproduce boot to boot the issue on the same kernel). I've done three bisection passes (always restarting with the "first bad commit" from the previous bisection as the initial bad commit for the following pass), and it does seem to keep moving back. But it seems much easier to trigger with newer kernels then older (and so far I've not seen it with 4.12). Wanted to see if anyone had any ideas what might be going wrong, and how I should further debug this. The last bisect log I generated was: # good: [6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c] Linux 4.12 git bisect good 6f7da290413ba713f0cdd9ff1a2a9bb129ef4f6c # bad: [98fdd857a3bd6a3bf0003d3f68f07c25c85dcde3] net: ethernet: ti: cpsw: move skb timestamp to packet_submit git bisect bad 98fdd857a3bd6a3bf0003d3f68f07c25c85dcde3 # good: [48b6bbef9a1789f0365c1a385879a1fea4460016] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect good 48b6bbef9a1789f0365c1a385879a1fea4460016 # good: [a2e8bbd2ef5457485f00b6b947bbbfa2778e5b1e] bpf: Fix test_obj_id.c for llvm 5.0 git bisect good a2e8bbd2ef5457485f00b6b947bbbfa2778e5b1e # good: [273889e306256e95ea55d5ebaef99310cf589def] Merge tag 'mlx5-updates-2017-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux git bisect good 273889e306256e95ea55d5ebaef99310cf589def # bad: [8f46d46715a12f509e13200033a1ed4d6cf335ff] cxgb4: Use Firmware params to get buffer-group map git bisect bad 8f46d46715a12f509e13200033a1ed4d6cf335ff # bad: [f5c306470ed0a8f03ba7017f397da2555b5800d4] Merge tag 'mlx5-updates-2017-06-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux git bisect bad f5c306470ed0a8f03ba7017f397da2555b5800d4 # bad: [e289ef0ded13021db292be9aef134451546e7c60] net: dsa: mv88e6xxx: clarify SMI PHY functions git bisect bad e289ef0ded13021db292be9aef134451546e7c60 # bad: [836d57e5c08e13bb206dcd559d96ee9355e8316e] liquidio: implement vlan filter enable and disable git bisect bad 836d57e5c08e13bb206dcd559d96ee9355e8316e # bad: [ad65a2f05695aced349e308193c6e2a6b1d87112] ipv6: call dst_hold_safe() properly git bisect bad ad65a2f05695aced349e308193c6e2a6b1d87112 # good: [0830106c53900181d336350581119af09e123bf3] ipv4: take dst->__refcnt when caching dst in fib git bisect good 0830106c53900181d336350581119af09e123bf3 # good: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect good b838d5e1c5b6e57b10ec8af2268824041e3ea911 # bad: [9514528d92d4cbe086499322370155ed69f5d06c] ipv6: call dst_dev_put() properly git bisect bad 9514528d92d4cbe086499322370155ed69f5d06c # good: [1cfb71eeb12047bcdbd3e6730ffed66e810a0855] ipv6: take dst->__refcnt for insertion into fib6 tree git bisect good 1cfb71eeb12047bcdbd3e6730ffed66e810a0855 # first bad commit: [9514528d92d4cbe086499322370155ed69f5d06c] ipv6: call dst_dev_put() properly But again, reverting the "ipv6: call dst_dev_put() properly" commit doesn't seem to completely resolve the issue on newer kernels (though it may make it harder to trigger), and I suspect with further bisection passes I might move further back. Ideas? I don't seem to have similar issues with USB mass storage devices, so it seems to be networking specific. thanks -john