On Wed, 2020-09-23 at 17:23 -0700, David Miller wrote: > From: David Miller <da...@davemloft.net> > Date: Wed, 23 Sep 2020 17:21:25 -0700 (PDT) > > > If an async code path tests 'present', gets true, and then the RTNL > > holding synchronous code path puts the device into D3hot > immediately > > afterwards, the async code path will still continue and access the > > chips registers and fault. > > Wait, is the sequence: > > ->ndo_stop() > mark device not present and put into D3hot > triggers linkwatch event > ... > ->ndo_get_stats64() > > ??? >
I assume it is, since normally device drivers do carrier_off() on ndo_stop() 1) One problematic sequence would be (for drivers doing D3hot on ndo_stop()) __dev_close_many() ->ndo_stop() netif_device_detach() //Mark !present; ... D3hot carrier_off()->linkwatch_event() ... // !present && IFF_UP 2) Another problematic scenario which i see is repeated in many drivers: shutdown/suspend() rtnl_lock() netif_device_detach()//Mark !present; stop()->carrier_off()->linkwatch_event() // at this point device is still IFF_UP and !present // due to the early detach above.. rtnl_unlock(); For scenario 1) we can fix by marking IFF_UP at the beginning, but for 2), i think we need to fix the drivers to detach only after stop :( > Then yeah we might have to clear IFF_UP at the beginning of taking > a netdev down.