From: Saeed Mahameed <sa...@kernel.org> Date: Wed, 23 Sep 2020 15:42:17 -0700
> Maybe we need to clear IFF_UP before calling ops->ndo_stop(dev), > instead of after on __dev_close_many(). Assuming no driver is checking > IFF_UP state on its own ndo_stop(), other than this, the order > shouldn't really matter, since clearing the flag and calling ndo_stop() > should be considered as one atomic operation. This is my biggest concern, that some ndo_stop, or some helper called by ndo_stop, checks IFF_UP or similar. There is also something else. We have both synchronous and async code that checks state like IFF_UP and 'present' and makes a decision based upon that. If an async code path tests 'present', gets true, and then the RTNL holding synchronous code path puts the device into D3hot immediately afterwards, the async code path will still continue and access the chips registers and fault. I'm saying all of this because the only way this bug makes sense is if the ->ndo_stop() sequence that marks the device !present and then clears IFF_UP runs with the RTNL mutex held, and the code path that tests this state in the linkwatch bits in question do not.