On 08/31/2017 11:12 AM, Mason wrote: > On 31/08/2017 19:53, Florian Fainelli wrote: >> On 08/31/2017 10:49 AM, Mason wrote: >>> On 31/08/2017 18:57, Florian Fainelli wrote: >>>> And the race is between phy_detach() setting phydev->attached_dev = NULL >>>> and phy_state_machine() running in PHY_HALTED state and calling >>>> netif_carrier_off(). >>> >>> I must be missing something. >>> (Since a thread cannot race against itself.) >>> >>> phy_disconnect calls phy_stop_machine which >>> 1) stops the work queue from running in a separate thread >>> 2) calls phy_state_machine *synchronously* >>> which runs the PHY_HALTED case with everything well-defined >>> end of phy_stop_machine >>> >>> phy_disconnect only then calls phy_detach() >>> which makes future calls of phy_state_machine perilous. >>> >>> This all happens in the same thread, so I'm not yet >>> seeing where the race happens? >> >> The race is as described in David's earlier email, so let's recap: >> >> Thread 1 Thread 2 >> phy_disconnect() >> phy_stop_interrupts() >> phy_stop_machine() >> phy_state_machine() >> -> queue_delayed_work() >> phy_detach() >> phy_state_machine() >> -> netif_carrier_off() >> >> If phy_detach() finishes earlier than the workqueue had a chance to be >> scheduled and process PHY_HALTED again, then we trigger the NULL pointer >> de-reference. >> >> workqueues are not tasklets, the CPU scheduling them gets no guarantee >> they will run on the same CPU. > > Something does not add up. > > The synchronous call to phy_state_machine() does: > > case PHY_HALTED: > if (phydev->link) { > phydev->link = 0; > netif_carrier_off(phydev->attached_dev); > phy_adjust_link(phydev); > do_suspend = true; > } > > then sets phydev->link = 0; therefore subsequent calls to > phy_state_machin() will be no-op.
Actually you are right, once phydev->link is set to 0 these would become no-ops. Still scratching my head as to what happens for David then... > > Also, queue_delayed_work() is only called in polling mode. > David stated that he's using interrupt mode. Right that's confusing too now. David can you check if you tree has: 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state correctly in phy_stop_machine") -- Florian