On 09/06/2017 11:00 AM, David Daney wrote: > On 08/31/2017 11:29 AM, Florian Fainelli wrote: >> On 08/31/2017 11:12 AM, Mason wrote: >>> On 31/08/2017 19:53, Florian Fainelli wrote: >>>> On 08/31/2017 10:49 AM, Mason wrote: >>>>> On 31/08/2017 18:57, Florian Fainelli wrote: >>>>>> And the race is between phy_detach() setting phydev->attached_dev >>>>>> = NULL >>>>>> and phy_state_machine() running in PHY_HALTED state and calling >>>>>> netif_carrier_off(). >>>>> >>>>> I must be missing something. >>>>> (Since a thread cannot race against itself.) >>>>> >>>>> phy_disconnect calls phy_stop_machine which >>>>> 1) stops the work queue from running in a separate thread >>>>> 2) calls phy_state_machine *synchronously* >>>>> which runs the PHY_HALTED case with everything well-defined >>>>> end of phy_stop_machine >>>>> >>>>> phy_disconnect only then calls phy_detach() >>>>> which makes future calls of phy_state_machine perilous. >>>>> >>>>> This all happens in the same thread, so I'm not yet >>>>> seeing where the race happens? >>>> >>>> The race is as described in David's earlier email, so let's recap: >>>> >>>> Thread 1 Thread 2 >>>> phy_disconnect() >>>> phy_stop_interrupts() >>>> phy_stop_machine() >>>> phy_state_machine() >>>> -> queue_delayed_work() >>>> phy_detach() >>>> phy_state_machine() >>>> -> netif_carrier_off() >>>> >>>> If phy_detach() finishes earlier than the workqueue had a chance to be >>>> scheduled and process PHY_HALTED again, then we trigger the NULL >>>> pointer >>>> de-reference. >>>> >>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee >>>> they will run on the same CPU. >>> >>> Something does not add up. >>> >>> The synchronous call to phy_state_machine() does: >>> >>> case PHY_HALTED: >>> if (phydev->link) { >>> phydev->link = 0; >>> netif_carrier_off(phydev->attached_dev); >>> phy_adjust_link(phydev); >>> do_suspend = true; >>> } >>> >>> then sets phydev->link = 0; therefore subsequent calls to >>> phy_state_machin() will be no-op. >> >> Actually you are right, once phydev->link is set to 0 these would become >> no-ops. Still scratching my head as to what happens for David then... >> >>> >>> Also, queue_delayed_work() is only called in polling mode. >>> David stated that he's using interrupt mode. > > Did you see what I wrote?
Still not following, see below. > > phy_disconnect() calls phy_stop_interrupts() which puts it into polling > mode. So the polling work gets queued unconditionally. What part of phy_stop_interrupts() is responsible for changing phydev->irq to PHY_POLL? free_irq() cannot touch phydev->irq otherwise subsequent request_irq() calls won't work anymore. phy_disable_interrupts() only calls back into the PHY driver to acknowledge and clear interrupts. If we were using a PHY with PHY_POLL, as Marc said, the first synchronous call to phy_state_machine() would have acted on PHY_HALTED and even if we incorrectly keep re-scheduling the state machine from PHY_HALTED to PHY_HALTED the second time around nothing can happen. What are we missing here? > > > >> >> Right that's confusing too now. David can you check if you tree has: >> >> 49d52e8108a21749dc2114b924c907db43358984 ("net: phy: handle state >> correctly in phy_stop_machine") >> > > Yes, I am using the 4.9 stable branch, and that commit was also present. Thanks for checking that. -- Florian