On 06/09/2017 20:00, David Daney wrote: > On 08/31/2017 11:29 AM, Florian Fainelli wrote: >> On 08/31/2017 11:12 AM, Mason wrote: >>> On 31/08/2017 19:53, Florian Fainelli wrote: >>>> On 08/31/2017 10:49 AM, Mason wrote: >>>>> On 31/08/2017 18:57, Florian Fainelli wrote: >>>>>> And the race is between phy_detach() setting phydev->attached_dev = NULL >>>>>> and phy_state_machine() running in PHY_HALTED state and calling >>>>>> netif_carrier_off(). >>>>> >>>>> I must be missing something. >>>>> (Since a thread cannot race against itself.) >>>>> >>>>> phy_disconnect calls phy_stop_machine which >>>>> 1) stops the work queue from running in a separate thread >>>>> 2) calls phy_state_machine *synchronously* >>>>> which runs the PHY_HALTED case with everything well-defined >>>>> end of phy_stop_machine >>>>> >>>>> phy_disconnect only then calls phy_detach() >>>>> which makes future calls of phy_state_machine perilous. >>>>> >>>>> This all happens in the same thread, so I'm not yet >>>>> seeing where the race happens? >>>> >>>> The race is as described in David's earlier email, so let's recap: >>>> >>>> Thread 1 Thread 2 >>>> phy_disconnect() >>>> phy_stop_interrupts() >>>> phy_stop_machine() >>>> phy_state_machine() >>>> -> queue_delayed_work() >>>> phy_detach() >>>> phy_state_machine() >>>> -> netif_carrier_off() >>>> >>>> If phy_detach() finishes earlier than the workqueue had a chance to be >>>> scheduled and process PHY_HALTED again, then we trigger the NULL pointer >>>> de-reference. >>>> >>>> workqueues are not tasklets, the CPU scheduling them gets no guarantee >>>> they will run on the same CPU. >>> >>> Something does not add up. >>> >>> The synchronous call to phy_state_machine() does: >>> >>> case PHY_HALTED: >>> if (phydev->link) { >>> phydev->link = 0; >>> netif_carrier_off(phydev->attached_dev); >>> phy_adjust_link(phydev); >>> do_suspend = true; >>> } >>> >>> then sets phydev->link = 0; therefore subsequent calls to >>> phy_state_machin() will be no-op. >> >> Actually you are right, once phydev->link is set to 0 these would become >> no-ops. Still scratching my head as to what happens for David then... >> >>> >>> Also, queue_delayed_work() is only called in polling mode. >>> David stated that he's using interrupt mode. > > Did you see what I wrote? > > phy_disconnect() calls phy_stop_interrupts() which puts it into polling > mode. So the polling work gets queued unconditionally.
I did address that remark in https://www.mail-archive.com/netdev@vger.kernel.org/msg186336.html int phy_stop_interrupts(struct phy_device *phydev) { int err = phy_disable_interrupts(phydev); if (err) phy_error(phydev); free_irq(phydev->irq, phydev); /* If work indeed has been cancelled, disable_irq() will have * been left unbalanced from phy_interrupt() and enable_irq() * has to be called so that other devices on the line work. */ while (atomic_dec_return(&phydev->irq_disable) >= 0) enable_irq(phydev->irq); return err; } Which part of this function changes phydev->irq to PHY_POLL? Perhaps phydev->drv->config_intr? What PHY are you using? Regards.