Hi Dima, Paul, thank you both for the detailed review and for looping in the PCI folks.
On 10/02/26, Dima Ruinskiy wrote: > The use of usleep_range() means you will get a "scheduling while > atomic" bug each time a register is read in an atomic context. You are absolutely right. I overlooked that igc_rd32() is frequently called from atomic context (e.g. interrupt handlers and other non-sleepable paths), so using usleep_range() or msleep() there is incorrect and could lead to scheduling-while-atomic problems. That means the current implementation is not suitable and would need reworking. > I can understand the read-retry mechanism as a way to cope with > transient failures. We've implemented something similar for PHY > accesses in the e1000e driver. However, full PCIe link retraining feels > like too heavy a tool inside a register read routine. I understand your concern. In hindsight, doing a full link retrain with long waits directly from igc_rd32() is indeed too heavy for this hot path. igc_rd32() sits on a very hot and latency-sensitive path, so any recovery logic there must remain strictly non-sleeping and minimal. If I revisit this, I will focus on a much simpler, retry-only approach similar to the e1000e PHY access retries instead of trying to drive PCIe link retraining from the register read routine. > Maybe we should first understand the scope of the problem we're dealing > with here - does the issue with L0s transitions on I225/I226 affect > specific units/systems? My motivation came from Bug 221048, where users report “PCIe link lost, device now detached” on I225/I226 after a day or two of uptime and mention ASPM/L0s as a likely trigger. However, I currently do not have access to I225/I226 hardware myself, so I cannot characterize how widespread this really is or verify the behaviour across platforms. Given that limitation, I agree that pushing a generic fix into igc_rd32() is premature. On 11/02/26, Paul Menzel wrote: > Just a note to please version your patches. > `git format-patch` has the switch `--reroll-count` or `-v` to do this. Thanks for the pointer; for any future submission I will use `git format-patch -v2` (or higher) and keep unrelated/cosmetic changes in separate patches. > > 1. Immediate retries: 3 attempts with 100-200μs delays > Why were these delays chosen? At this point they are only conservative estimates. The idea was to give the link a very short window to recover from a transient ASPM-related glitch without introducing millisecond level stalls, so I picked a 100–200 μs range per retry and limited it to three attempts. Without measurements on real I225/I226 hardware, I agree that these timings are not well enough justified for production code. > It’d be great if you added the test setup and case. This is a gap on my side, I currently do not have I225/I226 hardware available, so I cannot yet provide a proper, reproducible test setup or results. Given that limitation, it makes sense not to push this change further until I can actually reproduce the issue and validate any proposed fix on real systems. > + /* Give the link some additional time to > + * recover on its own */ > + msleep(100); > That’s quite a long delay. Is that according to some standard? The 100 ms delay was chosen by analogy with PCIe reset and recovery paths where code often waits on the order of tens to hundreds of milliseconds for DLLLA and link training to complete after a reset. The intention was to follow that precedent rather than choose a much shorter timeout that might race with link training, but I understand this is a significant delay and not appropriate in the igc_rd32() path. > The idea looks good, but maybe there is something in Linux’ PCIe core to > achieve something similar? Thank you for the hint. I will look into the PCIe core error-recovery and link-retrain helpers; if a solution ultimately needs a retrain, it should be built on that infrastructure instead of open-coding LNKCTL.RL handling inside the driver. Given the issues raised particularly the atomic-context constraints, the heaviness of link retraining in a register read path, the lack of well-justified timing data, and my current lack of access to I225/I226 hardware, I agree that this version of the patch should be dropped. If I am able to obtain access to suitable hardware and reproduce the ASPM/L0s behaviour, I would like to revisit this with a smaller and better-scoped proposal (likely retry-only or PCIe-core-assisted), along with a documented and reproducible test setup. I would very much appreciate your feedback again at that point. Thank you again for the detailed review and guidance — this discussion has been very helpful for me in understanding the expectations around changes in this area. Best regards, Harshank Matkar ________________________________________ From: Paul Menzel <[email protected]> Sent: 11 February 2026 20:30 To: Harshank Matkar Cc: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Bjorn Helgaas; [email protected] Subject: Re: [Intel-wired-lan] [PATCH] igc: Add PCIe link recovery for I225/I226 [Cc: +Linux PCI folks] Dear Harshank, Thank you for your patch. Just a note to please version your patches. `git format-patch` has the switch `--reroll-count` or `-v` to do this. Am 10.02.26 um 21:34 schrieb Harshank Matkar: > From: Harshank Matkar <[email protected]> > > When ASPM L0s transitions occur on Intel I225/I226 controllers, > transient PCIe link instability can cause register read failures > (0xFFFFFFFF responses). Implement a multi-layer recovery strategy: > > 1. Immediate retries: 3 attempts with 100-200μs delays Why were these delays chosen? > 2. Link retraining: Trigger PCIe link retraining via capabilities > 3. Device detachment: Only as last resort after max attempts > > The recovery mechanism includes rate limiting, maximum attempt > tracking, and device presence validation to prevent false detaches > on transient ASPM glitches while maintaining safety through > bounded retry limits. It’d be great if you added the test setup and case. > Signed-off-by: Harshank Matkar <[email protected]> > --- > drivers/net/ethernet/intel/igc/igc.h | 6 +- > drivers/net/ethernet/intel/igc/igc_main.c | 155 ++++++++++++++++++++-- > 2 files changed, 149 insertions(+), 12 deletions(-) > > diff --git a/drivers/net/ethernet/intel/igc/igc.h > b/drivers/net/ethernet/intel/igc/igc.h > index a427f05814c1..2ef488b279b9 100644 > --- a/drivers/net/ethernet/intel/igc/igc.h > +++ b/drivers/net/ethernet/intel/igc/igc.h > @@ -346,6 +346,10 @@ struct igc_adapter { > struct mutex led_mutex; > struct igc_led_classdev *leds; > bool leds_available; > + > + /* PCIe recovery tracking */ > + unsigned int pcie_recovery_attempts; > + unsigned long last_recovery_time; > }; > > void igc_up(struct igc_adapter *adapter); > @@ -422,7 +426,7 @@ enum igc_rss_type_num { > IGC_RSS_TYPE_MAX = 10, > }; > #define IGC_RSS_TYPE_MAX_TABLE 16 > -#define IGC_RSS_TYPE_MASK GENMASK(3,0) /* 4-bits (3:0) = mask > 0x0F */ > +#define IGC_RSS_TYPE_MASK GENMASK(3, 0) /* 4-bits (3:0) = mask > 0x0F */ > > /* igc_rss_type - Rx descriptor RSS type field */ > static inline u32 igc_rss_type(const union igc_adv_rx_desc *rx_desc) > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c > b/drivers/net/ethernet/intel/igc/igc_main.c > index 89a321a344d2..f0daa3edbb79 100644 > --- a/drivers/net/ethernet/intel/igc/igc_main.c > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > @@ -1411,9 +1411,8 @@ static int igc_tx_map(struct igc_ring *tx_ring, > /* Make sure there is space in the ring for the next send. */ > igc_maybe_stop_tx(tx_ring, DESC_NEEDED); > > - if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) { > + if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) > writel(i, tx_ring->tail); > - } > > return 0; > dma_error: > @@ -1613,8 +1612,7 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff > *skb, > * otherwise try next time > */ > for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) > - count += TXD_USE_COUNT(skb_frag_size( > - &skb_shinfo(skb)->frags[f])); > + count += > TXD_USE_COUNT(skb_frag_size(&skb_shinfo(skb)->frags[f])); Unrelated. > > if (igc_maybe_stop_tx(tx_ring, count + 5)) { > /* this is a hard error */ > @@ -2524,7 +2522,6 @@ static int __igc_xdp_run_prog(struct igc_adapter > *adapter, > if (xdp_do_redirect(adapter->netdev, xdp, prog) < 0) > goto out_failure; > return IGC_XDP_REDIRECT; > - break; > default: > bpf_warn_invalid_xdp_action(adapter->netdev, prog, act); > fallthrough; > @@ -2791,7 +2788,7 @@ static struct igc_xdp_buff *xsk_buff_to_igc_ctx(struct > xdp_buff *xdp) > * igc_xdp_buff shares its layout with xdp_buff_xsk and private > * igc_xdp_buff fields fall into xdp_buff_xsk->cb > */ > - return (struct igc_xdp_buff *)xdp; > + return (struct igc_xdp_buff *)xdp; Correct, but should be a separate patch. > } > > static int igc_clean_rx_irq_zc(struct igc_q_vector *q_vector, const int > budget) > @@ -3895,9 +3892,8 @@ static int igc_enable_nfc_rule(struct igc_adapter > *adapter, > { > int err; > > - if (rule->flex) { > + if (rule->flex) > return igc_add_flex_filter(adapter, rule); > - } Correct, but unrelated and I think cosmetic changes are not wanted. > > if (rule->filter.match_flags & IGC_FILTER_FLAG_ETHER_TYPE) { > err = igc_add_etype_filter(adapter, rule->filter.etype, > @@ -6984,11 +6980,112 @@ static const struct net_device_ops igc_netdev_ops = { > .ndo_hwtstamp_set = igc_ptp_hwtstamp_set, > }; > > +#define IGC_REGISTER_READ_RETRIES 3 > +#define IGC_PCIE_RECOVERY_MAX_ATTEMPTS 10 > +#define IGC_PCIE_RECOVERY_INTERVAL_MS 1000 > + > +/** > + * igc_pcie_link_recover - Attempt PCIe link recovery > + * @adapter: board private structure > + * > + * Attempts to recover a failed PCIe link by triggering a link retrain. > + * Rate-limited to 1 attempt per second, maximum 10 attempts. > + * > + * Returns true if recovery was successful, false otherwise. > + */ > +static bool igc_pcie_link_recover(struct igc_adapter *adapter) > +{ > + struct pci_dev *pdev = adapter->pdev; > + unsigned long now = jiffies; > + u16 lnksta, lnkctl; > + int ret; > + int i; > + > + /* Rate limiting: no more than 1 attempt per second */ > + if (time_before(now, adapter->last_recovery_time + > + msecs_to_jiffies(IGC_PCIE_RECOVERY_INTERVAL_MS))) > + return false; > + > + /* Maximum attempt limit */ > + if (adapter->pcie_recovery_attempts >= IGC_PCIE_RECOVERY_MAX_ATTEMPTS) { > + netdev_err(adapter->netdev, > + "Exceeded maximum PCIe recovery attempts (%d)\n", > + IGC_PCIE_RECOVERY_MAX_ATTEMPTS); > + return false; > + } > + > + adapter->last_recovery_time = now; > + adapter->pcie_recovery_attempts++; > + > + netdev_warn(adapter->netdev, > + "Attempting PCIe link recovery (attempt %d/%d)\n", > + adapter->pcie_recovery_attempts, > + IGC_PCIE_RECOVERY_MAX_ATTEMPTS); > + > + /* Check if device is still present on the bus */ > + if (!pci_device_is_present(pdev)) { > + netdev_err(adapter->netdev, "Device not present on PCIe bus\n"); > + return false; > + } > + > + /* Check link status */ > + ret = pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta); > + if (ret) { > + netdev_err(adapter->netdev, "Failed to read link status\n"); > + return false; > + } > + > + /* Trigger link retrain if link appears down */ > + if (!(lnksta & PCI_EXP_LNKSTA_DLLLA)) { > + netdev_info(adapter->netdev, > + "Link down, attempting retrain\n"); > + > + /* Read link control register */ > + ret = pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl); > + if (ret == 0) { > + /* Trigger retrain by setting RL bit */ > + pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, > + lnkctl | PCI_EXP_LNKCTL_RL); > + > + /* Wait for retrain to complete (up to 1 second) */ > + for (i = 0; i < 100; i++) { > + usleep_range(10000, 20000); > + ret = pcie_capability_read_word(pdev, > PCI_EXP_LNKSTA, > + &lnksta); > + if (ret == 0 && (lnksta & PCI_EXP_LNKSTA_DLLLA) > && > + !(lnksta & PCI_EXP_LNKSTA_LT)) { > + netdev_info(adapter->netdev, > + "PCIe link recovery > successful\n"); > + return true; > + } > + } > + } > + } > + > + /* Give the link some additional time to recover on its own */ > + msleep(100); That’s quite a long delay. Is that according to some standard? > + > + /* Check if device is responding now */ > + if (pci_device_is_present(pdev)) { > + ret = pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnksta); > + if (ret == 0 && (lnksta & PCI_EXP_LNKSTA_DLLLA)) { > + netdev_info(adapter->netdev, > + "PCIe link recovered after delay\n"); > + return true; > + } > + } > + > + netdev_warn(adapter->netdev, "PCIe link recovery failed\n"); > + return false; > +} > + > u32 igc_rd32(struct igc_hw *hw, u32 reg) > { > struct igc_adapter *igc = container_of(hw, struct igc_adapter, hw); > u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr); > + struct net_device *netdev = igc->netdev; > u32 value = 0; > + int retry; > > if (IGC_REMOVED(hw_addr)) > return ~value; > @@ -6997,13 +7094,49 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg) > > /* reads should not return all F's */ > if (!(~value) && (!reg || !(~readl(hw_addr)))) { > - struct net_device *netdev = igc->netdev; > + /* Layer 1: Immediate retries with short delays (100-200μs) */ > + for (retry = 0; retry < IGC_REGISTER_READ_RETRIES; retry++) { > + usleep_range(100, 200); > + value = readl(&hw_addr[reg]); > + > + /* If we got a valid read, return immediately */ > + if (~value || (reg && ~readl(hw_addr))) { > + netdev_dbg(netdev, > + "Register read recovered after %d > retries\n", > + retry + 1); > + return value; > + } > + } > + > + /* Layer 2: Attempt full PCIe link recovery */ > + netdev_warn(netdev, > + "All immediate retries failed, attempting PCIe link > recovery\n"); > + > + if (igc_pcie_link_recover(igc)) { > + /* Recovery succeeded, re-read the register */ > + hw_addr = READ_ONCE(hw->hw_addr); > + if (hw_addr && !IGC_REMOVED(hw_addr)) { > + value = readl(&hw_addr[reg]); > + > + /* Verify the read is valid */ > + if (~value || (reg && ~readl(hw_addr))) { > + netdev_info(netdev, > + "Register read successful > after link recovery\n"); > + return value; > + } > + } > + } > + > + /* Layer 3: All recovery attempts failed, detach device */ > + netdev_err(netdev, > + "PCIe link lost after %d retries and recovery > attempts, device now detached\n", > + IGC_REGISTER_READ_RETRIES); > > hw->hw_addr = NULL; > netif_device_detach(netdev); > - netdev_err(netdev, "PCIe link lost, device now detached\n"); > + > WARN(pci_device_is_present(igc->pdev), > - "igc: Failed to read reg 0x%x!\n", reg); > + "igc: Failed to read reg 0x%x after all recovery > attempts!\n", reg); > } > > return value; The idea looks good, but maybe there is something in Linux’ PCIe core to achieve something similar? Kind regards, Paul
