Hi All, On Sat, 11 Oct 2025 08:11:42 -0700 Manivannan Sadhasivam <[email protected]> wrote:
> On Sat, Oct 11, 2025 at 07:25:26AM +0200, Lukas Wunner wrote: > > [cc += Mani] > > > > On Sat, Oct 11, 2025 at 07:12:49AM +0200, Christian Zigotzky wrote: > > > On 09 October 2025 at 07:37 am, Lukas Wunner wrote: > > > > On Thu, Oct 09, 2025 at 06:54:58AM +0200, Christian Zigotzky wrote: > > > > > On 08 October 2025 at 09:51 pm, Bjorn Helgaas wrote: > > > > > > On Wed, Oct 08, 2025 at 06:35:42PM +0200, Christian Zigotzky wrote: > > > > > > > > > > > > > Our PPC boards [1] have boot problems since the > > > > > > > pci-v6.18-changes. [2] > > > > > > > > > > > > > > Without the pci-v6.18-changes, the PPC boards boot without any > > > > > > > problems. > > > > > > > > > > > > > > Boot log with error messages: > > > > > > > https://github.com/user-attachments/files/22782016/Kernel_6.18_with_PCI_changes.log > > > > > > > > > > > > > > Further information: > > > > > > > https://github.com/chzigotzky/kernels/issues/17 > > > > > > Do you happen to have a similar log from a recent working kernel, > > > > > > e.g., v6.17, that we could compare with? > > > > > Thanks for your answer. Here is a similar log from the kernel 6.17.0: > > > > > https://github.com/user-attachments/files/22789946/Kernel_6.17.0_Cyrus_Plus_board_P5040.log > > > > > > > > > These lines are added in v6.18: > > > > > > > > pci 0000:01:00.0: ASPM: DT platform, enabling L0s-up L0s-dw L1 > > > > ASPM-L1.1 ASPM-L1.2 PCI-PM-L1.1 PCI-PM-L1.2 > > > > pci 0000:01:00.0: ASPM: DT platform, enabling ClockPM > > > > pci 0001:01:00.0: ASPM: DT platform, enabling L0s-up L0s-dw L1 > > > > ASPM-L1.1 ASPM-L1.2 PCI-PM-L1.1 PCI-PM-L1.2 > > > > pci 0001:01:00.0: ASPM: DT platform, enabling ClockPM > > > > pci 0001:03:00.0: ASPM: DT platform, enabling L0s-up L0s-dw L1 > > > > ASPM-L1.1 ASPM-L1.2 PCI-PM-L1.1 PCI-PM-L1.2 > > > > pci 0001:03:00.0: ASPM: DT platform, enabling ClockPM > > > > > > > > Possible candidate: > > > > > > > > f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for > > > > devicetree platforms") > > > > > > After reverting the commit f3ac2ff14834, the kernel boots without any > > > problems. > > > > > > f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree > > > platforms") is the bad commit. > > > > Hi Mani, your commit f3ac2ff14834 is causing a regression on certain > > powerpc machines. Any ideas? > > > > Hi Lukas, > > Thanks for looping me in. The referenced commit forcefully enables ASPM on all > DT platforms as we decided to bite the bullet finally. > > Looks like the device (0000:01:00.0) doesn't play nice with ASPM even though > it > advertises ASPM capability. > > Christian, could you please test the below change and see if it fixes the > issue? > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index 214ed060ca1b..e006b0560b39 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -2525,6 +2525,15 @@ static void quirk_disable_aspm_l0s_l1(struct pci_dev > *dev) > */ > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ASMEDIA, 0x1080, > quirk_disable_aspm_l0s_l1); > > + > +static void quirk_disable_aspm_all(struct pci_dev *dev) > +{ > + pci_info(dev, "Disabling ASPM\n"); > + pci_disable_link_state(dev, PCIE_LINK_STATE_ALL); > +} > + > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6738, quirk_disable_aspm_all); > + > /* > * Some Pericom PCIe-to-PCI bridges in reverse mode need the PCIe Retrain > * Link bit cleared after starting the link retrain process to allow this > > > Going forward, we should be quirking the devices if they behave erratically. > > - Mani > I also observed issues with the commit f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms") My system is an ARM board (Marvel Armada 3720 DDB) https://elixir.bootlin.com/linux/v6.17.1/source/arch/arm64/boot/dts/marvell/armada-3720-db.dts I use an LAN966x PCI board https://elixir.bootlin.com/linux/v6.17.1/source/drivers/misc/lan966x_pci.c Usually, when I did a ping using the PCI board, I have more or less the following timings: # ping 192.168.32.100 PING 192.168.32.100 (192.168.32.100): 56 data bytes 64 bytes from 192.168.32.100: seq=0 ttl=64 time=3.328 ms 64 bytes from 192.168.32.100: seq=1 ttl=64 time=2.636 ms 64 bytes from 192.168.32.100: seq=2 ttl=64 time=2.928 ms 64 bytes from 192.168.32.100: seq=3 ttl=64 time=2.649 ms But with a vanilla v6.18-rc1 kernel, those timings become awful: # ping 192.168.32.100 PING 192.168.32.100 (192.168.32.100): 56 data bytes 64 bytes from 192.168.32.100: seq=0 ttl=64 time=656.634 ms 64 bytes from 192.168.32.100: seq=1 ttl=64 time=551.812 ms 64 bytes from 192.168.32.100: seq=2 ttl=64 time=702.966 ms 64 bytes from 192.168.32.100: seq=3 ttl=64 time=725.904 ms Reverting commit f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for devicetree platforms") fixes my timing issues. Also tried the quirk proposed in this discussion (quirk_disable_aspm_all) an the quirk also fixes the timing issue. I used the same PCI board on an x86 system and no timing issues were observed. I am not sure the quirk_disable_aspm_all quirk is the solution. Indeed, the issue could be at the PCIe controller level and not the PCIe device. What should be the best solution ? Is something missing on device-tree based systems to have the commit f3ac2ff14834 applied without regressions ? Best regards, Hervé
