Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Wed, Mar 20, 2024 at 6:31 AM Christian König wrote: > Can you provide the full output of lspci -. As far as I can see that > doesn't looks so invalid to me. I've added the relevant pci probing debug output without assign-busses and the lspci - for a boot with all devices visible. https://gist.github.com/kkartaltepe/2f01f33c7e7af33cf0d28678e91f50fb > Well that is just a very very old workaround for a buggy BIOS on 20 year old > laptops. The last reference I could find for hardware which actually needed > it is this: > > commit 8c4b2cf9af9b4ecc29d4f0ec4ecc8e94dc4432d7 > Author: Bernhard Kaindl > Date: Sat Feb 18 01:36:55 2006 -0800 > > [PATCH] PCI: PCI/Cardbus cards hidden, needs pci=assign-busses to fix > > > So as far as I know nobody had to use that in ages and I wouldn't expect that > this option actually works correctly on any modern hardware. > > Especially not anything PCIe based since it messes up the ACPI to PCIe device > mappings. That amdgpu doesn't work is just the tip of the iceberg here. > > The bus assignment code in the PCI subsystem is made to support hotplug, not > completely re-number the root hubs from scratch. That is just a hack somebody > came up with two decades ago to get some Cardbus slots in laptops working. > > I'm not sure yet what's going wrong with the Thunderbold controller, but > completely re-assigning bus numbers is certainly the wrong approach. I was referring to the work outlined in https://ostconf.com/system/attachments/files/000/001/698/original/Sergei_Miroshnichenko_linux_piter_2019_presentation.pdf?1570136708 for nvme enclosures. Which maybe referncing more the movable BARs than the renumbering that occurs with assign-busses, but also on power with device trees which may behave differently as it mentions assign-busses to get this same renumbering of buses. This makes me think at least modern non-x86 devices expect to behave this way, which may not be relevant to ACPI/x86 systems but at least this shared pci code should be solid. > I'm not sure yet what's going wrong with the Thunderbold controller, but > completely re-assigning bus numbers is certainly the wrong approach. I agree, it is just what is currently available in the kernel. A less disruptive approach seems needed. --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
Am 19.03.24 um 16:04 schrieb Kurt Kartaltepe: On Tue, Mar 19, 2024 at 2:54 AM Christian König wrote: Well what problems do you run into? The ACPI and BIOS assignments usually work much better than whatever the Linux PCI subsystem comes up with. Perhaps its easier to show the lspci output for the BIOS assignment and we can agree it's far from helpful +-04.1-[64-c3]00.0-[65-68]--+-01.0-[66]00.0-[67]00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | +-02.0-[67]-- | \-04.0-[68]-- In this case the bios has assigned the upstream port 65-68, for its 3 downstreams 66,67,68, and then assigned the upstream port of the device's own bridge to 67. In this case not only did BIOS produce an invalid topology but it also does not provide any space at the first upstream or downstream ports which the current PCI implementation would require to assign bus numbers if I understand it correctly. Can you provide the full output of lspci -. As far as I can see that doesn't looks so invalid to me. The PCI subsystem in the Linux kernel for example can't handle back to back resources behind multiple downstream bridges. So when the BIOS fails to assign something it's extremely unlikely that the Linux kernel will do the right thing either. I'm not sure this is still the case, the PCI subsystem with realloc (and assign-busses for x86) deals with enumerating this topology which reports multiple bridges just fine. Well that is just a very very old workaround for a buggy BIOS on 20 year old laptops. The last reference I could find for hardware which actually needed it is this: commit 8c4b2cf9af9b4ecc29d4f0ec4ecc8e94dc4432d7 Author: Bernhard Kaindl Date: Sat Feb 18 01:36:55 2006 -0800 [PATCH] PCI: PCI/Cardbus cards hidden, needs pci=assign-busses to fix So as far as I know nobody had to use that in ages and I wouldn't expect that this option actually works correctly on any modern hardware. Especially not anything PCIe based since it messes up the ACPI to PCIe device mappings. That amdgpu doesn't work is just the tip of the iceberg here. The same configuration as above produces this bus numbering (with hpbussize=20) +-04.1-[24-66]00.0-[25-66]--+-01.0-[26-45]00.0-[27-29]--+-01.0-[28]00.0 Intel Corporation DG2 [Arc A750] | | \-04.0-[29]00.0 Intel Corporation DG2 Audio Controller | +-02.0-[46]00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | \-04.0-[47-66]-- The Linux kernel doesnt do the right thing without these features, and these are not the default. So you may be right that by default it does not recover from the situation of well. Given the bus allocation at the root port I can imagine a more aggressive than default but less aggressive than `assign-busses` reallocation scheme could deal with both preserving root allocations like the APU and renumbering things behind upstream ports. That might be a better approach than renumbering even the root bus devices. The bus assignment code in the PCI subsystem is made to support hotplug, not completely re-number the root hubs from scratch. That is just a hack somebody came up with two decades ago to get some Cardbus slots in laptops working. I'm not sure yet what's going wrong with the Thunderbold controller, but completely re-assigning bus numbers is certainly the wrong approach. Regards, Christian. Regards, Christian.
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Tue, Mar 19, 2024 at 2:54 AM Christian König wrote: > > > Well what problems do you run into? The ACPI and BIOS assignments > usually work much better than whatever the Linux PCI subsystem comes up > with. Perhaps its easier to show the lspci output for the BIOS assignment and we can agree it's far from helpful +-04.1-[64-c3]00.0-[65-68]--+-01.0-[66]00.0-[67]00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | +-02.0-[67]-- | \-04.0-[68]-- In this case the bios has assigned the upstream port 65-68, for its 3 downstreams 66,67,68, and then assigned the upstream port of the device's own bridge to 67. In this case not only did BIOS produce an invalid topology but it also does not provide any space at the first upstream or downstream ports which the current PCI implementation would require to assign bus numbers if I understand it correctly. > > The PCI subsystem in the Linux kernel for example can't handle back to > back resources behind multiple downstream bridges. > > So when the BIOS fails to assign something it's extremely unlikely that > the Linux kernel will do the right thing either. I'm not sure this is still the case, the PCI subsystem with realloc (and assign-busses for x86) deals with enumerating this topology which reports multiple bridges just fine. The same configuration as above produces this bus numbering (with hpbussize=20) +-04.1-[24-66]00.0-[25-66]--+-01.0-[26-45]00.0-[27-29]--+-01.0-[28]00.0 Intel Corporation DG2 [Arc A750] | | \-04.0-[29]00.0 Intel Corporation DG2 Audio Controller | +-02.0-[46]00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge DD 2018] | \-04.0-[47-66]-- The Linux kernel doesnt do the right thing without these features, and these are not the default. So you may be right that by default it does not recover from the situation of well. Given the bus allocation at the root port I can imagine a more aggressive than default but less aggressive than `assign-busses` reallocation scheme could deal with both preserving root allocations like the APU and renumbering things behind upstream ports. That might be a better approach than renumbering even the root bus devices. > > Regards, > Christian.
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
Am 19.03.24 um 02:55 schrieb Kurt Kartaltepe: On Mon, Mar 18, 2024 at 12:57 PM Alex Deucher wrote: On Mon, Mar 18, 2024 at 3:52 PM Alex Deucher wrote: ... Depends on the platform, but recent ones use VFCT. That said, there should only ever be one IGPU in the system so I think we could just rely on the VID and DID for APUs in this case and check everything for dGPUs. Is there a reason why you need this option? Even beyond this, I could envision other problems related to APUs and ACPI if these changed. Alex So there are multiple factors in play. I am trying to make use of the lovely usb4/tb3 controllers on the 7940HS with the reportedly Intel Tamales Module 2 pci/pci bridge over the usb4 interface. This provides a handy way to expand the pcie bus but configuring ACPI and pcie topology isn't generally an option on consumer BIOS (unless you want to enlighten me). This leaves us in the situation where the bios can enumerate devices poorly resulting in inaccessible devices due to address conflicts. To resolve address conflicts the only option I'm aware of is pci=assign-busses, maybe this could also be configured at runtime but assign-busses seemed nice in some ways. Well what problems do you run into? The ACPI and BIOS assignments usually work much better than whatever the Linux PCI subsystem comes up with. The PCI subsystem in the Linux kernel for example can't handle back to back resources behind multiple downstream bridges. So when the BIOS fails to assign something it's extremely unlikely that the Linux kernel will do the right thing either. Regards, Christian. I havnt experienced any issues with the APU (graphics, hardware encoders/decoders) but I do think assign-busses might be renumbering again after suspend/resume/pci rescans but I need to debug further, maybe suspend/resume are just broken when ACPI addresses are wrong. Obviously the graphics user space (compositors, mesa might be working as expected) dont handle the device switching addresses while in use, for amdgpu kernel side I haven't inspected deeply yet. I'm not sure if this is the right approach to solving the problem, and given your input i'm considering it may be better, though not upstreamable, to implement renumbering only for specified devices like this pci bridge or investigate runtime management of the pci bus addresses. The current assign-busses implementation is quite the big hammer admittedly. --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 8:42 AM Alex Deucher wrote: > > On Mon, Mar 18, 2024 at 10:19 AM Kurt Kartaltepe > wrote: > > > > On Mon, Mar 18, 2024 at 6:37 AM Alex Deucher wrote: > > > > > > On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe > > > wrote: > > > > > > > > These checks prevent using amdgpu with the pcie=assign-busses parameter > > > > which will re-address devices from their acpi values. > > > > > > > > Signed-off-by: Kurt Kartaltepe > > > > > > This will likely break multi-GPU functionality. The BDF values are > > > how the sbios/driver differentiates between the VFCT images. If you > > > have multiple GPUs in the system, the driver won't be able to figure > > > out which one goes with which GPU an you may end up assigning the > > > wrong image to the wrong device. > > > > > > Alex > > > > The vendor and device portions must be correct in the existing > > kernels, so device type differentiation should already work without > > BDF values. > > > > So does that mean the concern is images are different for devices with > > the same vendor:device pairs? There are sites out there dedicated to > > dumping AMD's video roms which seem to suggest all discrete devices > > would be fine loading the same rom. Is there another platform you are > > thinking of where devices with the same vendor:device values would > > need different images? > > That is incorrect. The vbios images are board specific. Using the > wrong image can cause a lot of problems. The vbios exists to handle > board specific design variations (e.g., the number and type of display > connectors, the i2c/aux channel mappings, board specific clock and > voltage settings, etc.). The PCI DID just indicates the chip used on > the board. The actual board design varies with each AIB vendor (e.g., > Sapphire and XFX both make 7900XTX boards, but they can have very > different configurations. Thanks for the explanation, that makes sense. Is my understanding correct that IGPUs (my case) simply won't have vbios available in any other mechanism. If so perhaps this isnt feasible in amdgpu as the BDF information is lost in reassignment. --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 12:57 PM Alex Deucher wrote: > > On Mon, Mar 18, 2024 at 3:52 PM Alex Deucher wrote: > > ... > > Depends on the platform, but recent ones use VFCT. That said, there > > should only ever be one IGPU in the system so I think we could just > > rely on the VID and DID for APUs in this case and check everything for > > dGPUs. > > Is there a reason why you need this option? Even beyond this, I could > envision other problems related to APUs and ACPI if these changed. > > Alex So there are multiple factors in play. I am trying to make use of the lovely usb4/tb3 controllers on the 7940HS with the reportedly Intel Tamales Module 2 pci/pci bridge over the usb4 interface. This provides a handy way to expand the pcie bus but configuring ACPI and pcie topology isn't generally an option on consumer BIOS (unless you want to enlighten me). This leaves us in the situation where the bios can enumerate devices poorly resulting in inaccessible devices due to address conflicts. To resolve address conflicts the only option I'm aware of is pci=assign-busses, maybe this could also be configured at runtime but assign-busses seemed nice in some ways. I havnt experienced any issues with the APU (graphics, hardware encoders/decoders) but I do think assign-busses might be renumbering again after suspend/resume/pci rescans but I need to debug further, maybe suspend/resume are just broken when ACPI addresses are wrong. Obviously the graphics user space (compositors, mesa might be working as expected) dont handle the device switching addresses while in use, for amdgpu kernel side I haven't inspected deeply yet. I'm not sure if this is the right approach to solving the problem, and given your input i'm considering it may be better, though not upstreamable, to implement renumbering only for specified devices like this pci bridge or investigate runtime management of the pci bus addresses. The current assign-busses implementation is quite the big hammer admittedly. --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 6:37 AM Alex Deucher wrote: > > On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe wrote: > > > > These checks prevent using amdgpu with the pcie=assign-busses parameter > > which will re-address devices from their acpi values. > > > > Signed-off-by: Kurt Kartaltepe > > This will likely break multi-GPU functionality. The BDF values are > how the sbios/driver differentiates between the VFCT images. If you > have multiple GPUs in the system, the driver won't be able to figure > out which one goes with which GPU an you may end up assigning the > wrong image to the wrong device. > > Alex The vendor and device portions must be correct in the existing kernels, so device type differentiation should already work without BDF values. So does that mean the concern is images are different for devices with the same vendor:device pairs? There are sites out there dedicated to dumping AMD's video roms which seem to suggest all discrete devices would be fine loading the same rom. Is there another platform you are thinking of where devices with the same vendor:device values would need different images? (Sorry this is my first patch to the mailing list and I am replying with gmail, I hope it doesnt break things). --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 3:52 PM Alex Deucher wrote: > > On Mon, Mar 18, 2024 at 12:06 PM Kurt Kartaltepe > wrote: > > > > On Mon, Mar 18, 2024 at 8:42 AM Alex Deucher wrote: > > > > > > On Mon, Mar 18, 2024 at 10:19 AM Kurt Kartaltepe > > > wrote: > > > > > > > > On Mon, Mar 18, 2024 at 6:37 AM Alex Deucher > > > > wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe > > > > > wrote: > > > > > > > > > > > > These checks prevent using amdgpu with the pcie=assign-busses > > > > > > parameter > > > > > > which will re-address devices from their acpi values. > > > > > > > > > > > > Signed-off-by: Kurt Kartaltepe > > > > > > > > > > This will likely break multi-GPU functionality. The BDF values are > > > > > how the sbios/driver differentiates between the VFCT images. If you > > > > > have multiple GPUs in the system, the driver won't be able to figure > > > > > out which one goes with which GPU an you may end up assigning the > > > > > wrong image to the wrong device. > > > > > > > > > > Alex > > > > > > > > The vendor and device portions must be correct in the existing > > > > kernels, so device type differentiation should already work without > > > > BDF values. > > > > > > > > So does that mean the concern is images are different for devices with > > > > the same vendor:device pairs? There are sites out there dedicated to > > > > dumping AMD's video roms which seem to suggest all discrete devices > > > > would be fine loading the same rom. Is there another platform you are > > > > thinking of where devices with the same vendor:device values would > > > > need different images? > > > > > > That is incorrect. The vbios images are board specific. Using the > > > wrong image can cause a lot of problems. The vbios exists to handle > > > board specific design variations (e.g., the number and type of display > > > connectors, the i2c/aux channel mappings, board specific clock and > > > voltage settings, etc.). The PCI DID just indicates the chip used on > > > the board. The actual board design varies with each AIB vendor (e.g., > > > Sapphire and XFX both make 7900XTX boards, but they can have very > > > different configurations. > > > > Thanks for the explanation, that makes sense. > > > > Is my understanding correct that IGPUs (my case) simply won't have > > vbios available in any other mechanism. If so perhaps this isnt > > feasible in amdgpu as the BDF information is lost in reassignment. > > Depends on the platform, but recent ones use VFCT. That said, there > should only ever be one IGPU in the system so I think we could just > rely on the VID and DID for APUs in this case and check everything for > dGPUs. Is there a reason why you need this option? Even beyond this, I could envision other problems related to APUs and ACPI if these changed. Alex
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 12:06 PM Kurt Kartaltepe wrote: > > On Mon, Mar 18, 2024 at 8:42 AM Alex Deucher wrote: > > > > On Mon, Mar 18, 2024 at 10:19 AM Kurt Kartaltepe > > wrote: > > > > > > On Mon, Mar 18, 2024 at 6:37 AM Alex Deucher > > > wrote: > > > > > > > > On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe > > > > wrote: > > > > > > > > > > These checks prevent using amdgpu with the pcie=assign-busses > > > > > parameter > > > > > which will re-address devices from their acpi values. > > > > > > > > > > Signed-off-by: Kurt Kartaltepe > > > > > > > > This will likely break multi-GPU functionality. The BDF values are > > > > how the sbios/driver differentiates between the VFCT images. If you > > > > have multiple GPUs in the system, the driver won't be able to figure > > > > out which one goes with which GPU an you may end up assigning the > > > > wrong image to the wrong device. > > > > > > > > Alex > > > > > > The vendor and device portions must be correct in the existing > > > kernels, so device type differentiation should already work without > > > BDF values. > > > > > > So does that mean the concern is images are different for devices with > > > the same vendor:device pairs? There are sites out there dedicated to > > > dumping AMD's video roms which seem to suggest all discrete devices > > > would be fine loading the same rom. Is there another platform you are > > > thinking of where devices with the same vendor:device values would > > > need different images? > > > > That is incorrect. The vbios images are board specific. Using the > > wrong image can cause a lot of problems. The vbios exists to handle > > board specific design variations (e.g., the number and type of display > > connectors, the i2c/aux channel mappings, board specific clock and > > voltage settings, etc.). The PCI DID just indicates the chip used on > > the board. The actual board design varies with each AIB vendor (e.g., > > Sapphire and XFX both make 7900XTX boards, but they can have very > > different configurations. > > Thanks for the explanation, that makes sense. > > Is my understanding correct that IGPUs (my case) simply won't have > vbios available in any other mechanism. If so perhaps this isnt > feasible in amdgpu as the BDF information is lost in reassignment. Depends on the platform, but recent ones use VFCT. That said, there should only ever be one IGPU in the system so I think we could just rely on the VID and DID for APUs in this case and check everything for dGPUs. Alex
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 10:19 AM Kurt Kartaltepe wrote: > > On Mon, Mar 18, 2024 at 6:37 AM Alex Deucher wrote: > > > > On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe > > wrote: > > > > > > These checks prevent using amdgpu with the pcie=assign-busses parameter > > > which will re-address devices from their acpi values. > > > > > > Signed-off-by: Kurt Kartaltepe > > > > This will likely break multi-GPU functionality. The BDF values are > > how the sbios/driver differentiates between the VFCT images. If you > > have multiple GPUs in the system, the driver won't be able to figure > > out which one goes with which GPU an you may end up assigning the > > wrong image to the wrong device. > > > > Alex > > The vendor and device portions must be correct in the existing > kernels, so device type differentiation should already work without > BDF values. > > So does that mean the concern is images are different for devices with > the same vendor:device pairs? There are sites out there dedicated to > dumping AMD's video roms which seem to suggest all discrete devices > would be fine loading the same rom. Is there another platform you are > thinking of where devices with the same vendor:device values would > need different images? That is incorrect. The vbios images are board specific. Using the wrong image can cause a lot of problems. The vbios exists to handle board specific design variations (e.g., the number and type of display connectors, the i2c/aux channel mappings, board specific clock and voltage settings, etc.). The PCI DID just indicates the chip used on the board. The actual board design varies with each AIB vendor (e.g., Sapphire and XFX both make 7900XTX boards, but they can have very different configurations. Alex > > (Sorry this is my first patch to the mailing list and I am replying > with gmail, I hope it doesnt break things). > > --Kurt Kartaltepe
Re: [PATCH] drm/amdgpu: Remove pci address checks from acpi_vfct_bios
On Mon, Mar 18, 2024 at 4:47 AM Kurt Kartaltepe wrote: > > These checks prevent using amdgpu with the pcie=assign-busses parameter > which will re-address devices from their acpi values. > > Signed-off-by: Kurt Kartaltepe This will likely break multi-GPU functionality. The BDF values are how the sbios/driver differentiates between the VFCT images. If you have multiple GPUs in the system, the driver won't be able to figure out which one goes with which GPU an you may end up assigning the wrong image to the wrong device. Alex > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c > index 618e469e3622..932ce13ad232 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_bios.c > @@ -386,9 +386,6 @@ static bool amdgpu_acpi_vfct_bios(struct amdgpu_device > *adev) > } > > if (vhdr->ImageLength && > - vhdr->PCIBus == adev->pdev->bus->number && > - vhdr->PCIDevice == PCI_SLOT(adev->pdev->devfn) && > - vhdr->PCIFunction == PCI_FUNC(adev->pdev->devfn) && > vhdr->VendorID == adev->pdev->vendor && > vhdr->DeviceID == adev->pdev->device) { > adev->bios = kmemdup(>VbiosContent, > -- > 2.44.0 >