Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
Am 31.08.2018 um 09:30 schrieb Daniel Drake: On over 40 Intel-based Asus products, the nvidia GPU becomes unusable after S3 suspend/resume. The affected products include multiple generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs many errors such as: fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown] DRM: failed to idle channel 0 [DRM] Similarly, the nvidia proprietary driver also fails after resume (black screen, 100% CPU usage in Xorg process). We shipped a sample to Nvidia for diagnosis, and their response indicated that it's a problem with the parent PCI bridge (on the Intel SoC), not the GPU. We found a workaround: on resume, rewrite the Intel PCI bridge 'Prefetchable Base Upper 32 Bits' register. In the cases that I checked, this register has value 0 and we just have to rewrite that value. It's very strange that rewriting the exact same register value makes a difference, but it definitely makes the issue go away. It's not just acting as some kind of memory barrier, because rewriting other bridge registers does not work around the issue. There's something magic in this particular register. We examined our database of Asus hardware and identified 43 products that we believe are affected. Checking the nvidia GPU parent PCI bridge on each one, in total 5 Intel PCI bridges need quirking as below. The quirk will run on bridges even where no nvidia GPU is connected, but it should be harmless, and we at least limit it to only running on Asus products. This fix was tested on all the affected models that we have in hands (X542UQ, UX533FD, X530UN, V272UN). Hello, this patch helps on my HP Zbook 14u G5 which otherwise fails to resume the dGPU after suspend. In this case it's a radeon gpu (polaris 10). Of course I had to remove the check for ASUS, but made no other changes. With this patch I can successfully run "DRI_PRIME=1 glxinfo | grep -i renderer" and see the radeon, as well as "DRI_PRIME=1 glxgears", after resuming from suspend. Attemting that without the patch makes the system hang for a few seconds followed by lots of powerplay errors in dmesg. glxinfo/gears sometimes use the Intel graphics or show a blank window. FWIW, this problem was discussed a lot in bug https://bugs.freedesktop.org/show_bug.cgi?id=105760 (it's closed only because the original bug crash is solved but the root problem is still unfixed). Therefore I add Peter Wu and Alex Deucher who attempted to help me out already. I think this supports your other mail where you suggest it should be done unconditionally. Thanks for the patch! Best regards ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Sat, Sep 1, 2018 at 3:12 AM, Bjorn Helgaas wrote: > Can we tell whether Windows rewrites this register unconditionally at > resume-time? If so, it may be more robust for Linux to do the same. > The whole thing is black magic, which I hate, but if it's our only > choice, it may be better to have this applied everywhere so we don't > keep stubbing our toes on new systems that require the quirk. Checked this with qemu adding a PCI-to-PCI bridge (ioh3420). $ qemu-system-x86_64 -enable-kvm -M q35,accel=kvm -m 2G -vga qxl -cpu host -hda testimg.img -device ioh3420,id=rp1,bus=pcie.0,addr=1c.0,port=1 -trace events=events.txt events.txt has: pci_cfg_read pci_cfg_write Logged cfg space accesses during boot: https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-boot-txt Suspend: https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-suspend-txt Resume: https://gist.github.com/dsd/135fb255cb2b237567d8ea2d6bfc6917#file-resume-txt Notably during resume, the prefetch-related registers get rewritten: pci_cfg_write ioh3420 28:0 @0x24 <- 0xfeb0fea0 pci_cfg_write ioh3420 28:0 @0x28 <- 0x0 pci_cfg_write ioh3420 28:0 @0x2c <- 0x0 This happened even though there was nothing behind the bridge. Windows failed to resume in this test (black screen) but the traced register writes seem indicative enough. Peter Wu confirms the same results in a similar experiment: https://marc.info/?l=linux-pci&m=153616336225386&w=2 I'll look into creating a new patch that unconditionally reprograms the PCI bridge prefetch stuff on resume. Thanks Daniel ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
Hi Daniel, I love your patch! Perhaps something to improve: [auto build test WARNING on pci/next] [also build test WARNING on v4.19-rc2 next-20180831] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Daniel-Drake/PCI-add-prefetch-quirk-to-work-around-Asus-Nvidia-suspend-issues/20180901-043245 base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next config: x86_64-randconfig-s5-09031857 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 :: branch date: 3 days ago :: commit date: 3 days ago All warnings (new ones prefixed by >>): In file included from include/linux/export.h:45:0, from include/linux/linkage.h:7, from include/linux/kernel.h:7, from drivers/pci/quirks.c:16: drivers/pci/quirks.c: In function 'quirk_asus_pci_prefetch': drivers/pci/quirks.c:5134:6: warning: argument 1 null where non-null expected [-Wnonnull] if (strcmp(sys_vendor, "ASUSTeK COMPUTER INC.") != 0) ^~~ include/linux/compiler.h:58:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> drivers/pci/quirks.c:5134:2: note: in expansion of macro 'if' if (strcmp(sys_vendor, "ASUSTeK COMPUTER INC.") != 0) ^~ In file included from include/linux/uuid.h:20:0, from include/linux/mod_devicetable.h:13, from include/linux/pci.h:21, from drivers/pci/quirks.c:18: include/linux/string.h:44:12: note: in a call to function 'strcmp' declared here extern int strcmp(const char *,const char *); ^~ # https://github.com/0day-ci/linux/commit/eccd2a8c40e1a705a666e6fe1c52aca3f2130984 git remote add linux-review https://github.com/0day-ci/linux git remote update linux-review git checkout eccd2a8c40e1a705a666e6fe1c52aca3f2130984 vim +/if +5134 drivers/pci/quirks.c e7aaf90f9 Bjorn Helgaas 2018-08-15 4983 e7aaf90f9 Bjorn Helgaas 2018-08-15 4984 /* ad281ecf1 Doug Meyer2018-05-23 4985 * Microsemi Switchtec NTB uses devfn proxy IDs to move TLPs between ad281ecf1 Doug Meyer2018-05-23 4986 * NT endpoints via the internal switch fabric. These IDs replace the ad281ecf1 Doug Meyer2018-05-23 4987 * originating requestor ID TLPs which access host memory on peer NTB ad281ecf1 Doug Meyer2018-05-23 4988 * ports. Therefore, all proxy IDs must be aliased to the NTB device ad281ecf1 Doug Meyer2018-05-23 4989 * to permit access when the IOMMU is turned on. ad281ecf1 Doug Meyer2018-05-23 4990 */ ad281ecf1 Doug Meyer2018-05-23 4991 static void quirk_switchtec_ntb_dma_alias(struct pci_dev *pdev) ad281ecf1 Doug Meyer2018-05-23 4992 { ad281ecf1 Doug Meyer2018-05-23 4993void __iomem *mmio; ad281ecf1 Doug Meyer2018-05-23 4994struct ntb_info_regs __iomem *mmio_ntb; ad281ecf1 Doug Meyer2018-05-23 4995struct ntb_ctrl_regs __iomem *mmio_ctrl; ad281ecf1 Doug Meyer2018-05-23 4996struct sys_info_regs __iomem *mmio_sys_info; ad281ecf1 Doug Meyer2018-05-23 4997u64 partition_map; ad281ecf1 Doug Meyer2018-05-23 4998u8 partition; ad281ecf1 Doug Meyer2018-05-23 4999int pp; ad281ecf1 Doug Meyer2018-05-23 5000 ad281ecf1 Doug Meyer2018-05-23 5001if (pci_enable_device(pdev)) { ad281ecf1 Doug Meyer2018-05-23 5002pci_err(pdev, "Cannot enable Switchtec device\n"); ad281ecf1 Doug Meyer2018-05-23 5003return; ad281ecf1 Doug Meyer2018-05-23 5004} ad281ecf1 Doug Meyer2018-05-23 5005 ad281ecf1 Doug Meyer2018-05-23 5006mmio = pci_iomap(pdev, 0, 0); ad281ecf1 Doug Meyer2018-05-23 5007if (mmio == NULL) { ad281ecf1 Doug Meyer2018-05-23 5008 pci_disable_device(pdev); ad281ecf1 Doug Meyer2018-05-23 5009pci_err(pdev, "Cannot iomap Switchtec device\n"); ad281ecf1 Doug Meyer2018-05-23 5010return; ad281ecf1 Doug Meyer2018-05-23 5011} ad281ecf1 Doug Meyer2018-05-23 5012 ad281ecf1 Doug Meyer2018-05-23 5013pci_info(pdev, "Setting Switchtec proxy ID aliases\n"); ad281ecf1 Doug Meyer2018-05-23 5014 ad281ecf1 Doug Meyer2018-05-23 5015mmio_ntb = mmio + SWITCHTEC_GAS_NTB_OFFSET; ad281ecf1 Doug Meyer2018-05-23 5016mmio_ctrl = (void __iomem *) mmio_ntb + SWITCHTEC_NTB_REG_CTRL_OFFSET; ad281ecf1 Doug Meyer2018-05-23 5017mmio_sys_info = mmio + SWITCHTEC_GAS_SYS_INFO_OFFSET; ad281ecf1 Doug Meyer2018-05-23 5018 ad281ecf1 Doug Meyer2018-05-23 5019partition = ioread8(
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Tue, Sep 04, 2018 at 03:07:52PM +0800, Daniel Drake wrote: > On Tue, Sep 4, 2018 at 2:43 PM, Mika Westerberg > wrote: > > Yes, can you check if the failing device BAR is included in any of the > > above entries? If not then it is probably not related. > > mtrr again for reference: > reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable > reg01: base=0x0a000 ( 2560MB), size= 512MB, count=1: uncachable > reg02: base=0x09000 ( 2304MB), size= 256MB, count=1: uncachable > reg03: base=0x08c00 ( 2240MB), size= 64MB, count=1: uncachable > reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable > > > The PCI bridge is: > 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express > Root Port (rev f1) (prog-if 00 [Normal decode]) > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > SERR- Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 122 > Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 > I/O behind bridge: e000-efff > Memory behind bridge: ee00-ef0f > Prefetchable memory behind bridge: d000-e1ff > > The memory behind bridge at ee00 is included in the mtrr region > reg00 which is 0xc000 to 0x. > Same for the prefetchable memory behind bridge. Yeah and it is uncachable so it should be fine. > The nvidia GPU which becomes unresponsive is: > > 01:00.0 3D controller: NVIDIA Corporation GM108M [GeForce 940MX] (rev a2) > Subsystem: ASUSTeK Computer Inc. GM108M [GeForce 940MX] > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > SERR- Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin A routed to IRQ 133 > Region 0: Memory at ee00 (32-bit, non-prefetchable) [size=16M] > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] > Region 3: Memory at e000 (64-bit, prefetchable) [size=32M] > Region 5: I/O ports at e000 [size=128] > Expansion ROM at ef00 [disabled] [size=512K] > > Region 0, 1, 3 and the expansion ROM are all included in the mtrr region > reg00. > > > The magic register that we write to workaround the issue is in PCI > bridge config space - not in a BAR. OK, I just wanted to rule out MTRR misconfiguration but I guess it is not the case here. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Tue, Sep 4, 2018 at 2:43 PM, Mika Westerberg wrote: > Yes, can you check if the failing device BAR is included in any of the > above entries? If not then it is probably not related. mtrr again for reference: reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable reg01: base=0x0a000 ( 2560MB), size= 512MB, count=1: uncachable reg02: base=0x09000 ( 2304MB), size= 256MB, count=1: uncachable reg03: base=0x08c00 ( 2240MB), size= 64MB, count=1: uncachable reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable The PCI bridge is: 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port (rev f1) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Tue, Sep 04, 2018 at 09:52:02AM +0800, Daniel Drake wrote: > # cat /proc/mtrr > reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable > reg01: base=0x0a000 ( 2560MB), size= 512MB, count=1: uncachable > reg02: base=0x09000 ( 2304MB), size= 256MB, count=1: uncachable > reg03: base=0x08c00 ( 2240MB), size= 64MB, count=1: uncachable > reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable > > # cat /sys/kernel/debug/x86/pat_memtype_list > PAT memtype list: > write-back @ 0x84a23000-0x84a24000 > write-back @ 0x8ad34000-0x8ad6 > write-back @ 0x8ad5f000-0x8ad66000 > write-back @ 0x8ad5f000-0x8ad6 > write-back @ 0x8ad65000-0x8ad6a000 > write-back @ 0x8ad69000-0x8ad6b000 > write-back @ 0x8ad6a000-0x8ad6c000 > write-back @ 0x8ad6b000-0x8ad6e000 > write-back @ 0x8ad9c000-0x8ad9d000 > write-back @ 0x8adce000-0x8adcf000 > write-back @ 0x8adcf000-0x8add > write-back @ 0x8adcf000-0x8add2000 > write-back @ 0x8add3000-0x8add4000 > write-back @ 0x8ae04000-0x8ae05000 > write-back @ 0x8b208000-0x8b209000 > write-combining @ 0xc000-0xd000 > write-combining @ 0xd000-0xe000 > write-combining @ 0xe000-0xe004 > write-combining @ 0xe004-0xe005 > write-combining @ 0xe005-0xe0051000 > write-combining @ 0xe0051000-0xe0151000 > write-combining @ 0xe0151000-0xe0191000 > write-combining @ 0xe0191000-0xe01a1000 > write-combining @ 0xe01a1000-0xe01b1000 > write-combining @ 0xe01b1000-0xe01c1000 > write-combining @ 0xe01c1000-0xe01c3000 > write-combining @ 0xe01c3000-0xe01c5000 > write-combining @ 0xe01c5000-0xe01cd000 > write-combining @ 0xe01cd000-0xe01d5000 > write-combining @ 0xe01d5000-0xe01dd000 > write-combining @ 0xe01dd000-0xe01e5000 > write-combining @ 0xe01e5000-0xe01ed000 > write-combining @ 0xe01ed000-0xe01f5000 > write-combining @ 0xe01f5000-0xe01fd000 > write-combining @ 0xe01fd000-0xe0205000 > write-combining @ 0xe0205000-0xe020d000 > write-combining @ 0xe020d000-0xe0215000 > uncached-minus @ 0xed00-0xed20 > write-combining @ 0xed80-0xee00 > uncached-minus @ 0xee00-0xef00 > uncached-minus @ 0xef20-0xef40 > uncached-minus @ 0xef40-0xef401000 > uncached-minus @ 0xef404000-0xef405000 > uncached-minus @ 0xef51-0xef52 > uncached-minus @ 0xef528000-0xef52c000 > uncached-minus @ 0xef533000-0xef534000 > uncached-minus @ 0xef533000-0xef534000 > uncached-minus @ 0xef533000-0xef534000 > uncached-minus @ 0xef534000-0xef535000 > uncached-minus @ 0xef534000-0xef535000 > uncached-minus @ 0xef534000-0xef535000 > uncached-minus @ 0xef535000-0xef536000 > uncached-minus @ 0xef537000-0xef538000 > uncached-minus @ 0xef538000-0xef539000 > uncached-minus @ 0xef538000-0xef539000 > uncached-minus @ 0xef538000-0xef539000 > uncached-minus @ 0xef539000-0xef53a000 > uncached-minus @ 0xef539000-0xef53a000 > uncached-minus @ 0xef539000-0xef53a000 > uncached-minus @ 0xef53a000-0xef53b000 > uncached-minus @ 0xf000-0xf800 > uncached-minus @ 0xf00e-0xf00e1000 > uncached-minus @ 0xf010-0xf0101000 > uncached-minus @ 0xf0101000-0xf0102000 > uncached-minus @ 0xfdac-0xfdad > uncached-minus @ 0xfdae-0xfdaf > uncached-minus @ 0xfdaf-0xfdb0 > uncached-minus @ 0xfdc43000-0xfdc44000 > uncached-minus @ 0xfe00-0xfe001000 > uncached-minus @ 0xfe00-0xfe001000 > uncached-minus @ 0xfed0-0xfed01000 > uncached-minus @ 0xfed15000-0xfed16000 > uncached-minus @ 0xfed4-0xfed41000 > uncached-minus @ 0xfed9-0xfed91000 > uncached-minus @ 0xfed91000-0xfed92000 > > Is that the info you were looking for? Yes, can you check if the failing device BAR is included in any of the above entries? If not then it is probably not related. ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Mon, Sep 3, 2018 at 8:12 PM, Mika Westerberg wrote: > We have seen one similar issue with LPSS devices when BIOS assigns > device BARs above 4G (which is not the case here) and it turned out to > be misconfigured MTRR register or something like that. It may not be > related at all but it could be worth a try to dump out MTRR registers of > one of the affected systems and see if the memory areas are listed there > (and if the attributes are somehow wrong if found). From Asus X542UQ: # cat /proc/mtrr reg00: base=0x0c000 ( 3072MB), size= 1024MB, count=1: uncachable reg01: base=0x0a000 ( 2560MB), size= 512MB, count=1: uncachable reg02: base=0x09000 ( 2304MB), size= 256MB, count=1: uncachable reg03: base=0x08c00 ( 2240MB), size= 64MB, count=1: uncachable reg04: base=0x08b80 ( 2232MB), size=8MB, count=1: uncachable # cat /sys/kernel/debug/x86/pat_memtype_list PAT memtype list: write-back @ 0x84a23000-0x84a24000 write-back @ 0x8ad34000-0x8ad6 write-back @ 0x8ad5f000-0x8ad66000 write-back @ 0x8ad5f000-0x8ad6 write-back @ 0x8ad65000-0x8ad6a000 write-back @ 0x8ad69000-0x8ad6b000 write-back @ 0x8ad6a000-0x8ad6c000 write-back @ 0x8ad6b000-0x8ad6e000 write-back @ 0x8ad9c000-0x8ad9d000 write-back @ 0x8adce000-0x8adcf000 write-back @ 0x8adcf000-0x8add write-back @ 0x8adcf000-0x8add2000 write-back @ 0x8add3000-0x8add4000 write-back @ 0x8ae04000-0x8ae05000 write-back @ 0x8b208000-0x8b209000 write-combining @ 0xc000-0xd000 write-combining @ 0xd000-0xe000 write-combining @ 0xe000-0xe004 write-combining @ 0xe004-0xe005 write-combining @ 0xe005-0xe0051000 write-combining @ 0xe0051000-0xe0151000 write-combining @ 0xe0151000-0xe0191000 write-combining @ 0xe0191000-0xe01a1000 write-combining @ 0xe01a1000-0xe01b1000 write-combining @ 0xe01b1000-0xe01c1000 write-combining @ 0xe01c1000-0xe01c3000 write-combining @ 0xe01c3000-0xe01c5000 write-combining @ 0xe01c5000-0xe01cd000 write-combining @ 0xe01cd000-0xe01d5000 write-combining @ 0xe01d5000-0xe01dd000 write-combining @ 0xe01dd000-0xe01e5000 write-combining @ 0xe01e5000-0xe01ed000 write-combining @ 0xe01ed000-0xe01f5000 write-combining @ 0xe01f5000-0xe01fd000 write-combining @ 0xe01fd000-0xe0205000 write-combining @ 0xe0205000-0xe020d000 write-combining @ 0xe020d000-0xe0215000 uncached-minus @ 0xed00-0xed20 write-combining @ 0xed80-0xee00 uncached-minus @ 0xee00-0xef00 uncached-minus @ 0xef20-0xef40 uncached-minus @ 0xef40-0xef401000 uncached-minus @ 0xef404000-0xef405000 uncached-minus @ 0xef51-0xef52 uncached-minus @ 0xef528000-0xef52c000 uncached-minus @ 0xef533000-0xef534000 uncached-minus @ 0xef533000-0xef534000 uncached-minus @ 0xef533000-0xef534000 uncached-minus @ 0xef534000-0xef535000 uncached-minus @ 0xef534000-0xef535000 uncached-minus @ 0xef534000-0xef535000 uncached-minus @ 0xef535000-0xef536000 uncached-minus @ 0xef537000-0xef538000 uncached-minus @ 0xef538000-0xef539000 uncached-minus @ 0xef538000-0xef539000 uncached-minus @ 0xef538000-0xef539000 uncached-minus @ 0xef539000-0xef53a000 uncached-minus @ 0xef539000-0xef53a000 uncached-minus @ 0xef539000-0xef53a000 uncached-minus @ 0xef53a000-0xef53b000 uncached-minus @ 0xf000-0xf800 uncached-minus @ 0xf00e-0xf00e1000 uncached-minus @ 0xf010-0xf0101000 uncached-minus @ 0xf0101000-0xf0102000 uncached-minus @ 0xfdac-0xfdad uncached-minus @ 0xfdae-0xfdaf uncached-minus @ 0xfdaf-0xfdb0 uncached-minus @ 0xfdc43000-0xfdc44000 uncached-minus @ 0xfe00-0xfe001000 uncached-minus @ 0xfe00-0xfe001000 uncached-minus @ 0xfed0-0xfed01000 uncached-minus @ 0xfed15000-0xfed16000 uncached-minus @ 0xfed4-0xfed41000 uncached-minus @ 0xfed9-0xfed91000 uncached-minus @ 0xfed91000-0xfed92000 Is that the info you were looking for? Thanks Daniel ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Mon, Sep 03, 2018 at 04:56:32PM +0800, Daniel Drake wrote: > On Sat, Sep 1, 2018 at 3:12 AM, Bjorn Helgaas wrote: > > If true, this sounds like some sort of erratum, so it would be good to > > get some input from Intel, and I cc'd a few Intel folks. > > Yes, it would be great to get their input. We have seen one similar issue with LPSS devices when BIOS assigns device BARs above 4G (which is not the case here) and it turned out to be misconfigured MTRR register or something like that. It may not be related at all but it could be worth a try to dump out MTRR registers of one of the affected systems and see if the memory areas are listed there (and if the attributes are somehow wrong if found). ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
On Sat, Sep 1, 2018 at 3:12 AM, Bjorn Helgaas wrote: > If true, this sounds like some sort of erratum, so it would be good to > get some input from Intel, and I cc'd a few Intel folks. Yes, it would be great to get their input. > It's interesting that all the systems below are from Asus. That makes > me think there's some BIOS or SMM connection, e.g., SMM traps the > register write and does something magic. Is there a way I can check if there is a SMM trap active for this address? > Does this problem happen after a full system suspend/resume, or does > it happen after runtime suspend of only the GPU? Or runtime suspend > of only the GPU and the upstream bridge? runtime suspend/resume works fine. It only happens after S3 suspend. > Can we tell whether Windows rewrites this register unconditionally at > resume-time? If so, it may be more robust for Linux to do the same. > The whole thing is black magic, which I hate, but if it's our only > choice, it may be better to have this applied everywhere so we don't > keep stubbing our toes on new systems that require the quirk. Any suggestions for how to make this happen? Booting windows in virt-manager (hoping that I could then spy on PCI config space reg accesses), I don't see an option for S3 suspend, but I'll keep looking into this. Thanks Daniel ___ Nouveau mailing list Nouveau@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/nouveau
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
Hi Daniel, I love your patch! Perhaps something to improve: [auto build test WARNING on pci/next] [also build test WARNING on v4.19-rc1 next-20180831] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Daniel-Drake/PCI-add-prefetch-quirk-to-work-around-Asus-Nvidia-suspend-issues/20180901-043245 base: https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next config: x86_64-randconfig-x000-201834 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All warnings (new ones prefixed by >>): drivers/pci/quirks.c: In function 'quirk_asus_pci_prefetch': >> drivers/pci/quirks.c:5134:6: warning: argument 1 null where non-null >> expected [-Wnonnull] if (strcmp(sys_vendor, "ASUSTeK COMPUTER INC.") != 0) ^~~ In file included from include/linux/uuid.h:20:0, from include/linux/mod_devicetable.h:13, from include/linux/pci.h:21, from drivers/pci/quirks.c:18: include/linux/string.h:44:12: note: in a call to function 'strcmp' declared here extern int strcmp(const char *,const char *); ^~ vim +5134 drivers/pci/quirks.c 4983 4984 /* 4985 * Microsemi Switchtec NTB uses devfn proxy IDs to move TLPs between 4986 * NT endpoints via the internal switch fabric. These IDs replace the 4987 * originating requestor ID TLPs which access host memory on peer NTB 4988 * ports. Therefore, all proxy IDs must be aliased to the NTB device 4989 * to permit access when the IOMMU is turned on. 4990 */ 4991 static void quirk_switchtec_ntb_dma_alias(struct pci_dev *pdev) 4992 { 4993 void __iomem *mmio; 4994 struct ntb_info_regs __iomem *mmio_ntb; 4995 struct ntb_ctrl_regs __iomem *mmio_ctrl; 4996 struct sys_info_regs __iomem *mmio_sys_info; 4997 u64 partition_map; 4998 u8 partition; 4999 int pp; 5000 5001 if (pci_enable_device(pdev)) { 5002 pci_err(pdev, "Cannot enable Switchtec device\n"); 5003 return; 5004 } 5005 5006 mmio = pci_iomap(pdev, 0, 0); 5007 if (mmio == NULL) { 5008 pci_disable_device(pdev); 5009 pci_err(pdev, "Cannot iomap Switchtec device\n"); 5010 return; 5011 } 5012 5013 pci_info(pdev, "Setting Switchtec proxy ID aliases\n"); 5014 5015 mmio_ntb = mmio + SWITCHTEC_GAS_NTB_OFFSET; 5016 mmio_ctrl = (void __iomem *) mmio_ntb + SWITCHTEC_NTB_REG_CTRL_OFFSET; 5017 mmio_sys_info = mmio + SWITCHTEC_GAS_SYS_INFO_OFFSET; 5018 5019 partition = ioread8(&mmio_ntb->partition_id); 5020 5021 partition_map = ioread32(&mmio_ntb->ep_map); 5022 partition_map |= ((u64) ioread32(&mmio_ntb->ep_map + 4)) << 32; 5023 partition_map &= ~(1ULL << partition); 5024 5025 for (pp = 0; pp < (sizeof(partition_map) * 8); pp++) { 5026 struct ntb_ctrl_regs __iomem *mmio_peer_ctrl; 5027 u32 table_sz = 0; 5028 int te; 5029 5030 if (!(partition_map & (1ULL << pp))) 5031 continue; 5032 5033 pci_dbg(pdev, "Processing partition %d\n", pp); 5034 5035 mmio_peer_ctrl = &mmio_ctrl[pp]; 5036 5037 table_sz = ioread16(&mmio_peer_ctrl->req_id_table_size); 5038 if (!table_sz) { 5039 pci_warn(pdev, "Partition %d table_sz 0\n", pp); 5040 continue; 5041 } 5042 5043 if (table_sz > 512) { 5044 pci_warn(pdev, 5045 "Invalid Switchtec partition %d table_sz %d\n", 5046 pp, table_sz); 5047 continue; 5048 } 5049 5050 for (te = 0; te < table_sz; te++) { 5051 u32 rid_entry; 5052 u8 devfn; 5053 5054 rid_entry = ioread32(&mmio_peer_ctrl->req_id_table[te]); 5055 devfn = (rid_entry >> 1) & 0xFF; 5056 pci_dbg(pdev, 5057 "Aliasing Partition %d Proxy ID %02x.%d\n", 5058 pp, PCI_SLOT(devfn), PCI_FUNC(devfn)); 5059 pci_add_dma_alias(pdev, devfn); 5060 } 5061 } 5062 5063 pci_iounmap(pdev, mmio); 5064 pci_disable_device(
Re: [Nouveau] [PATCH] PCI: add prefetch quirk to work around Asus/Nvidia suspend issues
[+cc Intel folks] On Fri, Aug 31, 2018 at 03:30:57PM +0800, Daniel Drake wrote: > On over 40 Intel-based Asus products, the nvidia GPU becomes unusable > after S3 suspend/resume. The affected products include multiple > generations of nvidia GPUs and Intel SoCs. After resume, nouveau logs > many errors such as: > > fifo: fault 00 [READ] at 00555000 engine 00 [GR] client 04 > [HUB/FE] reason 4a [] on channel -1 [007fa91000 unknown] > DRM: failed to idle channel 0 [DRM] > > Similarly, the nvidia proprietary driver also fails after resume > (black screen, 100% CPU usage in Xorg process). We shipped a sample > to Nvidia for diagnosis, and their response indicated that it's a > problem with the parent PCI bridge (on the Intel SoC), not the GPU. > > We found a workaround: on resume, rewrite the Intel PCI bridge > 'Prefetchable Base Upper 32 Bits' register. In the cases that I checked, > this register has value 0 and we just have to rewrite that value. > > It's very strange that rewriting the exact same register value > makes a difference, but it definitely makes the issue go away. > It's not just acting as some kind of memory barrier, because rewriting > other bridge registers does not work around the issue. There's something > magic in this particular register. If true, this sounds like some sort of erratum, so it would be good to get some input from Intel, and I cc'd a few Intel folks. It's interesting that all the systems below are from Asus. That makes me think there's some BIOS or SMM connection, e.g., SMM traps the register write and does something magic. Does this problem happen after a full system suspend/resume, or does it happen after runtime suspend of only the GPU? Or runtime suspend of only the GPU and the upstream bridge? Can we tell whether Windows rewrites this register unconditionally at resume-time? If so, it may be more robust for Linux to do the same. The whole thing is black magic, which I hate, but if it's our only choice, it may be better to have this applied everywhere so we don't keep stubbing our toes on new systems that require the quirk. > We examined our database of Asus hardware and identified 43 products > that we believe are affected. Checking the nvidia GPU parent PCI bridge > on each one, in total 5 Intel PCI bridges need quirking as below. > The quirk will run on bridges even where no nvidia GPU is connected, > but it should be harmless, and we at least limit it to only running > on Asus products. > > This fix was tested on all the affected models that we have in hands > (X542UQ, UX533FD, X530UN, V272UN). > > Signed-off-by: Daniel Drake > --- > > Notes: > If anyone has ideas for why writing this register makes a difference, or > suggestions for other approaches then I'm all ears... > > Here is some basic info of the 43 products believed to be affected: > basic DMI data, nvidia GPU PCI info, parent PCI bridge info. Can you attach the list below to a kernel.org bugzilla and include the URL in your changelog? > sys_vendor: ASUSTeK COMPUTER INC. > board_name: FX502VD > product_name: FX502VD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev > ff) (prog-if ff) > !!! Unknown header type 7f > 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) > (prog-if 00 [Normal decode]) > > sys_vendor: ASUSTeK COMPUTER INC. > board_name: FX570UD > product_name: ASUS Gaming FX570UD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev > a1) > Subsystem: ASUSTeK Computer Inc. Device [1043:1f40] > 00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:9d10] (rev f1) > (prog-if 00 [Normal decode]) > > sys_vendor: ASUSTeK COMPUTER INC. > board_name: GL553VD > product_name: GL553VD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev > a1) > Subsystem: ASUSTeK Computer Inc. Device [1043:15e0] > 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) > (prog-if 00 [Normal decode]) > > sys_vendor: ASUSTeK COMPUTER INC. > board_name: GL553VD > product_name: GL553VD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev > a1) > Subsystem: ASUSTeK Computer Inc. Device [1043:15e0] > 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) > (prog-if 00 [Normal decode]) > > sys_vendor: ASUSTeK COMPUTER INC. > board_name: GL753VD > product_name: GL753VD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1c8d] (rev > a1) > Subsystem: ASUSTeK Computer Inc. Device [1043:1590] > 00:01.0 PCI bridge [0604]: Intel Corporation Device [8086:1901] (rev 05) > (prog-if 00 [Normal decode]) > > sys_vendor: ASUSTeK COMPUTER INC. > board_name: GL753VD > product_name: GL753VD > 01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:1