Hi Folks ! I've been tracking a problem on POWER server with a CEDAR card (FirePro 2270 X1 "server") on a POWER machine.
The cool thing is that our PCI-E bridges fancy error handling is kicking in and freezing all to that device traffic as soon as the error is detected. By sprinkling some test code around, here is what I have found: - The initial symptom is that the system reports an "EEH" (our error handling system) error on the card at boot. Under our normal firmware & Linux, that pretty much causes the card to be taken out. - With some hand made debug, I have got some more details about the error which is a bad DMA write from the card to an address that isn't mapped in our IOMMU. (Ie something that wasn't the result of a dma_map_sg or dma_alloc_coherent, etc...) - The error seems to always happen while ATOM is executing from the "atom_enable_crtc" table called from: [c000001f7e206fb0] [c000000000440148] .atombios_enable_crtc+0x38/0x50 [c000001f7e207030] [c000000000441784] .atombios_crtc_dpms+0x104/0x1a0 [c000001f7e2070c0] [c000000000441d68] .atombios_crtc_disable+0x28/0x170 [c000001f7e207190] [c0000000003e6ef4] .drm_helper_disable_unused_functions+0x144/0x230 [c000001f7e207230] [c0000000003e69ec] .drm_fb_helper_initial_config+0x5c/0x310 [c000001f7e207340] [c00000000046862c] .radeon_fbdev_init+0xdc/0x190 [c000001f7e2073e0] [c0000000004620c0] .radeon_modeset_init+0x740/0xc90 [c000001f7e2074b0] [c000000000438cfc] .radeon_driver_load_kms+0x14c/0x1a0 [c000001f7e207550] [c0000000003f6e14] .drm_get_pci_dev+0x1c4/0x2e0 [c000001f7e207600] [c00000000079aea8] .radeon_pci_probe+0xc4/0xe8 [c000001f7e207690] [c000000000378d60] .pci_device_probe+0x1a0/0x1c0 [c000001f7e207740] [c0000000004db054] .driver_probe_device+0xe4/0x2c0 [c000001f7e2077e0] [c0000000004db33c] .__driver_attach+0x10c/0x110 [c000001f7e207870] [c0000000004d8b2c] .bus_for_each_dev+0x8c/0xe0 [c000001f7e207920] [c0000000004daaa8] .driver_attach+0x28/0x40 [c000001f7e2079a0] [c0000000004da3d8] .bus_add_driver+0x228/0x300 [c000001f7e207a40] [c0000000004db8b0] .driver_register+0xa0/0x1e0 [c000001f7e207ae0] [c000000000378eb8] .__pci_register_driver+0x48/0x60 [c000001f7e207b60] [c0000000003f709c] .drm_pci_init+0x16c/0x1a0 [c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0 The error is detected on an atom_op_jump() that loops for ever due to the isolation which means that all MMIOs are returning ffffffff's. The actual error might have happened slightly earlier (see below). . From the backtrace, it seems to be trying to *disable* CRTCs (I would have understood if it was trying to incorrectly enable one which is sourcing pixels from the wrong address...) - I've added various delays in all sort of stages of radeon_modeset_init() and radeon_fbdev_init(), and the error still appears to be fairly well localized to the execution of that table, so it looks like it's not some stray DMA that happens to hit at that moment due to some timing, but something specifically triggered by that table execution. - I've turned on atom_debug right before the call to drm_fb_helper_initial_config() in radeon_fbdev_init() and added a freeze check between each op, and here's the result. I don't really have the brain cycles to try to parse that right now :-) I used to back then but heh, it's a long time ago... that's where I'm handing you the hot potato hoping it will make some obvious sense :-) As you can see, it's not doing much before the failure: >> execute D7E2 (len 24, WS 0, PS 4) SET_ATI_PORT @ 0xD7E8 port: 0 (MM) CLEAR_REG @ 0xD7EB dst: AND_REG @ 0xD7EF dst: src: dst: OR_REG @ 0xD7F4 dst: src: dst: EOT @ 0xD7F9 << >> execute D7CA (len 24, WS 0, PS 4) SET_ATI_PORT @ 0xD7D0 port: 0 (MM) CLEAR_REG @ 0xD7D3 dst: AND_REG @ 0xD7D7 dst: src: dst: OR_REG @ 0xD7DC dst: src: dst: EOT @ 0xD7E1 << >> execute BADE (len 25, WS 0, PS 0) 0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9 [ here we have detected the freeze, the stuff below is my own diagnostic code, it I will decrypt some of it later if it's of use, basically it says the freeze occurred due to a DMA error to an unmapped DMA address though I am not 100% sure of the actual DMA address, I think what it gives me is actually the address of the iommu "PTE" that had the valid bit clear, I need to do some work to turn that back into a page or something ... The packet hasn't been captured in the TLP header capture of the AER function unfortunately. ] PHB 1 diagnostic data: brdgCtl = 0x00000002 portStatusReg = 0x00000000 rootCmplxStatus = 0x00000000 busAgentStatus = 0x00000000 deviceStatus = 0x0000000f slotStatus = 0x016003c0 linkStatus = 0xa0120008 devCmdStatus = 0x00100147 devSecStatus = 0x00000000 rootErrorStatus = 0x00000000 uncorrErrorStatus = 0x00000000 corrErrorStatus = 0x00000000 tlpHdr1 = 0x00000000 tlpHdr2 = 0x00000000 tlpHdr3 = 0x00000000 tlpHdr4 = 0x00000000 sourceId = 0x00000000 errorClass = 0x0000000000000000 correlator = 0x0000000000000000 p7iocPlssr = 0x0000001c00000000 p7iocCsr = 0x0000000000000000 lemFir = 0x0200000004000000 lemErrorMask = 0x1249a1147f500f2c lemWOF = 0x0000000000000000 phbErrorStatus = 0x0000000000000000 phbFirstErrorStatus = 0x0000000000000000 phbErrorLog0 = 0x0000000000000000 phbErrorLog1 = 0x0000000000000000 mmioErrorStatus = 0x0200000000000000 mmioFirstErrorStatus = 0x0200000000000000 mmioErrorLog0 = 0x02040070004a3da1 mmioErrorLog1 = 0x98006e9800000000 dma0ErrorStatus = 0x0000000000004000 dma0FirstErrorStatus = 0x0000000000004000 dma0ErrorLog0 = 0x160007dbe0000002 dma0ErrorLog1 = 0x1500000000000002 dma1ErrorStatus = 0x0000000000000000 dma1FirstErrorStatus = 0x0000000000000000 dma1ErrorLog0 = 0x0000000000000000 dma1ErrorLog1 = 0x0000000000000000 PE[ 2] PESTA = 0x8000302500000000 PESTB = 0x8000001f6f800000 foo ! ------------[ cut here ]------------ WARNING: at arch/powerpc/platforms/powernv/pci.c:312 Modules linked in: NIP: c00000000003ffe0 LR: c00000000003ffdc CTR: 0000000030009e5c REGS: c000001f7e206ad0 TRAP: 0700 Not tainted (3.7.0-rc2-00006-gae38062-dirty) MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28002084 XER: 20000000 SOFTE: 1 CFAR: c0000000007946ec TASK = c000001f71340000[1] 'swapper/0' THREAD: c000001f7e204000 CPU: 57 GPR00: c00000000003ffdc c000001f7e206d50 c000000000b591a0 0000000000000005 GPR04: 0000000000000000 00000000000002c2 9000000000009032 c0000000008e5e80 GPR08: c000001f7e206d88 0000000080000039 0000000000000000 000000000000e086 GPR12: 0000000028002082 c00000000ff4ab00 c00000000000b410 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 c00000000091c9f0 c000001f7e206e78 c000001f7e206e70 GPR24: 000000000000bade 0000000000000019 0000000000000000 0000000000000002 GPR28: 0000000000000002 c000001f7130a800 c000000000aa4300 c000000000ce5800 NIP [c00000000003ffe0] .pnv_pci_check_eeh+0x130/0x150 LR [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150 Call Trace: [c000001f7e206d50] [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150 (unreliable) [c000001f7e206e00] [c00000000044b03c] .atom_execute_table_locked+0x1ac/0x380 [c000001f7e206f10] [c00000000044e724] .atom_execute_table+0x54/0x80 [c000001f7e206fb0] [c0000000004400b8] .atombios_enable_crtc_memreq+0x38/0x50 [c000001f7e207030] [c0000000004417bc] .atombios_crtc_dpms+0x17c/0x1a0 [c000001f7e2070c0] [c000000000441d28] .atombios_crtc_disable+0x28/0x170 [c000001f7e207190] [c0000000003e6eb4] .drm_helper_disable_unused_functions+0x144/0x230 [c000001f7e207230] [c0000000003e69ac] .drm_fb_helper_initial_config+0x5c/0x310 [c000001f7e207340] [c0000000004686ac] .radeon_fbdev_init+0x13c/0x1e0 [c000001f7e2073e0] [c0000000004620e0] .radeon_modeset_init+0x740/0xc90 [c000001f7e2074b0] [c000000000438cbc] .radeon_driver_load_kms+0x14c/0x1a0 [c000001f7e207550] [c0000000003f6dd4] .drm_get_pci_dev+0x1c4/0x2e0 [c000001f7e207600] [c00000000079af18] .radeon_pci_probe+0xc4/0xe8 [c000001f7e207690] [c000000000378d20] .pci_device_probe+0x1a0/0x1c0 [c000001f7e207740] [c0000000004db0c4] .driver_probe_device+0xe4/0x2c0 [c000001f7e2077e0] [c0000000004db3ac] .__driver_attach+0x10c/0x110 [c000001f7e207870] [c0000000004d8b9c] .bus_for_each_dev+0x8c/0xe0 [c000001f7e207920] [c0000000004dab18] .driver_attach+0x28/0x40 [c000001f7e2079a0] [c0000000004da448] .bus_add_driver+0x228/0x300 [c000001f7e207a40] [c0000000004db920] .driver_register+0xa0/0x1e0 [c000001f7e207ae0] [c000000000378e78] .__pci_register_driver+0x48/0x60 [c000001f7e207b60] [c0000000003f705c] .drm_pci_init+0x16c/0x1a0 [c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0 [c000001f7e207c90] [c00000000000ac04] .do_one_initcall+0x64/0x1e0 [c000001f7e207d50] [c00000000000b62c] .kernel_init+0x21c/0x3e0 [c000001f7e207e30] [c000000000009b0c] .ret_from_kernel_thread+0x5c/0x64 Cheers, Ben. _______________________________________________ xorg-driver-ati mailing list xorg-driver-ati@lists.x.org http://lists.x.org/mailman/listinfo/xorg-driver-ati