Re: Change in PCI behaviour
On 11/23/2010 07:44 AM, Gary Thomas wrote: On 11/22/2010 01:26 PM, Benjamin Herrenschmidt wrote: On Mon, 2010-11-22 at 03:01 -0700, Gary Thomas wrote: I have a bit more information on this. I'm pretty sure that the failures are only happening in my SCSI (SATA actually) code. My board (8347ea) has a PCI bus with a SIL SATA controller. This combo works perfectly in 2.6.28. In 2.6.32, it will run for a while (possibly quite a while), then timeout trying to do a large block write - typically 256 blocks. Once this timeout happens, the SIL controller is stuck and accesses to it will eventually cause the whole system to hang (as above). Was there any major change in how PCI or DMA was handled between 2.6.28 and 2.6.32? Given the ephemeral nature of these failures (multiple runs all eventually fail, but never the same twice), my only hope of fixing it will be to have some ideas what might have changed. Maybe the changes you did to the PCI outbound windows are now breaking DMA ? Make sure the outbound and inbound don't overlap for example and that all RAM is reachable for inbound. Here's what I did to work around this - in my DTS, I set up my PCI as ranges = 0x0200 0x0 0xC400 0xC400 0x0 0x1C00 0x0100 0x0 0x 0xB800 0x0 0x0010; Before, I had it as ranges = 0x0200 0x0 0xC000 0xC000 0x0 0x2000 0x0100 0x0 0x 0xB800 0x0 0x0010; I wasn't sure how to reserve the memory (based on your earlier suggestion), so I just narrowed the window. Note that I did not change the PCI hardware registers (maybe the FSL code does?), so the outbound window should still be the whole 512MB. If this isn't viable, perhaps you could explain a bit more how to reserve such a chunk of memory so that the PCI mappings remain the same. Any ideas on this? I'm a bit lost as to how to reserve the memory like you suggested and what I've tried so far has met little success. Thanks again -- Gary Thomas | Consulting for the MLB Associates |Embedded world ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On Sat, 2010-12-04 at 05:49 -0700, Gary Thomas wrote: On 11/23/2010 07:44 AM, Gary Thomas wrote: On 11/22/2010 01:26 PM, Benjamin Herrenschmidt wrote: On Mon, 2010-11-22 at 03:01 -0700, Gary Thomas wrote: I have a bit more information on this. I'm pretty sure that the failures are only happening in my SCSI (SATA actually) code. My board (8347ea) has a PCI bus with a SIL SATA controller. This combo works perfectly in 2.6.28. In 2.6.32, it will run for a while (possibly quite a while), then timeout trying to do a large block write - typically 256 blocks. Once this timeout happens, the SIL controller is stuck and accesses to it will eventually cause the whole system to hang (as above). Was there any major change in how PCI or DMA was handled between 2.6.28 and 2.6.32? Given the ephemeral nature of these failures (multiple runs all eventually fail, but never the same twice), my only hope of fixing it will be to have some ideas what might have changed. Maybe the changes you did to the PCI outbound windows are now breaking DMA ? Make sure the outbound and inbound don't overlap for example and that all RAM is reachable for inbound. Here's what I did to work around this - in my DTS, I set up my PCI as ranges = 0x0200 0x0 0xC400 0xC400 0x0 0x1C00 0x0100 0x0 0x 0xB800 0x0 0x0010; Before, I had it as ranges = 0x0200 0x0 0xC000 0xC000 0x0 0x2000 0x0100 0x0 0x 0xB800 0x0 0x0010; I wasn't sure how to reserve the memory (based on your earlier suggestion), so I just narrowed the window. Note that I did not change the PCI hardware registers (maybe the FSL code does?), so the outbound window should still be the whole 512MB. If this isn't viable, perhaps you could explain a bit more how to reserve such a chunk of memory so that the PCI mappings remain the same. Any ideas on this? I'm a bit lost as to how to reserve the memory like you suggested and what I've tried so far has met little success. Thanks again Look at pcibios_reserve_legacy_regions() in arch/powerpc/kernel/pci-common.c, it reserves the legacy IO and VGA regions on host bridges. You can make it reserve whatever your device is allergic too. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On 11/22/2010 01:26 PM, Benjamin Herrenschmidt wrote: On Mon, 2010-11-22 at 03:01 -0700, Gary Thomas wrote: I have a bit more information on this. I'm pretty sure that the failures are only happening in my SCSI (SATA actually) code. My board (8347ea) has a PCI bus with a SIL SATA controller. This combo works perfectly in 2.6.28. In 2.6.32, it will run for a while (possibly quite a while), then timeout trying to do a large block write - typically 256 blocks. Once this timeout happens, the SIL controller is stuck and accesses to it will eventually cause the whole system to hang (as above). Was there any major change in how PCI or DMA was handled between 2.6.28 and 2.6.32? Given the ephemeral nature of these failures (multiple runs all eventually fail, but never the same twice), my only hope of fixing it will be to have some ideas what might have changed. Maybe the changes you did to the PCI outbound windows are now breaking DMA ? Make sure the outbound and inbound don't overlap for example and that all RAM is reachable for inbound. Here's what I did to work around this - in my DTS, I set up my PCI as ranges = 0x0200 0x0 0xC400 0xC400 0x0 0x1C00 0x0100 0x0 0x 0xB800 0x0 0x0010; Before, I had it as ranges = 0x0200 0x0 0xC000 0xC000 0x0 0x2000 0x0100 0x0 0x 0xB800 0x0 0x0010; I wasn't sure how to reserve the memory (based on your earlier suggestion), so I just narrowed the window. Note that I did not change the PCI hardware registers (maybe the FSL code does?), so the outbound window should still be the whole 512MB. If this isn't viable, perhaps you could explain a bit more how to reserve such a chunk of memory so that the PCI mappings remain the same. Thanks again -- Gary Thomas | Consulting for the MLB Associates |Embedded world ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On 11/21/2010 10:59 AM, Gary Thomas wrote: On 11/19/2010 02:46 PM, Benjamin Herrenschmidt wrote: On Fri, 2010-11-19 at 08:42 -0700, Gary Thomas wrote: In this case, note that PCI device :00:0c.0 is at 0xc000. This causes problems because it's a truly stupid device that does not work properly at PCI [relative] address 0x. It simply does not respond at that address. Pick anywhere else and it will work fine! Hrm, we used to have a trick avoid giving out the first meg of a bus to avoid that sort of thing, I suppose it got lost. The rest is related to the way you map your PCI I suppose in your dts. You can switch back to a 1:1 instead of 1:0 mapping I suppose. One way to achieve the above result would be to, in your platform code, reserve the mem region that corresponds to PCI 0...1M (c000...+1M) before the device resources are assigned/allocated. I though we had code to do that with the legacy regions somewhere... oh well, no code at hand to check right now. Thanks, I found a combo of regions in my DTS that fixed this. That went well and the system is now running, but it's not stable :-( It will crash randomly, generally leaving no trace of what went wrong. I've attached a BDI to it, but mostly all it can tell me is it's dead The one thing that seems to pop up is it looks like it's jumping into space (aka the wrong place) when doing rfi (this is a guess). I've seen things like the MSR ends up loaded with an address, or similar strangeness. Were there any system level changes during this period (I know it's some time ago) that might have introduced such an instability? It's tough to scan through the diffs and get a feeling for any little details like this. Any ideas or hints greatly appreciated, thanks I have a bit more information on this. I'm pretty sure that the failures are only happening in my SCSI (SATA actually) code. My board (8347ea) has a PCI bus with a SIL SATA controller. This combo works perfectly in 2.6.28. In 2.6.32, it will run for a while (possibly quite a while), then timeout trying to do a large block write - typically 256 blocks. Once this timeout happens, the SIL controller is stuck and accesses to it will eventually cause the whole system to hang (as above). Was there any major change in how PCI or DMA was handled between 2.6.28 and 2.6.32? Given the ephemeral nature of these failures (multiple runs all eventually fail, but never the same twice), my only hope of fixing it will be to have some ideas what might have changed. Thanks for any ideas -- Gary Thomas | Consulting for the MLB Associates |Embedded world ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On Fri, Nov 19, 2010 at 08:42:46AM -0700, Gary Thomas wrote: In this case, note that PCI device :00:0c.0 is at 0xc000. This causes problems because it's a truly stupid device that does not work properly at PCI [relative] address 0x. It simply does not respond at that address. Pick anywhere else and it will work fine! Yes, but it was one upon a time in the PCI spec that setting the a base register to 0 should disable the corresponding decoder. I don't know whether this has changed (I actually never had the final PCI spec, only drafts). However I once had a device who actually did not disable base addresses set to zero and this was described as a bug in its (numerous) errata. This also caused a lot of mayhem since in some versions/configurations it used up to 64kB of PCI I/O space (especially fun on x86...). Gabriel ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On Mon, 2010-11-22 at 03:01 -0700, Gary Thomas wrote: I have a bit more information on this. I'm pretty sure that the failures are only happening in my SCSI (SATA actually) code. My board (8347ea) has a PCI bus with a SIL SATA controller. This combo works perfectly in 2.6.28. In 2.6.32, it will run for a while (possibly quite a while), then timeout trying to do a large block write - typically 256 blocks. Once this timeout happens, the SIL controller is stuck and accesses to it will eventually cause the whole system to hang (as above). Was there any major change in how PCI or DMA was handled between 2.6.28 and 2.6.32? Given the ephemeral nature of these failures (multiple runs all eventually fail, but never the same twice), my only hope of fixing it will be to have some ideas what might have changed. Maybe the changes you did to the PCI outbound windows are now breaking DMA ? Make sure the outbound and inbound don't overlap for example and that all RAM is reachable for inbound. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On 11/19/2010 02:46 PM, Benjamin Herrenschmidt wrote: On Fri, 2010-11-19 at 08:42 -0700, Gary Thomas wrote: In this case, note that PCI device :00:0c.0 is at 0xc000. This causes problems because it's a truly stupid device that does not work properly at PCI [relative] address 0x. It simply does not respond at that address. Pick anywhere else and it will work fine! Hrm, we used to have a trick avoid giving out the first meg of a bus to avoid that sort of thing, I suppose it got lost. The rest is related to the way you map your PCI I suppose in your dts. You can switch back to a 1:1 instead of 1:0 mapping I suppose. One way to achieve the above result would be to, in your platform code, reserve the mem region that corresponds to PCI 0...1M (c000...+1M) before the device resources are assigned/allocated. I though we had code to do that with the legacy regions somewhere... oh well, no code at hand to check right now. Thanks, I found a combo of regions in my DTS that fixed this. That went well and the system is now running, but it's not stable :-( It will crash randomly, generally leaving no trace of what went wrong. I've attached a BDI to it, but mostly all it can tell me is it's dead The one thing that seems to pop up is it looks like it's jumping into space (aka the wrong place) when doing rfi (this is a guess). I've seen things like the MSR ends up loaded with an address, or similar strangeness. Were there any system level changes during this period (I know it's some time ago) that might have introduced such an instability? It's tough to scan through the diffs and get a feeling for any little details like this. Any ideas or hints greatly appreciated, thanks -- Gary Thomas | Consulting for the MLB Associates |Embedded world ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Change in PCI behaviour
I'm upgrading from 2.6.28 to 2.6.32 (yes, I know it's not the latest, but it's the best I can do at the moment). There seems to have been a change in how the PCI bus is scanned/assigned which is causing me some hardware problems. My hardware is FSL MPC8347 and the problem is likely specific to the FSL PCI code. My system has 256MB RAM, fully mapped into the PCI export window. There are an additional 2 devices on the PCI bus. On 2.6.28, I get this layout: PCI: Probing PCI hardware pci :00:00.0: reg 10 32bit mmio: [0x00-0x0f] pci :00:00.0: reg 18 64bit mmio: [0x00-0xfff] pci :00:0b.0: reg 10 io port: [0x1000-0x1007] pci :00:0b.0: reg 14 io port: [0x1008-0x100b] pci :00:0b.0: reg 18 io port: [0x1010-0x1017] pci :00:0b.0: reg 1c io port: [0x1018-0x101b] pci :00:0b.0: reg 20 io port: [0x1020-0x102f] pci :00:0b.0: reg 24 32bit mmio: [0x10-0x1001ff] pci :00:0b.0: reg 30 32bit mmio: [0x00-0x07] pci :00:0b.0: supports D1 D2 pci :00:0c.0: reg 10 32bit mmio: [0x400-0x7ff] PCI: Cannot allocate resource region 5 of device :00:0b.0, will remap PCI: Cannot allocate resource region 0 of device :00:0c.0, will remap bus: 00 index 0 io port: [0x00-0xf] bus: 00 index 1 mmio: [0xc000-0xdfff] There are no notices, but the key item is that device :00:0c.0 gets mapped at PCI address 0xD000. On 2.6.32, I get this: PCI: Probing PCI hardware pci :00:0b.0: reg 10: [io 0x1000-0x1007] pci :00:0b.0: reg 14: [io 0x1008-0x100b] pci :00:0b.0: reg 18: [io 0x1010-0x1017] pci :00:0b.0: reg 1c: [io 0x1018-0x101b] pci :00:0b.0: reg 20: [io 0x1020-0x102f] pci :00:0b.0: reg 24: [mem 0x0010-0x001001ff] pci :00:0b.0: reg 30: [mem 0x-0x0007 pref] pci :00:0b.0: supports D1 D2 pci :00:0c.0: reg 10: [mem 0x0400-0x07ff] PCI: Cannot allocate resource region 5 of device :00:0b.0, will remap PCI: Cannot allocate resource region 0 of device :00:0c.0, will remap pci :00:0c.0: BAR 0: assigned [mem 0xc000-0xc3ff] pci :00:0c.0: BAR 0: set to [mem 0xc000-0xc3ff] (PCI address [0xc000-0xc3ff] pci :00:0b.0: BAR 6: assigned [mem 0xc400-0xc407 pref] pci :00:0b.0: BAR 5: assigned [mem 0xc408-0xc40801ff] pci :00:0b.0: BAR 5: set to [mem 0xc408-0xc40801ff] (PCI address [0xc408-0xc40801ff] pci_bus :00: resource 0 [io 0x-0xf] pci_bus :00: resource 1 [mem 0xc000-0xdfff] In this case, note that PCI device :00:0c.0 is at 0xc000. This causes problems because it's a truly stupid device that does not work properly at PCI [relative] address 0x. It simply does not respond at that address. Pick anywhere else and it will work fine! On 2.6.28, I get this layout: # ls -l /sys/bus/pci/devices lrwxrwxrwx1 root root 0 Jan 1 1970 :00:00.0 - ../../../devices/pci:00/:00:00.0 lrwxrwxrwx1 root root 0 Jan 1 1970 :00:0b.0 - ../../../devices/pci:00/:00:0b.0 lrwxrwxrwx1 root root 0 Jan 1 1970 :00:0c.0 - ../../../devices/pci:00/:00:0c.0 # cat /sys/bus/pci/devices/\:00\:0c.0/resource 0xd000 0xd3ff 0x00020200 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x On 2.6.32, the final layout looks like this: # ls -l /sys/bus/pci/devices/ lrwxrwxrwx1 root root 0 Jan 1 1970 :00:0b.0 - ../../../devices/pci:00/:00:0b.0 lrwxrwxrwx1 root root 0 Jan 1 1970 :00:0c.0 - ../../../devices/pci:00/:00:0c.0 # cat /sys/bus/pci/devices/\:00\:0c.0/resource 0xc000 0xc3ff 0x00020200 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x Bottom line: how can I get this behaviour back (so as to get my stupid graphics controller working again)?? Thanks for any ideas -- Gary Thomas | Consulting for the MLB Associates |Embedded world ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Change in PCI behaviour
On Fri, 2010-11-19 at 08:42 -0700, Gary Thomas wrote: In this case, note that PCI device :00:0c.0 is at 0xc000. This causes problems because it's a truly stupid device that does not work properly at PCI [relative] address 0x. It simply does not respond at that address. Pick anywhere else and it will work fine! Hrm, we used to have a trick avoid giving out the first meg of a bus to avoid that sort of thing, I suppose it got lost. The rest is related to the way you map your PCI I suppose in your dts. You can switch back to a 1:1 instead of 1:0 mapping I suppose. One way to achieve the above result would be to, in your platform code, reserve the mem region that corresponds to PCI 0...1M (c000...+1M) before the device resources are assigned/allocated. I though we had code to do that with the legacy regions somewhere... oh well, no code at hand to check right now. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev