I've localized the boot hang on the DL585 to pci_boot.c:add_reg_props()
(see 
http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#1166)

The hang always occurs when probing the first PCI-PCI bridge in the system.

On a hunch, I changed this line:

   1166         pci_putw(bus, dev, func, PCI_CONF_COMM,
   1167             cmd_reg & ~(PCI_COMM_IO | PCI_COMM_MAE));

to
                    cmd_reg & ~(PCI_COMM_IO | PCI_COMM_MAE | PCI_COMM_ME));

which additionally clears master enable for the bridge during probing.

With this fix, I've now successfully rebooted my DL585 over 500 times
without a single hang (not manually; I'm using an rc3.d script).  I've
opened P1 CR 6451513 for this; escalations should note this CR.

I admit I do not know precisely why this works; however, here's a snippet
of my current thoughts on this:

- Devices under the bridge (according to scanpci) are:
  - ATI Rage XL
  - Compaq Integrated Lights Out Controller
  - Compaq Integrated Lights Out Processor

- The bridge is normally configured to map I/O range 0x4000..0x4fff and
  memory range 0xf5f00000..0xf7afffff through the bridge.

- Disabling I/O-enable and memory-access-enable effectively opens the
  mapped windows from the secondary bus to the primary bus.  Bridge
  mappings are one-way; either the bridge forwards from the primary bus
  to the secondary bus when a mapping exists and is enabled, or the
  bridge forwards from the secondary bus to the primary bus when a
  mapping does not exist or is not enabled.

- Turning off Master Enable stops the bridge from forwarding in either
  direction.

So, my hypothesis is that one or both of the Compaq Integrated Lights Out
devices is generating memory or I/O accesses that are normally blocked
from the primary bus by the bridge.  When we turn off the I/O and memory
windows, we enable these cycles to reach the primary bus.  I don't exactly
know how this causes a hang, though the bridge is configured to not issue
master aborts - thus, a memory cycle may hang indefinitely.

It would not at all surprise me if the two Compaq Integrated Lights Out
devices are periodically talking to each other via memory cycles in the
(normally mapped) range of 0xf5f00000..0xf7afffff, and this would explain
the random nature of the hang.  It doesn't quite explain why a serial console
always works around the hang, though.

_______________________________________________
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org

Reply via email to