I've localized the boot hang on the DL585 to pci_boot.c:add_reg_props() (see http://cvs.opensolaris.org/source/xref/on/usr/src/uts/i86pc/io/pci/pci_boot.c#1166)
The hang always occurs when probing the first PCI-PCI bridge in the system. On a hunch, I changed this line: 1166 pci_putw(bus, dev, func, PCI_CONF_COMM, 1167 cmd_reg & ~(PCI_COMM_IO | PCI_COMM_MAE)); to cmd_reg & ~(PCI_COMM_IO | PCI_COMM_MAE | PCI_COMM_ME)); which additionally clears master enable for the bridge during probing. With this fix, I've now successfully rebooted my DL585 over 500 times without a single hang (not manually; I'm using an rc3.d script). I've opened P1 CR 6451513 for this; escalations should note this CR. I admit I do not know precisely why this works; however, here's a snippet of my current thoughts on this: - Devices under the bridge (according to scanpci) are: - ATI Rage XL - Compaq Integrated Lights Out Controller - Compaq Integrated Lights Out Processor - The bridge is normally configured to map I/O range 0x4000..0x4fff and memory range 0xf5f00000..0xf7afffff through the bridge. - Disabling I/O-enable and memory-access-enable effectively opens the mapped windows from the secondary bus to the primary bus. Bridge mappings are one-way; either the bridge forwards from the primary bus to the secondary bus when a mapping exists and is enabled, or the bridge forwards from the secondary bus to the primary bus when a mapping does not exist or is not enabled. - Turning off Master Enable stops the bridge from forwarding in either direction. So, my hypothesis is that one or both of the Compaq Integrated Lights Out devices is generating memory or I/O accesses that are normally blocked from the primary bus by the bridge. When we turn off the I/O and memory windows, we enable these cycles to reach the primary bus. I don't exactly know how this causes a hang, though the bridge is configured to not issue master aborts - thus, a memory cycle may hang indefinitely. It would not at all surprise me if the two Compaq Integrated Lights Out devices are periodically talking to each other via memory cycles in the (normally mapped) range of 0xf5f00000..0xf7afffff, and this would explain the random nature of the hang. It doesn't quite explain why a serial console always works around the hang, though. _______________________________________________ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org