This problem may have been more far-reaching than I thought: the
machine is now entirely dead.  Testing shows the motherboard is blown,
not the processor or memory, so I'll get a replacement in and then
report on whether the expander works with that.

Will

On Tue, Sep 15, 2009 at 22:42, Vikram Hegde <[email protected]> wrote:
> Ok, I will investigate. I may contact you privately to run some tests
>
> Will Murnane <[email protected]> wrote:
>
>>On Tue, Sep 15, 2009 at 21:35, Vikram Hegde <[email protected]> wrote:
>>> Hi,
>>>
>>> Did you follow this sequence:
>>>
>>> 1. power down machine
>>> 2. install hardware
>>> 3. reboot machine
>>> 4. install drivers/software for expander
>>No driver is necessary, as far as I can tell.  It occupies a PCI
>>express slot, but (I believe) uses it only for power.  The LSI card
>>that drives it (I'm using the mpt driver) has been in the system for a
>>long time now, stable.
>>
>>> 5. *no* reboot
>>> 6. crash
>>The machine actually crashed twice.  The first time I wrote it off as
>>a power glitch despite the UPS (the lights flickered at the same time,
>>and it's a cheap UPS) and the second time I took notes, removed
>>memory, dug up logs and generally started debugging.
>>
>>>
>>> Vikram
>>>
>>> Will Murnane wrote:
>>>>
>>>> Oh, sorry.  I forgot to give details.  Here's the machine in question:
>>>>
>>>> w...@will-fs:~$ uname -a
>>>> SunOS will-fs 5.11 snv_122 i86pc i386 i86pc
>>>> w...@will-fs:~$ isainfo
>>>> amd64 i386
>>>> w...@will-fs:~$ prtdiag
>>>> System Configuration: Intel Corporation S3210SH
>>>> BIOS Configuration: Intel Corporation
>>>> S3200X38.86B.00.00.0046.011420090950 01/14/2009
>>>> BMC Configuration: IPMI 2.0 (KCS: Keyboard Controller Style)
>>>>
>>>> ==== Processor Sockets ====================================
>>>>
>>>> Version                          Location Tag
>>>> -------------------------------- --------------------------
>>>> Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz Intel(R) Genuine processor
>>>>
>>>> ==== Memory Device Sockets ================================
>>>>
>>>> Type        Status Set Device Locator      Bank Locator
>>>> ----------- ------ --- ------------------- ----------------
>>>> DDR2        in use 0   DIMM_A1             CHAN A DIMM 1
>>>> DDR2        empty  0   DIMM_A2             CHAN A DIMM 2
>>>> DDR2        empty  0   DIMM_B1             CHAN B DIMM 1
>>>> DDR2        empty  0   DIMM_B2             CHAN B DIMM 2
>>>>
>>>> ==== On-Board Devices =====================================
>>>> PCIe x1 VGA embedded in ServerEngine(tm) Pilot-II
>>>> Intel 82541PI Ethernet Device
>>>> Intel 82566DM Ethernet Device
>>>>
>>>> ==== Upgradeable Slots ====================================
>>>>
>>>> ID  Status    Type             Description
>>>> --- --------- ---------------- ----------------------------
>>>> 1   available PCI              PCI SLOT 1 PCI 32/33
>>>> 2   available PCI              PCI SLOT 2 PCI 32/33
>>>> 3   available PCI Express      PCIe x4 SLOT 3
>>>> 5   in use    PCI Express      PCIe x8 SLOT 5
>>>> 6   in use    PCI Express      PCIe x16 SLOT 6 / FL RISER
>>>>
>>>> Slot 3, which says "available" is in fact where the SAS card is.
>>>>
>>>> When the box first crashed, I took out one stick of memory (there's
>>>> normally two, for a total of 4GB).  The box hasn't crashed in about
>>>> three hours now.  Once zpool scrub finishes I can pull out the disks
>>>> and put the other stick back in and see what happens.
>>>>
>>>> Will
>>>>
>>>> On Tue, Sep 15, 2009 at 21:12, Vikram Hegde <[email protected]> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> What build are you running ? Several bugs in this area have been fixed in
>>>>> recent builds.
>>>>>
>>>>> Vikram
>>>>>
>>>>> Will Murnane wrote:
>>>>>
>>>>>>
>>>>>> I recently purchased an HP-branded SAS expander, model number
>>>>>> 468406-B21.  Today I installed it, and got a panic ("genunix: [ID
>>>>>> 655072 kern.notice]" at beginning of lines removed):
>>>>>>  panic[cpu1]/thread=ffffff0004883c60:
>>>>>>  Freeing a free IOMMU page: paddr=0x14eb7000
>>>>>>  ffffff0004883100 rootnex:iommu_page_free+cb ()
>>>>>>  ffffff0004883120 rootnex:iommu_free_page+15 ()
>>>>>>  ffffff0004883190 rootnex:iommu_setup_level_table+a4 ()
>>>>>>  ffffff00048831d0 rootnex:iommu_setup_page_table+e1 ()
>>>>>>  ffffff0004883250 rootnex:iommu_map_page_range+6a ()
>>>>>>  ffffff00048832a0 rootnex:iommu_map_dvma+50 ()
>>>>>>  ffffff0004883360 rootnex:intel_iommu_map_sgl+22f ()
>>>>>>  ffffff0004883400 rootnex:rootnex_coredma_bindhdl+11e ()
>>>>>>  ffffff0004883440 rootnex:rootnex_dma_bindhdl+36 ()
>>>>>>  ffffff00048834e0 genunix:ddi_dma_buf_bind_handle+117 ()
>>>>>>  ffffff0004883540 scsi:scsi_dma_buf_bind_attr+48 ()
>>>>>>  ffffff00048835d0 scsi:scsi_init_cache_pkt+2e1 ()
>>>>>>  ffffff0004883650 scsi:scsi_init_pkt+5c ()
>>>>>>  ffffff0004883730 sd:sd_setup_rw_pkt+12a ()
>>>>>>  ffffff00048837a0 sd:sd_initpkt_for_buf+ad ()
>>>>>>  ffffff0004883810 sd:sd_start_cmds+197 ()
>>>>>>  ffffff0004883860 sd:sd_core_iostart+184 ()
>>>>>>  ffffff00048838d0 sd:sd_mapblockaddr_iostart+302 ()
>>>>>>  ffffff0004883910 sd:sd_xbuf_strategy+50 ()
>>>>>>  ffffff0004883960 sd:xbuf_iostart+1e5 ()
>>>>>>  ffffff00048839a0 sd:ddi_xbuf_qstrategy+d3 ()
>>>>>>  ffffff00048839d0 sd:sdstrategy+10b ()
>>>>>>  ffffff0004883a00 genunix:bdev_strategy+75 ()
>>>>>>  ffffff0004883a30 genunix:ldi_strategy+59 ()
>>>>>>  ffffff0004883a70 zfs:vdev_disk_io_start+d0 ()
>>>>>>  ffffff0004883ab0 zfs:zio_vdev_io_start+17d ()
>>>>>>  ffffff0004883ae0 zfs:zio_execute+a0 ()
>>>>>>  ffffff0004883b00 zfs:zio_nowait+42 ()
>>>>>>  ffffff0004883b40 zfs:vdev_queue_io_done+9c ()
>>>>>>  ffffff0004883b70 zfs:zio_vdev_io_done+62 ()
>>>>>>  ffffff0004883ba0 zfs:zio_execute+a0 ()
>>>>>>  ffffff0004883c40 genunix:taskq_thread+1b7 ()
>>>>>>  ffffff0004883c50 unix:thread_start+8 ()
>>>>>>
>>>>>> On the next boot, I saw this set of messages three times:
>>>>>> WARNING: bios issue: rmrr is not in reserved memory range
>>>>>> WARNING: rmrr overlap with physmem [0x24400000 - 0x7fcf0000] for
>>>>>> pci8086,34d0
>>>>>> WARNING: rmrr overlap with physmem [0x7fd96000 - 0x7fdfd000] for
>>>>>> pci8086,34d0
>>>>>>
>>>>>> and this, several times:
>>>>>> NOTICE: IRQ21 is being shared by drivers with different interrupt
>>>>>> levels.
>>>>>> This may result in reduced system performance.
>>>>>>
>>>>>> I found a suggestion for showing the interrupt map:
>>>>>> # echo ::interrupts -d | mdb -k
>>>>>> IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# Driver Name(s)
>>>>>> 1    0x41 5   ISA    Edg Fixed  3   1     0x0/0x1   i8042#1
>>>>>> 3    0xb1 12  ISA    Edg Fixed  3   1     0x0/0x3   asy#3
>>>>>> 4    0xb0 12  ISA    Edg Fixed  2   1     0x0/0x4   asy#2
>>>>>> 6    0x44 5   ISA    Edg Fixed  1   1     0x0/0x6   fdc#1
>>>>>> 9    0x82 9   PCI    Lvl Fixed  1   1     0x0/0x9   acpi_wrapper_isr
>>>>>> 12   0x42 5   ISA    Edg Fixed  0   1     0x0/0xc   i8042#1
>>>>>> 14   0x40 5   ISA    Edg Fixed  2   1     0x0/0xe   ata#2
>>>>>> 16   0x84 9   PCI    Lvl Fixed  3   1     0x0/0x10  nvidia#1
>>>>>> 17   0x85 9   PCI    Lvl Fixed  0   1     0x0/0x11  ehci#2
>>>>>> 18   0x87 9   PCI    Lvl Fixed  2   3     0x0/0x12  e1000g#3, uhci#10,
>>>>>> uhci#6
>>>>>> 19   0x89 9   PCI    Lvl Fixed  0   1     0x0/0x13  uhci#9
>>>>>> 21   0x88 9   PCI    Lvl Fixed  3   1     0x0/0x15  pci-ide#2
>>>>>> 23   0x86 9   PCI    Lvl Fixed  1   2     0x0/0x17  uhci#8, ehci#3
>>>>>> 24   0x83 7   PCI    Edg MSI    1   1     -         pcieb#2
>>>>>> 25   0x30 4   PCI    Edg MSI    2   1     -         pcieb#3
>>>>>> 26   0x60 6   PCI    Edg MSI    1   1     -         e1000g#0
>>>>>> 27   0x43 5   PCI    Edg MSI    3   1     -         mpt#0
>>>>>> 128  0x81 8          Edg IPI    all 1     -         iommu_intr_handler
>>>>>> 160  0xa0 0          Edg IPI    all 0     -         poke_cpu
>>>>>> 208  0xd0 14         Edg IPI    all 1     -
>>>>>> kcpc_hw_overflow_intr
>>>>>> 209  0xd1 14         Edg IPI    all 1     -         cbe_fire
>>>>>> 210  0xd3 14         Edg IPI    all 1     -         cbe_fire
>>>>>> 240  0xe0 15         Edg IPI    all 1     -         xc_serv
>>>>>> 241  0xe1 15         Edg IPI    all 1     -         apic_error_intr
>>>>>>
>>>>>> ... which seems to indicate that IRQ21 is in fact not shared.
>>>>>>
>>>>>> Any suggestions where I should start debugging this?  From what I can
>>>>>> find, the rmrr issue is a should-never-happen case; should I raise a
>>>>>> bug with Intel?  This didn't happen before adding the SAS expander, so
>>>>>> I'm reluctant to point fingers at the onboard ethernet controller
>>>>>> (whose PCI ID that is).  Maybe swapping cards around will help; I'll
>>>>>> try that in a bit.
>>>>>>
>>>>>> Will
>>>>>> _______________________________________________
>>>>>> indiana-discuss mailing list
>>>>>> [email protected]
>>>>>> http://mail.opensolaris.org/mailman/listinfo/indiana-discuss
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>
_______________________________________________
indiana-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/indiana-discuss

Reply via email to