Hiya,

I'm currently evaluating two classes of server which we source through NEC. 
However, the motherboards for these machines are HP. I can routinely panic both 
of these machines using 12.0-A4, as well as 11.1-R with a shoehorned in 
SES/SMARTPQI driver, and 11.2-R with its native SES/SMARTPQI driver. NEC seems 
to think this is a ZFS issue and they may be correct. If so I suspect ARC, 
though as I explain further down, I haven't had a problem on other hardware. 

I've managed to get a core dump on 11.1 and 11.2, but on 12.0 when the panic 
occurs, I can backtrace and force a panic and the system claims it is writing 
out a core dump, but on reboot there is no core dump.

Machine A: HP ProLiant DL360 Gen 10 with a Xeon Bronze 3106 and 16 gigs RAM and 
three hard drives.

Machine B: HP Proliant DL380 Gen 10 with a Xeon Silver 4114 and 32 gigs RAM and 
five hard drives.

I install 12.0-A4 using ZFS on root. I install with 8 gigs of swap but 
otherwise it's a standard FreeBSD install. I can panic these machines rather 
easily in 10-15 minutes by firing up 6 instances of bonnie++ and a few 
memtesters, three using 2g and three using 4g. I've done this on the 11.x 
installs without memtester and gotten panics within 10-15 minutes. Those gave 
me core dumps, but the panic error is different than with 12.0-A4. I have run 
some tests using UFS2 and did not manage to force a panic.

At first I thought the problem was the HPE RAID card which uses the SES driver, 
so I put in a recent LSI MegaRAID card using the MRSAS driver, and can panic 
that as well. I've managed to panic Machine B while it was using either RAID 
card to create two mirrors and one hot spare, and I've managed to panic it when 
letting the RAID cards pass through the hard drives so I could create a raidz 
of 4 drives and one hot spare. I know many people immediately think "Don't use 
a RAID card with ZFS!" but I've done this for years without a problem using the 
LSI MegaRAID in a variety of configurations.

It really seems to me that when ARC starts to ramp up and hits a lot of memory 
contention, a panic occurs. However, I've been running the same test on a 
previous generation NEC server with an LSI MegaRAID using the MRSAS driver 
under 11.2-R and it has been running like clockwork for 11 days. We use this 
iteration of server extensively. If this were a problem with ARC, I assume 
(perhaps presumptuously) that I would see the same problems. I also have 
servers running 11.2-R with ZFS and rather large and very heavily used JBOD 
arrays and have never had an issue.

The HPE RAID card info, from pciconf -lv:

smartpqi0@pci0:92:0:0:  class=0x010700 card=0x0654103c chip=0x028f9005 rev=0x01 
hdr=0x00
    vendor     = 'Adaptec'
    device     = 'Smart Storage PQI 12G SAS/PCIe 3'
    class      = mass storage
    subclass   = SAS

And from dmesg:

root@hvm2d:~ # dmesg | grep smartpq
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff at 
device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0
smartpqi0: <E208i-a SR Gen10> port 0x8000-0x80ff mem 0xe6c00000-0xe6c07fff at 
device 0.0 on pci9
smartpqi0: using MSI-X interrupts (40 vectors)
da0 at smartpqi0 bus 0 scbus0 target 0 lun 0
da1 at smartpqi0 bus 0 scbus0 target 1 lun 0
ses0 at smartpqi0 bus 0 scbus0 target 69 lun 0
pass3 at smartpqi0 bus 0 scbus0 target 1088 lun 0

However, since I can panic these with either RAID card, I don't suspect the HPE 
RAID card as the culprit.

Here is an image with the scant bt info I got from the last panic:

https://ibb.co/dzFOn9

This thread from Saturday on -stable sounded all too familiar:

https://lists.freebsd.org/pipermail/freebsd-stable/2018-September/089623.html

I'm at a loss so I have gathered as much info as I can to predict questions and 
requests for more info. Hoping someone can point me in the right direction for 
further troubleshooting or at least isolation of the problem to a specific area.

Thanks for your time,

Dave

_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to