Hi,
I posted this before, but I didn’t really get an answer and I’m still looking
for ways to debug this.
I have to servers (HP DL 380 Gen8, 192 GB RAM, 6C CPU).
I’ve outfitted them with HP’s H22x cards (really OEMed 9207-8x, three
altogether) that I recently cross-flashed to LSI’s latest firmware.
They have:
- 14 600GB SAS disks in their internal cages (2x ZFs mirror to boot from and
2x RAIDZ2)
- another 12 1200 GB SAS disks in an external HP 2700 enclosure (2x RAID-Z2)
- all the RAID Z2s together form a single zpool
These are NFS-servers (master + warm standby) and MySQL slaves.
The client uses the NFS-server and previously used both MySQL-Slaves for SELECT
queries only (via a middleware that seems to be able to sort this out).
When this system was built, we had twelve single RAID0 logical volumes on HP
P4x0 RAID-controllers (with 2 GB cache), merged into two RAID-Z2 vdevs for
storage (and a RAID1 boot-drive).
However, the customer outgrew that solution and because there are no native HP
utilities for managing the RAID-cards on FreeBSD, each time a drive dies on
such a setup, you have to reboot the server to make sure ZFS „sees“ the new
drive (go into adapter BIOS, destroy RAID, create RAID…).
So, I switched to LSI-type JBOD HBAs.
I had to re-install both servers from scratch, obviously, and also upgraded to
10.3 (from 10.1).
I have zfsnap (sysutils/zfsnap) create hourly, daily and weekly snapshots that
I transfer via zxfer from server1 to server2.
Previously, this only took a couple of seconds. But since I switched to LSI
HBAs, it’s taking longer - even though the deltas are really small in most
cases.
After 03:00 AM, when zfsnap deletes some snapshots on the source system, it
even takes several minutes to zxfer the snapshots.
At that point, it’s really noticeable that the 2nd system complete freezes. The
first system (that zfs sends) is absolutely unaffected and continues like
normal.
I’ve setup a cron-job that outputs iostat and vmstat into second-timestamped
files and a few seconds after the zxfer starts, it doesn’t generate any new
files until shortly before it finishes.
With these hangs, the customer can’t use the 2nd MySQL-server and is thus
unhappy.
I’ve upgraded to 11.0-RELEAE on the 2nd system - but it hasn’t helped (it might
even be slower actually).
How can I debug this? What’s the best way forward?
I don’t have any ZFS errors (checksum etc.). So, I doubt it’s a cable.
Here’s various system information
ZFS Subsystem ReportSat Oct 1 17:50:48 2016
System Information:
Kernel Version: 1100122 (osreldate)
Hardware Platform: amd64
Processor Architecture: amd64
ZFS Storage pool Version: 5000
ZFS Filesystem Version: 5
FreeBSD 11.0-RELEASE #0 r306211: Thu Sep 22 21:43:30 UTC 2016 root
5:50PM up 2 days, 19:50, 1 users, load averages: 0.11, 0.09, 0.08
System Memory:
0.26% 506.77 MiB Active, 11.52% 21.54 GiB Inact
62.29% 116.49 GiB Wired, 0.00% 0 Cache
25.93% 48.49 GiB Free, 0.00% 0 Gap
Real Installed: 192.00 GiB
Real Available: 99.96% 191.93 GiB
Real Managed: 97.44% 187.01 GiB
Logical Total: 192.00 GiB
Logical Used: 63.53% 121.97 GiB
Logical Free: 36.47% 70.03 GiB
Kernel Memory: 2.15GiB
Data: 98.41% 2.12GiB
Text: 1.59% 35.02 MiB
Kernel Memory Map: 187.01 GiB
Size: 40.16% 75.10 GiB
Free: 59.84% 111.92 GiB
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted:2.16m
Recycle Misses: 0
Mutex Misses: 1.92k
Evict Skips:54.73k
ARC Size: 64.39% 41.21 GiB
Target Size: (Adaptive) 100.00% 64.00 GiB
Min Size (Hard Limit): 36.33% 23.25 GiB
Max Size (High Water): 2:1 64.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 89.29% 57.14 GiB
Frequently Used Cache Size: 10.71% 6.86GiB
ARC Hash Breakdown:
Elements Max: 5.37m
Elements Current