subject:"zfs receive leads to a stuck server"

Re: zfs receive leads to a stuck server

2016-11-03 Thread Rainer Duffner


> Am 01.10.2016 um 18:02 schrieb Rainer Duffner :
> 
> Hi,
> 
> I posted this before, but I didn’t really get an answer and I’m still looking 
> for ways to debug this.
> 
> I have to servers (HP DL 380 Gen8, 192 GB RAM, 6C CPU).
> I’ve outfitted them with HP’s H22x cards (really OEMed 9207-8x, three 
> altogether) that I recently cross-flashed to LSI’s latest firmware.



And it turns out, using anything branded by HP is just asking for ulcers and 
headaches.

We replaced said HBAs with LSI^WAvago^WBroadcom HBAs - and everything started 
working as it should.
There are still hangs at 03:00 and 04:00 (when the snapshots that got deleted 
on the master get deleted on the slave) - but that's not really a problem 
because they are much shorter than before and nobody is using the system at 
that time.

So, while the H22x-cards are „close enough to a LSI2308-based card so that the 
driver actually thinks it’s a 2308 - they are in fact not.

Also, with the original LSI cards, you can use LSI’s FreeBSD flash utility and 
have a far better life overall.


___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

zfs receive leads to a stuck server

2016-10-01 Thread Rainer Duffner

Hi,

I posted this before, but I didn’t really get an answer and I’m still looking 
for ways to debug this.

I have to servers (HP DL 380 Gen8, 192 GB RAM, 6C CPU).
I’ve outfitted them with HP’s H22x cards (really OEMed 9207-8x, three 
altogether) that I recently cross-flashed to LSI’s latest firmware.
They have:
 - 14 600GB SAS disks in their internal cages (2x ZFs mirror to boot from and 
2x RAIDZ2)
 - another 12 1200 GB SAS disks in an external HP 2700 enclosure (2x RAID-Z2)
 - all the RAID Z2s together form a single zpool

These are NFS-servers (master + warm standby) and MySQL slaves.
The client uses the NFS-server and previously used both MySQL-Slaves for SELECT 
queries only (via a middleware that seems to be able to sort this out).

When this system was built, we had twelve single RAID0 logical volumes on HP 
P4x0 RAID-controllers (with 2 GB cache), merged into two RAID-Z2 vdevs for 
storage (and a RAID1 boot-drive).
However, the customer outgrew that solution and because there are no native HP 
utilities for managing the RAID-cards on FreeBSD, each time a drive dies on 
such a setup, you have to reboot the server to make sure ZFS „sees“ the new 
drive (go into adapter BIOS, destroy RAID, create RAID…).

So, I switched to LSI-type JBOD HBAs.
I had to re-install both servers from scratch, obviously, and also upgraded to 
10.3 (from 10.1).


I have zfsnap (sysutils/zfsnap) create hourly, daily and weekly snapshots that 
I transfer via zxfer from server1 to server2.

Previously, this only took a couple of seconds. But since I switched to LSI 
HBAs, it’s taking longer - even though the deltas are really small in most 
cases.
After 03:00 AM, when zfsnap deletes some snapshots on the source system, it 
even takes several minutes to zxfer the snapshots.
At that point, it’s really noticeable that the 2nd system complete freezes. The 
first system (that zfs sends) is absolutely unaffected and continues like 
normal.

I’ve setup a cron-job that outputs iostat and vmstat into second-timestamped 
files and a few seconds after the zxfer starts, it doesn’t generate any new 
files until shortly before it finishes.
With these hangs, the customer can’t use the 2nd MySQL-server and is thus 
unhappy.

I’ve upgraded to 11.0-RELEAE on the 2nd system - but it hasn’t helped (it might 
even be slower actually).

How can I debug this? What’s the best way forward?
I don’t have any ZFS errors (checksum etc.). So, I doubt it’s a cable.


Here’s various system information



ZFS Subsystem ReportSat Oct  1 17:50:48 2016


System Information:

Kernel Version: 1100122 (osreldate)
Hardware Platform:  amd64
Processor Architecture: amd64

ZFS Storage pool Version:   5000
ZFS Filesystem Version: 5

FreeBSD 11.0-RELEASE #0 r306211: Thu Sep 22 21:43:30 UTC 2016 root
 5:50PM  up 2 days, 19:50, 1 users, load averages: 0.11, 0.09, 0.08



System Memory:

0.26%   506.77  MiB Active, 11.52%  21.54   GiB Inact
62.29%  116.49  GiB Wired,  0.00%   0 Cache
25.93%  48.49   GiB Free,   0.00%   0 Gap

Real Installed: 192.00  GiB
Real Available: 99.96%  191.93  GiB
Real Managed:   97.44%  187.01  GiB

Logical Total:  192.00  GiB
Logical Used:   63.53%  121.97  GiB
Logical Free:   36.47%  70.03   GiB

Kernel Memory:  2.15GiB
Data:   98.41%  2.12GiB
Text:   1.59%   35.02   MiB

Kernel Memory Map:  187.01  GiB
Size:   40.16%  75.10   GiB
Free:   59.84%  111.92  GiB



ARC Summary: (HEALTHY)
Memory Throttle Count:  0

ARC Misc:
Deleted:2.16m
Recycle Misses: 0
Mutex Misses:   1.92k
Evict Skips:54.73k

ARC Size:   64.39%  41.21   GiB
Target Size: (Adaptive) 100.00% 64.00   GiB
Min Size (Hard Limit):  36.33%  23.25   GiB
Max Size (High Water):  2:1 64.00   GiB

ARC Size Breakdown:
Recently Used Cache Size:   89.29%  57.14   GiB
Frequently Used Cache Size: 10.71%  6.86GiB

ARC Hash Breakdown:
Elements Max:   5.37m
Elements Current

Re: zfs receive leads to a stuck server

zfs receive leads to a stuck server

2 matches

Site Navigation

Mail list logo

Footer information