Hi Pascal, Sorry it took a while to respond this time.
At 11:37 PM 5/26/2011, Pascal BERTON wrote:
Hi, Richard! Ok, good news for the NICs, but bad news to you unfortunately :) Well, since the /dev/sdb is used by this /work directory, we might try to investigate this track. Some infos missing though : When the platform calms down, do you still see à 10:1 ratio between sdb and other volumes write activity ?
Yes. It pretty much stays the same regardless of the load.
When the platform experiences this short peak level, does this 10:1 ratio hit a peak too, and may be rise to 15:1 or 20:1 ?
Nope.
What I'm suspecting, now that you've checked a good amount of things, is : What if DRD was doing nothing but what it is asked to, and if the problem was coming from your mail application? I mean, if your software starts doing odd thing, DRBD will react too, and what you see might only be the visible part of the iceberg... Don't you have installed some patches on may 10th ? Or did you reconfigure something in your mail subsystem that would explain this /work usage increase ?
No. I _WISH_ we had changed something, but we didn't.
More generally, what does cause these writes in /dev/sdb ? What type of activity ? (Please don't answer "work"! ;) )
I have no idea. Note that /work is only allotted 50G, less than 10% of the total available on that disk. The rest is allotted to letters. I don't know of a way to see which letters (or work) are writing to /dev/sdb as opposed to /dev/sdc. The only thing writing to /work with any regularity is squirrelmail's pref files, and that certainly can't account for that much activity (especially since if we disable the webmail servers, it doesn't go down). Still at a loss here, any help appreciated. Thanks. - Richard
Best regards, Pascal. -----Message d'origine----- De : drbd-user-boun...@lists.linbit.com [mailto:drbd-user-boun...@lists.linbit.com] De la part de Richard Stockton Envoyé : vendredi 27 mai 2011 03:43 À : drbd-user@lists.linbit.com Objet : [DRBD-user] Sudden system load Thanks to those who helped on the first go-around of this issue. We have a better handle on the indicators, but still no solution. We use DRBD to handle our 8000 (or so) pop/imap accounts. These exist on 2 separate servers, each with 1/2 the alphabet, and each set to failover to the other if one goes down. So MDA01 handles accounts beginning with "A" through "L", while MDA02 handled letters "M" through "Z", plus a "/work/" partition. Each server replicates its letters (via drbd) to the other server, and heartbeat can force all the partitions to be handled on a single server if the other goes down. The various mail servers connect to the DRBD machines via NFS3 (rw,udp,intr,noatime,nolock). We have been using this system for several years now without serious issues, until suddenly, on or about noon on May 10, 2011, the server load went from a normal less than 1.0 to 14+. It does not remain high all day, in fact it seems to run normally for about 10.5 hours and then run higher than normal for the next 13.5 hours, although on Tuesday it never hit the high loads even though activity was at a normal level. Also, it builds to a peak (anywhere from 12 to 28) somewhere in the middle of that 13.5 hours, holds it for only a few minutes (10-15) and then trails off again. The pattern repeats on weekends too, but the load is much lower (3-4). Each server has almost identical drives allocated for this purpose. mda01:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID vLYWPe-TsJk-L6cv-Dycp-GBNp-XyfV-KkxzET --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 89695 Allocated PE 231808 PV UUID QVpf66-euRN-je7I-L0oq-Cahk-ezFr-HAa3vx mda02:root:> /usr/sbin/pvdisplay --- Physical volume --- PV Name /dev/sdb VG Name lvm PV Size 544.50 GB / not usable 4.00 MB Allocatable yes (but full) PE Size (KByte) 4096 Total PE 139391 Free PE 0 Allocated PE 139391 PV UUID QY8V1P-uCni-Va3b-2Ypl-7wP9-lEtl-ofD05G --- Physical volume --- PV Name /dev/sdc1 VG Name lvm PV Size 1.23 TB / not usable 3.77 MB Allocatable yes PE Size (KByte) 4096 Total PE 321503 Free PE 87135 Allocated PE 234368 PV UUID E09ufG-Qkep-I4hB-3Zda-n7Vy-7zXZ-Nn0Lvi So, a 1.75 TB virtual disk on each server, 81% allocated. The part that really confuses me is that the 2 500GB drives seem to always have 10 times as many writes going on as the 1.23TB drives. (Parts of the iostat output removed for readability) mda01:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda01.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.04 2.09 19.41 0.00 78.37 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3.21 25.87 73.52 36074352 102536956 sda1 0.00 0.00 0.00 2566 60 sda2 0.00 0.00 0.00 2664 448 sda3 3.21 25.86 73.52 36066666 102536448 sdb 336.60 205.27 5675.40 286280151 7915188839 sdc 27.56 148.08 622.15 206517470 867684260 sdc1 27.56 148.08 622.15 206514758 867684260 mda02:root:> iostat | head -13 Linux 2.6.18-128.1.10.el5 (mda02.adhost.com) 05/26/2011 avg-cpu: %user %nice %system %iowait %steal %idle 0.10 0.05 1.87 12.33 0.00 85.65 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 2.84 18.23 68.26 25339938 94863896 sda1 0.00 0.00 0.00 2920 56 sda2 0.00 0.00 0.00 2848 1568 sda3 2.84 18.23 68.26 25331994 94862272 sdb 333.45 109.90 5679.15 152727845 7892497866 sdc 29.93 124.20 588.41 172601220 817732660 sdc1 29.93 124.20 588.41 172598796 817732660 We have checked the network I/O and the NIC's. There are no errors, no dropped packets, no overruns, etc. The NIC's look perfect. We have run rkhunter and chkrootkit on both machines and they found nothing. RedHat 5.3 (2.6.18-128.1.10.el5) DRBD 8.3.1 Heartbeat 2.1.4 Again, any ideas about what is happening, and/or additional diagnostics we might run would be much appreciated. Thank you. - Richard _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user