I have several successful DRBD clusters in production, including two
RHEL 5.3 servers running drbd 8.3.2. They have been running fine for
more than a year. Today we saw very high iowait (99%) on the primary
node (possibly on the secondary too, but I neglected to look) and users
could not work. We tried finding the source of the iowait but could not.
Ended up rebooting the primary. The standby took over fine and began a
resync, but it was running very slow, like 8K per second. Resyncing the
2TB volume was going to take 11+ hours.
So I did...
drbdsetup /dev/drbd0 -r 300M
The command took more than a minute to return to the shell prompt. Now
when I cat /proc/drbd, I see the speed going at ~90K, then ~80, then
~70, then ~60, and so on, until it reaches 0 and then it says "stalled."
After 10-30 seconds in a stalled state, it kicks back off again at about
90K, then it slowly drops back down and stalls again.
Both servers are on the same GigE switch. Running iperf shows that we're
getting almost the full gigabit per second on the replication link.
Any ideas how I can troubleshoot this before users come in tomorrow
morning?
--
Eric Robinson
Disclaimer - February 16, 2011
This email and any files transmitted with it are confidential and intended
solely for [email protected]. If you are not the named addressee you
should not disseminate, distribute, copy or alter this email. Any views or
opinions presented in this email are solely those of the author and might not
represent those of Physicians' Managed Care or Physician Select Management.
Warning: Although Physicians' Managed Care or Physician Select Management has
taken reasonable precautions to ensure no viruses are present in this email,
the company cannot accept responsibility for any loss or damage arising from
the use of this email or attachments.
This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user