Hello
We are running DRBD 8.3.12 in a dual primary system. On top of the 3
DRBD resources we run CLVM, and KVM virtual machines running from these.
Setup of the cluster followed Alteve's tutorial
https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial
We have 5 virtual machines, 2 of which are Windows Server 2008 (one is
SBS 2011), the others linux. All run fine, as far as I can tell, most of
the time.
The problem we have is when the SBS2011 guest VM is restarted. This did
not happen when the server was first installed, but the last few reboots
has done.
DRBD/KVM Host 1
Apr 25 21:24:42 oberon kernel: block drbd2: sock was shut down by peer
Apr 25 21:24:42 oberon kernel: block drbd2: peer( Primary -> Unknown )
conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) susp( 0 -> 1 )
Apr 25 21:24:42 oberon kernel: block drbd2: short read expecting header
on sock: r=0
Apr 25 21:24:42 oberon kernel: block drbd2: asender terminated
Apr 25 21:24:42 oberon kernel: block drbd2: Terminating asender thread
(Host 1 is STONITHed at this point)
DRBD/Host 2
Apr 25 21:24:42 titania kernel: block drbd2: PingAck did not arrive in time.
Apr 25 21:24:42 titania kernel: block drbd2: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) susp( 0
-> 1 )
Apr 25 21:24:42 titania kernel: block drbd2: asender terminated
Apr 25 21:24:42 titania kernel: block drbd2: Terminating asender thread
Apr 25 21:24:42 titania kernel: block drbd2: Connection closed
Apr 25 21:24:42 titania kernel: block drbd2: conn( NetworkFailure ->
Unconnected )
Host 2 continues, brings up the 2 VMs sucessfully, etc.
I assume the ping not arriving in time to host 2 causes the socket to
shut down on host 1?
The ping time out is the default 5/10'th sec. Why is it timing out when
this guest VM is rebooted?
The 2 host servers are have a dedicated Intel 10 Gigabit AT2 adaptor for
DRBD.
I have a feeling this may have started after when the guest Windows VM
had more memory assigned, from about 15Gb to 20Gb, and I wonder if
Windows is writing some large memory dump when rebooting which pushes
DRBD's replication too far?
Simply upping the ping timeout seems like the wrong solution, but is the
only thing I can think of. Any suggestions welcome.
Cheers
Alastair Battrick
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user