[DRBD-user] Strange behavior with non-default ping timeout since 8.3.11

Andreas Hofmeister Wed, 19 Sep 2012 06:43:41 -0700

Hi all,

We use drbd 8.3.11 as a dual-primary in a pacemaker (1.0.x) cluster.

In our setup, we need a somewhat larger ping-timeout (2s) due tointerruptions during a firewall restart. That used to work well with8.3.10 but caused crm resource stop/start sequences to fail since 8.3.11.

A git bisect showed that this effect occurs sincehttp://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=a0c9e5442e3be2d17772f50e1cf1d714cbddc51d

It seems that PCMK executes the sequence drbdadm up + drbdadm primaryrather quickly. If the "drbdadm primary" happens while drbd is stillwaiting for the connection being established (WFConnection), theresource startup fails, because then a split-brain is detected and thenautomatic resolution fails because by then both sides are already primary.

Above patch prolongs the time during which the problem may occur: withthe old 100ms connection timeout it was rather unlikely to happen, witha 2s timeout it is almost guaranteed.

We were able to reproduce the problem with ping-timeout 20 on a runningdual-primary with


  drbdadm down <res>; drbdadm up <res>; drbdadm primary <res>

This sequence however works:

  drbdadm down <res> ; drbdadm up <res> ; drbdadm wait-connect <res>;\
    drbdadm primary test

Our test setup was a 3.0.41 kernel running drbd 8.3.13 under KVM.

Putting this

  test "$rc" = "$OCF_SUCCESS" && drbdadm wait-connect $DRBD_RESOURCE

into the drbd_start function of the RA seems to work for us.

Ciao
  Andi
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

[DRBD-user] Strange behavior with non-default ping timeout since 8.3.11

Reply via email to