Lars,

Thank you very much for your explanation. In this case, if I had "connection reset by peer" error, situation becomes more strange. Actually, I have two resources on this cluster r0 and r1 and I had the problem with r1 only. If it was communication "hiccup", I'd have a problem with both resources simultaneously, but I didn't. Split brain was for r1 only. See my config file below:

global {
  usage-count no;
}
common {
  protocol C;
}

resource r0 {
  device    /dev/drbd1;
  disk      /dev/sdb;
  meta-disk internal;
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ping-timeout 20;
  }
  startup {
    wfc-timeout 100;
    degr-wfc-timeout 60;
    become-primary-on both;
  }
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }

  on infplsm004 {
    address   192.168.10.9:7789;
  }
  on infplsm005 {
    address   192.168.10.10:7789;
  }
}
resource r1 {
  device    /dev/drbd2;
  disk      /dev/sdc;
  meta-disk internal;

  # This is to allow dual primary mode.
  # http://www.drbd.org/users-guide-emb/s-enable-dual-primary.html
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
    ping-timeout 20;
  }
  startup {
    wfc-timeout 100;
    degr-wfc-timeout 60;
    become-primary-on both;
  }
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
  }

  on infplsm004 {
    address   192.168.10.9:7790;
  }
  on infplsm005 {
    address   192.168.10.10:7790;
  }
}

Thank you,
Ivan


On 09/21/2011 10:15 PM, Lars Ellenberg wrote:
On Wed, Sep 21, 2011 at 10:08:42AM +1000, Ivan Pavlenko wrote:
Hi All,

Recently I had split brain onto my cluster. There was a not a big
issue, but I still haven't found any reason of this glitch. I got in
my log dile next:
We call it a DRBD resource internal split brain, when you have a period
in time during which both nodes can not communicate, _and_ both have
been Primary.

Which means, whenever you run dual-primary DRBD, and have a hickup on
the replication link, that causes a DRBD "split brain",
maybe better read that as "potential data-set divergence".

Sep 20 18:44:35 infplsm004<kern.info>  kernel: VMCIUtil: Updating
context id from 0x775d2835 to 0x775d2835 on event 0.
Sep 20 18:44:35 infplsm004<kern.err>  kernel: block drbd2:
sock_recvmsg returned -104
Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: peer(
Primary ->  Unknown ) conn( Connected ->  NetworkFailure ) pdsk(
UpToDate ->  DUnknown )
Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: asender
terminated
Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2:
Terminating asender thread
Sep 20 18:44:35 infplsm004<kern.err>  kernel: block drbd2: short
read expecting header on sock: r=-512
Sep 20 18:44:35 infplsm004<kern.info>  kernel: block drbd2: Creating
new current UUID
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2:
Connection closed
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: conn(
NetworkFailure ->  Unconnected )
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: receiver
terminated
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2:
Restarting receiver thread
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: receiver
(re)started
Sep 20 18:44:36 infplsm004<kern.info>  kernel: block drbd2: conn(
Unconnected ->  WFConnection )
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
Handshake successful: Agreed network protocol version 94
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
WFConnection ->  WFReportParams )
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: Starting
asender thread (from drbd2_receiver [11360])
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
data-integrity-alg:<not-used>
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
drbd_sync_handshake:
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: self
AD9C020C7BA6E149:51B8CD59E67A7227:01C987FB5F84C0D1:30241D96D32A31CF
bits:1 flags:0
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: peer
A2111F74640A099D:51B8CD59E67A7227:01C987FB5F84C0D0:30241D96D32A31CF
bits:0 flags:0
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
uuid_compare()=100 by rule 90
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
command: /sbin/drbdadm initial-split-brain minor-2
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
Sep 20 18:44:38 infplsm004<kern.alert>  kernel: block drbd2:
Split-Brain detected but unresolved, dropping connection!
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
command: /sbin/drbdadm split-brain minor-2
Sep 20 18:44:38 infplsm004<kern.err>  kernel: block drbd2: meta
connection shut down by peer.
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
WFReportParams ->  NetworkFailure )
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: asender
terminated
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
Terminating asender thread
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: helper
command: /sbin/drbdadm split-brain minor-2 exit code 0 (0x0)
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
NetworkFailure ->  Disconnecting )
Sep 20 18:44:38 infplsm004<kern.err>  kernel: block drbd2: error
receiving ReportState, l: 4!
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
Connection closed
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: conn(
Disconnecting ->  StandAlone )
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2: receiver
terminated
Sep 20 18:44:38 infplsm004<kern.info>  kernel: block drbd2:
Terminating receiver thread

I'd like to stress your attention on first two rows.  DRBD socket
received messages is code -104. What's it for? Where I can get info
about error codes?
These are typically normal negative errno codes,
on my box 104 would be ECONNRESET, Connection reset by peer.

Thank you in advance,
Ivan

_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to