I'm running DRBD 8.3.13 on Debian Wheezy, Linux 3.2.20 and
every now and then my DRBD resources spontaneously switch from
cs:Connected to cs:WFConnection or the various syncing states and back
(according to "watch cat /proc/drbd").

I've sometimes seen "broken pipe" or even "protocol error"(!?) flashing
by briefly.

No luck debugging this so far. I've tried changing network cards, switching between bonding modes, reverting back to regular ethX (instead of bonding), various MTU and txqueuelen values, using resource-only-fencing (corosync) and not. Nothing has helped so far - this connection unstability just seems to come and go.

Any better debugging ideas? Or maybe this is not a network issue at all?
Excerpt from DRBD configuration:

        net {
                timeout 20;
                max-epoch-size  8192;
                max-buffers     128k;
                connect-int     2;
                ping-int        2;
                sndbuf-size     10M;
                rcvbuf-size     10M;
                ko-count        5;
                after-sb-0pri   discard-zero-changes;
                after-sb-1pri   discard-secondary;
                ping-timeout    2;
        }

        syncer {
                rate    100M;
                al-extents      3389;
                csums-alg       crc32c;
                verify-alg      crc32c;
        }


Here's a syslog snippet demonstrating one whole cycle of this behavior:

kernel: [ 9827.966027] block drbd6: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) kernel: [ 9828.199039] block drbd6: helper command: /sbin/drbdadm after-resync-target minor-6
crm-unfence-peer.sh[24132]: invoked for drbd-serv-mail
crm-unfence-peer.sh[24132]: WARNING drbd-fencing could not determine the master id of drbd resource drbd-serv-mail kernel: [ 9828.238394] block drbd6: helper command: /sbin/drbdadm after-resync-target minor-6 exit code 1 (0x100)
kernel: [ 9828.298906] block drbd6: bitmap WRITE of 83 pages took 15 jiffies
kernel: [ 9828.503024] block drbd6: 0 KB (0 bits) marked out-of-sync by on disk bit-map. kernel: [ 9831.788745] block drbd6: magic?? on data m: 0xa0816800 c: 5120 l: 0 kernel: [ 9831.788790] block drbd6: peer( Primary -> Unknown ) conn( Connected -> ProtocolError ) pdsk( UpToDate -> DUnknown )
kernel: [ 9831.789573] block drbd6: asender terminated
kernel: [ 9831.789576] block drbd6: Terminating drbd6_asender
kernel: [ 9832.041526] block drbd6: Connection closed
kernel: [ 9832.041531] block drbd6: conn( ProtocolError -> Unconnected )
kernel: [ 9832.041535] block drbd6: receiver terminated
kernel: [ 9832.041537] block drbd6: Restarting drbd6_receiver
kernel: [ 9832.041539] block drbd6: receiver (re)started
kernel: [ 9832.041542] block drbd6: conn( Unconnected -> WFConnection )
kernel: [ 9832.457266] block drbd6: Handshake successful: Agreed network protocol version 96
kernel: [ 9832.457276] block drbd6: conn( WFConnection -> WFReportParams )
kernel: [ 9832.457357] block drbd6: Starting asender thread (from drbd6_receiver [29943])
kernel: [ 9832.457733] block drbd6: data-integrity-alg: <not-used>
kernel: [ 9832.457745] block drbd6: drbd_sync_handshake:
kernel: [ 9832.457748] block drbd6: self E8E3BDC352C4C580:0000000000000000:71C7A5DE96C51226:71C6A5DE96C51227 bits:0 flags:0 kernel: [ 9832.457751] block drbd6: peer E915DF859DCA76C9:E8E3BDC352C4C581:71C7A5DE96C51227:71C6A5DE96C51227 bits:12 flags:0
kernel: [ 9832.457754] block drbd6: uuid_compare()=-1 by rule 50
kernel: [ 9832.457758] block drbd6: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) disk( UpToDate -> Outdated ) pdsk( DUnknown -> UpToDate )
kernel: [ 9832.883300] block drbd6: conn( WFBitMapT -> WFSyncUUID )
kernel: [ 9832.987097] block drbd6: updated sync uuid E8E4BDC352C4C580:0000000000000000:71C7A5DE96C51226:71C6A5DE96C51227 kernel: [ 9833.141291] block drbd6: helper command: /sbin/drbdadm before-resync-target minor-6 kernel: [ 9833.158129] block drbd6: helper command: /sbin/drbdadm before-resync-target minor-6 exit code 0 (0x0) kernel: [ 9833.158135] block drbd6: conn( WFSyncUUID -> SyncTarget ) disk( Outdated -> Inconsistent ) kernel: [ 9833.158141] block drbd6: Began resync as SyncTarget (will sync 52 KB [13 bits set]). kernel: [ 9833.415551] block drbd6: Resync done (total 1 sec; paused 0 sec; 52 K/sec) kernel: [ 9833.415554] block drbd6: 23 % had equal checksums, eliminated: 12K; transferred 40K total 52K kernel: [ 9833.415558] block drbd6: updated UUIDs E915DF859DCA76C8:0000000000000000:E8E4BDC352C4C580:E8E3BDC352C4C581 kernel: [ 9833.415563] block drbd6: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) kernel: [ 9833.575311] block drbd6: helper command: /sbin/drbdadm after-resync-target minor-6
crm-unfence-peer.sh[24433]: invoked for drbd-serv-mail
crm-unfence-peer.sh[24433]: WARNING drbd-fencing could not determine the master id of drbd resource drbd-serv-mail kernel: [ 9833.615746] block drbd6: helper command: /sbin/drbdadm after-resync-target minor-6 exit code 1 (0x100)
kernel: [ 9833.661043] block drbd6: bitmap WRITE of 84 pages took 11 jiffies
kernel: [ 9833.772319] block drbd6: 0 KB (0 bits) marked out-of-sync by on disk bit-map. kernel: [ 9851.333540] block drbd6: magic?? on data m: 0x80816700 c: 19201 l: 0


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to