Reposting the below as I guess early January wasn't the best time to get any responses. I'd really appreciate any assistance as I'd prefer to avoid rebuilding the VM from scratch (wasted hours, not lost data), but also I'd like to know how to resolve or avoid the issue in the future when I actually have real data being stored.

Thanks,
Adam


I have a small test setup with 2 x diskless linstor-satellite nodes, and 4 x diskful linstor-satellite nodes, one of which is the linstor-controller.


The idea is that the diskless node is the compute node (xen, running the VM's whose data is on linstor resources).

I have 2 x test VM's, one which was (and still is) working OK (it's an older debian linux crossbowold), the other has failed (a Windows 10 VM jspiterivm1) while I was installing (attempting) the xen PV drivers (not sure if that is relevant or not). The other two resources are unused (ns2 and windows-wm).

I have a nothing relevant in the linstor error logs, but the linstor controller node has this in it's kern.log:

Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm san6.mytest.com.au: sock was shut down by peer Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0 drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm san6.mytest.com.au: ack_receiver terminated Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm san6.mytest.com.au: Terminating ack_recv thread Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm san6.mytest.com.au: Restarting sender thread Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm san6.mytest.com.au: Connection closed Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm san6.mytest.com.au: helper command: /sbin/drbdadm disconnected Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0 Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm san6.mytest.com.au: conn( BrokenPipe -> Unconnected ) Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm san6.mytest.com.au: Restarting receiver thread Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm san6.mytest.com.au: conn( Unconnected -> Connecting ) Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm san6.mytest.com.au: Handshake to peer 2 successful: Agreed network protocol version 117 Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm san6.mytest.com.au: Peer authenticated using 20 bytes HMAC Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164]) Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0 drbd1001 san6.mytest.com.au: drbd_sync_handshake: Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0 drbd1001 san6.mytest.com.au: self CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:120 Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0 drbd1001 san6.mytest.com.au: peer CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:120 Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0 drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38 Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm: Preparing cluster-wide state change 1880606796 (0->2 499/146) Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State change 1880606796: primary_nodes=0, weak_nodes=0 Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm: Committing cluster-wide state change 1880606796 (0ms) Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0 drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> Established ) Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1 xen1.mytest.com.au: peer( Primary -> Secondary ) Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0 drbd1011: bitmap WRITE of 327 pages took 216 ms Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1 xen1.mytest.com.au: Preparing remote state change 490644362 Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1 xen1.mytest.com.au: Committing remote state change 490644362 (primary_nodes=10) Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1 xen1.mytest.com.au: peer( Secondary -> Primary ) Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0 drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency ) Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget ) resync-susp( peer -> no ) Dec 30 10:58:40 castle kernel: [4104106.547198] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Syncer continues. Dec 30 11:04:29 castle kernel: [4104456.362585] drbd jspiteriVM1 xen1.mytest.com.au: peer( Primary -> Secondary ) Dec 30 11:04:30 castle kernel: [4104456.388543] drbd jspiteriVM1/0 drbd1011: bitmap WRITE of 1 pages took 24 ms Dec 30 11:04:30 castle kernel: [4104456.401108] drbd jspiteriVM1/0 drbd1011 san6.mytest.com.au: pdsk( UpToDate -> Outdated ) Dec 30 11:04:30 castle kernel: [4104456.788360] drbd jspiteriVM1/0 drbd1011 san6.mytest.com.au: pdsk( Outdated -> Inconsistent ) Dec 30 11:09:15 castle kernel: [4104742.275721] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2 Dec 30 11:09:15 castle kernel: [4104742.377977] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2 Dec 30 11:09:16 castle kernel: [4104742.481920] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=3 Dec 30 11:09:16 castle kernel: [4104742.585933] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=4 Dec 30 11:09:16 castle kernel: [4104742.689909] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104742.793898] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104742.897895] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104743.001927] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104743.105909] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104743.209908] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:16 castle kernel: [4104743.313927] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.417897] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.521909] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.575764] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.625902] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.729908] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.833894] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104743.937890] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5 Dec 30 11:09:17 castle kernel: [4104744.041907] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=5
[this line repeats .... until Jan 2 2:33am, probably when I rebooted it]

Jan  2 02:33:46 castle kernel: [4333012.494110] drbd jspiteriVM1 san5.mytest.com.au: Restarting sender thread Jan  2 02:33:46 castle kernel: [4333012.528437] drbd jspiteriVM1 san5.mytest.com.au: Connection closed Jan  2 02:33:46 castle kernel: [4333012.528447] drbd jspiteriVM1 san5.mytest.com.au: helper command: /sbin/drbdadm disconnected Jan  2 02:33:46 castle kernel: [4333012.530942] drbd jspiteriVM1 san5.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0 Jan  2 02:33:46 castle kernel: [4333012.530960] drbd jspiteriVM1 san5.mytest.com.au: conn( BrokenPipe -> Unconnected ) Jan  2 02:33:46 castle kernel: [4333012.530970] drbd jspiteriVM1 san5.mytest.com.au: Restarting receiver thread Jan  2 02:33:46 castle kernel: [4333012.530974] drbd jspiteriVM1 san5.mytest.com.au: conn( Unconnected -> Connecting ) Jan  2 02:33:46 castle kernel: [4333013.054060] drbd jspiteriVM1 san5.mytest.com.au: Handshake to peer 1 successful: Agreed network protocol version 117 Jan  2 02:33:46 castle kernel: [4333013.054067] drbd jspiteriVM1 san5.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan  2 02:33:46 castle kernel: [4333013.054426] drbd jspiteriVM1 san5.mytest.com.au: Peer authenticated using 20 bytes HMAC Jan  2 02:33:46 castle kernel: [4333013.054452] drbd jspiteriVM1 san5.mytest.com.au: Starting ack_recv thread (from drbd_r_jspiteri [1046]) Jan  2 02:33:46 castle kernel: [4333013.085933] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: drbd_sync_handshake: Jan  2 02:33:46 castle kernel: [4333013.085941] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: self 122E90789B3D90E2:122E90789B3D90E3:4D2D1C8F63C38B44:B1B847713A96996E bits:21168661 flags:124 Jan  2 02:33:46 castle kernel: [4333013.085946] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: peer 2B520E804A7D4EAC:0000000000000000:4D2D1C8F63C38B44:B1B847713A96996E bits:21168661 flags:124 Jan  2 02:33:46 castle kernel: [4333013.085952] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: uuid_compare()=target-set-bitmap by rule 60 Jan  2 02:33:46 castle kernel: [4333013.085956] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Setting and writing one bitmap slot, after drbd_sync_handshake Jan  2 02:33:46 castle kernel: [4333013.226948] drbd jspiteriVM1/0 drbd1011: bitmap WRITE of 1078 pages took 88 ms Jan  2 02:33:46 castle kernel: [4333013.278401] drbd jspiteriVM1: Preparing cluster-wide state change 3482568163 (0->1 499/146) Jan  2 02:33:46 castle kernel: [4333013.278980] drbd jspiteriVM1: State change 3482568163: primary_nodes=0, weak_nodes=0 Jan  2 02:33:46 castle kernel: [4333013.278985] drbd jspiteriVM1: Committing cluster-wide state change 3482568163 (0ms) Jan  2 02:33:46 castle kernel: [4333013.279050] drbd jspiteriVM1 san5.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan  2 02:33:46 castle kernel: [4333013.279055] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: repl( Off -> WFBitMapT ) Jan  2 02:33:46 castle kernel: [4333013.326494] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Jan  2 02:33:46 castle kernel: [4333013.337300] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% Jan  2 02:33:46 castle kernel: [4333013.337313] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm before-resync-target Jan  2 02:33:46 castle kernel: [4333013.339475] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm before-resync-target exit code 0 Jan  2 02:33:46 castle kernel: [4333013.339503] drbd jspiteriVM1/0 drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency ) Jan  2 02:33:46 castle kernel: [4333013.339504] drbd jspiteriVM1/0 drbd1011 san7.mytest.com.au: resync-susp( no -> connection dependency ) Jan  2 02:33:46 castle kernel: [4333013.339505] drbd jspiteriVM1/0 drbd1011 san6.mytest.com.au: resync-susp( no -> connection dependency ) Jan  2 02:33:46 castle kernel: [4333013.339507] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: repl( WFBitMapT -> SyncTarget ) Jan  2 02:33:46 castle kernel: [4333013.339552] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Began resync as SyncTarget (will sync 104859732 KB [26214933 bits set]). Jan  2 02:50:55 castle kernel: [4334042.151194] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Retrying drbd_rs_del_all() later. refcnt=2 Jan  2 02:50:55 castle kernel: [4334042.254225] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: Resync done (total 1028 sec; paused 0 sec; 102000 K/sec) Jan  2 02:50:55 castle kernel: [4334042.254230] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: expected n_oos:23691797 to be equal to rs_failed:23727152 Jan  2 02:50:55 castle kernel: [4334042.254232] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au:             23727152 failed blocks Jan  2 02:50:55 castle kernel: [4334042.254245] drbd jspiteriVM1/0 drbd1011 xen1.mytest.com.au: resync-susp( connection dependency -> no ) Jan  2 02:50:55 castle kernel: [4334042.254247] drbd jspiteriVM1/0 drbd1011 san7.mytest.com.au: resync-susp( connection dependency -> no ) Jan  2 02:50:55 castle kernel: [4334042.254249] drbd jspiteriVM1/0 drbd1011 san6.mytest.com.au: resync-susp( connection dependency -> no ) Jan  2 02:50:55 castle kernel: [4334042.254252] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: pdsk( Outdated -> UpToDate ) repl( SyncTarget -> Established ) Jan  2 02:50:55 castle kernel: [4334042.281495] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm after-resync-target Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: helper command: /sbin/drbdadm after-resync-target exit code 0 Jan  2 02:50:55 castle kernel: [4334042.289879] drbd jspiteriVM1/0 drbd1011 san5.mytest.com.au: pdsk( UpToDate -> Inconsistent ) Jan  2 10:23:28 castle kernel: [4361194.855074] drbd windows-wm san7.mytest.com.au: sock was shut down by peer Jan  2 10:23:28 castle kernel: [4361194.855101] drbd windows-wm san7.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) Jan  2 10:23:28 castle kernel: [4361194.855109] drbd windows-wm/0 drbd1001 san7.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) Jan  2 10:23:28 castle kernel: [4361194.855161] drbd windows-wm san7.mytest.com.au: ack_receiver terminated Jan  2 10:23:28 castle kernel: [4361194.855164] drbd windows-wm san7.mytest.com.au: Terminating ack_recv thread Jan  2 10:23:28 castle kernel: [4361194.882138] drbd windows-wm san7.mytest.com.au: Restarting sender thread Jan  2 10:23:28 castle kernel: [4361194.961402] drbd windows-wm san7.mytest.com.au: Connection closed Jan  2 10:23:28 castle kernel: [4361194.961435] drbd windows-wm san7.mytest.com.au: helper command: /sbin/drbdadm disconnected Jan  2 10:23:28 castle kernel: [4361194.968763] drbd windows-wm san7.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0 Jan  2 10:23:28 castle kernel: [4361194.968800] drbd windows-wm san7.mytest.com.au: conn( BrokenPipe -> Unconnected ) Jan  2 10:23:28 castle kernel: [4361194.968812] drbd windows-wm san7.mytest.com.au: Restarting receiver thread Jan  2 10:23:28 castle kernel: [4361194.968816] drbd windows-wm san7.mytest.com.au: conn( Unconnected -> Connecting ) Jan  2 10:23:29 castle kernel: [4361195.486059] drbd windows-wm san7.mytest.com.au: Handshake to peer 3 successful: Agreed network protocol version 117 Jan  2 10:23:29 castle kernel: [4361195.486066] drbd windows-wm san7.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES. Jan  2 10:23:29 castle kernel: [4361195.486490] drbd windows-wm san7.mytest.com.au: Peer authenticated using 20 bytes HMAC Jan  2 10:23:29 castle kernel: [4361195.486515] drbd windows-wm san7.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1165]) Jan  2 10:23:29 castle kernel: [4361195.517928] drbd windows-wm/0 drbd1001 san7.mytest.com.au: drbd_sync_handshake: Jan  2 10:23:29 castle kernel: [4361195.517935] drbd windows-wm/0 drbd1001 san7.mytest.com.au: self CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:120 Jan  2 10:23:29 castle kernel: [4361195.517940] drbd windows-wm/0 drbd1001 san7.mytest.com.au: peer CC647323743B5AE0:0000000000000000:0000000000000000:0000000000000000 bits:0 flags:120 Jan  2 10:23:29 castle kernel: [4361195.517944] drbd windows-wm/0 drbd1001 san7.mytest.com.au: uuid_compare()=no-sync by rule 38 Jan  2 10:23:29 castle kernel: [4361195.677932] drbd windows-wm: Preparing cluster-wide state change 3667329610 (0->3 499/146) Jan  2 10:23:29 castle kernel: [4361195.678459] drbd windows-wm: State change 3667329610: primary_nodes=0, weak_nodes=0 Jan  2 10:23:29 castle kernel: [4361195.678466] drbd windows-wm: Committing cluster-wide state change 3667329610 (0ms) Jan  2 10:23:29 castle kernel: [4361195.678516] drbd windows-wm san7.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> Secondary ) Jan  2 10:23:29 castle kernel: [4361195.678522] drbd windows-wm/0 drbd1001 san7.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> Established )

castle:/var/log# linstor resource list
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node   ┊ Port ┊ Usage  ┊ Conns ┊             State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ crossbowold  ┊ castle ┊ 7010 ┊ Unused ┊ Ok   ┊          UpToDate ┊ 2020-10-07 00:46:23 ┊ ┊ crossbowold  ┊ flail  ┊ 7010 ┊ Unused ┊ Ok ┊          Diskless ┊ 2021-01-04 05:03:20 ┊ ┊ crossbowold  ┊ san5   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-10-07 00:46:23 ┊ ┊ crossbowold  ┊ san6   ┊ 7010 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 2020-10-07 00:46:22 ┊ ┊ crossbowold  ┊ san7   ┊ 7010 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-10-07 00:46:21 ┊ ┊ crossbowold  ┊ xen1   ┊ 7010 ┊ InUse  ┊ Ok ┊          Diskless ┊ 2020-10-15 00:30:31 ┊ ┊ jspiteriVM1  ┊ castle ┊ 7011 ┊ Unused ┊ StandAlone(san6.mytest.com.au,san7.mytest.com.au)    ┊ SyncTarget(0.00%) ┊ 2020-10-14 22:15:00 ┊ ┊ jspiteriVM1  ┊ san5   ┊ 7011 ┊ Unused ┊ Connecting(san7.mytest.com.au)   ┊      Inconsistent ┊ 2020-10-14 22:14:59 ┊ ┊ jspiteriVM1  ┊ san6   ┊ 7011 ┊ Unused ┊ Connecting(castle.mytest.com.au,san7.mytest.com.au) ┊ SyncTarget(0.00%) ┊ 2020-10-14 22:14:58 ┊ ┊ jspiteriVM1  ┊ san7   ┊ 7011 ┊ Unused ┊ Connecting(castle.mytest.com.au),StandAlone(san6.mytest.com.au,san5.mytest.com.au) ┊      Inconsistent ┊ 2020-10-14 22:14:58 ┊ ┊ jspiteriVM1  ┊ xen1   ┊ 7011 ┊ Unused ┊ Ok ┊          Diskless ┊ 2020-11-20 20:39:20 ┊ ┊ ns2          ┊ castle ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-10-28 23:22:13 ┊ ┊ ns2          ┊ flail  ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 2021-01-04 05:03:42 ┊ ┊ ns2          ┊ san5   ┊ 7000 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-10-28 23:22:12 ┊ ┊ ns2          ┊ san6   ┊ 7000 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 2020-10-28 23:22:11 ┊ ┊ ns2          ┊ xen1   ┊ 7000 ┊ Unused ┊ Ok ┊          Diskless ┊ 2020-10-28 23:30:20 ┊ ┊ windows-wm   ┊ castle ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-09-30 00:03:41 ┊ ┊ windows-wm   ┊ flail  ┊ 7001 ┊ Unused ┊ Ok ┊          Diskless ┊ 2021-01-04 05:03:48 ┊ ┊ windows-wm   ┊ san5   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-09-30 00:03:40 ┊ ┊ windows-wm   ┊ san6   ┊ 7001 ┊ Unused ┊ Ok ┊          UpToDate ┊ 2020-09-30 00:03:39 ┊ ┊ windows-wm   ┊ san7   ┊ 7001 ┊ Unused ┊ Ok    ┊          UpToDate ┊ 2020-09-30 00:13:05 ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Could anyone determine from this, or advise what additional logs I should examine, to work out why this failed? I don't see anything obvious as to what caused linstor/drbd to fail here, all nodes where online and un-interrupted as far as I can tell. All physical storage is backed by MD raid arrays, so again there is some protection against disk failures (haven't noticed any anyway though).

I've since done a upgrade to the latest version of drbd/linstor components on all nodes.

Finally, what could I do to recover the data? Has it been destroyed, or do I just need to select a node and tell lintor that this node has up to date data? Or can linstor work that out somehow?

Regards,
Adam

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
https://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to