[DRBD-user] linstor failure

2021-02-16 Thread Adam Goryachev
Reposting the below as I guess early January wasn't the best time to get 
any responses. I'd really appreciate any assistance as I'd prefer to 
avoid rebuilding the VM from scratch (wasted hours, not lost data), but 
also I'd like to know how to resolve or avoid the issue in the future 
when I actually have real data being stored.


Thanks,
Adam


I have a small test setup with 2 x diskless linstor-satellite nodes, and 
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.



The idea is that the diskless node is the compute node (xen, running the 
VM's whose data is on linstor resources).


I have 2 x test VM's, one which was (and still is) working OK (it's an 
older debian linux crossbowold), the other has failed (a Windows 10 VM 
jspiterivm1) while I was installing (attempting) the xen PV drivers (not 
sure if that is relevant or not). The other two resources are unused 
(ns2 and windows-wm).


I have a nothing relevant in the linstor error logs, but the linstor 
controller node has this in it's kern.log:


Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm 
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm 
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm 
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm 
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm 
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm 
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm 
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm 
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm 
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm 
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network 
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm 
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm 
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm 
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: self 
CC647323743B5AE0::: 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: peer 
CC647323743B5AE0::: 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm: 
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State 
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm: 
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm 
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1 
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1 
xen1.mytest.com.au: Committing remote state change 490644362 
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Second

[DRBD-user] linstor failure

2021-01-03 Thread Adam Goryachev
I have a small test setup with 2 x diskless linstor-satellite nodes, and 
4 x diskful linstor-satellite nodes, one of which is the linstor-controller.


The idea is that the diskless node is the compute node (xen, running the 
VM's whose data is on linstor resources).


I have 2 x test VM's, one which was (and still is) working OK (it's an 
older debian linux crossbowold), the other has failed (a Windows 10 VM 
jspiterivm1) while I was installing (attempting) the xen PV drivers (not 
sure if that is relevant or not). The other two resources are unused 
(ns2 and windows-wm).


I have a nothing relevant in the linstor error logs, but the linstor 
controller node has this in it's kern.log:


Dec 30 10:50:44 castle kernel: [4103630.414725] drbd windows-wm 
san6.mytest.com.au: sock was shut down by peer
Dec 30 10:50:44 castle kernel: [4103630.414752] drbd windows-wm 
san6.mytest.com.au: conn( Connected -> BrokenPipe ) peer( Secondary -> 
Unknown )
Dec 30 10:50:44 castle kernel: [4103630.414759] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( UpToDate -> DUnknown ) repl( 
Established -> Off )
Dec 30 10:50:44 castle kernel: [4103630.414807] drbd windows-wm 
san6.mytest.com.au: ack_receiver terminated
Dec 30 10:50:44 castle kernel: [4103630.414810] drbd windows-wm 
san6.mytest.com.au: Terminating ack_recv thread
Dec 30 10:50:44 castle kernel: [4103630.445961] drbd windows-wm 
san6.mytest.com.au: Restarting sender thread
Dec 30 10:50:44 castle kernel: [4103630.479708] drbd windows-wm 
san6.mytest.com.au: Connection closed
Dec 30 10:50:44 castle kernel: [4103630.479739] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected
Dec 30 10:50:44 castle kernel: [4103630.486479] drbd windows-wm 
san6.mytest.com.au: helper command: /sbin/drbdadm disconnected exit code 0
Dec 30 10:50:44 castle kernel: [4103630.486533] drbd windows-wm 
san6.mytest.com.au: conn( BrokenPipe -> Unconnected )
Dec 30 10:50:44 castle kernel: [4103630.486556] drbd windows-wm 
san6.mytest.com.au: Restarting receiver thread
Dec 30 10:50:44 castle kernel: [4103630.486566] drbd windows-wm 
san6.mytest.com.au: conn( Unconnected -> Connecting )
Dec 30 10:50:44 castle kernel: [4103631.006727] drbd windows-wm 
san6.mytest.com.au: Handshake to peer 2 successful: Agreed network 
protocol version 117
Dec 30 10:50:44 castle kernel: [4103631.006735] drbd windows-wm 
san6.mytest.com.au: Feature flags enabled on protocol level: 0xf TRIM 
THIN_RESYNC WRITE_SAME WRITE_ZEROES.
Dec 30 10:50:44 castle kernel: [4103631.006918] drbd windows-wm 
san6.mytest.com.au: Peer authenticated using 20 bytes HMAC
Dec 30 10:50:44 castle kernel: [4103631.006943] drbd windows-wm 
san6.mytest.com.au: Starting ack_recv thread (from drbd_r_windows- [1164])
Dec 30 10:50:44 castle kernel: [4103631.041925] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: drbd_sync_handshake:
Dec 30 10:50:44 castle kernel: [4103631.041932] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: self 
CC647323743B5AE0::: 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041937] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: peer 
CC647323743B5AE0::: 
bits:0 flags:120
Dec 30 10:50:44 castle kernel: [4103631.041941] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: uuid_compare()=no-sync by rule 38
Dec 30 10:50:44 castle kernel: [4103631.229931] drbd windows-wm: 
Preparing cluster-wide state change 1880606796 (0->2 499/146)
Dec 30 10:50:44 castle kernel: [4103631.230424] drbd windows-wm: State 
change 1880606796: primary_nodes=0, weak_nodes=0
Dec 30 10:50:44 castle kernel: [4103631.230429] drbd windows-wm: 
Committing cluster-wide state change 1880606796 (0ms)
Dec 30 10:50:44 castle kernel: [4103631.230480] drbd windows-wm 
san6.mytest.com.au: conn( Connecting -> Connected ) peer( Unknown -> 
Secondary )
Dec 30 10:50:44 castle kernel: [4103631.230486] drbd windows-wm/0 
drbd1001 san6.mytest.com.au: pdsk( DUnknown -> UpToDate ) repl( Off -> 
Established )
Dec 30 10:58:27 castle kernel: [4104093.577650] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Primary -> Secondary )
Dec 30 10:58:27 castle kernel: [4104093.790062] drbd jspiteriVM1/0 
drbd1011: bitmap WRITE of 327 pages took 216 ms
Dec 30 10:58:39 castle kernel: [4104106.278699] drbd jspiteriVM1 
xen1.mytest.com.au: Preparing remote state change 490644362
Dec 30 10:58:39 castle kernel: [4104106.278984] drbd jspiteriVM1 
xen1.mytest.com.au: Committing remote state change 490644362 
(primary_nodes=10)
Dec 30 10:58:39 castle kernel: [4104106.278999] drbd jspiteriVM1 
xen1.mytest.com.au: peer( Secondary -> Primary )
Dec 30 10:58:40 castle kernel: [4104106.547178] drbd jspiteriVM1/0 
drbd1011 xen1.mytest.com.au: resync-susp( no -> connection dependency )
Dec 30 10:58:40 castle kernel: [4104106.547191] drbd jspiteriVM1/0 
drbd1011 san5.mytest.com.au: repl( PausedSyncT -> SyncTarget ) 
resync-susp( peer -> no )
Dec 30 10:58:40 castle kernel: [4104