Hello, to complete previous logs (xosan1), I attach kern.log from xosan3, this may help to understand the issue:
[27911.178915] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: meta connection shut down by peer. [27911.178932] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: conn( Connected -> NetworkFailure ) peer( Secondary -> Unknown ) [27911.178934] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan1: pdsk( UpToDate -> DUnknown ) repl( Established -> Off ) [27911.178945] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan1: ack_receiver terminated Matias On Tue, Jun 07, 2022 at 04:31:49PM +0200, Matias Ezequiel Vara Larsen wrote: > Hello, > > I am observing the following behavior during the cluster-wide state change > (kern.log): > > [27050.375909] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing > cluster-wide state change 2579620917 (0->-1 3/1) > [27080.486801] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state > change 2579620917 (30112ms) rv = -23 > [27081.671615] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing > cluster-wide state change 3642399026 (0->-1 3/1) > [27113.253606] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state > change 3642399026 (31584ms) rv = -23 > > Here rv=-13 indicates SS_TIMEOUT. This continues for a while. Then, error > changes to rv=-19 that indicates SS_CONCURRENT_ST_CH: > > [27914.141426] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing > cluster-wide state change 3371883425 (0->-1 3/1) > [27926.280093] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: sock_recvmsg returned > -113 > [27926.280099] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: > conn( Connected -> BrokenPipe ) peer( Secondary -> Unknown ) > [27926.280101] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: pdsk( > UpToDate -> DUnknown ) repl( Established -> Off ) > [27926.280167] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: ack_receiver > terminated > [27926.280168] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: > Terminating ack_recv thread > [27926.280189] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Aborting cluster-wide state > change 3371883425 (12140ms) rv = -19 > [27926.280252] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change failed: In > transient state, retry after next state change > > State change continues failing but with -23: > > [27926.328265] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing > cluster-wide state change 3899801398 (0->-1 3/1) > [27926.329319] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Connection closed > [27926.329326] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: > helper command: /sbin/drbdadm disconnected > [27926.332926] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: helper command: > /sbin/drbdadm disconnected exit code 0 > [27926.332955] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( BrokenPipe -> > Unconnected ) > [27926.332980] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: Restarting receiver > thread > [27956.998808] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: > Aborting cluster-wide state change 3899801398 (30672ms) rv = -23 > [27956.998973] > drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e xosan3: conn( Unconnected > -> Connecting ) > > Finally, the state changes: > > [27957.686777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Preparing > cluster-wide state change 2682176118 (0->2 499/146) > [27957.710748] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002: > discard_zeroes_data=0 and discard_zeroes_if_aligned=no: disabling discards > [27957.710774] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 > xosan3: drbd_sync_handshake: > [27957.710775] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 drbd1002 xosan3: self > 259E390F9A174E68:0000000000000000:78D7B860591E8E4E:0000000000000000 bits:0 > flags:120 > [27957.710777] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 > drbd1002 xosan3: peer > 259E390F9A174E68:0000000000000000:0000000000000000:0000000000000000 bits:0 > flags:120 > [27957.710778] drbd xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e/0 > drbd1002 xosan3: uuid_compare()=no-sync by rule=reconnected > [27957.711207] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: State change 2682176118: > primary_nodes=0, weak_nodes=0 > [27957.711208] drbd > xcp-volume-ce0edf61-e363-4167-b904-da57d7615f3e: Committing cluster-wide state > change 2682176118 (24ms) > > The state successes to change after nearly 20 minutes. Do you know what is > going > on or how could I debug this issue? I could provide more information if > required. > > Thanks, Matias. _______________________________________________ Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user@lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user