wow, my mails finally made it to the list ... forget it, it's redondant with my today's thread.
Julien Le 22/06/2018 à 14:39, Julien Escario a écrit : > Hello, DRBD9 is really a great piece of software but from time to time, we > end stuck in a situation without other solution than reboot. > > For exemple, right now, when we run : # drdbadm status It display some > ressources than hang on a specific ressource and finally returns "Command > 'drbdsetup status' did not terminate within 5 seconds". > > And drdsetup processus stacks. drbdmanage is completely out of order on > both nodes (see below). > > Running drbdsetup status with strace runs until problematic ressource and > displays : >> write(3, >> "4\0\0\0\34\0\1\3\227\251,[\330f\0\0\37\2\0\0\377\377\377\377\0\0\0\0\30\0\2\ 0"..., >> 52) = 52 poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000) >> = 1 ([{fd=3, revents=POLLIN}]) poll([{fd=3, events=POLLIN}], 1, -1) = >> 1 ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, >> nl_pid=0, nl_groups=00000000}, msg_namelen=12, >> msg_iov=[{iov_base=[{{len=720, type=0x1c /* NLMSG_??? */, >> flags=NLM_F_MULTI, seq=1529653655, pid=26328}, >> "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2@vm-1"...}, >> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], >> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) = >> 720 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, >> revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, >> nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=720, >> type=0x1c /* NLMSG_??? */, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, >> "\37\2\0\0\244\0\0\0e\0\0\0 \0\2\0\10\0\1@\0\0\0\0\22\0\2@vm-1"...}, >> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], >> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 720 >> poll([{fd=1, events=POLLHUP}, {fd=3, events=POLLIN}], 2, 120000) = 1 >> ([{fd=3, revents=POLLIN}]) poll([{fd=3, events=POLLIN}], 1, -1) = 1 >> ([{fd=3, revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, >> nl_pid=0, nl_groups=00000000}, msg_namelen=12, >> msg_iov=[{iov_base=[{{len=20, type=NLMSG_DONE, flags=NLM_F_MULTI, >> seq=1529653655, pid=26328}, "\0\0\0\0"}, {{len=164, type=0x65 /* >> NLMSG_??? */, flags=0, seq=131104, pid=65544}, >> "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...}, {{len=6433, >> type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0, pid=1114117}, >> "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...}, >> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], >> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK) = >> 20 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, >> revents=POLLIN}]) recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, >> nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{{len=20, >> type=NLMSG_DONE, flags=NLM_F_MULTI, seq=1529653655, pid=26328}, >> "\0\0\0\0"}, {{len=164, type=0x65 /* NLMSG_??? */, flags=0, seq=131104, >> pid=65544}, "\0\0\0\0\22\0\2\0vm-145-disk-1\0\0\0\330\0\3\0(\0\1\0"...}, >> {{len=6433, type=0x5 /* NLMSG_??? */, flags=NLM_F_DUMP_INTR, seq=0, >> pid=1114117}, >> "\1\0\0\0\5\0\22\0\1\0\0\0\5\0\23\0\0\0\0\0\10\0\24\0\0\0\0\0\10\0\25\0"...}, >> {{len=0, type=0 /* NLMSG_??? */, flags=0, seq=0, pid=0}}], >> iov_len=8192}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 >> write(3, "4\0\0\0\34\0\1\3\230\251,[\330f\0\0 >> \2\0\0\377\377\377\377\0\0\0\0\30\0\2\0"..., 52 > > drbdadm runs fine on node 2. > > I don't exactly see how to interpret this. > > Finally, I can see that node 1 is keeping the drbdctrl ressource as primary > : something must have gone wrong on this node. > > drbdtop actually runs correctly and shows for the problematic ressouce : > volume 0 (/dev/drbd164): UpToDate(normal disk state) Blocked: upper and : > Connection to node2(Unknown): NetworkFailure(lost connection to node2) > > How can I debug such situation without rebooting node1 ? > > This is not the time we're encountering such situation and rebooting each > time is really a pain, we're talking of highly available clusters. > > Any other info I can provide ? > > Thanks a lot ! > > Best regards, Julien Escario > > P.S. : drbdmanage output > > On node 1 (actual drbdctrl primary) : > > # drbdmanage r ERROR:dbus.proxies:Introspect error on :1.53:/interface: > dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not > receive a reply. Possible causes include: the remote application did not > send a reply, the message bus security policy blocked the reply, the reply > timeout expired, or the network connection was broken. > > Error: Cannot connect to the drbdmanaged process using DBus The DBus > subsystem returned the following error description: > org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible > causes include: the remote application did not send a reply, the message > bus security policy blocked the reply, the reply timeout expired, or the > network connection was broken. > > On node 2 (shown drbdctrl as secondary) # drbdmanage r Waiting for server: > ............... Error: Satellite could not request control volume from > leader No resources defined > _______________________________________________ drbd-user mailing list > drbd-user@lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > _______________________________________________ drbd-user mailing list drbd-user@lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user