Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster
Hi! > I'm not able to reproduce the bug under current sid. Even fixed with recent update in Jessie! :-) YEAH! Thanks for your support! > As ctdb in jessie was in another repository than samba, I suspect an > API incompatibility. Actually I am not quite sure if that really is an API incompatibility; from what I found out, the issue would have been fixed with an update to ctdb 2.5.6 which includes a lot of fixes in general. For the records: when trying to run ctdb with gdb, the issue did not occur -- but ctdb was painfully slow. Next I tried to read the messages on the socket, like this: | mv /var/run/ctdb/ctdbd.socket /var/run/ctdb/ctdbd.socket-orig | socat -t100 -x -v \ |UNIX-LISTEN:/var/run/ctdb/ctdbd.socket,mode=777,reuseaddr,fork \ |UNIX-CONNECT:/var/run/ctdb/ctdbd.socket-orig | mv /var/run/ctdb/ctdbd.socket-orig /var/run/ctdb/ctdbd.socket That just slowed down ctdb a little, but everything worked like a charm. So I suspect some kind of race condition to be the root cause of the issue. > I'm tempted to mark this as fixed under sid, but can you setup a sid > box and test yourself with a similar config? You may even mark this as fixed in jessie with version 4.2.10+... Thank you very much for your help! -- Adi signature.asc Description: Digital signature
Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster
Hello Adi, I'm not able to reproduce the bug under current sid. As ctdb in jessie was in another repository than samba, I suspect an API incompatibility. I'm tempted to mark this as fixed under sid, but can you setup a sid box and test yourself with a similar config? Regards Mathieu Parent
Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster
Hi! Thanks for getting back to me! :) > > I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba > > and glusterfs from backports) to Jessie. The cluster itself is way older > > and basically always worked. Since the upgrade to Jessie 'smbstatus -b' > > (almost always) just hangs the whole cluster; I need to interrupt the call > > with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup > > leading to the other cluster nodes being banned and the node I run smbstatus > > on to have ctdbd run at 100% load but not being able to recover. > > How do you recover then? KILL-ing ctdbd? Killing the loaded node is the easiest; manual unbanning of the other nodes is still required. Combinations of enabling and disabling nodes may fix the situation too. > > Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine. > > Have you tried which of --processes, --notify hangs? Does it hangs > with "-b --fast"? Ah, I missed that: '--brief --fast' works just fine. So obviously the validation does not work... > > 'strace'ing ctdbd leads to a massive amount of these messages: > > | > > write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > | 1184) = -1 EAGAIN (Resource temporarily > > unavailable) > > fd 58 is probably the ctdb socket. Can you confirm? Right. > To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg > and send the stacktrace of ctdbd at the write? Ok, I will report back the stack traces in a few days (I'm afraid I can only do these during the weekend). All the best, Adi signature.asc Description: Digital signature
Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster
2015-10-13 15:44 GMT+02:00 Adi Kriegisch: > Package: ctdb > Version: 2.5.4+debian0-4 > > Dear maintainers, Hello Adi, Sorry for my late reply. > I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba > and glusterfs from backports) to Jessie. The cluster itself is way older > and basically always worked. Since the upgrade to Jessie 'smbstatus -b' > (almost always) just hangs the whole cluster; I need to interrupt the call > with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup > leading to the other cluster nodes being banned and the node I run smbstatus > on to have ctdbd run at 100% load but not being able to recover. How do you recover then? KILL-ing ctdbd? > The cluster itself consists of three nodes sharing three cluster ips. The > only service ctdb manages is Samba. The lock file is located on a mirrored > glusterfs volume. > > running and interrupting the hanging smbstatus leads to the following log > messages in /var/log/ctdb/log.ctdb: > | 2015/10/13 15:09:24.923002 [19378]: Starting traverse on DB > | smbXsrv_session_global.tdb (id 2592646) > | 2015/10/13 15:09:25.505302 [19378]: server/ctdb_traverse.c:644 Traverse > | cancelled by client disconnect for database:0x6b06a26d > | 2015/10/13 15:09:25.505492 [19378]: Could not find idr:2592646 > | [...] > | 2015/10/13 15:09:25.507553 [19378]: Could not find idr:2592646 > > 'ctdb getdbmap' lists that database, but also lists a second entry for > smbXsrv_session_global.tdb: > | dbid:0x521b7544 name:smbXsrv_version_global.tdb > path:/var/lib/ctdb/smbXsrv_version_global.tdb.0 > | dbid:0x6b06a26d name:smbXsrv_session_global.tdb > path:/var/lib/ctdb/smbXsrv_session_global.tdb.0 > (I have no idea if that has always been the case or if that happened after > the upgrade). > > Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine. Have you tried which of --processes, --notify hangs? Does it hangs with "-b --fast"? , > 'strace'ing ctdbd leads to a massive amount of these messages: > | > write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > | 1184) = -1 EAGAIN (Resource temporarily > unavailable) fd 58 is probably the ctdb socket. Can you confirm? To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg and send the stacktrace of ctdbd at the write? > Running 'ctdb_diagnostics' is only possible shortly after the cluster is > started (ie. while smbstatus -b works) and yields the following messages: > | ERROR[1]: /etc/krb5.conf is missing on node 0 > | ERROR[2]: File /etc/hosts is different on node 1 > | ERROR[3]: File /etc/hosts is different on node 2 > | ERROR[4]: File /etc/samba/smb.conf is different on node 1 > | ERROR[5]: File /etc/samba/smb.conf is different on node 2 > | ERROR[6]: File /etc/fstab is different on node 1 > | ERROR[7]: File /etc/fstab is different on node 2 > | ERROR[8]: /etc/multipath.conf is missing on node 0 > | ERROR[9]: /etc/pam.d/system-auth is missing on node 0 > | ERROR[10]: /etc/default/nfs is missing on node 0 > | ERROR[11]: /etc/exports is missing on node 0 > | ERROR[12]: /etc/vsftpd/vsftpd.conf is missing on node 0 > | ERROR[13]: Optional file /etc/ctdb/static-routes is not present on node 0 > '/etc/hosts' differs in some newlines and comments while 'smb.conf' only > has some different log levels on the nodes. The rest of the messages does > not affect ctdb as it only manages samba. Yes. Nothing relevant here. > Feel free to ask if you need any more information. Regards -- Mathieu