Hi! Thanks for getting back to me! :)
> > I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba > > and glusterfs from backports) to Jessie. The cluster itself is way older > > and basically always worked. Since the upgrade to Jessie 'smbstatus -b' > > (almost always) just hangs the whole cluster; I need to interrupt the call > > with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup > > leading to the other cluster nodes being banned and the node I run smbstatus > > on to have ctdbd run at 100% load but not being able to recover. > > How do you recover then? KILL-ing ctdbd? Killing the loaded node is the easiest; manual unbanning of the other nodes is still required. Combinations of enabling and disabling nodes may fix the situation too. > > Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine. > > Have you tried which of --processes, --notify hangs? Does it hangs > with "-b --fast"? Ah, I missed that: '--brief --fast' works just fine. So obviously the validation does not work... > > 'strace'ing ctdbd leads to a massive amount of these messages: > > | > > write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > > | 1184) = -1 EAGAIN (Resource temporarily > > unavailable) > > fd 58 is probably the ctdb socket. Can you confirm? Right. > To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg > and send the stacktrace of ctdbd at the write? Ok, I will report back the stack traces in a few days (I'm afraid I can only do these during the weekend). All the best, Adi
signature.asc
Description: Digital signature