Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2016-04-15 Thread Adi Kriegisch
Hi!

> I'm not able to reproduce the bug under current sid.
Even fixed with recent update in Jessie! :-) YEAH! Thanks for your support!
 
> As ctdb in jessie was in another repository than samba, I suspect an
> API incompatibility.
Actually I am not quite sure if that really is an API incompatibility; from
what I found out, the issue would have been fixed with an update to ctdb 2.5.6
which includes a lot of fixes in general.

For the records: when trying to run ctdb with gdb, the issue did not occur
-- but ctdb was painfully slow. Next I tried to read the messages on the
socket, like this:
  | mv /var/run/ctdb/ctdbd.socket /var/run/ctdb/ctdbd.socket-orig
  | socat -t100 -x -v \
  |UNIX-LISTEN:/var/run/ctdb/ctdbd.socket,mode=777,reuseaddr,fork \
  |UNIX-CONNECT:/var/run/ctdb/ctdbd.socket-orig
  | mv /var/run/ctdb/ctdbd.socket-orig /var/run/ctdb/ctdbd.socket
That just slowed down ctdb a little, but everything worked like a charm. So
I suspect some kind of race condition to be the root cause of the issue.
 
> I'm tempted to mark this as fixed under sid, but can you setup a sid
> box and test yourself with a similar config?
You may even mark this as fixed in jessie with version 4.2.10+...

Thank you very much for your help!

-- Adi


signature.asc
Description: Digital signature


Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2016-04-02 Thread Mathieu Parent
Hello Adi,

I'm not able to reproduce the bug under current sid.

As ctdb in jessie was in another repository than samba, I suspect an
API incompatibility.

I'm tempted to mark this as fixed under sid, but can you setup a sid
box and test yourself with a similar config?

Regards

Mathieu Parent



Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2015-11-04 Thread Adi Kriegisch
Hi!

Thanks for getting back to me! :)

> > I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
> > and glusterfs from backports) to Jessie. The cluster itself is way older
> > and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
> > (almost always) just hangs the whole cluster; I need to interrupt the call
> > with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
> > leading to the other cluster nodes being banned and the node I run smbstatus
> > on to have ctdbd run at 100% load but not being able to recover.
> 
> How do you recover then? KILL-ing ctdbd?
Killing the loaded node is the easiest; manual unbanning of the other nodes
is still required. Combinations of enabling and disabling nodes may fix the
situation too.

> > Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.
> 
> Have you tried which of --processes, --notify hangs? Does it hangs
> with "-b --fast"?
Ah, I missed that: '--brief --fast' works just fine. So obviously the
validation does not work...

> > 'strace'ing ctdbd leads to a massive amount of these messages:
> >   | 
> > write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> >   |  1184) = -1 EAGAIN (Resource temporarily 
> > unavailable)
> 
> fd 58 is probably the ctdb socket. Can you confirm?
Right.

> To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg
> and send the stacktrace of ctdbd at the write?
Ok, I will report back the stack traces in a few days (I'm afraid I can
only do these during the weekend).

All the best,
Adi


signature.asc
Description: Digital signature


Bug#801690: [Pkg-samba-maint] Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

2015-11-01 Thread Mathieu Parent
2015-10-13 15:44 GMT+02:00 Adi Kriegisch :
> Package: ctdb
> Version: 2.5.4+debian0-4
>
> Dear maintainers,

Hello Adi,

Sorry for my late reply.

> I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
> and glusterfs from backports) to Jessie. The cluster itself is way older
> and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
> (almost always) just hangs the whole cluster; I need to interrupt the call
> with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
> leading to the other cluster nodes being banned and the node I run smbstatus
> on to have ctdbd run at 100% load but not being able to recover.

How do you recover then? KILL-ing ctdbd?

> The cluster itself consists of three nodes sharing three cluster ips. The
> only service ctdb manages is Samba. The lock file is located on a mirrored
> glusterfs volume.
>
> running and interrupting the hanging smbstatus leads to the following log
> messages in /var/log/ctdb/log.ctdb:
>   | 2015/10/13 15:09:24.923002 [19378]: Starting traverse on DB
>   |  smbXsrv_session_global.tdb (id 2592646)
>   | 2015/10/13 15:09:25.505302 [19378]: server/ctdb_traverse.c:644 Traverse
>   |  cancelled by client disconnect for database:0x6b06a26d
>   | 2015/10/13 15:09:25.505492 [19378]: Could not find idr:2592646
>   | [...]
>   | 2015/10/13 15:09:25.507553 [19378]: Could not find idr:2592646
>
> 'ctdb getdbmap' lists that database, but also lists a second entry for
> smbXsrv_session_global.tdb:
>   | dbid:0x521b7544 name:smbXsrv_version_global.tdb 
> path:/var/lib/ctdb/smbXsrv_version_global.tdb.0
>   | dbid:0x6b06a26d name:smbXsrv_session_global.tdb 
> path:/var/lib/ctdb/smbXsrv_session_global.tdb.0
> (I have no idea if that has always been the case or if that happened after
> the upgrade).
>
> Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.

Have you tried which of --processes, --notify hangs? Does it hangs
with "-b --fast"?

,

> 'strace'ing ctdbd leads to a massive amount of these messages:
>   | 
> write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
>   |  1184) = -1 EAGAIN (Resource temporarily 
> unavailable)

fd 58 is probably the ctdb socket. Can you confirm?

To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg
and send the stacktrace of ctdbd at the write?

> Running 'ctdb_diagnostics' is only possible shortly after  the cluster is
> started (ie. while smbstatus -b works) and yields the following messages:
>   | ERROR[1]: /etc/krb5.conf is missing on node 0
>   | ERROR[2]: File /etc/hosts is different on node 1
>   | ERROR[3]: File /etc/hosts is different on node 2
>   | ERROR[4]: File /etc/samba/smb.conf is different on node 1
>   | ERROR[5]: File /etc/samba/smb.conf is different on node 2
>   | ERROR[6]: File /etc/fstab is different on node 1
>   | ERROR[7]: File /etc/fstab is different on node 2
>   | ERROR[8]: /etc/multipath.conf is missing on node 0
>   | ERROR[9]: /etc/pam.d/system-auth is missing on node 0
>   | ERROR[10]: /etc/default/nfs is missing on node 0
>   | ERROR[11]: /etc/exports is missing on node 0
>   | ERROR[12]: /etc/vsftpd/vsftpd.conf is missing on node 0
>   | ERROR[13]: Optional file /etc/ctdb/static-routes is not present on node 0
> '/etc/hosts' differs in some newlines and comments while 'smb.conf' only
> has some different log levels on the nodes. The rest of the messages does
> not affect ctdb as it only manages samba.

Yes. Nothing relevant here.

> Feel free to ask if you need any more information.

Regards


-- 
Mathieu