Working on this problem further... On Tue, May 21, 2013 at 5:14 PM, David Vossel <dvos...@redhat.com> wrote: > I'd suggest this. Try running the pacemaker_remote regression test and see > what happens. This will start up > an instance of pacemaker_remote locally and issue client commands to it to > test both the TLS connection and > the ability to start/stop/monitor services. > > /usr/share/pacemaker/tests/lrmd/regression.py -R
But sadly SL 6.4 doesn't have the systemctl commands this is trying to use. (Also I am building RPMs and installing those, the lrmd regression tests aren't included in pacemaker-cts. No problem, I ran directly from the build directory.) It doesn't seem to make much progress. The stdout is: sh: systemctl: command not found sh: /lib/systemd/system/lrmd_dummy_daemon.service: No such file or directory sh: systemctl: command not found Starting ... And the lrmd-regression.log has: Set r/w permissions for uid=496, gid=494 on /tmp/lrmd-regression.log May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: lrmd May 23 15:14:39 [3610] swbuildsl6 pacemaker_remoted: notice: lrmd_init_remote_tls_server: Starting a tls listener on port 3121. May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_ro May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_rw May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: cib_shm May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: attrd May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: stonith-ng May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: qb_ipcs_us_publish: server name: crmd May 23 15:14:40 [3610] swbuildsl6 pacemaker_remoted: info: main: Starting > By default, the connection should retry for 60 seconds after the vm resource > starts. Like you've noticed, this > can be extended to account for vms that take longer to boot. But maybe this should start after the monitor method for the VM first indicates success? Or does it already? >> There have been a few segfaults of crmd during my testing of this, so perhaps >> there is a memory smash somewhere. (A couple times the failure was at >> remote_lrmd_ra.c:186, > > Please provide gdb backtrace. We need to get this resolved asap before the > release of v.1.1.10 is complete. > I believe there is a new rc in the works already. So I've attached results from a few core dumps. All were triggered using "crm resource cleanup swbuildsl6" where swbuildsl6 is the host name of the VM (that I can still telnet to port 3121). >> > I doubt this will make a difference, but here's the key I use during >> > testing, >> > lrmd:ce9db0bc3cec583d3b3bf38b0ac9ff91 It makes no difference. I had wondered if the shorter key would matter. Also, I've attached some patches I made to 1.1.10rc3 to try to resolve this problem. So far no success. Some of these add logging; the others are fix what look to me to be fishy code with cases that aren't completely handled. With the additional logging, I see these results being logged: May 23 17:06:51 swbuildsl6 pacemaker_remoted[2326]: notice: lrmd_remote_listen: LRMD client connection established. 0x995250 id: df04d8ee-7fcb-4025-8c8f-8a1555a4d097 May 23 17:06:53 cvmh02 crmd[18982]: warning: lrmd_tcp_connect_cb: Client tls handshake failed for server swbuildsl6:3121. Disconnecting May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]: error: lrmd_remote_client_msg: Remote lrmd tls handshake failed: -9 May 23 17:06:52 swbuildsl6 pacemaker_remoted[2326]: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: <unknown> id: df04d8ee-7fcb-4025-8c8f-8a1555a4d097 Puzzling -- nothing being logged from crm_initiate_client_tls_handshake -- is there something I need to add to somehow activate the crm_err and crm_info calls? /rlt
Using 1.1.10rc3: Bus error (gdb) where #0 0x000000000042541f in retry_start_cmd_cb (data=0x82f090) at remote_lrmd_ra.c:186 #1 0x0000003bba03961b in ?? () from /lib64/libglib-2.0.so.0 #2 0x0000003bba038f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #3 0x0000003bba03c938 in ?? () from /lib64/libglib-2.0.so.0 #4 0x0000003bba03cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #5 0x000000000040530e in crmd_init () at main.c:154 #6 0x000000000040560c in main (argc=1, argv=0x7fff7b552368) at main.c:120 (gdb) list 181 lrm_state_t *lrm_state = data; 182 remote_ra_data_t *ra_data = lrm_state->remote_ra_data; 183 remote_ra_cmd_t *cmd = NULL; 184 int rc = -1; 185 186 if (!ra_data || !ra_data->cur_cmd) { 187 return FALSE; 188 } 189 cmd = ra_data->cur_cmd; 190 if (safe_str_neq(cmd->action, "start")) { Using 1.1.10rc2 Segmentation fault: (gdb) where #0 0x00007f03c6603eed in lrmd_tls_connection_destroy ( userdata=<value optimized out>) at lrmd_client.c:506 #1 0x00007f03c66046c0 in lrmd_tcp_connect_cb (userdata=0x9541b0, sock=-1) at lrmd_client.c:1079 #2 0x00007f03c6a3764a in check_connect_finished (userdata=0x97d350) at remote.c:736 #3 0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0 #4 0x0000003b67a38f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #5 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0 #6 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #7 0x000000000040530e in crmd_init () at main.c:154 #8 0x000000000040560c in main (argc=1, argv=0x7fff26ac1ab8) at main.c:120 (gdb) list 501 lrmd_t *lrmd = userdata; 502 lrmd_private_t *native = lrmd->private; 503 504 crm_info("TLS connection destroyed"); 505 506 if (native->remote->tls_session) { 507 gnutls_bye(*native->remote->tls_session, GNUTLS_SHUT_RDWR); 508 gnutls_deinit(*native->remote->tls_session); 509 gnutls_free(native->remote->tls_session); 510 } Segmentation fault: (gdb) where #0 0x00007f0e0751ceed in lrmd_tls_connection_destroy ( userdata=<value optimized out>) at lrmd_client.c:506 #1 0x00007f0e0751d6c0 in lrmd_tcp_connect_cb (userdata=0x21b6ae0, sock=-110) at lrmd_client.c:1079 #2 0x00007f0e0795064a in check_connect_finished (userdata=0x21639d0) at remote.c:736 #3 0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0 #4 0x0000003b67a38f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #5 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0 #6 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #7 0x000000000040530e in crmd_init () at main.c:154 #8 0x000000000040560c in main (argc=1, argv=0x7fff567875b8) at main.c:120 Segmentation fault: (gdb) where #0 0x00007fac43d87eed in lrmd_tls_connection_destroy ( userdata=<value optimized out>) at lrmd_client.c:506 #1 0x00007fac43d886c0 in lrmd_tcp_connect_cb (userdata=0x1bd5f60, sock=-110) at lrmd_client.c:1079 #2 0x00007fac441bb64a in check_connect_finished (userdata=0x1bc4360) at remote.c:736 #3 0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0 #4 0x0000003b67a38f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #5 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0 #6 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #7 0x000000000040530e in crmd_init () at main.c:154 #8 0x000000000040560c in main (argc=1, argv=0x7fff0a0ee808) at main.c:120 Segmentation fault: (gdb) where #0 0x00007f0242ebceed in lrmd_tls_connection_destroy ( userdata=<value optimized out>) at lrmd_client.c:506 #1 0x00007f0242ebd6c0 in lrmd_tcp_connect_cb (userdata=0x1eec210, sock=-1) at lrmd_client.c:1079 #2 0x00007f02432f064a in check_connect_finished (userdata=0x1f17150) at remote.c:736 #3 0x0000003b67a3961b in ?? () from /lib64/libglib-2.0.so.0 #4 0x0000003b67a38f0e in g_main_context_dispatch () from /lib64/libglib-2.0.so.0 #5 0x0000003b67a3c938 in ?? () from /lib64/libglib-2.0.so.0 #6 0x0000003b67a3cd55 in g_main_loop_run () from /lib64/libglib-2.0.so.0 #7 0x000000000040530e in crmd_init () at main.c:154 #8 0x000000000040560c in main (argc=1, argv=0x7fff162351e8) at main.c:120
pacemaker-ccni.patch
Description: Binary data
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org