Hi, On Wed, Jul 07, 2010 at 10:04:34AM +0200, th.schrei...@ndr.de wrote: > Hello Andrew, > yes, the lrmd is running, but it has defunct: > > root 6068 6044 0 Jul06 ? 00:00:00 [lrmd] <defunct> > root 6076 6044 0 Jul06 ? 00:00:00 /usr/lib64/heartbeat/lrmd
The first instance of lrmd exited. We'd need the full logs to say what happened. Since this is SLE11, you can open a call with Novell for the incident. BTW, it is strange that there's a zombie still, corosync should've collected the status. Thanks, Dejan > > Thomas Schreiber > > > > > Andrew Beekhof <and...@beekhof.net> > 07.07.2010 08:38 > > An > th.schrei...@ndr.de > Kopie > "Openais@lists.linux-foundation.org" <Openais@lists.linux-foundation.org> > Thema > Re: [Openais] corosync offline > > > > > > On Tue, Jul 6, 2010 at 1:53 PM, <th.schrei...@ndr.de> wrote: > > > > Hello, > > > > I've build a cluster with just two nodes, both of them see each other, > but > > they don't like to go online. This is my config: > > > > interface { > > bindnetaddr: 172.28.87.0 > > mcastaddr: 226.94.1.1 > > mcastport: 5420 > > ringnumber: 0 > > } > > Both nodes have the same config. > > .. > > > > # crm_mon --one-shot > > ============ > > Last updated: Tue Jul 6 13:38:39 2010 > > Stack: openais > > Current DC: NONE > > 2 Nodes configured, 2 expected votes > > 1 Resources configured. > > ============ > > > > OFFLINE: [ lis01 lis11 ] > > .. > > > > > > I made a tcpdump: > > ... > > 13:40:15.870996 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119 > > 13:40:16.085725 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 75 > > 13:40:16.086270 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 919 > > 13:40:16.296619 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119 > > 13:40:16.539215 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119 > > 13:40:16.773796 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119 > > .... > > > > most of the time, just the .64 node is sending packets. Just this cut > shows > > after long time the .66 node > > This tcpdump is one the other node near the same, also .64 sends most of > the > > packets. > > > > When I stop openais(corosync) on .64 the other node send all the time > until > > the .64 is online again. > > That seems that both see each other. > > > > > > The syslog output: > > > > # tail -f /var/log/messages > > Jul 6 13:42:55 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to > sign on > > to the LRM 6 (30 max) times > > Jul 6 13:42:57 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer > > (I_NULL) just popped! > > Jul 6 13:42:57 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate > > connection > > Jul 6 13:42:57 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to > sign on > > to the LRM 7 (30 max) times > > Jul 6 13:42:59 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer > > (I_NULL) just popped! > > Jul 6 13:42:59 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate > > connection > > ... and so on > > So did you check if the lrmd was running (and if not, why not)? > > > > Jul 6 13:46:17 lis11 cib: [13507]: WARN: do_local_notify: A-Sync reply > to > > crmd failed: reply failed > > Jul 6 13:46:17 lis11 corosync[13445]: [pcmk ] info: pcmk_ipc_exit: > > Client crmd (conn=0x68eba0, async-conn=0x68eba0) left > > Jul 6 13:46:17 lis11 corosync[13445]: [pcmk ] ERROR: > pcmk_wait_dispatch: > > Child process crmd exited (pid=15909, rc=2) > > Jul 6 13:46:17 lis11 corosync[13445]: [pcmk ] ERROR: > pcmk_wait_dispatch: > > Child respawn count exceeded by crmd > > Jul 6 13:46:17 lis11 corosync[13445]: [pcmk ] info: update_member: > Node > > hhloklis11 now has process list: 00000000000000000000000000111112 > (1118482) > > Jul 6 13:46:17 lis11 corosync[13445]: [pcmk ] WARN: > route_ais_message: > > Sending message to local.crmd failed: ipc delivery failed (rc=-2) > > Jul 6 13:47:06 lis11 corosync[13445]: [pcmk ] WARN: > route_ais_message: > > Sending message to local.crmd failed: ipc delivery failed (rc=-2) > > Jul 6 13:47:54 lis11 cib: [13507]: info: cib_stats: Processed 28 > operations > > (1071.00us average, 0% utilization) in the last 10min > > .... > > > > > > > > OS is SuSE SLES11 SP1 > > > > pacemaker-1.1.2-0.2.1 > > pacemaker-mgmt-2.0.0-0.2.19 > > corosync-1.2.1-0.5.1 > > libcorosync4-1.2.1-0.5.1 > > openais-1.1.2-0.5.19 > > libopenais3-1.1.2-0.5.19 > > > > openais config is empty. > > > > > > Kernel: 2.6.32.12-0.7-default x86_64 > > > > > > Any help? > > > > > > Thomas Schreiber > > _______________________________________________ > > Openais mailing list > > Openais@lists.linux-foundation.org > > https://lists.linux-foundation.org/mailman/listinfo/openais > > > > _______________________________________________ > Openais mailing list > Openais@lists.linux-foundation.org > https://lists.linux-foundation.org/mailman/listinfo/openais _______________________________________________ Openais mailing list Openais@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/openais