Hi,

On Wed, Jul 07, 2010 at 10:04:34AM +0200, th.schrei...@ndr.de wrote:
> Hello Andrew,
> yes, the lrmd is running, but it has defunct:
> 
> root      6068  6044  0 Jul06 ?        00:00:00 [lrmd] <defunct>
> root      6076  6044  0 Jul06 ?        00:00:00 /usr/lib64/heartbeat/lrmd

The first instance of lrmd exited. We'd need the full logs to say
what happened. Since this is SLE11, you can open a call with
Novell for the incident. BTW, it is strange that there's a zombie
still, corosync should've collected the status.

Thanks,

Dejan

> 
> Thomas Schreiber
> 
> 
> 
> 
> Andrew Beekhof <and...@beekhof.net> 
> 07.07.2010 08:38
> 
> An
> th.schrei...@ndr.de
> Kopie
> "Openais@lists.linux-foundation.org" <Openais@lists.linux-foundation.org>
> Thema
> Re: [Openais] corosync offline
> 
> 
> 
> 
> 
> On Tue, Jul 6, 2010 at 1:53 PM,  <th.schrei...@ndr.de> wrote:
> >
> > Hello,
> >
> > I've build a cluster with just two nodes, both of them see each other, 
> but
> >  they don't like to go online. This is my config:
> >
> > interface {
> >         bindnetaddr:    172.28.87.0
> >         mcastaddr:      226.94.1.1
> >                 mcastport:      5420
> >                 ringnumber:     0
> > }
> > Both nodes have the same config.
> > ..
> >
> > # crm_mon --one-shot
> > ============
> > Last updated: Tue Jul  6 13:38:39 2010
> > Stack: openais
> > Current DC: NONE
> > 2 Nodes configured, 2 expected votes
> > 1 Resources configured.
> > ============
> >
> > OFFLINE: [ lis01 lis11 ]
> > ..
> >
> >
> > I made a tcpdump:
> > ...
> > 13:40:15.870996 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> > 13:40:16.085725 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 75
> > 13:40:16.086270 IP 172.28.87.66.5419 > 226.94.1.1.5420: UDP, length 919
> > 13:40:16.296619 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> > 13:40:16.539215 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> > 13:40:16.773796 IP 172.28.87.64.5419 > 226.94.1.1.5420: UDP, length 119
> > ....
> >
> > most of the time, just the .64 node is sending packets. Just this cut 
> shows
> > after long time the .66 node
> > This tcpdump is one the other node near the same, also .64 sends most of 
> the
> > packets.
> >
> > When I stop openais(corosync) on .64 the other node send all the time 
> until
> > the .64 is online again.
> > That seems that both see each other.
> >
> >
> > The syslog output:
> >
> >  # tail -f /var/log/messages
> > Jul  6 13:42:55 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to 
> sign on
> > to the LRM 6 (30 max) times
> > Jul  6 13:42:57 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer
> > (I_NULL) just popped!
> > Jul  6 13:42:57 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate
> > connection
> > Jul  6 13:42:57 lis11 crmd: [13107]: WARN: do_lrm_control: Failed to 
> sign on
> > to the LRM 7 (30 max) times
> > Jul  6 13:42:59 lis11 crmd: [13107]: info: crm_timer_popped: Wait Timer
> > (I_NULL) just popped!
> > Jul  6 13:42:59 lis11 crmd: [13107]: WARN: lrm_signon: can not initiate
> > connection
> > ... and so on
> 
> So did you check if the lrmd was running (and if not, why not)?
> 
> 
> > Jul  6 13:46:17 lis11 cib: [13507]: WARN: do_local_notify: A-Sync reply 
> to
> > crmd failed: reply failed
> > Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] info: pcmk_ipc_exit:
> > Client crmd (conn=0x68eba0, async-conn=0x68eba0) left
> > Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] ERROR: 
> pcmk_wait_dispatch:
> > Child process crmd exited (pid=15909, rc=2)
> > Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] ERROR: 
> pcmk_wait_dispatch:
> > Child respawn count exceeded by crmd
> > Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] info: update_member: 
> Node
> > hhloklis11 now has process list: 00000000000000000000000000111112 
> (1118482)
> > Jul  6 13:46:17 lis11 corosync[13445]:   [pcmk  ] WARN: 
> route_ais_message:
> > Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> > Jul  6 13:47:06 lis11 corosync[13445]:   [pcmk  ] WARN: 
> route_ais_message:
> > Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> > Jul  6 13:47:54 lis11 cib: [13507]: info: cib_stats: Processed 28 
> operations
> > (1071.00us average, 0% utilization) in the last 10min
> > ....
> >
> >
> >
> > OS is SuSE SLES11 SP1
> >
> > pacemaker-1.1.2-0.2.1
> > pacemaker-mgmt-2.0.0-0.2.19
> > corosync-1.2.1-0.5.1
> > libcorosync4-1.2.1-0.5.1
> > openais-1.1.2-0.5.19
> > libopenais3-1.1.2-0.5.19
> >
> > openais config is empty.
> >
> >
> > Kernel: 2.6.32.12-0.7-default      x86_64
> >
> >
> > Any help?
> >
> >
> > Thomas Schreiber
> > _______________________________________________
> > Openais mailing list
> > Openais@lists.linux-foundation.org
> > https://lists.linux-foundation.org/mailman/listinfo/openais
> >
> 

> _______________________________________________
> Openais mailing list
> Openais@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais

_______________________________________________
Openais mailing list
Openais@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to