Hi,

On Fri, Dec 07, 2007 at 08:44:46AM +1100, Amos Shapira wrote:
> On 05/12/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > On Fri, Nov 30, 2007 at 05:16:38PM +1100, Amos Shapira wrote:
> > > On 30/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote:
> > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote:
> > > > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote:
> > > > > > > > Yes, very much so. For some reason the MCP (master control
> > > > > > > > process) doesn't start the rest of the programs which are doing
> > > > > > > > the real work. I really can't say why. Can you please attach the
> > > > > > > > logs from this node?
> > > > > > >
> > > > > > > A pstree(1) on the better node visualizes the responsibility of
> > > > > > > starting the programs pretty vividly:
> > > > > > >
> > > > > > >   |-heartbeat,18449
> > > > > > >   |   |-attrd,18477
> > > > > > >   |   |-ccm,18473
> > > > > > >   |   |-cib,18474
> > > > > > >   |   |-crmd,18478
> > > > > > >   |   |   |-pengine,18505
> > > > > > >   |   |   `-tengine,18504
> > > > > > >   |   |-heartbeat,18452
> > > > > > >   |   |-heartbeat,18453
> > > > > > >   |   |-heartbeat,18454
> > > > > > >   |   |-heartbeat,18455
> > > > > > >   |   |-heartbeat,18456
> > > > > > >   |   |-lrmd,18475 -r
> > > > > > >   |   |-mgmtd,18479 -v
> > > > > > >   |   `-stonithd,18476
> > > > > > >
> > > > > > > Here they are again (from tonight):
> > > > > > >
> > > > > > >       1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp
> > > > > > > port 695 reserved for service "ieee-mms-ssl".
> > > > > > >       2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2
> > > > support:
> > > > > > yes
> > > > > > >       3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File
> > > > > > > /etc/ha.d/haresources exists.
> > > > > > >       4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is 
> > > > > > > not
> > > > > > > used because crm is enabled
> > > > > > >       5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon
> > > > is
> > > > > > > disabled --enabling logging daemon is recommended
> > > > > > >       6 heartbeat[17481]: 2007/11/29_07:12:40 info:
> > > > > > **************************
> > > > > > >       7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration
> > > > > > > validated. Starting heartbeat 2.1.2
> > > > > > >       8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat:
> > > > version
> > > > > > 2.1.2
> > > > > > >       9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat
> > > > > > > generation: 1196102397
> > > > > > >      10 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > > > G_main_add_TriggerHandler: Added signal manual handler
> > > > > > >      11 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > > > G_main_add_TriggerHandler: Added signal manual handler
> > > > > > >      12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing
> > > > > > > /var/run/heartbeat/rsctmp failed, recreating.
> > > > > > >      13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > write
> > > > > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > > > > >      14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > bound
> > > > > > > send socket to device: eth0
> > > > > > >      15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > bound
> > > > > > > receive socket to device: eth0
> > > > > > >      16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > > > > started on port 695 interface eth0 to 192.168.0.248
> > > > > > >      17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > write
> > > > > > > socket priority set to IPTOS_LOWDELAY on eth0
> > > > > > >      18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > bound
> > > > > > > send socket to device: eth0
> > > > > > >      19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > bound
> > > > > > > receive socket to device: eth0
> > > > > > >      20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast:
> > > > > > > started on port 695 interface eth0 to 192.168.0.249
> > > > > > >      21 heartbeat[17482]: 2007/11/29_07:12:40 info:
> > > > > > > G_main_add_SignalHandler: Added signal handler for signal 17
> > > > > > >      22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status 
> > > > > > > now
> > > > > > > set to: 'up'
> > > > > > >      23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link
> > > > > > > drbd01.test.spammatters.local:eth0 up.
> > > > > > >      24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update
> > > > for
> > > > > > > node drbd01.test.spammatters.local: status up
> > > > > > >      25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients 
> > > > > > > are
> > > > now
> > > > > > paused
> > > > > > >      26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq 
> > > > > > > =0
> > > > > > >      27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq
> > > > =0,
> > > > > > > hist->hiseq=101
> > > > > > >      28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting 
> > > > > > > from
> > > > > > > drbd01.test.spammatters.local
> > > > > > >      29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0
> > > > > >
> > > > > > heartbeat is getting no packet acknowledgements from drbd01. It
> > > > > > must be a communication problem. Looks like drbd02 doesn't see
> > > > > > packets coming from drbd01, assuming that it's sending them,
> > > > > > which it does if there are no errors reported in drbd01.
> > > > >
> > > > >
> > > > > Wouldn't this be the case if crmd crashes? Could this be related to
> > > > "stonith
> > > > > -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, 
> > > > > lrmd,
> > > > > mgmtd, stonithd) which I can see on the other node?
> > > >
> > > > No. There's an IPC layer which is used by heartbeat (the process)
> > > > only. If that doesn't work, it won't start other programs.
> > >
> > >
> > > I did some more experimentation - I installed a third machine identical to
> > > the second one but still get the same results.
> >
> > Then perhaps the problem is on the good host. Did you try to make
> > a cluster of only the second and the third host?
> 
> That's a good point but I already installed heartbeat 1.2.5 on drbd01
> and drbd02 and will try to move forward with that.
> 
> My co-worker made it work on a couple of other identical CentOS Xen
> guests with v1-style resource configuration (haresources), though.
> 
> >
> > > One thing that I managed to change (on both the new machine and the 
> > > previous
> > > "secondary") is that by moving aside the content of
> > > /usr/lib64/stonith/plugins/stonith2 and leaving only the "null" plugin in
> > > there I could get rid of the "stonith -h" segmentation fault (and I don't
> > > have any of the devices these plugins talk to anyway).
> >
> > The stonith program problem is definitely annoying, but it is not
> > going to influence your cluster in any way.
> 
> Thanks for the clarification.
> 
> >
> > > But still I don't see crmd and friends on any machine except for the
> > > primary.
> > >
> > > Anyway, you would definitely see error messages if a program
> > > > can't be started.
> > >
> > >
> > > Where should I look for it? The init.d script forward most everything into
> > > /dev/null.
> >
> > It's nothing to do with the init script. The heartbeat MCP
> > (master control process) starts all other processes itself. The
> > default syslog facility is daemon (2.0.x releases had local7).
> 
> BTW - how long should it take to start all programs in a normal
> situation? It seems that crm_mon takes at least 3 minutes to start
> getting info even on what I though the beginning to be the "good"
> host.

Three minutes is way too long. Normally, the result comes
immediately on the DC, and it takes a second if run on
another node.

Thanks,

Dejan


> Thanks,
> 
> --Amos
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to