Hi, On Fri, Dec 07, 2007 at 08:44:46AM +1100, Amos Shapira wrote: > On 05/12/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > On Fri, Nov 30, 2007 at 05:16:38PM +1100, Amos Shapira wrote: > > > On 30/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > > > > Hi, > > > > > > > > On Thu, Nov 29, 2007 at 05:23:33PM +0000, Amos Shapira wrote: > > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > On Thu, Nov 29, 2007 at 10:25:47AM +0000, Amos Shapira wrote: > > > > > > > On 29/11/2007, Dejan Muhamedagic <[EMAIL PROTECTED]> wrote: > > > > > > > > Yes, very much so. For some reason the MCP (master control > > > > > > > > process) doesn't start the rest of the programs which are doing > > > > > > > > the real work. I really can't say why. Can you please attach the > > > > > > > > logs from this node? > > > > > > > > > > > > > > A pstree(1) on the better node visualizes the responsibility of > > > > > > > starting the programs pretty vividly: > > > > > > > > > > > > > > |-heartbeat,18449 > > > > > > > | |-attrd,18477 > > > > > > > | |-ccm,18473 > > > > > > > | |-cib,18474 > > > > > > > | |-crmd,18478 > > > > > > > | | |-pengine,18505 > > > > > > > | | `-tengine,18504 > > > > > > > | |-heartbeat,18452 > > > > > > > | |-heartbeat,18453 > > > > > > > | |-heartbeat,18454 > > > > > > > | |-heartbeat,18455 > > > > > > > | |-heartbeat,18456 > > > > > > > | |-lrmd,18475 -r > > > > > > > | |-mgmtd,18479 -v > > > > > > > | `-stonithd,18476 > > > > > > > > > > > > > > Here they are again (from tonight): > > > > > > > > > > > > > > 1 heartbeat[17481]: 2007/11/29_07:12:40 WARN: heartbeat: udp > > > > > > > port 695 reserved for service "ieee-mms-ssl". > > > > > > > 2 heartbeat[17481]: 2007/11/29_07:12:40 info: Version 2 > > > > support: > > > > > > yes > > > > > > > 3 heartbeat[17481]: 2007/11/29_07:12:40 WARN: File > > > > > > > /etc/ha.d/haresources exists. > > > > > > > 4 heartbeat[17481]: 2007/11/29_07:12:40 WARN: This file is > > > > > > > not > > > > > > > used because crm is enabled > > > > > > > 5 heartbeat[17481]: 2007/11/29_07:12:40 WARN: Logging daemon > > > > is > > > > > > > disabled --enabling logging daemon is recommended > > > > > > > 6 heartbeat[17481]: 2007/11/29_07:12:40 info: > > > > > > ************************** > > > > > > > 7 heartbeat[17481]: 2007/11/29_07:12:40 info: Configuration > > > > > > > validated. Starting heartbeat 2.1.2 > > > > > > > 8 heartbeat[17482]: 2007/11/29_07:12:40 info: heartbeat: > > > > version > > > > > > 2.1.2 > > > > > > > 9 heartbeat[17482]: 2007/11/29_07:12:40 info: Heartbeat > > > > > > > generation: 1196102397 > > > > > > > 10 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > > > > G_main_add_TriggerHandler: Added signal manual handler > > > > > > > 11 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > > > > G_main_add_TriggerHandler: Added signal manual handler > > > > > > > 12 heartbeat[17482]: 2007/11/29_07:12:40 info: Removing > > > > > > > /var/run/heartbeat/rsctmp failed, recreating. > > > > > > > 13 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > write > > > > > > > socket priority set to IPTOS_LOWDELAY on eth0 > > > > > > > 14 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > bound > > > > > > > send socket to device: eth0 > > > > > > > 15 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > bound > > > > > > > receive socket to device: eth0 > > > > > > > 16 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > > > > started on port 695 interface eth0 to 192.168.0.248 > > > > > > > 17 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > write > > > > > > > socket priority set to IPTOS_LOWDELAY on eth0 > > > > > > > 18 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > bound > > > > > > > send socket to device: eth0 > > > > > > > 19 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > bound > > > > > > > receive socket to device: eth0 > > > > > > > 20 heartbeat[17482]: 2007/11/29_07:12:40 info: glib: ucast: > > > > > > > started on port 695 interface eth0 to 192.168.0.249 > > > > > > > 21 heartbeat[17482]: 2007/11/29_07:12:40 info: > > > > > > > G_main_add_SignalHandler: Added signal handler for signal 17 > > > > > > > 22 heartbeat[17482]: 2007/11/29_07:12:40 info: Local status > > > > > > > now > > > > > > > set to: 'up' > > > > > > > 23 heartbeat[17482]: 2007/11/29_07:12:41 info: Link > > > > > > > drbd01.test.spammatters.local:eth0 up. > > > > > > > 24 heartbeat[17482]: 2007/11/29_07:12:41 info: Status update > > > > for > > > > > > > node drbd01.test.spammatters.local: status up > > > > > > > 25 heartbeat[17482]: 2007/11/29_07:13:45 info: all clients > > > > > > > are > > > > now > > > > > > paused > > > > > > > 26 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->ackseq > > > > > > > =0 > > > > > > > 27 heartbeat[17482]: 2007/11/29_07:13:45 debug: hist->lowseq > > > > =0, > > > > > > > hist->hiseq=101 > > > > > > > 28 heartbeat[17482]: 2007/11/29_07:13:45 debug: expecting > > > > > > > from > > > > > > > drbd01.test.spammatters.local > > > > > > > 29 heartbeat[17482]: 2007/11/29_07:13:45 debug: it's ackseq=0 > > > > > > > > > > > > heartbeat is getting no packet acknowledgements from drbd01. It > > > > > > must be a communication problem. Looks like drbd02 doesn't see > > > > > > packets coming from drbd01, assuming that it's sending them, > > > > > > which it does if there are no errors reported in drbd01. > > > > > > > > > > > > > > > Wouldn't this be the case if crmd crashes? Could this be related to > > > > "stonith > > > > > -h" seg-faulting and the missing processes (crmd, cib, attrd, ccm, > > > > > lrmd, > > > > > mgmtd, stonithd) which I can see on the other node? > > > > > > > > No. There's an IPC layer which is used by heartbeat (the process) > > > > only. If that doesn't work, it won't start other programs. > > > > > > > > > I did some more experimentation - I installed a third machine identical to > > > the second one but still get the same results. > > > > Then perhaps the problem is on the good host. Did you try to make > > a cluster of only the second and the third host? > > That's a good point but I already installed heartbeat 1.2.5 on drbd01 > and drbd02 and will try to move forward with that. > > My co-worker made it work on a couple of other identical CentOS Xen > guests with v1-style resource configuration (haresources), though. > > > > > > One thing that I managed to change (on both the new machine and the > > > previous > > > "secondary") is that by moving aside the content of > > > /usr/lib64/stonith/plugins/stonith2 and leaving only the "null" plugin in > > > there I could get rid of the "stonith -h" segmentation fault (and I don't > > > have any of the devices these plugins talk to anyway). > > > > The stonith program problem is definitely annoying, but it is not > > going to influence your cluster in any way. > > Thanks for the clarification. > > > > > > But still I don't see crmd and friends on any machine except for the > > > primary. > > > > > > Anyway, you would definitely see error messages if a program > > > > can't be started. > > > > > > > > > Where should I look for it? The init.d script forward most everything into > > > /dev/null. > > > > It's nothing to do with the init script. The heartbeat MCP > > (master control process) starts all other processes itself. The > > default syslog facility is daemon (2.0.x releases had local7). > > BTW - how long should it take to start all programs in a normal > situation? It seems that crm_mon takes at least 3 minutes to start > getting info even on what I though the beginning to be the "good" > host.
Three minutes is way too long. Normally, the result comes immediately on the DC, and it takes a second if run on another node. Thanks, Dejan > Thanks, > > --Amos > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems