Ok i figured out why crm_mon didnt work and why crmd wasnt initialized 
correctly.

On solaris, the heartbeat subprocesses are not spawned in the same order as on 
other systems - they take some time to spawn - this is the issue.

My first hack was just put in a sleep in the heartbeat.c before crmd is spawned 
- to wait 20sec so that all other processes were already finished initializing 
- this worked at least most of time.

The real issue however was that you would see

crmd[29618]: 2009/03/05_01:34:24 ERROR: socket_client_channel_new: 
open(/var/run/heartbeat/crm/cib_rw, ...) failure: No such file or directory

which means that crmd was still sometimes initialized before cib or ccm.

Remember - on solaris pipes are used, not sockets.

What i simply did, was hack a while loop in lib/clplumbing/ipcsocket.c for 
sockfd = open(path_name, O_RDWR|O_NONBLOCK);
 till sockfd != -1

this worked perfect, i could see the number of times it tried to open the 
socket, and after a while it was created by another subprocess.

Then crm_mon and cibadmin would always work - also stopping heartbeat would now 
too always work under solaris - it would no longer hang - the issue why it hang 
was that crmd couldnt be stopped - when you killed crmd normally - heartbeat 
could shutdown

The final issue is, that the pipes were not sometimes removed from 
/var/run/heartbeat - this would lead that crmd wouldnt always work - a simply 
fix was in the init.d script to rm the run dir pipes before start.


--- On Wed, 3/4/09, Harakiri <harakiri...@yahoo.com> wrote:

> From: Harakiri <harakiri...@yahoo.com>
> Subject: Re: [Linux-HA] crm_mon vs cl_status
> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>, "Andrew 
> Beekhof" <beek...@gmail.com>
> Date: Wednesday, March 4, 2009, 11:15 AM
> Thanks for answering, 
> 
> 
> --- On Wed, 3/4/09, Andrew Beekhof
> <beek...@gmail.com> wrote:
> 
> > 
> > crm_mon takes other things into account.
> > but without logs or the current cib its impossible to
> say
> > for sure why
> > this is happening.
> 
> 
> after a reboot, or restart the following log information
> are found in ha-debug
> 
> http://pastebin.com/m7d9c71f7
> 
> note the only error is :
> 
> mgmtd[5612]: 2009/03/04_16:58:25 ERROR:
> socket_client_channel_new:
> open(/var/lib/heartbeat/run/heartbeat/lrm_cmd_sock, ...)
> failure: No such file or directory
> 
> but it exists - its probably a race condition and created
> later:
> 
> ls -la /var/lib/heartbeat/run/heartbeat/lrm_cmd_sock
> prwxrwxrwx   1 root     root           0 Mar  4 16:58
> /var/lib/heartbeat/run/heartbeat/lrm_cmd_sock|
> 
> At this point, cibadmin etc will not work and hang because
> they cant seem to connect to the crmd, crm_mon will indicate
> the note as offline
> 
> After killing crmd the following log information is found:
> 
> http://pastebin.com/m29a3ec9d
> 
> crmd[5644]: 2009/03/04_17:06:29 info: do_cib_control: CIB
> connection established
> 
> etc
> 
> So it seems that on the initial start crmd does not
> correctly initialize, maybe the cib process has to be
> started before crmd?
> 
> Maybe its related to the issue that under solaris sparc
> PIPES are used instead of sockets for communication
> 
> PIPES were introduced because of this patch
> 
> http://www.mail-archive.com/linux-ha-...@lists.linux-ha.org/msg00307.html
> 
> since i have solaris 10 i tried to use streams but i dont
> find the ucred.h anywere for solaris.
> 
> Any ideas? How can i modify the "Starting child
> client" in different order?
> 
> Thanks
> 
> 
>       
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems


      
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to