Ok i figured out why crm_mon didnt work and why crmd wasnt initialized correctly.
On solaris, the heartbeat subprocesses are not spawned in the same order as on other systems - they take some time to spawn - this is the issue. My first hack was just put in a sleep in the heartbeat.c before crmd is spawned - to wait 20sec so that all other processes were already finished initializing - this worked at least most of time. The real issue however was that you would see crmd[29618]: 2009/03/05_01:34:24 ERROR: socket_client_channel_new: open(/var/run/heartbeat/crm/cib_rw, ...) failure: No such file or directory which means that crmd was still sometimes initialized before cib or ccm. Remember - on solaris pipes are used, not sockets. What i simply did, was hack a while loop in lib/clplumbing/ipcsocket.c for sockfd = open(path_name, O_RDWR|O_NONBLOCK); till sockfd != -1 this worked perfect, i could see the number of times it tried to open the socket, and after a while it was created by another subprocess. Then crm_mon and cibadmin would always work - also stopping heartbeat would now too always work under solaris - it would no longer hang - the issue why it hang was that crmd couldnt be stopped - when you killed crmd normally - heartbeat could shutdown The final issue is, that the pipes were not sometimes removed from /var/run/heartbeat - this would lead that crmd wouldnt always work - a simply fix was in the init.d script to rm the run dir pipes before start. --- On Wed, 3/4/09, Harakiri <harakiri...@yahoo.com> wrote: > From: Harakiri <harakiri...@yahoo.com> > Subject: Re: [Linux-HA] crm_mon vs cl_status > To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>, "Andrew > Beekhof" <beek...@gmail.com> > Date: Wednesday, March 4, 2009, 11:15 AM > Thanks for answering, > > > --- On Wed, 3/4/09, Andrew Beekhof > <beek...@gmail.com> wrote: > > > > > crm_mon takes other things into account. > > but without logs or the current cib its impossible to > say > > for sure why > > this is happening. > > > after a reboot, or restart the following log information > are found in ha-debug > > http://pastebin.com/m7d9c71f7 > > note the only error is : > > mgmtd[5612]: 2009/03/04_16:58:25 ERROR: > socket_client_channel_new: > open(/var/lib/heartbeat/run/heartbeat/lrm_cmd_sock, ...) > failure: No such file or directory > > but it exists - its probably a race condition and created > later: > > ls -la /var/lib/heartbeat/run/heartbeat/lrm_cmd_sock > prwxrwxrwx 1 root root 0 Mar 4 16:58 > /var/lib/heartbeat/run/heartbeat/lrm_cmd_sock| > > At this point, cibadmin etc will not work and hang because > they cant seem to connect to the crmd, crm_mon will indicate > the note as offline > > After killing crmd the following log information is found: > > http://pastebin.com/m29a3ec9d > > crmd[5644]: 2009/03/04_17:06:29 info: do_cib_control: CIB > connection established > > etc > > So it seems that on the initial start crmd does not > correctly initialize, maybe the cib process has to be > started before crmd? > > Maybe its related to the issue that under solaris sparc > PIPES are used instead of sockets for communication > > PIPES were introduced because of this patch > > http://www.mail-archive.com/linux-ha-...@lists.linux-ha.org/msg00307.html > > since i have solaris 10 i tried to use streams but i dont > find the ucred.h anywere for solaris. > > Any ideas? How can i modify the "Starting child > client" in different order? > > Thanks > > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems