A few more thoughts that occurred after I hit <return> 1. This problem sees to only occur when "/etc/init.d/heartbeat start" is executed on two nodes at the same time. If I only do one at a time it does not seem to occur. (this may be related to the creation of master/slave resources in /etc/ha.d/resource.d/startstop when heartbeat starts) 2. This problem seemed to occur most frequently when I went from 4 master/slave resources to 6 master/slave resources.
Thanks, Bob ----- Original Message ---- From: Bob Schatz <bsch...@yahoo.com> To: The Pacemaker cluster resource manager <pacemaker@oss.clusterlabs.org> Sent: Fri, March 25, 2011 4:22:39 PM Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg After reading more threads, I noticed that I needed to include the PE outputs. Therefore, I have rerun the tests and included the PE outputs, the configuration file and the logs for both nodes. The test was rerun with max-children of 20. Thanks, Bob ----- Original Message ---- From: Bob Schatz <bsch...@yahoo.com> To: pacemaker@oss.clusterlabs.org Sent: Thu, March 24, 2011 7:35:54 PM Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg I am getting these messages in the log: 2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg 2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows: 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [lrm_rid=SSJ0000E02A2:0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=300000] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [(2)lrm_param=0x64c230(938 1098)] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27 fields 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : [CRM_meta_notify_slave_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [CRM_meta_notify_active_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [CRM_meta_notify_demote_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [ssconf=/var/omneon/config/config.J0000E02A2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [CRM_meta_master_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [CRM_meta_notify_stop_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [CRM_meta_notify_master_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [CRM_meta_clone_node_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [CRM_meta_clone_max=2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [CRM_meta_notify=true] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : [CRM_meta_notify_stop_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : [crm_feature_set=3.0.1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [CRM_meta_notify_master_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : [CRM_meta_master_max=1] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : [CRM_meta_globally_unique=false] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : [CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : [CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : [CRM_meta_notify_active_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : [CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] : [CRM_meta_notify_slave_uname= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] : [CRM_meta_name=start] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] : [ss_resource=SSJ0000E02A2] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] : [CRM_meta_notify_demote_resource= ] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] : [CRM_meta_timeout=300000] 2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15] This results in the resources being stopped even though I can see from the logging that the agent START function returned $OCF_SUCCESS. (The agent start function prints "ss_start() START" and "ss_start() END" in the logging). The START function can take anywhere from 30 - 60 seconds to complete due to our application. I am running with 1.0.9 Pacemaker and heartbeat 3.0.3. I have attached the configuration as a file to this email since I thought it would make the email unreadable. (Summary is 6 master/slave resources). I have also attached logs . The above messages are from the file n0-short.txt but also occur in n1-short.txt. I thought that maybe I was running into a problem with the number of threads that lrmd had configured. I increased in to 40 and proved that it was in affect with: # /sbin/lrmadmin -g max-children max-children: 40 This problem is reproducible every time. Thanks in advance, Bob _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker