Re: [Pacemaker] help deciphering output

2014-10-09 Thread Alexandre
I have seen this behavior on several virtualsed environments. when vm
backup starts, the VM actually freezes for a (short?) Period of time.I
guess it then no more responding to the other cluster nodes thus triggering
unexpected fail over and/or fencing.I have this kind of behavior on VMware
env using veam backup, as well promox (+ u don't what backup tool)
That's actually an interesting topic I never though about rising here.
How can we avoid that? Increasing timeout? I am afraid we would have to
reach unacceptable high timeout values and am not even sure that would fix
the pb.
I think not all VM snapshots strategy would trigger that PV, do you guys
have any feedback to provide on the backup/snapshot method best suits
corosync clusters?

Regards
Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com
a écrit :

 One of my nodes died in a 2 node cluster

 I gather something went wrong, and it fenced/killed itself. But I am not
 sure what happened.

 I think maybe around that time the VM backups happened and snap of the VM
 could have happened

 But there is nothing for me to put my finger on

 Output from messages around that time

 This is on devrp1
 Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed,
 forming new configuration.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking
 activity
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the
 non-primary component and will NOT provide any services.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
 Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:2 left:1)
 Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 424: quorum lost
 Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained,
 resuming activity
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the
 primary component and will provide service.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:1 left:0)
 Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now lost (was member)
 Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node
 (devrp2) left the cluster
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 428: quorum acquired
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now member (was lost)
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
 origin=reap_dead_nodes ]
 Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we
 were killed by cman_tool or other application
 Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection
 terminated
 Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
 Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
 Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost
 connection to Corosync service!
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client
 0x18cf240, pid=2350...
 Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy:
 Connection destroyed
 Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync
 connection lost!  Exiting.
 Oct  8 23:31:43 devrp1 stonith-ng[2346]:error:
 stonith_peer_cs_destroy: Corosync connection terminated
 Oct  8 23:31:43 devrp1 dlm_controld[1752]: 

[Pacemaker] Raid RA Changes to Enable ms configuration -- need some assistance plz.

2014-10-09 Thread Errol Neal
Hi all. I was hoping to get some help with my configuration. I'm kinda 
stuck at the moment. 

I've made some changes to the Raid1 RA to implement a ms style 
configuration (you can see a diff here http://pastebin.com/Q2nbF6Rg 
against the RA in the github repo)

I've modeled my ms implementation heavily on the SCST RA from the ESOS 
project (thanks Marc!). 

Here is my full pacemaker config:

http://pastebin.com/jw6WTpZz

I'm running on CentOS 6.5 using Pacemaker, and CMAN available from the 
distro.  

I'll explain a little bit about what I'm doing then tell you folks where 
I think I need a bit of assistance or a push in the right direction

The Raid1 RA changes I made are because I want to assemble the same 
RAID1 members on two different hosts. The slave should assemble it in 
readonly mode. Using the SCST RA, I can offer two paths to the same LUN 
from two distinct systems, one standby, one active. In order to keep 
things good and consistent, the master of the MD resource also needs to 
be master of the SCST resource (i'm using the example here 
http://clusterlabs.org/doc/en-
US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-sets-
colocation.html).
So if you can't already tell, this is YAA to create a highly available 
open storage system :)

My understanding is that in order for N to start, N+1 must already be 
running. So my configuration (to me) reads that the ms_md0 master 
resource must be started and running before the ms_scst1 resource will 
be started (as master) and these services will be force on the same 
node. Please correct me if my understanding is incorrect. When both 
nodes are up and running, the master roles are not split so I *think* my 
configuration is being honored, which leads me to my next issue. 

In my modified RA, I'm not sure I understand how to promote/demote 
properly. For example, when I put a node on standby, the remaining node 
doesn't get promoted. I'm not sure why, so I'm asking the experts. 

I'd really appreciate any feedback, advice, etc you folks can give. 

Thanks, 

Errol Neal





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Time out issue while stopping resource in pacemaker

2014-10-09 Thread Lax
Hi All,

I ran into a time out issue while failing over from master to the peer
server and I have a 2 node setup with 2 resources. Though it was working all
along, this was the first time this issue is seen for me.

It fail with following error 'error: process_lrm_event: LRM operation
resourceB_stop_0 (40) Timed Out (timeout=2ms)'.



Here is the complete log snippet from pacemaker, appreciate your help on this.


Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: Diff: +++ 0.3.1
4e9bfa03cf2fef61843c18e127044d81
Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: -- cib
admin_epoch=0 epoch=2 num_updates=8 /
Oct  9 14:57:38 server1 crmd[373]:   notice: do_state_transition: State
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++
instance_attributes id=nodes-server1 
Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++   nvpair
id=nodes-server1-standby name=standby value=true /
Oct  9 14:57:38 server1 cib[368]:   notice: cib:diff: ++
/instance_attributes
Oct  9 14:57:38 server1 pengine[372]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
ClusterIP#011(Started server1 - 172.28.0.64)
Oct  9 14:57:38 server1 pengine[372]:   notice: LogActions: Move   
resourceB#011(Started server1 - 172.28.0.64)
Oct  9 14:57:38 server1 pengine[372]:   notice: process_pe_message:
Calculated Transition 11: /var/lib/pacemaker/pengine/pe-input-1710.bz2
Oct  9 14:57:58 server1 lrmd[370]:  warning: child_timeout_callback:
resourceB_stop_0 process (PID 17327) timed out
Oct  9 14:57:58 server1 lrmd[370]:  warning: operation_finished:
resourceB_stop_0:17327 - timed out after 2ms
Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
resourceB_stop_0:17327 [   % Total% Received % Xferd  Average Speed  
TimeTime Time  Current ]
Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
resourceB_stop_0:17327 [  Dload  Upload  
Total   SpentLeft  Speed ]
Oct  9 14:57:58 server1 lrmd[370]:   notice: operation_finished:
resourceB_stop_0:17327 [ #015  0 00 00 0  0  0
--:--:-- --:--:-- --:--:-- 0#015  0 00 00 0  0 
0 --:--:--  0:00:01 --:--:-- 0#015  0 00 00 0  
   0  0 --:--:--  0:00:02 --:--:-- 0#015  0 00 00  
  0  0  0 --:--:--  0:00:03 --:--:-- 0#015  0 00 0 
  0 0  0  0 --:--:--  0:00:04 --:--:-- 0#015  0 00 
   00 0  0  0 --:--:--  0:00:05 -
Oct  9 14:57:58 server1 crmd[373]:error: process_lrm_event: LRM
operation resourceB_stop_0 (40) Timed Out (timeout=2ms)
Oct  9 14:57:58 server1 crmd[373]:  warning: status_from_rc: Action 10
(resourceB_stop_0) on server1 failed (target: 0 vs. rc: 1): Error
Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
time=1412891878)
Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-resourceB (INFINITY)
Oct  9 14:57:58 server1 crmd[373]:  warning: update_failcount: Updating
failcount for resourceB on server1 after failed stop: rc=1 (update=INFINITY,
time=1412891878)
Oct  9 14:57:58 server1 crmd[373]:   notice: run_graph: Transition 11
(Complete=2, Pending=0, Fired=0, Skipped=9, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-1710.bz2): Stopped
Oct  9 14:57:58 server1 attrd[371]:   notice: attrd_perform_update: Sent
update 11: fail-count-resourceB=INFINITY


Thanks
Lax


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-09 Thread renayama19661014
Hi Andrew,

Okay!

I test your patch.
And I inform you of a result.

Many thanks!
Hideo Yamauchi.



- Original Message -
 From: Andrew Beekhof and...@beekhof.net
 To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org
 Cc: 
 Date: 2014/10/10, Fri 10:47
 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, 
 g_source_remove fails.
 
 Perfect!
 
 Can you try this:
 
 diff --git a/lib/services/services.c b/lib/services/services.c
 index 8590b56..cb0f0ae 100644
 --- a/lib/services/services.c
 +++ b/lib/services/services.c
 @@ -417,6 +417,7 @@ services_action_kick(const char *name, const char 
 *action, 
 int interval /* ms */
      free(id);
 
      if (op == NULL) {
 +        op-opaque-repeat_timer = 0;
          return FALSE;
      }
 
 @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char 
 *action, 
 int interval /* ms */
      } else {
          if (op-opaque-repeat_timer) {
              g_source_remove(op-opaque-repeat_timer);
 +            op-opaque-repeat_timer = 0;
          }
          recurring_action_timer(op);
          return TRUE;
 @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void 
 (*action_callback) (svc_actio
          if (dup-pid != 0) {
              if (op-opaque-repeat_timer) {
                  g_source_remove(op-opaque-repeat_timer);
 +                op-opaque-repeat_timer = 0;
              }
              recurring_action_timer(dup);
          }
 
 
 On 10 Oct 2014, at 12:16 pm, renayama19661...@ybb.ne.jp wrote:
 
  Hi Andrew,
 
  Setting of gdb of the Ubuntu environment does not yet go well and I touch 
 lrmd and cannot acquire trace.
  Please wait for this a little more.
 
 
  But.. I let lrmd terminate abnormally when g_source_remove() of 
 cancel_recurring_action() returned FALSE.
  -
  gboolean
  cancel_recurring_action(svc_action_t * op)
  {
      crm_info(Cancelling operation %s, op-id);
 
      if (recurring_actions) {
          g_hash_table_remove(recurring_actions, op-id);
      }
 
      if (op-opaque-repeat_timer) {
          if (g_source_remove(op-opaque-repeat_timer) == FALSE)  {
                  abort();
          }
  (snip)
  ---core
  #0  0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at 
 ../nptl/sysdeps/unix/sysv/linux/raise.c:56
 
  56      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
  (gdb) where
  #0  0x7f30aa60ff79 in __GI_raise (sig=sig@entry=6) at 
 ../nptl/sysdeps/unix/sysv/linux/raise.c:56
  #1  0x7f30aa613388 in __GI_abort () at abort.c:89
  #2  0x7f30aadcde77 in crm_abort (file=file@entry=0x7f30aae0152b 
 logging.c, 
      function=function@entry=0x7f30aae028c0 __FUNCTION__.23262 
 crm_glib_handler, line=line@entry=73, 
      assert_condition=assert_condition@entry=0x19d2ad0 Source ID 63 
 was not found when attempting to remove it, do_core=do_core@entry=1, 
      do_fork=optimized out, do_fork@entry=1) at utils.c:1195
  #3  0x7f30aadf5ca7 in crm_glib_handler (log_domain=0x7f30aa35eb6e 
 GLib, flags=optimized out, 
      message=0x19d2ad0 Source ID 63 was not found when attempting to 
 remove it, user_data=optimized out) at logging.c:73
  #4  0x7f30aa320ae1 in g_logv () from 
 /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #5  0x7f30aa320d72 in g_log () from 
 /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #6  0x7f30aa318c5c in g_source_remove () from 
 /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #7  0x7f30aabb2b55 in cancel_recurring_action (op=op@entry=0x19caa90) 
 at services.c:363
  #8  0x7f30aabb2bee in services_action_cancel (name=name@entry=0x19d0530 
 dummy3, action=optimized out, interval=interval@entry=1)
      at services.c:385
  #9  0x0040405a in cancel_op (rsc_id=rsc_id@entry=0x19d0530 
 dummy3, action=action@entry=0x19cec10 monitor, 
 interval=1)
      at lrmd.c:1404
  #10 0x0040614f in process_lrmd_rsc_cancel (client=0x19c8290, id=74, 
 request=0x19ca8a0) at lrmd.c:1468
  #11 process_lrmd_message (client=client@entry=0x19c8290, id=74, 
 request=request@entry=0x19ca8a0) at lrmd.c:1507
  #12 0x00402bac in lrmd_ipc_dispatch (c=0x19c79c0, 
 data=optimized out, size=361) at main.c:148
  #13 0x7f30aa07b4d9 in qb_ipcs_dispatch_connection_request () from 
 /usr/lib/libqb.so.0
  #14 0x7f30aadf209d in gio_read_socket (gio=optimized out, 
 condition=G_IO_IN, data=0x19c68a8) at mainloop.c:437
  #15 0x7f30aa319ce5 in g_main_context_dispatch () from 
 /lib/x86_64-linux-gnu/libglib-2.0.so.0
  ---Type return to continue, or q return to quit---
  #16 0x7f30aa31a048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #17 0x7f30aa31a30a in g_main_loop_run () from 
 /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #18 0x00402774 in main (argc=optimized out, 
 argv=0x7fffcdd90b88) at main.c:344
  -
 
  Best Regards,
  Hideo Yamauchi.
 
 
 
  - Original Message -
  From: renayama19661...@ybb.ne.jp 
 

Re: [Pacemaker] help deciphering output

2014-10-09 Thread Andrew Beekhof

On 9 Oct 2014, at 5:06 pm, Alexandre alxg...@gmail.com wrote:

 I have seen this behavior on several virtualsed environments. when vm backup 
 starts, the VM actually freezes for a (short?) Period of time.I guess it then 
 no more responding to the other cluster nodes thus triggering unexpected fail 
 over and/or fencing.

Alas the dlm is _really_ intolerant of any membership blips.
Once a node is marked failed the dlm wants it fenced.  Even if is comes back 
1ms later.

 I have this kind of behavior on VMware env using veam backup, as well promox 
 (+ u don't what backup tool)
 That's actually an interesting topic I never though about rising here.
 How can we avoid that? Increasing timeout? I am afraid we would have to reach 
 unacceptable high timeout values and am not even sure that would fix the pb.
 I think not all VM snapshots strategy would trigger that PV, do you guys have 
 any feedback to provide on the backup/snapshot method best suits corosync 
 clusters?
 
 Regards
 
 Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com 
 a écrit :
 One of my nodes died in a 2 node cluster
 
 I gather something went wrong, and it fenced/killed itself. But I am not sure 
 what happened.
 
 I think maybe around that time the VM backups happened and snap of the VM 
 could have happened
 
 But there is nothing for me to put my finger on
 
 Output from messages around that time
 
 This is on devrp1
 Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking 
 activity
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the 
 non-primary component and will NOT provide any services.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
 Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
 r(0) ip(10.172.214.51) ; members(old:2 left:1)
 Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service 
 synchronization, ready to provide service.
 Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
 424: quorum lost
 Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming 
 activity
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the 
 primary component and will provide service.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
 r(0) ip(10.172.214.51) ; members(old:1 left:0)
 Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service 
 synchronization, ready to provide service.
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
 cman_event_callback: Node devrp2[2] - state is now lost (was member)
 Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node 
 (devrp2) left the cluster
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
 428: quorum acquired
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
 cman_event_callback: Node devrp2[2] - state is now member (was lost)
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State 
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
 origin=reap_dead_nodes ]
 Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were 
 killed by cman_tool or other application
 Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: 
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: 
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to 
 the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection 
 terminated
 Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
 Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
 Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection 
 to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost 
 connection to Corosync service!
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 
 0x18cf240, pid=2350...
 Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: 
 Connection destroyed
 Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to 
 the CPG API 

Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, g_source_remove fails.

2014-10-09 Thread renayama19661014
Hi Andrew,

I applied three corrections that you made and checked movement.
I picked all abort processing with g_source_remove() of services.c just to 
make sure.
 * I set following abort in four places that carried out g_source_remove

          if (g_source_remove(op-opaque-repeat_timer) == FALSE)  
 {
                  abort();
          }


As a result, abort still occurred.


The problem does not seem to be yet settled by your correction.


(gdb) where
#0  0x7fdd923e1f79 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x7fdd923e5388 in __GI_abort () at abort.c:89
#2  0x7fdd92b9fe77 in crm_abort (file=file@entry=0x7fdd92bd352b 
logging.c, 
    function=function@entry=0x7fdd92bd48c0 __FUNCTION__.23262 
crm_glib_handler, line=line@entry=73, 
    assert_condition=assert_condition@entry=0xe20b80 Source ID 40 was not 
found when attempting to remove it, do_core=do_core@entry=1, 
    do_fork=optimized out, do_fork@entry=1) at utils.c:1195
#3  0x7fdd92bc7ca7 in crm_glib_handler (log_domain=0x7fdd92130b6e GLib, 
flags=optimized out, 
    message=0xe20b80 Source ID 40 was not found when attempting to remove it, 
user_data=optimized out) at logging.c:73
#4  0x7fdd920f2ae1 in g_logv () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x7fdd920f2d72 in g_log () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x7fdd920eac5c in g_source_remove () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x7fdd92984b55 in cancel_recurring_action (op=op@entry=0xe19b90) at 
services.c:365
#8  0x7fdd92984bee in services_action_cancel (name=name@entry=0xe1d2d0 
dummy2, action=optimized out, interval=interval@entry=1)
    at services.c:387
#9  0x0040405a in cancel_op (rsc_id=rsc_id@entry=0xe1d2d0 dummy2, 
action=action@entry=0xe10d90 monitor, interval=1)
    at lrmd.c:1404
#10 0x0040614f in process_lrmd_rsc_cancel (client=0xe17290, id=74, 
request=0xe1be10) at lrmd.c:1468
#11 process_lrmd_message (client=client@entry=0xe17290, id=74, 
request=request@entry=0xe1be10) at lrmd.c:1507
#12 0x00402bac in lrmd_ipc_dispatch (c=0xe169c0, data=optimized out, 
size=361) at main.c:148
#13 0x7fdd91e4d4d9 in qb_ipcs_dispatch_connection_request () from 
/usr/lib/libqb.so.0
#14 0x7fdd92bc409d in gio_read_socket (gio=optimized out, 
condition=G_IO_IN, data=0xe158a8) at mainloop.c:437
#15 0x7fdd920ebce5 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
---Type return to continue, or q return to quit---
#16 0x7fdd920ec048 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#17 0x7fdd920ec30a in g_main_loop_run () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#18 0x00402774 in main (argc=optimized out, argv=0x7fff22cac268) at 
main.c:344

Best Regards,
Hideo Yamauchi.


- Original Message -
 From: renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp
 To: Andrew Beekhof and...@beekhof.net; The Pacemaker cluster resource 
 manager pacemaker@oss.clusterlabs.org
 Cc: 
 Date: 2014/10/10, Fri 10:55
 Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of glib, 
 g_source_remove fails.
 
 Hi Andrew,
 
 Okay!
 
 I test your patch.
 And I inform you of a result.
 
 Many thanks!
 Hideo Yamauchi.
 
 
 
 - Original Message -
  From: Andrew Beekhof and...@beekhof.net
  To: renayama19661...@ybb.ne.jp; The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org
  Cc: 
  Date: 2014/10/10, Fri 10:47
  Subject: Re: [Pacemaker] [Problem]When Pacemaker uses a new version of 
 glib, g_source_remove fails.
 
  Perfect!
 
  Can you try this:
 
  diff --git a/lib/services/services.c b/lib/services/services.c
  index 8590b56..cb0f0ae 100644
  --- a/lib/services/services.c
  +++ b/lib/services/services.c
  @@ -417,6 +417,7 @@ services_action_kick(const char *name, const char 
 *action, 
  int interval /* ms */
       free(id);
 
       if (op == NULL) {
  +        op-opaque-repeat_timer = 0;
           return FALSE;
       }
 
  @@ -425,6 +426,7 @@ services_action_kick(const char *name, const char 
 *action, 
  int interval /* ms */
       } else {
           if (op-opaque-repeat_timer) {
               g_source_remove(op-opaque-repeat_timer);
  +            op-opaque-repeat_timer = 0;
           }
           recurring_action_timer(op);
           return TRUE;
  @@ -459,6 +461,7 @@ handle_duplicate_recurring(svc_action_t * op, void 
  (*action_callback) (svc_actio
           if (dup-pid != 0) {
               if (op-opaque-repeat_timer) {
                   g_source_remove(op-opaque-repeat_timer);
  +                op-opaque-repeat_timer = 0;
               }
               recurring_action_timer(dup);
           }
 
 
  On 10 Oct 2014, at 12:16 pm, renayama19661...@ybb.ne.jp wrote:
 
   Hi Andrew,
 
   Setting of gdb of the Ubuntu environment does not yet go well and I 
 touch 
  lrmd and cannot acquire trace.
   Please wait for this a little more.
 
 
   But.. I let lrmd