Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-07 Thread Attila Megyeri
One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
restarted the node. It resulted in this state on all other nodes again...

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Friday, March 07, 2014 7:54 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
   Hello,
  
   We have a strange issue with Corosync/Pacemaker.
   From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
   When I check the cpu usage, I see that one of the cores uses 100%
   cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
  
   In such a case, this high CPU usage is happening on all 7 nodes.
   I have to manually go to each node, stop pacemaker, restart
   corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.
  
   Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
  
   Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
  
   Logs are usually flooded with CPG related messages, such as:
  
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  
   OR
  
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly confused 
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
  
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2
 
 There are no updates available. The only option is to install from sources, 
 but
 that would be very difficult to maintain and I'm not sure I would get rid of 
 this
 issue.
 
 What do you recommend?
 
 
  
   HTOP show something like this (sorted by TIME+ descending):
  
  
  
 1  [100.0%] Tasks: 59, 4
  thr; 2 running
 2  [| 0.7%] Load average: 
   1.00 0.99 1.02
 Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
 Swp[   0/509MB]
  
 PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
   1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
   /usr/sbin/snmpd -
  Lsd -Lf /dev/null -u snmp -g snm
   1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
   1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
   1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
   /usr/sbin/watchdog
   1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
   1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
   1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
   1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
   1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
   read
 process
   1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
  /usr/lib/pacemaker/pengine
   1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
   write
 process
   1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
   /usr/sbin/ntpd -p
  /var/run/ntpd.pid -g -u 105:112
 899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
   /usr/sbin/irqbalance
   1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
   /usr/bin/monit -c
  /etc/monit/monitrc
   4374 kamailio   20   0  

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out lost?

2014-03-07 Thread Kristoffer Grönlund
On Fri, 07 Mar 2014 10:30:13 +0300
Vladislav Bogdanov bub...@hoster-ok.com wrote:

  Andrew, current git master (ee094a2) almost works, the only issue
  is that crm_diff calculates incorrect diff digest. If I replace
  digest in diff by hands with what cib calculates as expected. it
  applies correctly. Otherwise - -206.  
  
  More details?  
 
 Hmmm...
 seems to be crmsh-specific,
 Cannot reproduce with pure-XML editing.
 Kristoffer, does 
 http://hg.savannah.gnu.org/hgweb/crmsh/rev/c42d9361a310 address this?

No, that commit fixes an issue when importing the CIB into crmsh, the
diff calculation happens when going the other way. It seems strange
that crmsh should be causing such a problem, all it does is call
crm_diff to generate the actual diff so any problem with an incorrect
digest should be coming from crm_diff.

I don't think this is an issue that is known to me, it doesn't sound
like it is the same problem I have been investigating. Could you file a
bug at https://savannah.nongnu.org/bugs/?group=crmsh with some more
details?

Thank you,

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker with cman and dbrd when primary node panics or poweroff

2014-03-07 Thread Gianluca Cecchi
so I fixed the problem regarding hostname in drbd.conf and in name
from cluster point of view.
ALso configured and verified fence_vmware agent and enabled stonith
Changed in drbd resource configuration

resource ovirt {
disk {
disk-flushes no;
md-flushes no;
fencing resource-and-stonith;
}
 device minor 0;
 disk /dev/sdb;
 syncer {
 rate 30M;
 verify-alg md5;
 }
 handlers {
 fence-peer /usr/lib/drbd/crm-fence-peer.sh;
 after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
 }

Put in cluster.conf
cman expected_votes=1 two_node=1/
and restarted pacemaker and cman on nodes.

service active on ovirteng01
I provoke power off of ovirteng01. Fencing agent works ok on
ovirteng02 and reboots it.
I stop boot ofovirteng01 at grub prompt to simulate problem in boot
(for example system put in console mode due to filesystem problem)
In the mean time ovirteng02 becomes master of drbd resource, but
doesn't start the group
This in messages:

Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: PingAck did not arrive in time.
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: peer( Primary -
Unknown ) conn( Connected - NetworkFailure ) pdsk( UpToDate -
DUnknown )
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: asender terminated
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: Terminating drbd_a_ovirt
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: Connection closed
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: conn( NetworkFailure -
Unconnected )
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: receiver terminated
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: Restarting receiver thread
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: receiver (re)started
Mar  8 01:08:00 ovirteng02 kernel: drbd ovirt: conn( Unconnected -
WFConnection )
Mar  8 01:08:02 ovirteng02 corosync[12908]:   [TOTEM ] A processor
failed, forming new configuration.
Mar  8 01:08:04 ovirteng02 corosync[12908]:   [QUORUM] Members[1]: 2
Mar  8 01:08:04 ovirteng02 corosync[12908]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Mar  8 01:08:04 ovirteng02 corosync[12908]:   [CPG   ] chosen
downlist: sender r(0) ip(192.168.33.46) ; members(old:2 left:1)
Mar  8 01:08:04 ovirteng02 corosync[12908]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Mar  8 01:08:04 ovirteng02 kernel: dlm: closing connection to node 1
Mar  8 01:08:04 ovirteng02 crmd[13168]:   notice:
crm_update_peer_state: cman_event_callback: Node
ovirteng01.localdomain.local[1] - state is now lost (was member)
Mar  8 01:08:04 ovirteng02 crmd[13168]:  warning: reap_dead_nodes: Our
DC node (ovirteng01.localdomain.local) left the cluster
Mar  8 01:08:04 ovirteng02 crmd[13168]:   notice: do_state_transition:
State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=reap_dead_nodes ]
Mar  8 01:08:04 ovirteng02 crmd[13168]:   notice: do_state_transition:
State transition S_ELECTION - S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Mar  8 01:08:04 ovirteng02 fenced[12962]: fencing node
ovirteng01.localdomain.local
Mar  8 01:08:04 ovirteng02 attrd[13166]:   notice:
attrd_local_callback: Sending full refresh (origin=crmd)
Mar  8 01:08:04 ovirteng02 attrd[13166]:   notice:
attrd_trigger_update: Sending flush op to all hosts for:
master-OvirtData (1)
Mar  8 01:08:04 ovirteng02 attrd[13166]:   notice:
attrd_trigger_update: Sending flush op to all hosts for:
probe_complete (true)
Mar  8 01:08:04 ovirteng02 fence_pcmk[13733]: Requesting Pacemaker
fence ovirteng01.localdomain.local (reset)
Mar  8 01:08:04 ovirteng02 stonith_admin[13734]:   notice:
crm_log_args: Invoked: stonith_admin --reboot
ovirteng01.localdomain.local --tolerance 5s --tag cman
Mar  8 01:08:04 ovirteng02 stonith-ng[13164]:   notice:
handle_request: Client stonith_admin.cman.13734.5528351f wants to
fence (reboot) 'ovirteng01.localdomain.local' with device '(any)'
Mar  8 01:08:04 ovirteng02 stonith-ng[13164]:   notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
ovirteng01.localdomain.local: 1e70a341-efbf-470a-bcaa-886a8acfa9d1 (0)
Mar  8 01:08:04 ovirteng02 stonith-ng[13164]:   notice:
can_fence_host_with_device: Fencing can fence
ovirteng01.localdomain.local (aka. 'ovirteng01'): static-list
Mar  8 01:08:04 ovirteng02 stonith-ng[13164]:   notice:
can_fence_host_with_device: Fencing can fence
ovirteng01.localdomain.local (aka. 'ovirteng01'): static-list
Mar  8 01:08:05 ovirteng02 pengine[13167]:   notice: unpack_config: On
loss of CCM Quorum: Ignore
Mar  8 01:08:05 ovirteng02 pengine[13167]:  warning: pe_fence_node:
Node ovirteng01.localdomain.local will be fenced because the node is
no longer part of the cluster
Mar  8 01:08:05 ovirteng02 pengine[13167]:  warning:
determine_online_status: Node ovirteng01.localdomain.local is unclean
Mar  8 01:08:05 ovirteng02 pengine[13167]:  warning: custom_action:
Action OvirtData:0_demote_0 on ovirteng01.localdomain.local is
unrunnable (offline)
Mar  8 01:08:05 ovirteng02