On 10/02/2015 01:47 PM, Pritam Kharat wrote: > Hi, > > I have set up a ACTIVE/PASSIVE HA > > *Issue 1) * > > *corosync.conf* file is > > # Please read the openais.conf.5 manual page > > totem { > > version: 2 > > # How long before declaring a token lost (ms) > token: 10000 > > # How many token retransmits before forming a new configuration > token_retransmits_before_loss_const: 20 > > # How long to wait for join messages in the membership protocol (ms) > join: 10000 > > # How long to wait for consensus to be achieved before starting a > new round of membership configuration (ms) > consensus: 12000 > > # Turn off the virtual synchrony filter > vsftype: none > > # Number of messages that may be sent by one processor on receipt > of the token > max_messages: 20 > > # Limit generated nodeids to 31-bits (positive signed integers) > clear_node_high_bit: yes > > # Disable encryption > secauth: off > > # How many threads to use for encryption/decryption > threads: 0 > > # Optionally assign a fixed node id (integer) > # nodeid: 1234 > > # This specifies the mode of redundant ring, which may be none, > active, or passive. > rrp_mode: none > interface { > # The following values need to be set based on your > environment > ringnumber: 0 > bindnetaddr: 192.168.101.0 > mcastport: 5405 > } > > transport: udpu > } > > amf { > mode: disabled > } > > quorum { > # Quorum for the Pacemaker Cluster Resource Manager > provider: corosync_votequorum > expected_votes: 1
If you're using a recent version of corosync, use "two_node: 1" instead of "expected_votes: 1", and get rid of "no-quorum-policy: ignore" in the pacemaker cluster options. > } > > > nodelist { > > node { > ring0_addr: 192.168.101.73 > } > > node { > ring0_addr: 192.168.101.74 > } > } > > aisexec { > user: root > group: root > } > > > logging { > fileline: off > to_stderr: yes > to_logfile: yes > to_syslog: yes > syslog_facility: daemon > logfile: /var/log/corosync/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > And I have added 5 resources - 1 is VIP and 4 are upstart jobs > Node names are configured as -> sc-node-1(ACTIVE) and sc-node-2(PASSIVE) > Resources are running on ACTIVE node > > Default cluster properties - > > <cluster_property_set id="cib-bootstrap-options"> > <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" > value="1.1.10-42f2063"/> > <nvpair id="cib-bootstrap-options-cluster-infrastructure" > name="cluster-infrastructure" value="corosync"/> > <nvpair name="no-quorum-policy" value="ignore" > id="cib-bootstrap-options-no-quorum-policy"/> > <nvpair name="stonith-enabled" value="false" > id="cib-bootstrap-options-stonith-enabled"/> > <nvpair name="cluster-recheck-interval" value="3min" > id="cib-bootstrap-options-cluster-recheck-interval"/> > <nvpair name="default-action-timeout" value="120s" > id="cib-bootstrap-options-default-action-timeout"/> > </cluster_property_set> > > > But sometimes after 2-3 migrations from ACTIVE to STANDBY and then from > STANDBY to ACTIVE, > both nodes become OFFLINE and Current DC becomes None, I have disabled the > stonith property and even quorum is ignored Disabling stonith isn't helping you. The cluster needs stonith to recover from difficult situations, so it's easier to get into weird states like this without it. > root@sc-node-2:/usr/lib/python2.7/dist-packages/sc# crm status > Last updated: Sat Oct 3 00:01:40 2015 > Last change: Fri Oct 2 23:38:28 2015 via crm_resource on sc-node-1 > Stack: corosync > Current DC: NONE > 2 Nodes configured > 5 Resources configured > > OFFLINE: [ sc-node-1 sc-node-2 ] > > What is going wrong here ? What is the reason for node Current DC becoming > None suddenly ? Is corosync.conf okay ? Are default cluster properties fine > ? Help will be appreciated. I'd recommend seeing how the problem behaves with stonith enabled, but in any case you'll need to dive into the logs to figure what starts the chain of events. > > *Issue 2)* > Command used to add upstart job is > > crm configure primitive service upstart:service meta allow-migrate=true > migration-threshold=5 failure-timeout=30s op monitor interval=15s > timeout=60s > > But still sometimes I see fail count going to INFINITY. Why ? How can we > avoid it ? Resource should have migrated as soon as it reaches migration > threshold. > > * Node sc-node-2: > service: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct > 2 23:38:53 2015' > service1: migration-threshold=5 fail-count=1000000 last-failure='Fri Oct > 2 23:38:53 2015' > > Failed actions: > service_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > last-rc-change=Fri Oct 2 23:38:53 2015 > , queued=0ms, exec=0ms > ): unknown error > service1_start_0 (node=sc-node-2, call=-1, rc=1, status=Timed Out, > last-rc-change=Fri Oct 2 23:38:53 2015 > , queued=0ms, exec=0ms migration-threshold is used for monitor failures, not (by default) start or stop failures. This is a start failure, which (by default) makes the fail-count go to infinity. The rationale is that a monitor failure indicates some sort of temporary error, but failing to start could well mean that something is wrong with the installation or configuration. You can tell the cluster to apply migration-threshold to start failures too, by setting the start-failure-is-fatal=false cluster option. _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org