Initial version
Attachments:
-
[2918.diff.gz](https://sourceforge.net/p/opensaf/tickets/_discuss/thread/cae26fce/4510/attachment/2918.diff.gz)
(9.2 kB; application/x-gzip)
---
** [tickets:#2918] amf: delay node failover of nodes that are separated from
the main network partition**
**Status:** accepted
**Milestone:** 5.18.12
**Created:** Fri Aug 24, 2018 06:40 AM UTC by Gary Lee
**Last Updated:** Thu Oct 04, 2018 09:19 AM UTC
**Owner:** Gary Lee
Tickets [#64] and [#2795] added support to prevent multiple active controllers
in a split network scenario. However, nodes residing in the smaller network
partitions can remain running. Meanwhile the active SC residing in the largest
partition may failover assignments at the unreachable nodes to other reachable
nodes, causing conflicts when the partitions are merged.
The original proposal involved two parts, a CLM part and an AMF part. CLM would
not announce a node has left the cluster until the fencing of the node has
completed successfully. However, some users rely on timely CLM notifications to
send out node related events and alarms. Thus the proposal has been changed to
be done in AMF only.
AMF should not perform a node failover, until a node has been fenced.
When using remote fencing, this means that the fencing API has reported that
the fencing was completed. When remote fencing is disabled, we need to wait for
at least IMMSV_SC_ABSENCE_ALLOWED seconds (the configuration in immd.conf)
before considering the fencing to be completed.
If MDS connectivity is re-established while waiting, AMF can wait a few seconds
for a node_up (with leds_set == false) message to indicate the node has been
already rebooted. Otherwise, AMF can send a message to the node asking it to
reboot itself. When AMF sees that the MDS connectivity is lost again, it can
consider the fencing to be complete witout the need to wait the full
IMMSV_SC_ABSENCE_ALLOWED time.
Potentially waiting up to IMMSV_SC_ABSENCE_ALLOWED seconds will affect
availability. This option must be configurable via IMM and take effect without
a restart. It is up to the user to turn on, if node disturbances are planned or
expected in the environment due to poor quality links between the nodes.
Additionally, we should allow the user to set this 'node failover' timer to a
smaller value than IMMSV_SC_ABSENCE_ALLOWED, with the understanding that this
introduces the risk of duplicate assignments.
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets