[tickets] [opensaf:tickets] #1127 IMM: FAilure to send to completed to PBE can cause cluster restart.

Anders Bjornerstedt Tue, 23 Sep 2014 05:04:15 -0700

- **summary**: IMM: Detached PBE just before a ccb-apply can cause cluster 
restart. --> IMM: FAilure to send to completed to PBE can cause cluster restart.
- Description has changed:


Diff:

~~~~

--- old
+++ new
@@ -3,12 +3,26 @@
  http://sourceforge.net/p/opensaf/tickets/1096/
 
 The PBE detaches after having received the ccb-operations for a ccb but before
-having received the completed-callback. In this case there a re no OIs so
+having received the completed-callback. In this case there are no OIs so
 the completed-callback to PBE is to be sent directly when handling the apply
 downcall from the user.
 
-The problem is that if the PBE is detached here, the IMMNMDs will abort,
-causing a cluster restart. 
+Detachment itself (of the PBE or any imm client) arrives over fevs, so that
+is actually not the problem. The client node will only be removed in conjuction
+with clearing of the implementer in ImmModel. Thus the return from ImmModel of 
+a non-null pbeConn means the client-node must exist. This is an "invariant"
+i.e. an assertable condition. 
 
-The IMMNDs must not abort in this case, they should simply let the apply be
-handled by the PBE restart/recovery. 
+The problem that *does* exist in immnd_evt_proc_ccb_apply is that the send 
+itself over MDS may fail, due to a race with a PBE going down. In that case
+the code in immnd_evt_proc_ccb_apply will explititly abort, which will happen
+on all nodes, which will result in a cluster restart.
+
+It is this abort() on send failure which is wrong. The other abort on client
+node not found should be changed to an assert.
+
+So the problem that needs to be fixed is to remove the abort on send failure
+and instead "drop" the ccb apply to the recovery case, lettting the apply
+result be resolved by the PBE restart/recovery.
+Indeed, it is concewivable that the PBE may have received the completed&commit
+message even if the sending IMMND receives an error from MDS on the send. 

~~~~




---

** [tickets:#1127] IMM: FAilure to send to completed to PBE can cause cluster 
restart.**

**Status:** accepted
**Milestone:** 4.3.3
**Created:** Tue Sep 23, 2014 07:58 AM UTC by Anders Bjornerstedt
**Last Updated:** Tue Sep 23, 2014 07:58 AM UTC
**Owner:** Anders Bjornerstedt

This ticket is similar to #1096:

 http://sourceforge.net/p/opensaf/tickets/1096/

The PBE detaches after having received the ccb-operations for a ccb but before
having received the completed-callback. In this case there are no OIs so
the completed-callback to PBE is to be sent directly when handling the apply
downcall from the user.

Detachment itself (of the PBE or any imm client) arrives over fevs, so that
is actually not the problem. The client node will only be removed in conjuction
with clearing of the implementer in ImmModel. Thus the return from ImmModel of 
a non-null pbeConn means the client-node must exist. This is an "invariant"
i.e. an assertable condition. 

The problem that *does* exist in immnd_evt_proc_ccb_apply is that the send 
itself over MDS may fail, due to a race with a PBE going down. In that case
the code in immnd_evt_proc_ccb_apply will explititly abort, which will happen
on all nodes, which will result in a cluster restart.

It is this abort() on send failure which is wrong. The other abort on client
node not found should be changed to an assert.

So the problem that needs to be fixed is to remove the abort on send failure
and instead "drop" the ccb apply to the recovery case, lettting the apply
result be resolved by the PBE restart/recovery.
Indeed, it is concewivable that the PBE may have received the completed&commit
message even if the sending IMMND receives an error from MDS on the send. 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1127 IMM: FAilure to send to completed to PBE can cause cluster restart.

Reply via email to