[tickets] [opensaf:tickets] #2115 amfnd: loses sync with director if PG track action msg is sent during SC recovery
- **status**: review --> fixed - **Comment**: changeset: 8222:0fbc9742846a branch: opensaf-5.0.x tag: tip parent: 8217:810f6dde01cb user:Gary Lee date:Fri Oct 14 17:41:00 2016 +1100 summary: amfnd: queue PG track action msgs [#2115] changeset: 8221:362d30af4503 branch: opensaf-5.1.x parent: 8219:d4c05e5a66e5 user:Gary Lee date:Fri Oct 14 17:23:57 2016 +1100 summary: amfnd: queue PG track action msgs [#2115] changeset: 8220:ae6805b74777 parent: 8218:3565b16a8f88 user:Gary Lee date:Fri Oct 14 17:23:37 2016 +1100 summary: amfnd: queue PG track action msgs [#2115] --- ** [tickets:#2115] amfnd: loses sync with director if PG track action msg is sent during SC recovery** **Status:** fixed **Milestone:** 5.1.1 **Created:** Wed Oct 12, 2016 10:30 PM UTC by Gary Lee **Last Updated:** Wed Oct 12, 2016 10:50 PM UTC **Owner:** Gary Lee After SC absence, active amfd will reject messages from 'veteran' amfnds until its local amfnd has started. During this period, if a PG track action msg is sent and rejected by amfd, it will cause the sending amfnd to lose sync with amfd. So we should also queue this message to be re-sent. Oct 11 18:06:01 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 3, msg type 8, from 2030f should be 2 Oct 11 18:06:01 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 4, msg type 8, from 2030f should be 2 Oct 11 18:06:10 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 5, msg type 6, from 2030f should be 2 Oct 11 18:06:20 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 6, msg type 8, from 2030f should be 5 Oct 11 18:06:20 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 7, msg type 8, from 2030f should be 5 Oct 11 18:06:20 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 8, msg type 8, from 2030f should be 5 Oct 11 18:06:20 SC-1 osafamfd[12545]: WA avd_msg_sanity_chk: invalid msg id 9, msg type 8, from 2030f should be 5 After set_leds event is received by amfnd, it can be seen that msgs with id 3 and 4 are retransmitted, but 5 is not received by amfd. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1828 AMF: Both director and node director hang if immnd dies in new SC reallocation scenario
Hi Ritu Raj, The problem in #1828 happened in this scenario - Stop both Active/Standby controllers - The Quiesced controller will become new active - The services hang if IMMND is dead during new Active controller initialization. Simulated IMMND's deadth by killing IMMND process, by I was doing this way: diff --git a/osaf/services/saf/cpsv/cpnd/cpnd_init.c b/osaf/services/saf/cpsv/cpnd/cpnd_init.c --- a/osaf/services/saf/cpsv/cpnd/cpnd_init.c +++ b/osaf/services/saf/cpsv/cpnd/cpnd_init.c @@ -576,7 +576,7 @@ void cpnd_main_process(CPND_CB *cb) SaClmCallbacksT gen_cbk; LOG_NO("Bad CLM handle. Reinitializing."); - + if (system("pkill osafimmnd")); // @TODO figure out why CLMS is not 'instantly' ready usleep(10); Thanks, Minh --- ** [tickets:#1828] AMF: Both director and node director hang if immnd dies in new SC reallocation scenario** **Status:** fixed **Milestone:** 5.1.FC **Created:** Mon May 16, 2016 05:29 AM UTC by Minh Hon Chau **Last Updated:** Thu Oct 13, 2016 10:04 AM UTC **Owner:** nobody Enable cloud & roaming feature. If both Active and Standby SC are stopped at the same time, new controllers will be allocated to be Active/Standby. During this new Active role allocation, if immnd dies there will be circle dependencies in controller (who is going to be Active): - clmd can not use IMM services since immnd dies - immnd needs restarted by amfnd - amfnd is hanging since amfnd is calling CLM services - amfd is also hanging since amfd is calling CLM and NTF services - ntfd is hanging due to logd's dependencies on IMM The problem can be solved if amfd/amfnd are not blocked in main thread so immnd can be restarted and controller will not be reboot due to heartbeat time out --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] Re: #2112 amfd: multiple SUs incorrectly assigned to single node
In (2), I actually also need to remove IMM update of saAmfSUHostedByNode in avd_susi_recreate. --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** assigned **Milestone:** 5.1.1 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Thu Oct 13, 2016 11:57 PM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2112 amfd: multiple SUs incorrectly assigned to single node
Hi Praveen I have tried (2), I thought it should work, but it doesn't The reason is after avd_su_config_get(), saAmfSUHostedByNode of all SUs now have been updated to IMM incorrectly. So reading saAmfSUHostedByNode in headless sync phase will still give incorrect mapping. Any ideas? Thanks, Minh --- ** [tickets:#2112] amfd: multiple SUs incorrectly assigned to single node** **Status:** assigned **Milestone:** 5.1.1 **Created:** Tue Oct 11, 2016 11:56 PM UTC by Gary Lee **Last Updated:** Thu Oct 13, 2016 04:43 AM UTC **Owner:** Minh Hon Chau Multiple SUs are assigned to a single node after SC absence. To reproduce: 0) load nwayactive demo 1) stop SCs 2) restart SCs The following is observed: root@SC-1:~# immlist safSu=SU4,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) root@SC-1:~# immlist safSu=SU2,safSg=AmfDemo,safApp=AmfDemo2 ... saAmfSUHostedByNodeSA_NAME_T safAmfNode=PL-4,safAmfCluster=myAmfCluster (42) SU2 is indeed assigned to PL-4, but SU4 was assigned to one of the SCs and is not assigned to PL-4. Operations on SU4 will lead to a crash of amfnd on PL-4. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2102 build: Make it possible to do a full installation of OpenSAF with RPMs
- **status**: accepted --> review --- ** [tickets:#2102] build: Make it possible to do a full installation of OpenSAF with RPMs** **Status:** review **Milestone:** 5.2.FC **Created:** Fri Oct 07, 2016 02:08 PM UTC by Anders Widell **Last Updated:** Fri Oct 07, 2016 02:08 PM UTC **Owner:** Anders Widell When installing OpenSAF from RPMs, you have to select to install either the opensaf-controller or the opensaf-payload package. If you have installed the opensaf-controller package, it is not possible to configure the node as a payload node, and vice versa. The suggestion is to include everything needed to configure the node as either a controller or a payload in the opensaf-controller package, i.e. by selecting to install this package you will get a full installation of OpenSAF, in the same was as you do if you install from source using "make install". --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2119 smf: Long DN setting in IMM is not correctly detected
--- ** [tickets:#2119] smf: Long DN setting in IMM is not correctly detected** **Status:** unassigned **Milestone:** 5.1.1 **Created:** Thu Oct 13, 2016 03:07 PM UTC by elunlen **Last Updated:** Thu Oct 13, 2016 03:07 PM UTC **Owner:** nobody **Attachments:** - [long_dn_campaign.xml](https://sourceforge.net/p/opensaf/tickets/2119/attachment/long_dn_campaign.xml) (22.8 kB; text/xml) - [osafsmfd](https://sourceforge.net/p/opensaf/tickets/2119/attachment/osafsmfd) (251.3 kB; application/octet-stream) - [syslog](https://sourceforge.net/p/opensaf/tickets/2119/attachment/syslog) (31.9 kB; application/octet-stream) After ticket SMF: smfnd asserted on active controller with long dn when executing the campaign [#2087] was fixed long DN settings in IMM is no longer read correctly. Campaigns with long DN fail. Attached: Campaign, syslog and osafsmfd trace --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2118 PLM: add ability to scale out EEs dynamically
--- ** [tickets:#2118] PLM: add ability to scale out EEs dynamically** **Status:** assigned **Milestone:** future **Created:** Thu Oct 13, 2016 01:26 PM UTC by Alex Jones **Last Updated:** Thu Oct 13, 2016 01:26 PM UTC **Owner:** Alex Jones This ticket is for adding the ability of PLM to scale out EEs, like CLM currently does. This is needed when PLM is used, because if PLMS doesn't know about the EE it won't unlock the EE which starts opensafd. Thus, CLM won't be called to trigger the scale out. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2117 clm: Allow payloads with lower node-id than system controllers
--- ** [tickets:#2117] clm: Allow payloads with lower node-id than system controllers** **Status:** accepted **Milestone:** 5.0.2 **Created:** Thu Oct 13, 2016 01:23 PM UTC by Anders Widell **Last Updated:** Thu Oct 13, 2016 01:23 PM UTC **Owner:** Anders Widell The "SC absence" feature does currently not support configurations where payload nodes have lower node_id than system controller nodes. This restriction was however not documented. The suggestion is to fix this by adding support for such configurations. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2114 smf: balanced upgrade, missing removal of exectrl copy
- **status**: assigned --> fixed - **Comment**: changeset: 8219:d4c05e5a66e5 branch: opensaf-5.1.x summary: smf: balanced upgrade, missing removal of exectrl copy [#2114] Branch opensaf-5.1.x Node ID d4c05e5a66e5431f1edffd22d1f5bc2527df36f0 Parent f86e8509a000626632fac108c38a86950bf2ef96 changeset: 8218:3565b16a8f88 summary: smf: balanced upgrade, missing removal of exectrl copy [#2114] Node ID 3565b16a8f889bd13e8b2ebc18ebce7d94beb2de Parent 90192f4b8e9830dfb5cdbc61fc1a77b532494968 --- ** [tickets:#2114] smf: balanced upgrade, missing removal of exectrl copy** **Status:** fixed **Milestone:** 5.1.1 **Created:** Wed Oct 12, 2016 11:01 AM UTC by Rafael **Last Updated:** Wed Oct 12, 2016 11:01 AM UTC **Owner:** Rafael When doing several bisu upgrades it was noticed that a IMM copy of execControl object was not removed after a upgrade. Looking into the code there is a bug which would case SMF to never remove this execControl copy. Then the copy would be reused in the next campaign if it did an SI swap. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2094 Standby controller goes for reboot on stopping openSaf with STONITH enabled cluster
Split brain may only happen between the system controllers. --- ** [tickets:#2094] Standby controller goes for reboot on stopping openSaf with STONITH enabled cluster** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Wed Oct 05, 2016 07:28 AM UTC by Chani Srivastava **Last Updated:** Thu Oct 13, 2016 09:39 AM UTC **Owner:** nobody OS : Ubuntu 64bit Changeset : 7997 ( 5.1.FC) Setup : 2-node cluster (both controllers) Remote fencing enabled Steps: 1. Bring up OpenSaf on two nodes 2. Enable STONITH 3. Stop opensaf on Standby Active controller triggers reboot of standby SC-1 Syslog Oct 5 13:01:23 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:4, dest:565215202263055) Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Global discard node received for nodeId:2020f pid:3579 Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Implementer disconnected 14 <0, 2020f(down)> (@safAmfService2020f) Oct 5 13:01:24 SC-1 osafamfd[5592]: **NO Node 'SC-2' left the cluster** Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Node Down event for node id 2020f: Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Current role: ACTIVE Oct 5 13:01:24 SC-1 osaffmd[5526]: **Rebooting OpenSAF NodeId = 131599 EE Name = SC-2, Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60 Oct 5 13:01:25 SC-1 external/libvirt[5893]: [5906]: notice: Domain SC-2 was stopped** Oct 5 13:01:27 SC-1 kernel: [ 5355.132093] tipc: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding Oct 5 13:01:27 SC-1 kernel: [ 5355.132123] tipc: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A Oct 5 13:01:27 SC-1 kernel: [ 5355.132126] tipc: Lost contact with <1.1.2> Oct 5 13:01:27 SC-1 external/libvirt[5893]: [5915]: notice: Domain SC-2 was started Oct 5 13:01:42 SC-1 kernel: [ 5370.557180] tipc: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A Oct 5 13:01:42 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:3, dest:565217457979407) Oct 5 13:01:42 SC-1 osafimmd[5535]: NO New IMMND process is on STANDBY Controller at 2020f Oct 5 13:01:42 SC-1 osafimmd[5535]: WA IMMND on controller (not currently coord) requests sync Oct 5 13:01:42 SC-1 osafimmd[5535]: NO Node 2020f request sync sync-pid:1176 epoch:0 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Announce sync, epoch:4 Oct 5 13:01:43 SC-1 osafimmd[5535]: NO Successfully announced sync. New ruling epoch:4 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync starting Oct 5 13:01:43 SC-1 osafimmloadd: IN Synced 346 objects in total Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 18430 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Epoch set to 4 in ImmModel Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2010f old epoch: 3 new epoch:4 Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2020f old epoch: 0 new epoch:4 Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync ending normally Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Received node_up from 2020f: msg_id 1 Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Node 'SC-2' joined the cluster Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer connected: 16 (MsgQueueService131599) <467, 2010f> Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer locally disconnected. Marking it as doomed 16 <467, 2010f> (MsgQueueService131599) Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer disconnected 16 <467, 2010f> (MsgQueueService131599) Oct 5 13:01:44 SC-1 osafrded[5518]: NO Peer up on node 0x2020f Oct 5 13:01:44 SC-1 osaffmd[5526]: NO clm init OK Oct 5 13:01:44 SC-1 osafimmd[5535]: NO MDS event from svc_id 24 (change:5, dest:13) Oct 5 13:01:44 SC-1 osaffmd[5526]: NO Peer clm node name: SC-2 Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info request from node 0x2020f with role STANDBY Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info response from node 0x2020f with role STANDBY --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2097 Both controllers went for reboot while recovering from split brain
To be able to handle split-brain with e.g. stonith there need to be a way for stonith to communicate with, in this case, the hypervisor, in some other configuration it may be e..g IPMI. If this only interface has been brought down stonith can not fence the other node. The correct way is to add an additonal interface to use for stonith with TCP. For testing purposes and if using only one interface you can use command tipc-config -bd eth:eth0 instead, this should work. --- ** [tickets:#2097] Both controllers went for reboot while recovering from split brain** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 06, 2016 04:58 AM UTC by Chani Srivastava **Last Updated:** Thu Oct 13, 2016 09:32 AM UTC **Owner:** nobody **Attachments:** - [Fencing_logs.zip](https://sourceforge.net/p/opensaf/tickets/2097/attachment/Fencing_logs.zip) (43.3 kB; application/zip) S : Ubuntu 64bit Changeset : 7997 ( 5.1.FC) Setup : 3-node cluster (2 controllers, 1 payload) Remote fencing enabled Steps: 1. Bring up OpenSaf on all nodes 2. Enable STONITH 3. Disconnect network from both controllers at the same time -- This will stimulate split brain and both controllers become ACTIVE 4. Connect network to both controllers together --- Both controllers reboot Expected: Controllers should join the cluster by rebooting only one of the controller. Syslog attached for both controllers --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #314 AMF looses alarms and notifications during switch-over
- **status**: assigned --> review - **Milestone**: future --> 5.0.2 --- ** [tickets:#314] AMF looses alarms and notifications during switch-over** **Status:** review **Milestone:** 5.0.2 **Created:** Fri May 24, 2013 08:34 AM UTC by Nagendra Kumar **Last Updated:** Tue Oct 04, 2016 09:11 AM UTC **Owner:** Praveen **Attachments:** - [messages](https://sourceforge.net/p/opensaf/tickets/314/attachment/messages) (41.9 kB; application/octet-stream) - [osafamfd](https://sourceforge.net/p/opensaf/tickets/314/attachment/osafamfd) (5.7 MB; application/octet-stream) Migrated from http://devel.opensaf.org/ticket/3051 Background: http://devel.opensaf.org/ticket/3028 If another node (payload) leaves the cluster in the middle of switch-over, amfd logs this: Mar 8 10:18:21 SC-1 osafamfd[304]: ER sendStateChangeNotificationAvd: saNtfNotificationSend Failed (6) Mar 8 10:18:21 SC-1 osafamfd[304]: ER sendAlarmNotificationAvd: saNtfNotificationSend Failed (6) These logs means that amfd failed to send an alarm and a notification due to TRYAGAIN returned from NTF (in NOACTIVE state) AMF needs to store the alarms/notifications produced in the NOACTIVE state and send them at the end of the switch-over. Or with using a separate thread that can block forever (?) on TRYAGAIN. The problem exist in all opensaf releases --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #1828 AMF: Both director and node director hang if immnd dies in new SC reallocation scenario
Hi Minh, Can we have some testing scenario (How to test?) I am doing it following ways: Testing wirh 3 Controller's with headless feature enabled 1- Stop/Kill OpenSAF on Active followed by Standby with a delay of 2 sec, in this case I am facing #1797 2- Stop OpenSAF on Active/Standby simultaneously, in this case QUIESCED will become Active immediately (as clmna Starting to promote this node to a system controller) but after few second EDS faulted on this node as mentioned in ticket #2116 Thanks Ritu Raj --- ** [tickets:#1828] AMF: Both director and node director hang if immnd dies in new SC reallocation scenario** **Status:** fixed **Milestone:** 5.1.FC **Created:** Mon May 16, 2016 05:29 AM UTC by Minh Hon Chau **Last Updated:** Thu Aug 04, 2016 10:17 AM UTC **Owner:** nobody Enable cloud & roaming feature. If both Active and Standby SC are stopped at the same time, new controllers will be allocated to be Active/Standby. During this new Active role allocation, if immnd dies there will be circle dependencies in controller (who is going to be Active): - clmd can not use IMM services since immnd dies - immnd needs restarted by amfnd - amfnd is hanging since amfnd is calling CLM services - amfd is also hanging since amfd is calling CLM and NTF services - ntfd is hanging due to logd's dependencies on IMM The problem can be solved if amfd/amfnd are not blocked in main thread so immnd can be restarted and controller will not be reboot due to heartbeat time out --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2116 EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE
- **Version**: 5.0.GA --> 5.1.GA --- ** [tickets:#2116] EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 13, 2016 09:49 AM UTC by Ritu Raj **Last Updated:** Thu Oct 13, 2016 09:49 AM UTC **Owner:** nobody **Attachments:** - [messages](https://sourceforge.net/p/opensaf/tickets/2116/attachment/messages) (2.9 MB; application/octet-stream) - [osafevtd](https://sourceforge.net/p/opensaf/tickets/2116/attachment/osafevtd) (102.4 kB; application/octet-stream) # Environment details OS : Suse 64bit Changeset : 8190 ( 5.1.GA) Setup : 3 nodes ( 3 controllers with headless feature enabled & PBE disabled) # Summary EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE # Steps followed & Observed behaviour 1. Initially started OpenSAF on 3 controller with HEADLESS feature enabled (SC-1 ACTIVE, SC-2 Standby, SC-3 QUIESCED) 2. Stop OpenSAF on both the controller(Active/Standby) simultaneously 3. QUIESCED controller become Active as clmna Starting to promote this node to a system controller Oct 13 14:29:05 SCALE_SLOT-73 osafclmna[3434]: NO Starting to promote this node to a system controller Oct 13 14:29:05 SCALE_SLOT-73 osafrded[3443]: NO Requesting ACTIVE role Oct 13 14:29:10 SCALE_SLOT-73 osafimmd[3462]: IN AMF HA ACTIVE request Oct 13 14:29:10 SCALE_SLOT-73 osaffmd[3452]: NO Stopped activation supervision due to new AMF state 1 Oct 13 14:29:10 SCALE_SLOT-73 osafamfd[3513]: NO Received node_up from 2030f: msg_id 1 Oct 13 14:29:10 SCALE_SLOT-73 osafamfd[3513]: NO Node 'SC-3' joined the cluster 3. After few second EDS faulted and node went for reboot Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: NO 'safComp=EDS,safSu=SC-3,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: ER safComp=EDS,safSu=SC-3,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131855, SupervisionTime = 60 ** Notes 1. Syslog attached 2. osafevtd trace attached --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2116 EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE
--- ** [tickets:#2116] EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 13, 2016 09:49 AM UTC by Ritu Raj **Last Updated:** Thu Oct 13, 2016 09:49 AM UTC **Owner:** nobody **Attachments:** - [messages](https://sourceforge.net/p/opensaf/tickets/2116/attachment/messages) (2.9 MB; application/octet-stream) - [osafevtd](https://sourceforge.net/p/opensaf/tickets/2116/attachment/osafevtd) (102.4 kB; application/octet-stream) # Environment details OS : Suse 64bit Changeset : 8190 ( 5.1.GA) Setup : 3 nodes ( 3 controllers with headless feature enabled & PBE disabled) # Summary EDS faulted on new Active controller after being promoted from QUIESCED to ACTIVE # Steps followed & Observed behaviour 1. Initially started OpenSAF on 3 controller with HEADLESS feature enabled (SC-1 ACTIVE, SC-2 Standby, SC-3 QUIESCED) 2. Stop OpenSAF on both the controller(Active/Standby) simultaneously 3. QUIESCED controller become Active as clmna Starting to promote this node to a system controller Oct 13 14:29:05 SCALE_SLOT-73 osafclmna[3434]: NO Starting to promote this node to a system controller Oct 13 14:29:05 SCALE_SLOT-73 osafrded[3443]: NO Requesting ACTIVE role Oct 13 14:29:10 SCALE_SLOT-73 osafimmd[3462]: IN AMF HA ACTIVE request Oct 13 14:29:10 SCALE_SLOT-73 osaffmd[3452]: NO Stopped activation supervision due to new AMF state 1 Oct 13 14:29:10 SCALE_SLOT-73 osafamfd[3513]: NO Received node_up from 2030f: msg_id 1 Oct 13 14:29:10 SCALE_SLOT-73 osafamfd[3513]: NO Node 'SC-3' joined the cluster 3. After few second EDS faulted and node went for reboot Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: NO 'safComp=EDS,safSu=SC-3,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: ER safComp=EDS,safSu=SC-3,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Oct 13 14:30:11 SCALE_SLOT-73 osafamfnd[3523]: Rebooting OpenSAF NodeId = 131855 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131855, SupervisionTime = 60 ** Notes 1. Syslog attached 2. osafevtd trace attached --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2094 Standby controller goes for reboot on stopping openSaf with STONITH enabled cluster
Is Stonith applicable only for controllers? As no reboot observed while stopping opensaf on Payload. --- ** [tickets:#2094] Standby controller goes for reboot on stopping openSaf with STONITH enabled cluster** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Wed Oct 05, 2016 07:28 AM UTC by Chani Srivastava **Last Updated:** Thu Oct 06, 2016 12:23 PM UTC **Owner:** nobody OS : Ubuntu 64bit Changeset : 7997 ( 5.1.FC) Setup : 2-node cluster (both controllers) Remote fencing enabled Steps: 1. Bring up OpenSaf on two nodes 2. Enable STONITH 3. Stop opensaf on Standby Active controller triggers reboot of standby SC-1 Syslog Oct 5 13:01:23 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:4, dest:565215202263055) Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Global discard node received for nodeId:2020f pid:3579 Oct 5 13:01:23 SC-1 osafimmnd[5545]: NO Implementer disconnected 14 <0, 2020f(down)> (@safAmfService2020f) Oct 5 13:01:24 SC-1 osafamfd[5592]: **NO Node 'SC-2' left the cluster** Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Node Down event for node id 2020f: Oct 5 13:01:24 SC-1 osaffmd[5526]: NO Current role: ACTIVE Oct 5 13:01:24 SC-1 osaffmd[5526]: **Rebooting OpenSAF NodeId = 131599 EE Name = SC-2, Reason: Received Node Down for peer controller, OwnNodeId = 131343, SupervisionTime = 60 Oct 5 13:01:25 SC-1 external/libvirt[5893]: [5906]: notice: Domain SC-2 was stopped** Oct 5 13:01:27 SC-1 kernel: [ 5355.132093] tipc: Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding Oct 5 13:01:27 SC-1 kernel: [ 5355.132123] tipc: Lost link <1.1.1:eth0-1.1.2:eth0> on network plane A Oct 5 13:01:27 SC-1 kernel: [ 5355.132126] tipc: Lost contact with <1.1.2> Oct 5 13:01:27 SC-1 external/libvirt[5893]: [5915]: notice: Domain SC-2 was started Oct 5 13:01:42 SC-1 kernel: [ 5370.557180] tipc: Established link <1.1.1:eth0-1.1.2:eth0> on network plane A Oct 5 13:01:42 SC-1 osafimmd[5535]: NO MDS event from svc_id 25 (change:3, dest:565217457979407) Oct 5 13:01:42 SC-1 osafimmd[5535]: NO New IMMND process is on STANDBY Controller at 2020f Oct 5 13:01:42 SC-1 osafimmd[5535]: WA IMMND on controller (not currently coord) requests sync Oct 5 13:01:42 SC-1 osafimmd[5535]: NO Node 2020f request sync sync-pid:1176 epoch:0 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Announce sync, epoch:4 Oct 5 13:01:43 SC-1 osafimmd[5535]: NO Successfully announced sync. New ruling epoch:4 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync starting Oct 5 13:01:43 SC-1 osafimmloadd: IN Synced 346 objects in total Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 18430 Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Epoch set to 4 in ImmModel Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2010f old epoch: 3 new epoch:4 Oct 5 13:01:43 SC-1 osafimmd[5535]: NO ACT: New Epoch for IMMND process at node 2020f old epoch: 0 new epoch:4 Oct 5 13:01:43 SC-1 osafimmloadd: NO Sync ending normally Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Received node_up from 2020f: msg_id 1 Oct 5 13:01:43 SC-1 osafamfd[5592]: NO Node 'SC-2' joined the cluster Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer connected: 16 (MsgQueueService131599) <467, 2010f> Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer locally disconnected. Marking it as doomed 16 <467, 2010f> (MsgQueueService131599) Oct 5 13:01:43 SC-1 osafimmnd[5545]: NO Implementer disconnected 16 <467, 2010f> (MsgQueueService131599) Oct 5 13:01:44 SC-1 osafrded[5518]: NO Peer up on node 0x2020f Oct 5 13:01:44 SC-1 osaffmd[5526]: NO clm init OK Oct 5 13:01:44 SC-1 osafimmd[5535]: NO MDS event from svc_id 24 (change:5, dest:13) Oct 5 13:01:44 SC-1 osaffmd[5526]: NO Peer clm node name: SC-2 Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info request from node 0x2020f with role STANDBY Oct 5 13:01:44 SC-1 osafrded[5518]: NO Got peer info response from node 0x2020f with role STANDBY --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2097 Both controllers went for reboot while recovering from split brain
To stimulate split brain scenario we intentionally did not configure redundant interface. --- ** [tickets:#2097] Both controllers went for reboot while recovering from split brain** **Status:** unassigned **Milestone:** 5.2.FC **Created:** Thu Oct 06, 2016 04:58 AM UTC by Chani Srivastava **Last Updated:** Wed Oct 12, 2016 10:46 AM UTC **Owner:** nobody **Attachments:** - [Fencing_logs.zip](https://sourceforge.net/p/opensaf/tickets/2097/attachment/Fencing_logs.zip) (43.3 kB; application/zip) S : Ubuntu 64bit Changeset : 7997 ( 5.1.FC) Setup : 3-node cluster (2 controllers, 1 payload) Remote fencing enabled Steps: 1. Bring up OpenSaf on all nodes 2. Enable STONITH 3. Disconnect network from both controllers at the same time -- This will stimulate split brain and both controllers become ACTIVE 4. Connect network to both controllers together --- Both controllers reboot Expected: Controllers should join the cluster by rebooting only one of the controller. Syslog attached for both controllers --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2073 log: the usage of command saflogger is not correct
- **Part**: tests --> tools --- ** [tickets:#2073] log: the usage of command saflogger is not correct** **Status:** review **Milestone:** 4.7.2 **Created:** Tue Sep 27, 2016 11:43 AM UTC by Canh Truong **Last Updated:** Thu Oct 13, 2016 07:29 AM UTC **Owner:** Canh Truong The usage of command saflogger is not correct: EXAMPLES saflogger -a safLgStrCfg=Test "Hello world" saflogger -a safLgStrCfg=Test -f testLogFile "Hello world" saflogger -s crit "I am going down" *root@SC-1:~# saflogger -a safLgStrCfg=Test -f testLogFile "Hello world" saLogStreamOpen2 FAILED: SAAISERRINVALIDPARAM (7)* The RDN of runtime stream is "safLgStr=". So we use "safLgStr=...instead of "safLgStrCfg=..." in command saflogger. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #2073 log: the usage of command saflogger is not correct
- **Component**: unknown --> log --- ** [tickets:#2073] log: the usage of command saflogger is not correct** **Status:** review **Milestone:** 4.7.2 **Created:** Tue Sep 27, 2016 11:43 AM UTC by Canh Truong **Last Updated:** Mon Oct 03, 2016 09:39 AM UTC **Owner:** Canh Truong The usage of command saflogger is not correct: EXAMPLES saflogger -a safLgStrCfg=Test "Hello world" saflogger -a safLgStrCfg=Test -f testLogFile "Hello world" saflogger -s crit "I am going down" *root@SC-1:~# saflogger -a safLgStrCfg=Test -f testLogFile "Hello world" saLogStreamOpen2 FAILED: SAAISERRINVALIDPARAM (7)* The RDN of runtime stream is "safLgStr=". So we use "safLgStr=...instead of "safLgStrCfg=..." in command saflogger. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets