[tickets] [opensaf:tickets] #203 avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM
Attached 203.tgz contains configuration: -203.xml configuration to reproduce the issue. -script new_si_csi_add.sh to add new SI -Traces. Steps to reproduce: 1) immcfg -f 203.xml 2) Unlock and Unlock-in of SU1 and SU2. 3) ./new_si_csi_add.sh 4) Lock SU1 5) OpenSAF stop of payload hosting SU2. --- ** [tickets:#203] avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM** **Status:** assigned **Created:** Wed May 15, 2013 04:32 AM UTC by Praveen **Last Updated:** Fri Sep 06, 2013 01:14 PM UTC **Owner:** Praveen The issue is observed on SLES 64bit VMs. Configuration: NWay RM with 2 SUs, 2SIs and 2 CSIs. PBE is enabled and opensaf is run as root user. New SI is added and then active SU is locked. The following message is seen in the syslog: Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'safSi=d_NWay_1Norm_3,safApp=N' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removed 'safSi=d_NWay_1Norm_3,safApp=N' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'all SIs' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Oct 6 19:24:43 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Oct 6 19:24:44 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Further operations failed since the SG is not stable. When PL-4 which was hosting the active SU is brought down, amfd on active controller crashed leading to the reboot of the node. The following message is seen in the syslog. Oct 6 19:43:59 SLES-SLOT-1 osafamfd[3693]: Node 'PL-4' left the cluster Oct 6 19:44:00 SLES-SLOT-1 osafamfd[3693]: avd_su.c:1585: avd_su_dec_curr_stdby_si: Assertion 'su-saAmfSUNumCurrStandbySIs 0' failed. Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: AMF director unexpectedly crashed Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: Rebooting OpenSAF NodeId? = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer locally disconnected. Marking it as doomed 3 17, 2010f (safAmfService) Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer disconnected 3 17, 2010f (safAmfService) Oct 6 19:44:00 SLES-SLOT-1 opensaf_reboot: Rebooting local node Bt of the core file: Core was generated by `/usr/lib64/opensaf/osafamfd —tracemask=0x'. Program terminated with signal 6, Aborted. #0 0x7f457decd645 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f457decd645 in raise () from /lib64/libc.so.6 #1 0x7f457decec33 in abort () from /lib64/libc.so.6 #2 0x7f457f4df095 in osafassert_fail (file=0x4ac5e5 avd_su.c, line=1585, func=0x4ad590 avd_su_dec_curr_stdby_si, assertion=0x4ad5b0 su-saAmfSUNumCurrStandbySIs 0) at sysf_def.c:399 #3 0x0048964f in avd_su_dec_curr_stdby_si (su=0x727f70) at avd_su.c:1585 #4 0x0048b244 in avd_susi_update_assignment_counters (susi=0x767bf0, action=AVSV_SUSI_ACT_DEL, current_ha_state=0, new_ha_state=0) at avd_siass.c:730 #5 0x0048aff7 in avd_susi_del_send (susi=0x767bf0) at avd_siass.c:663 #6 0x00474bbc in avd_sg_nway_node_fail_stable (cb=0x6bdbe0, su=0x732130, susi=0x0) at avd_sgNWayfsm.c:3191 #7 0x00476257 in avd_sg_nway_node_fail_sg_realign (cb=0x6bdbe0, su=0x732130) at avd_sgNWayfsm.c:3645 #8 0x0046c82c in avd_sg_nway_node_fail_func (cb=0x6bdbe0, su=0x732130) at avd_sgNWayfsm.c:657 #9 0x0047ad65 in avd_node_susi_fail_func (cb=0x6bdbe0, avnd=0x6fef50) at avd_sgproc.c:2126 #10 0x00434f72 in avd_node_failover (node=0x6fef50) at avd_ndproc.c:776 #11 0x00431a80 in avd_mds_avnd_down_evh (cb=0x6bdbe0, evt=0x7f4578000ae0) at avd_ndfsm.c:407 #12 0x0043b57e in avd_process_event (cb_now=0x6bdbe0, evt=0x7f4578000ae0) at avd_proc.c:589 #13 0x0043b305 in avd_main_proc () at avd_proc.c:505 #14 0x00409210 in main (argc=2, argv=0x7fff87968c08) at amfd_main.c:47 Traces from the active controller are attached. Changed 7 months ago
[tickets] [opensaf:tickets] #203 avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM
Attached traces and configuration. Attachment: 203.tgz (628.1 kB; application/x-compressed) --- ** [tickets:#203] avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM** **Status:** assigned **Created:** Wed May 15, 2013 04:32 AM UTC by Praveen **Last Updated:** Tue Oct 29, 2013 07:04 AM UTC **Owner:** Praveen The issue is observed on SLES 64bit VMs. Configuration: NWay RM with 2 SUs, 2SIs and 2 CSIs. PBE is enabled and opensaf is run as root user. New SI is added and then active SU is locked. The following message is seen in the syslog: Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'safSi=d_NWay_1Norm_3,safApp=N' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removed 'safSi=d_NWay_1Norm_3,safApp=N' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'all SIs' from 'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N' Oct 6 19:24:42 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Oct 6 19:24:43 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Oct 6 19:24:44 SLES-SLOT-1 osafamfd[3693]: SG state is not stable Further operations failed since the SG is not stable. When PL-4 which was hosting the active SU is brought down, amfd on active controller crashed leading to the reboot of the node. The following message is seen in the syslog. Oct 6 19:43:59 SLES-SLOT-1 osafamfd[3693]: Node 'PL-4' left the cluster Oct 6 19:44:00 SLES-SLOT-1 osafamfd[3693]: avd_su.c:1585: avd_su_dec_curr_stdby_si: Assertion 'su-saAmfSUNumCurrStandbySIs 0' failed. Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: AMF director unexpectedly crashed Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: Rebooting OpenSAF NodeId? = 131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer locally disconnected. Marking it as doomed 3 17, 2010f (safAmfService) Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer disconnected 3 17, 2010f (safAmfService) Oct 6 19:44:00 SLES-SLOT-1 opensaf_reboot: Rebooting local node Bt of the core file: Core was generated by `/usr/lib64/opensaf/osafamfd —tracemask=0x'. Program terminated with signal 6, Aborted. #0 0x7f457decd645 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f457decd645 in raise () from /lib64/libc.so.6 #1 0x7f457decec33 in abort () from /lib64/libc.so.6 #2 0x7f457f4df095 in osafassert_fail (file=0x4ac5e5 avd_su.c, line=1585, func=0x4ad590 avd_su_dec_curr_stdby_si, assertion=0x4ad5b0 su-saAmfSUNumCurrStandbySIs 0) at sysf_def.c:399 #3 0x0048964f in avd_su_dec_curr_stdby_si (su=0x727f70) at avd_su.c:1585 #4 0x0048b244 in avd_susi_update_assignment_counters (susi=0x767bf0, action=AVSV_SUSI_ACT_DEL, current_ha_state=0, new_ha_state=0) at avd_siass.c:730 #5 0x0048aff7 in avd_susi_del_send (susi=0x767bf0) at avd_siass.c:663 #6 0x00474bbc in avd_sg_nway_node_fail_stable (cb=0x6bdbe0, su=0x732130, susi=0x0) at avd_sgNWayfsm.c:3191 #7 0x00476257 in avd_sg_nway_node_fail_sg_realign (cb=0x6bdbe0, su=0x732130) at avd_sgNWayfsm.c:3645 #8 0x0046c82c in avd_sg_nway_node_fail_func (cb=0x6bdbe0, su=0x732130) at avd_sgNWayfsm.c:657 #9 0x0047ad65 in avd_node_susi_fail_func (cb=0x6bdbe0, avnd=0x6fef50) at avd_sgproc.c:2126 #10 0x00434f72 in avd_node_failover (node=0x6fef50) at avd_ndproc.c:776 #11 0x00431a80 in avd_mds_avnd_down_evh (cb=0x6bdbe0, evt=0x7f4578000ae0) at avd_ndfsm.c:407 #12 0x0043b57e in avd_process_event (cb_now=0x6bdbe0, evt=0x7f4578000ae0) at avd_proc.c:589 #13 0x0043b305 in avd_main_proc () at avd_proc.c:505 #14 0x00409210 in main (argc=2, argv=0x7fff87968c08) at amfd_main.c:47 Traces from the active controller are attached. Changed 7 months ago by nagendra Can you please test it on 4.2.2, i suspect that 2832 may be solving the issue as there has been many csi add/del before this issue has occured. Changed 7 months ago by allasirisha This
[tickets] [opensaf:tickets] #152 LOG: remove dependency to shared file system
- **status**: unassigned -- accepted - **Milestone**: future -- 4.4.FC --- ** [tickets:#152] LOG: remove dependency to shared file system** **Status:** accepted **Created:** Mon May 13, 2013 11:51 AM UTC by elunlen **Last Updated:** Mon May 13, 2013 11:51 AM UTC **Owner:** elunlen Purpose: Make OpenSAF not depended on shared filesystem. Benefit: improved performance and robustness. OpenSAF will be able to control a shared file system like DRBD. Drawback: Hard to follow when there is holes of missing log records on one controller that exists on the other controller. suggestion: * have same names on files on both controllers. * Files could be less than rotation size for one or both controllers if down time has occurred. * No configuration no shared file system will be the only option Implementation: * log records needs to be forwarded to the standby. * both active and standby LOG server writes to its local file Migrated from devel.opensaf.org #2416 --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] Re: #528 order of service not guaranteed during failover
The cause of the problem is that the discard_node message from the active IMMD to all IMMNDs over FEVS (MDS broadcast) has not reached the IMMND at new-active where the CLMD is trying to set implementer. This is a timing issue due to the non homogenous communication mechanism in OpenSAF. This can be fixed in various ways. 1) The IMMD could postpone its reply on the new active order from AMF untill the discard_node has reached the local IMMND and the local IMMND has sent a new confirm message back to the IMMD. This is quite a complicated solution and will in general slow down failover a bit. 2) The AMFD at new active would itself have to set-implementer for the AMF and it could postpone invoking new active on the other directors untill it itself succeeded. The AMFD would itself then need to cope with getting ERR_EXIST and treat it the same way it treats TRY_AGAIN in this particular context. 3) If for some reason it is preferrable for the AMFD implementation to do its implementer set later, it could invoke new active on one chosen service before the others (e.g. CLMD) and have CLMD setting implementer and coping with ERR_EXIST in the new active context, before replying to the AMFD. 4) All services could treat getting ERR_EXIST on implementerSet in the context of failover as getting TRY_AGAN. I would recommend (2) or (3) as the (from my perspective) simplest solutions. In general, it is safe to treat ERR_EXIST (or any error that has the semantics of nothing was done) as TRY_AGAIN. That is, nothing bad can happen simply because the request is tried again. Of course the request may be futile in most other contexts. When there is another healthy OI occupying the implementer-name the retry loop would run all the way to completion making the whole retry excercise both pointless and delaying other meaningfull tasks. But in the particular case where a service knows that this is a failover and I am the new actice then that service also knows that treating ERR_EXIST on implementerSet will not be futile unless somethign is seriously wrong with the cluster. /AndersBj From: Mathi Naickan [mailto:mathi-naic...@users.sf.net] Sent: den 28 oktober 2013 17:34 To: [opensaf:tickets] Subject: [tickets] [opensaf:tickets] #528 order of service not guaranteed during failover Well, before proceeding any further, there are questions... to identify a probably not yet uncovered real root cause (and that would also help in prioritizing this ticket) 1) The log snippet in this ticket indicate that IMM has been notified of a failover before the implementerClear reached IMM. So, when IMM already has received the failover indication, IMM implementation must ought be able to handle the implmenter clear. Why has that not happened? Is it because IMM waits for IMMA down? if so, it because IMMA down has not yet reached this IMMND? This is not a situation where i would expect the IMM clients to be shown an ERR_EXIST. 2) Is #528 and #599 really the same scenario? (OR) Is it that in the case of #599 the IMM has not received the failover indication before the implementerSet? in which case we could call this scenario born out of timing delay created somewhere in the stack... Questions apart, iam also thinking of how the dependencies(instantiation and csidep) among the middleware components are existing or can be changed such that IMM is ready first before a csiset is delivered to IMM clients(middleware components)! [tickets:#528]http://sourceforge.net/p/opensaf/tickets/528/ order of service not guaranteed during failover Status: unassigned Created: Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla Last Updated: Mon Oct 21, 2013 07:33 AM UTC Owner: nobody The issue is seen on changeset 4325 on SLES 4 node VMs. SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1. Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; timeout=60 SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on implementer set. The reason is IMMND has not yet disconnected the old implementer on 2010f. The following is the syslog which shows the sequence. Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link 1.1.2:eth1-1.1.1:eth0, peer not responding Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link
[tickets] [opensaf:tickets] #182 operational state of NPI component is not cleared to ENABLED when SU admin repaired is performed
This issue is reproducible on changeset: 4565:e8ae1895d8e3. Attached 182.tgz contains Traces and 182.xml (reproducible configuration). Steps to reproduce and observation: 1)immcfg -f 182.xml 2)amf-adm unlock-in and unlock safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: SA_AIS_ERR_TIMEOUT (5) States of components after unlock operation: safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI saAmfCompOperState=DISABLED(2) saAmfCompPresenceState=INSTANTIATION-FAILED(6) saAmfCompReadinessState=OUT-OF-SERVICE(1) safComp=AmfDemo1,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI saAmfCompOperState=ENABLED(1) saAmfCompPresenceState=UNINSTANTIATED(1) saAmfCompReadinessState=OUT-OF-SERVICE(1) Now correct instantiation script and perform repair. 3)amf-adm repaired safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI Component Process : root 17281 1 0 15:15 ?00:00:00 /opt/amf_demo/npi/amf_comp_npi root 17289 1 0 15:15 ?00:00:00 /opt/amf_demo/npi/amf_comp_npi safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI saAmfCompOperState=DISABLED(2) saAmfCompPresenceState=INSTANTIATED(3) saAmfCompReadinessState=OUT-OF-SERVICE(1) safComp=AmfDemo1,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI saAmfCompOperState=ENABLED(1) saAmfCompPresenceState=INSTANTIATED(3) saAmfCompReadinessState=IN-SERVICE(2) amf-state si all: safSi=AmfDemo,safApp=AmfDemo_NPI saAmfSIAdminState=UNLOCKED(1) saAmfSIAssignmentState=PARTIALLY_ASSIGNED(3) safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI saAmfSUAdminState=UNLOCKED(1) saAmfSUOperState=ENABLED(1) saAmfSUPresenceState=INSTANTIATED(3) saAmfSUReadinessState=IN-SERVICE(2) At AMFND during repair operation request, only PI components are enabled in amfnd/su.cc: if (m_AVND_COMP_TYPE_IS_PREINSTANTIABLE(comp)) { m_AVND_COMP_OPER_STATE_SET(comp, SA_AMF_OPERATIONAL_ENABLED); avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, saAmfCompOperState_ID, comp-name, comp-oper); } --- ** [tickets:#182] operational state of NPI component is not cleared to ENABLED when SU admin repaired is performed** **Status:** assigned **Created:** Tue May 14, 2013 06:56 AM UTC by Nagendra Kumar **Last Updated:** Fri Sep 06, 2013 01:17 PM UTC **Owner:** Praveen Migrated from http://devel.opensaf.org/ticket/2168 1. Brought NPI component to instantiation failure state for NPI component. SU and component moved to instantiation failure state and operational state is set to disabled. 2. After appropriate action is taken, admin repaired action on SU is performed. 3. SU's presence state moved to INSTANTIATED and operational state to ENABLED and readiness state to IN_SERVICE as the component is successfully spawned. But component's operational state is not changed to ENABLED and readiness is not set to IN_SERVICE, which should be set accordingly --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway
changeset: 4573:90aa577fb3ef tag: tip user:Nagendra Kumarnagendr...@oracle.com date:Tue Oct 29 16:23:58 2013 +0530 summary: amfd: Stop cluster startup timer, if all configured nodes have joined [#76] [staging:90aa57] --- ** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway** **Status:** fixed **Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar **Last Updated:** Wed Sep 25, 2013 09:24 AM UTC **Owner:** Nagendra Kumar Migrated from http://devel.opensaf.org/ticket/1791 If all nodes have joined the cluster there is no need to wait for the cluster startup timer to expire. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway
- **status**: review -- fixed --- ** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway** **Status:** fixed **Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar **Last Updated:** Wed Sep 25, 2013 09:24 AM UTC **Owner:** Nagendra Kumar Migrated from http://devel.opensaf.org/ticket/1791 If all nodes have joined the cluster there is no need to wait for the cluster startup timer to expire. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway
changeset: 4573:90aa577fb3ef tag: tip user:Nagendra Kumarnagendr...@oracle.com date:Tue Oct 29 16:23:58 2013 +0530 summary: amfd: Stop cluster startup timer, if all configured nodes have joined [#76] [staging:90aa57] --- ** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway** **Status:** fixed **Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar **Last Updated:** Tue Oct 29, 2013 10:54 AM UTC **Owner:** Nagendra Kumar Migrated from http://devel.opensaf.org/ticket/1791 If all nodes have joined the cluster there is no need to wait for the cluster startup timer to expire. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #40 IMM: Protect OI-handles against rapid switching of OI role.
- **status**: assigned -- unassigned - **Milestone**: 4.4.FC -- future --- ** [tickets:#40] IMM: Protect OI-handles against rapid switching of OI role.** **Status:** unassigned **Created:** Tue May 07, 2013 11:31 AM UTC by Anders Bjornerstedt **Last Updated:** Tue Sep 17, 2013 10:34 AM UTC **Owner:** Anders Bjornerstedt Migrated from: http://devel.opensaf.org/ticket/3092 -- This is related to the issues in #3072 and #3086. Repeated fast switching of OI role using the same OI handle can in theory result in imma library crashes due to a message being delivered to the wrong implementer. That is ImplementerSet??, implementerClear, implementerSet, etc. can result in a callback arriving on the handle when has changed its role from that which the callback was intended for. This ticket could be classed as a defect. The reason we class it as an enhancement is that we have not actually seen this happen yet and the fix is quite complex and thus not without risk. If a role switch occurs when there are messages backloged destined for the old role, but arriving to the OI in the new role, then this will cause the same symptom as the race fix in #3072. The backloged messages may actually reside in two places. Either backlogged in the incomming MDS buffer and not yet processed by the MDS thread, Or backlogged in the process internal IPC queue between the MDS thread and the application thread. The crash actually observed was in the MDS thread which indicates MDS backlog to the old role. We have not seen an incident of the second case, but it is sure to happen sooner or later since we have seen the first case. I propose to fix this issue by adding a generation counter to the client_node/handle. The generation counter is incremented each time a successful reply on implementerClear is received. Messages put to the IPC queue by the MDS thread will be stamped with the generation count at the time of MDS reception. On the other side, saImmOiDispatch() will check that the generation count of the message matches the generation count of the handle at the time of reception in the application thread. If it does not match then the message is discarded. On the server side, the IMMND will also have a generationCount associated with the connection. It will increment the generationCount when it is replying OK on an implementerClear for that connection. The IMMND sending an OI callback message to the client will stamp the message with the generationCount. The MDS thread in the client receiving the message will check that the generation count of the MDS message destined for OI has the same generation count as the generationCount of the handle. If not, the message is discarded. There should be no need for the OI to send the generationCount to the server. This since they are incremented on both sides by the same logical event, a successful implementerClear. There is of course the issue of ERR_TIMEOUT on implementerClear. If this happens then the only practical solution would be to mark the handle as stale and exposed. But this case should be rare. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #19 IMM: PBE should periodically audit the imm.db file
- **assigned_to**: Zoran Milinkovic -- Anders Bjornerstedt --- ** [tickets:#19] IMM: PBE should periodically audit the imm.db file** **Status:** assigned **Created:** Tue May 07, 2013 08:43 AM UTC by Anders Bjornerstedt **Last Updated:** Tue May 07, 2013 08:43 AM UTC **Owner:** Anders Bjornerstedt The Imm Persistent Back-End writes transactions/CCBs incrementally to an slqlite file imm.db. This file resides on a replicated file system. The replicated file system guards against hardware problems such as failure of the disk or the host where the disk resides. But there is always a risk of the imm.db file being corrupted accidentally. This could be due to bugs in the PBE; or due to network partitioning of the cluster causing two PBEs to concurrently write to the same file; or accidents with the backup and restore framework; or problems with the very complex communication stack which the shared filesystem is (drbd, journaling, nfs, sqlite recovery). The problem is that the imm.db file is a logically a single point of failure at cluster start. If the imm.db is corrupted due to whatever reason, then this may not be discovered until the critical time when it is needed for a cluster restart. This enhancement proposes that the PBE shall have some form of periodic audit of of the existing imm.db file. One possibility is for the PBE to periodically copy the imm.db file to a local tmp directory. During the copy the PBE will buffer delay the regular user requests (Ccbs PRTA updates). As soon as the copy has been made, a pseudo loading will be invoked using the copy of imm.db. In essence the immloder is invoked such that it reads the imm.db in exactly the way it does during loading, but does not try to actually load anything towards the immsv. Note that this level of audit will only catch consistency problems in the PBE/sqlite representation of the imm data. Loading may fail on higher levels, by failing checks inside the immsv or applications (failing validation by OIs). THe point of this is to discover an inconsistency earlier, when the problem has hopefully not impacted the executing cluster. IF a problem is detected, then the PBE will restart and generate a new version of the imm.db file. Migrated from: http://devel.opensaf.org/ticket/2451 -- The audit could actually verify snapshot value equality between the sqlite representation in PBE and the in-memory representation in immsv. By initializing an iterator towards the immsv during the short stop period for mutations enforced during the file copy, the iterator will take a snapshot of the in-memory representation. That snapshot should reflect all committed CCbs and PRTA updates. The same values should be commited to the PBE representation. - http://list.opensaf.org/pipermail/devel/2012-February/021139.html - The fix for this enhancement should be based on an improvement of verifyPbeState(..) in imm_pbe_dump.cc That function is executed each time the PBE re-attaches to the imm.db file. Currently it is very weak. It should ideally verify the state of all persistent objects both ways. All objects that exist in the imm.db must exist in the imm and have the same state; and all persistent objects that exist in the imm must exist in the imm.db file and have the same state. This same function could be periodically invoked by the immnd-coord using an admin-op towards the pbe. This should only be done during periods when there is a lull in persistence traffic. The frequency can be quite low, but could also be increased in relation to write traffic. Finally, there is a point in closing and re-opening the imm.db file before performing the verification. This to protect agains accidental removal of the file (inode). - --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #605 immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.
- **status**: review -- fixed - **assigned_to**: Hans Feldt -- nobody --- ** [tickets:#605] immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.** **Status:** fixed **Created:** Wed Oct 23, 2013 11:33 AM UTC by Hans Feldt **Last Updated:** Wed Oct 23, 2013 01:16 PM UTC **Owner:** nobody This happens from time to time when executing immomtest Code in question: case NCS_OS_LOCK_UNLOCK: if ((rc = pthread_mutex_unlock(lock-lock)) != 0) { /* unlock for all tasks */ assert(0); return (NCSCC_RC_FAILURE); } break; As a first step we could change this piece of code to use osaf_mutex_unlock_ordie instead which will log the return code before abort. That could take us closer to the real problem. In imma (and others libraries) there is a mix of direct pthread calls and NCS macros. Should probably align the code base to use the new utility functions instead. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #605 immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.
changeset: 4574:b424e8c4e643 branch: opensaf-4.2.x parent: 4569:dc118b8f1a07 user:Hans Feldt hans.fe...@ericsson.com date:Tue Oct 29 21:24:35 2013 +0100 summary: leap: change ncs_os_lock to syslog before abort [#605] changeset: 4575:03c46e565f76 branch: opensaf-4.3.x parent: 4570:b350c2a0377e user:Hans Feldt hans.fe...@ericsson.com date:Tue Oct 29 21:30:55 2013 +0100 summary: leap: change ncs_os_lock to syslog before abort [#605] changeset: 4576:5a7622688312 tag: tip parent: 4573:90aa577fb3ef user:Hans Feldt hans.fe...@ericsson.com date:Tue Oct 29 21:30:55 2013 +0100 summary: leap: change ncs_os_lock to syslog before abort [#605] --- ** [tickets:#605] immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.** **Status:** fixed **Created:** Wed Oct 23, 2013 11:33 AM UTC by Hans Feldt **Last Updated:** Wed Oct 23, 2013 01:16 PM UTC **Owner:** nobody This happens from time to time when executing immomtest Code in question: case NCS_OS_LOCK_UNLOCK: if ((rc = pthread_mutex_unlock(lock-lock)) != 0) { /* unlock for all tasks */ assert(0); return (NCSCC_RC_FAILURE); } break; As a first step we could change this piece of code to use osaf_mutex_unlock_ordie instead which will log the return code before abort. That could take us closer to the real problem. In imma (and others libraries) there is a mix of direct pthread calls and NCS macros. Should probably align the code base to use the new utility functions instead. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets
[tickets] [opensaf:tickets] #570 AMF: ava down event ignored in TERMINATING state
changeset: 4577:07e88841d268 branch: opensaf-4.2.x parent: 4574:b424e8c4e643 user:Hans Feldt hans.fe...@ericsson.com date:Tue Oct 29 21:57:47 2013 +0100 summary: amf: change saAmfResponse to be synchronous in TerminateCallback context [#570] changeset: 4578:19f8ca3ee0e3 branch: opensaf-4.3.x parent: 4575:03c46e565f76 user:Hans Feldt hans.fe...@ericsson.com date:Tue Oct 29 21:57:47 2013 +0100 summary: amf: change saAmfResponse to be synchronous in TerminateCallback context [#570] changeset: 4579:b26e27059891 tag: tip parent: 4576:5a7622688312 user:Hans Feldt hans.fe...@ericsson.com date:Wed Oct 30 06:55:31 2013 +0100 summary: amf: change saAmfResponse to be synchronous in TerminateCallback context [#570] --- ** [tickets:#570] AMF: ava down event ignored in TERMINATING state** **Status:** fixed **Created:** Wed Sep 18, 2013 06:45 AM UTC by Hans Feldt **Last Updated:** Wed Oct 16, 2013 07:51 AM UTC **Owner:** nobody If a component crash or exit in context of the terminate callback, AMF will not use the ava down event to trigger cleanup and finish component termination. Instead the CallbackTimeout is awaited which can be very long. This is a problem if this happens during an upgrade, it will cause the upgrade to fail potentially leading to system restore. Ignoring the event (in avnd_err.c) was added in: changeset: 1646:92e6e65eefc0 user:Nagendra Kumar nku...@emerson.com date:Thu Aug 26 18:43:31 2010 +0530 summary: Ticket 1433: Allowing dynamic configuration changes for AMF logical entities reasons unclear. This has to be revisited. As far as I can tell there is no race between the ava down event and the response message. Normally a process calls saAmfResponse(OK) and then exit(0). saAmfResponse(OK) under the hood sends a message and at least TIPC will do run to completion meaning it will post it to the receivers (amfnd) socket receive buffer. After that the process exit and a topology event is created and written to another socket receive buffer (MDS lib for amfnd). Eventually the MDS thread in amfnd context will receiver both messages and write them to amfnd mailbox with the same prio. --- Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is subscribed to https://sourceforge.net/p/opensaf/tickets/ To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.-- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Opensaf-tickets mailing list Opensaf-tickets@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-tickets