[tickets] [opensaf:tickets] #203 avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM

2013-10-29 Thread Praveen
Attached 203.tgz contains configuration:
-203.xml  configuration to reproduce the issue.
-script new_si_csi_add.sh to add new SI 
-Traces.

Steps to reproduce:
1) immcfg -f 203.xml
2) Unlock and Unlock-in of SU1 and SU2.
3) ./new_si_csi_add.sh
4) Lock SU1
5) OpenSAF stop of payload hosting SU2.



---

** [tickets:#203] avsv: SG went to unstable state when active SU is locked 
after adding new SI in NWay RM**

**Status:** assigned
**Created:** Wed May 15, 2013 04:32 AM UTC by Praveen
**Last Updated:** Fri Sep 06, 2013 01:14 PM UTC
**Owner:** Praveen

The issue is observed on SLES 64bit VMs.
 

Configuration:
 NWay RM with 2 SUs, 2SIs and 2 CSIs. PBE is enabled and opensaf is run as root 
user.
 

New SI is added and then active SU is locked. The following message is seen in 
the syslog:
 

Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 
'safSi=d_NWay_1Norm_3,safApp=N' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removed 
'safSi=d_NWay_1Norm_3,safApp=N' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'all SIs' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 Oct 6 19:24:43 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 Oct 6 19:24:44 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 

Further operations failed since the SG is not stable. When PL-4 which was 
hosting the active SU is brought down, amfd on active controller crashed 
leading to the reboot of the node. The following message is seen in the syslog.
 Oct 6 19:43:59 SLES-SLOT-1 osafamfd[3693]: Node 'PL-4' left the cluster
 Oct 6 19:44:00 SLES-SLOT-1 osafamfd[3693]: avd_su.c:1585: 
avd_su_dec_curr_stdby_si: Assertion 'su-saAmfSUNumCurrStandbySIs  0' failed.
 Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: AMF director unexpectedly crashed
 Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: Rebooting OpenSAF NodeId? = 131343 
EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received
 Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer locally disconnected. 
Marking it as doomed 3 17, 2010f (safAmfService)
 Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer disconnected 3 17, 
2010f (safAmfService)
 Oct 6 19:44:00 SLES-SLOT-1 opensaf_reboot: Rebooting local node
 

Bt of the core file:
 Core was generated by `/usr/lib64/opensaf/osafamfd —tracemask=0x'.
 Program terminated with signal 6, Aborted.
 #0 0x7f457decd645 in raise () from /lib64/libc.so.6
 (gdb) bt
 #0 0x7f457decd645 in raise () from /lib64/libc.so.6
 #1 0x7f457decec33 in abort () from /lib64/libc.so.6
 #2 0x7f457f4df095 in osafassert_fail (file=0x4ac5e5 avd_su.c, line=1585, 
func=0x4ad590 avd_su_dec_curr_stdby_si, 
assertion=0x4ad5b0 su-saAmfSUNumCurrStandbySIs  0) at sysf_def.c:399
#3 0x0048964f in avd_su_dec_curr_stdby_si (su=0x727f70) at avd_su.c:1585
 #4 0x0048b244 in avd_susi_update_assignment_counters (susi=0x767bf0, 
action=AVSV_SUSI_ACT_DEL, current_ha_state=0, new_ha_state=0)
at avd_siass.c:730
#5 0x0048aff7 in avd_susi_del_send (susi=0x767bf0) at avd_siass.c:663
 #6 0x00474bbc in avd_sg_nway_node_fail_stable (cb=0x6bdbe0, 
su=0x732130, susi=0x0) at avd_sgNWayfsm.c:3191
 #7 0x00476257 in avd_sg_nway_node_fail_sg_realign (cb=0x6bdbe0, 
su=0x732130) at avd_sgNWayfsm.c:3645
 #8 0x0046c82c in avd_sg_nway_node_fail_func (cb=0x6bdbe0, su=0x732130) 
at avd_sgNWayfsm.c:657
 #9 0x0047ad65 in avd_node_susi_fail_func (cb=0x6bdbe0, avnd=0x6fef50) 
at avd_sgproc.c:2126
 #10 0x00434f72 in avd_node_failover (node=0x6fef50) at avd_ndproc.c:776
 #11 0x00431a80 in avd_mds_avnd_down_evh (cb=0x6bdbe0, 
evt=0x7f4578000ae0) at avd_ndfsm.c:407
 #12 0x0043b57e in avd_process_event (cb_now=0x6bdbe0, 
evt=0x7f4578000ae0) at avd_proc.c:589
 #13 0x0043b305 in avd_main_proc () at avd_proc.c:505
 #14 0x00409210 in main (argc=2, argv=0x7fff87968c08) at amfd_main.c:47
 

Traces from the active controller are attached.

Changed 7 months ago 

[tickets] [opensaf:tickets] #203 avsv: SG went to unstable state when active SU is locked after adding new SI in NWay RM

2013-10-29 Thread Praveen
Attached traces and configuration.


Attachment: 203.tgz (628.1 kB; application/x-compressed) 


---

** [tickets:#203] avsv: SG went to unstable state when active SU is locked 
after adding new SI in NWay RM**

**Status:** assigned
**Created:** Wed May 15, 2013 04:32 AM UTC by Praveen
**Last Updated:** Tue Oct 29, 2013 07:04 AM UTC
**Owner:** Praveen

The issue is observed on SLES 64bit VMs.
 

Configuration:
 NWay RM with 2 SUs, 2SIs and 2 CSIs. PBE is enabled and opensaf is run as root 
user.
 

New SI is added and then active SU is locked. The following message is seen in 
the syslog:
 

Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:39 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_3,safApp=N' ACTIVE to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_1,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigning 
'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Assigned 
'safSi=d_NWay_1Norm_3,safApp=N' QUIESCED to 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 
'safSi=d_NWay_1Norm_3,safApp=N' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removed 
'safSi=d_NWay_1Norm_3,safApp=N' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfnd[3703]: Removing 'all SIs' from 
'safSu=d_NWay_1Norm_1,safSg=SG_d_n,safApp=N'
 Oct 6 19:24:42 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 Oct 6 19:24:43 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 Oct 6 19:24:44 SLES-SLOT-1 osafamfd[3693]: SG state is not stable
 

Further operations failed since the SG is not stable. When PL-4 which was 
hosting the active SU is brought down, amfd on active controller crashed 
leading to the reboot of the node. The following message is seen in the syslog.
 Oct 6 19:43:59 SLES-SLOT-1 osafamfd[3693]: Node 'PL-4' left the cluster
 Oct 6 19:44:00 SLES-SLOT-1 osafamfd[3693]: avd_su.c:1585: 
avd_su_dec_curr_stdby_si: Assertion 'su-saAmfSUNumCurrStandbySIs  0' failed.
 Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: AMF director unexpectedly crashed
 Oct 6 19:44:00 SLES-SLOT-1 osafamfnd[3703]: Rebooting OpenSAF NodeId? = 131343 
EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received
 Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer locally disconnected. 
Marking it as doomed 3 17, 2010f (safAmfService)
 Oct 6 19:44:00 SLES-SLOT-1 osafimmnd[3628]: Implementer disconnected 3 17, 
2010f (safAmfService)
 Oct 6 19:44:00 SLES-SLOT-1 opensaf_reboot: Rebooting local node
 

Bt of the core file:
 Core was generated by `/usr/lib64/opensaf/osafamfd —tracemask=0x'.
 Program terminated with signal 6, Aborted.
 #0 0x7f457decd645 in raise () from /lib64/libc.so.6
 (gdb) bt
 #0 0x7f457decd645 in raise () from /lib64/libc.so.6
 #1 0x7f457decec33 in abort () from /lib64/libc.so.6
 #2 0x7f457f4df095 in osafassert_fail (file=0x4ac5e5 avd_su.c, line=1585, 
func=0x4ad590 avd_su_dec_curr_stdby_si, 
assertion=0x4ad5b0 su-saAmfSUNumCurrStandbySIs  0) at sysf_def.c:399
#3 0x0048964f in avd_su_dec_curr_stdby_si (su=0x727f70) at avd_su.c:1585
 #4 0x0048b244 in avd_susi_update_assignment_counters (susi=0x767bf0, 
action=AVSV_SUSI_ACT_DEL, current_ha_state=0, new_ha_state=0)
at avd_siass.c:730
#5 0x0048aff7 in avd_susi_del_send (susi=0x767bf0) at avd_siass.c:663
 #6 0x00474bbc in avd_sg_nway_node_fail_stable (cb=0x6bdbe0, 
su=0x732130, susi=0x0) at avd_sgNWayfsm.c:3191
 #7 0x00476257 in avd_sg_nway_node_fail_sg_realign (cb=0x6bdbe0, 
su=0x732130) at avd_sgNWayfsm.c:3645
 #8 0x0046c82c in avd_sg_nway_node_fail_func (cb=0x6bdbe0, su=0x732130) 
at avd_sgNWayfsm.c:657
 #9 0x0047ad65 in avd_node_susi_fail_func (cb=0x6bdbe0, avnd=0x6fef50) 
at avd_sgproc.c:2126
 #10 0x00434f72 in avd_node_failover (node=0x6fef50) at avd_ndproc.c:776
 #11 0x00431a80 in avd_mds_avnd_down_evh (cb=0x6bdbe0, 
evt=0x7f4578000ae0) at avd_ndfsm.c:407
 #12 0x0043b57e in avd_process_event (cb_now=0x6bdbe0, 
evt=0x7f4578000ae0) at avd_proc.c:589
 #13 0x0043b305 in avd_main_proc () at avd_proc.c:505
 #14 0x00409210 in main (argc=2, argv=0x7fff87968c08) at amfd_main.c:47
 

Traces from the active controller are attached.

Changed 7 months ago by nagendra 
Can you please test it on 4.2.2, i suspect that 2832 may be solving the issue 
as there has been many csi add/del before this issue has occured.
 
Changed 7 months ago by allasirisha 
This 

[tickets] [opensaf:tickets] #152 LOG: remove dependency to shared file system

2013-10-29 Thread elunlen
- **status**: unassigned -- accepted
- **Milestone**: future -- 4.4.FC



---

** [tickets:#152] LOG: remove dependency to shared file system**

**Status:** accepted
**Created:** Mon May 13, 2013 11:51 AM UTC by elunlen
**Last Updated:** Mon May 13, 2013 11:51 AM UTC
**Owner:** elunlen

Purpose: Make OpenSAF not depended on shared filesystem.

Benefit: improved performance and robustness. OpenSAF will be able to control a 
shared file system like DRBD.

Drawback: Hard to follow when there is holes of missing log records on one 
controller that exists on the other controller.

suggestion:
* have same names on files on both controllers.
* Files could be less than rotation size for one or both controllers if down 
time has occurred.
* No configuration no shared file system will be the only option

Implementation:
* log records needs to be forwarded to the standby.
* both active and standby LOG server writes to its local file

Migrated from devel.opensaf.org #2416


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] Re: #528 order of service not guaranteed during failover

2013-10-29 Thread Anders Bjornerstedt
The cause of the problem is that the discard_node message from the active 
IMMD to all IMMNDs over FEVS (MDS broadcast)
has not reached the IMMND at new-active where the CLMD is trying to set 
implementer.
This is a timing issue due to the non homogenous communication mechanism in 
OpenSAF.
This can be fixed in various ways.

1) The IMMD could postpone its reply on the new active order from AMF untill 
the discard_node has reached the local IMMND and
the local IMMND has sent a new confirm message back to the IMMD. This is quite 
a complicated solution and will in general slow
down failover a bit.

2) The AMFD at new active would itself have to set-implementer for the AMF and 
it could postpone invoking new active on the other directors
untill it itself succeeded. The AMFD would itself then need to cope with 
getting ERR_EXIST and treat it the same way it treats TRY_AGAIN
in this particular context.

3) If for some reason it is preferrable for the AMFD implementation to do its 
implementer set later, it could invoke new active on one
chosen service before the others (e.g. CLMD) and have CLMD setting implementer 
and coping with ERR_EXIST in the new active
context, before replying to the AMFD.

4) All services could treat getting ERR_EXIST on implementerSet in the context 
of failover as getting TRY_AGAN.

I would recommend (2) or (3) as the (from my perspective) simplest solutions.

In general, it is safe to treat ERR_EXIST (or any error that has the semantics 
of nothing was done) as TRY_AGAIN.
That is, nothing bad can happen simply because the request is tried again.
Of course the request may be futile in most other contexts. When there is 
another healthy OI occupying the implementer-name
the retry loop would run all the way to completion making the whole retry 
excercise both pointless and delaying other meaningfull
tasks. But in the particular case where a service knows that this is a 
failover and I am the new actice then that service also
knows that treating ERR_EXIST on implementerSet will not be futile unless 
somethign is seriously wrong with the cluster.

/AndersBj


From: Mathi Naickan [mailto:mathi-naic...@users.sf.net]
Sent: den 28 oktober 2013 17:34
To: [opensaf:tickets]
Subject: [tickets] [opensaf:tickets] #528 order of service not guaranteed 
during failover


Well, before proceeding any further, there are questions... to identify a 
probably not yet uncovered real root cause (and that would also help in 
prioritizing this ticket)

1) The log snippet in this ticket indicate that IMM has been notified of a 
failover before the implementerClear reached IMM.
So, when IMM already has received the failover indication, IMM implementation 
must ought be able to handle the implmenter clear.
Why has that not happened? Is it because IMM waits for IMMA down? if so, it 
because IMMA down has not yet reached this IMMND?

This is not a situation where i would expect the IMM clients to be shown an 
ERR_EXIST.

2) Is #528 and #599 really the same scenario? (OR)
Is it that in the case of #599 the IMM has not received the failover indication 
before the implementerSet? in which case we could call this scenario born out 
of timing delay created somewhere in the stack...

Questions apart, iam also thinking of how the dependencies(instantiation and 
csidep) among the middleware components are existing or can be changed such 
that IMM is ready first before a csiset is delivered to IMM clients(middleware 
components)!



[tickets:#528]http://sourceforge.net/p/opensaf/tickets/528/ order of service 
not guaranteed during failover

Status: unassigned
Created: Tue Jul 30, 2013 09:43 AM UTC by Sirisha Alla
Last Updated: Mon Oct 21, 2013 07:33 AM UTC
Owner: nobody

The issue is seen on changeset 4325 on SLES 4 node VMs.

SC-1 is active, SC-2 is standby. Failover is triggered by killing FMD on SC-1.

Jul 29 21:28:18 SLES-64BIT-SLOT1 root: killing osaffmd from invoke_failover.sh
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: NO 
'safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: ER 
safComp=FMS,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jul 29 21:28:18 SLES-64BIT-SLOT1 osafamfnd[2448]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Jul 29 21:28:18 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

SC-2 tried becoming Active but failed since CLMD reported ERR_EXIST on 
implementer set. The reason is IMMND has not yet disconnected the old 
implementer on 2010f. The following is the syslog which shows the sequence.

Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408188] TIPC: Resetting link 
1.1.2:eth1-1.1.1:eth0, peer not responding
Jul 29 21:28:31 SLES-64BIT-SLOT2 kernel: [ 101.408194] TIPC: Lost link 

[tickets] [opensaf:tickets] #182 operational state of NPI component is not cleared to ENABLED when SU admin repaired is performed

2013-10-29 Thread Praveen
This issue is reproducible on changeset:   4565:e8ae1895d8e3.
Attached 182.tgz contains Traces and 182.xml (reproducible configuration).

Steps to reproduce and observation:
1)immcfg -f 182.xml
2)amf-adm unlock-in and unlock safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
   error - saImmOmAdminOperationInvoke_2 admin-op RETURNED: SA_AIS_ERR_TIMEOUT 
(5)

States of components after unlock operation:
safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
saAmfCompOperState=DISABLED(2)
saAmfCompPresenceState=INSTANTIATION-FAILED(6)
saAmfCompReadinessState=OUT-OF-SERVICE(1)
safComp=AmfDemo1,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
saAmfCompOperState=ENABLED(1)
saAmfCompPresenceState=UNINSTANTIATED(1)
saAmfCompReadinessState=OUT-OF-SERVICE(1)

Now correct instantiation script and perform repair.
3)amf-adm repaired  safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
Component Process :
root 17281 1  0 15:15 ?00:00:00 /opt/amf_demo/npi/amf_comp_npi
root 17289 1  0 15:15 ?00:00:00 /opt/amf_demo/npi/amf_comp_npi

safComp=AmfDemo,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
saAmfCompOperState=DISABLED(2)
saAmfCompPresenceState=INSTANTIATED(3)
saAmfCompReadinessState=OUT-OF-SERVICE(1)
safComp=AmfDemo1,safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
saAmfCompOperState=ENABLED(1)
saAmfCompPresenceState=INSTANTIATED(3)
saAmfCompReadinessState=IN-SERVICE(2)

amf-state si all:
safSi=AmfDemo,safApp=AmfDemo_NPI
saAmfSIAdminState=UNLOCKED(1)
saAmfSIAssignmentState=PARTIALLY_ASSIGNED(3)

safSu=SU1,safSg=AmfDemo,safApp=AmfDemo_NPI
saAmfSUAdminState=UNLOCKED(1)
saAmfSUOperState=ENABLED(1)
saAmfSUPresenceState=INSTANTIATED(3)
saAmfSUReadinessState=IN-SERVICE(2)


At AMFND during repair operation request, only PI components are enabled in 
amfnd/su.cc:
if (m_AVND_COMP_TYPE_IS_PREINSTANTIABLE(comp)) {
m_AVND_COMP_OPER_STATE_SET(comp, SA_AMF_OPERATIONAL_ENABLED);
avnd_di_uns32_upd_send(AVSV_SA_AMF_COMP, saAmfCompOperState_ID, 
comp-name, comp-oper);
}




---

** [tickets:#182] operational state of NPI component is not cleared to ENABLED 
when SU admin repaired is performed**

**Status:** assigned
**Created:** Tue May 14, 2013 06:56 AM UTC by Nagendra Kumar
**Last Updated:** Fri Sep 06, 2013 01:17 PM UTC
**Owner:** Praveen

Migrated from http://devel.opensaf.org/ticket/2168

1. Brought NPI component to instantiation failure state for NPI component. SU 
and component moved to instantiation failure state and operational state is set 
to disabled.


2. After appropriate action is taken, admin repaired action on SU is performed.


3. SU's presence state moved to INSTANTIATED and operational state to ENABLED 
and readiness state to IN_SERVICE as the component is successfully spawned.


But component's operational state is not changed to ENABLED and readiness is 
not set to IN_SERVICE, which should be set accordingly





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway

2013-10-29 Thread Nagendra Kumar
changeset:   4573:90aa577fb3ef
tag: tip
user:Nagendra Kumarnagendr...@oracle.com
date:Tue Oct 29 16:23:58 2013 +0530
summary: amfd: Stop cluster startup timer, if all configured nodes have 
joined [#76]

[staging:90aa57]




---

** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the 
assignment timer and assign anyway**

**Status:** fixed
**Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar
**Last Updated:** Wed Sep 25, 2013 09:24 AM UTC
**Owner:** Nagendra Kumar

Migrated from http://devel.opensaf.org/ticket/1791

If all nodes have joined the cluster there is no need to wait for the cluster 
startup timer to expire.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway

2013-10-29 Thread Nagendra Kumar
- **status**: review -- fixed



---

** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the 
assignment timer and assign anyway**

**Status:** fixed
**Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar
**Last Updated:** Wed Sep 25, 2013 09:24 AM UTC
**Owner:** Nagendra Kumar

Migrated from http://devel.opensaf.org/ticket/1791

If all nodes have joined the cluster there is no need to wait for the cluster 
startup timer to expire.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #76 AMF: if all nodes have joined the cluster, cancel the assignment timer and assign anyway

2013-10-29 Thread Nagendra Kumar
changeset:   4573:90aa577fb3ef
tag: tip
user:Nagendra Kumarnagendr...@oracle.com
date:Tue Oct 29 16:23:58 2013 +0530
summary: amfd: Stop cluster startup timer, if all configured nodes have 
joined [#76]


[staging:90aa57]



---

** [tickets:#76] AMF: if all nodes have joined the cluster, cancel the 
assignment timer and assign anyway**

**Status:** fixed
**Created:** Mon May 13, 2013 04:25 AM UTC by Nagendra Kumar
**Last Updated:** Tue Oct 29, 2013 10:54 AM UTC
**Owner:** Nagendra Kumar

Migrated from http://devel.opensaf.org/ticket/1791

If all nodes have joined the cluster there is no need to wait for the cluster 
startup timer to expire.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #40 IMM: Protect OI-handles against rapid switching of OI role.

2013-10-29 Thread Anders Bjornerstedt
- **status**: assigned -- unassigned
- **Milestone**: 4.4.FC -- future



---

** [tickets:#40] IMM: Protect OI-handles against rapid switching of OI role.**

**Status:** unassigned
**Created:** Tue May 07, 2013 11:31 AM UTC by Anders Bjornerstedt
**Last Updated:** Tue Sep 17, 2013 10:34 AM UTC
**Owner:** Anders Bjornerstedt

Migrated from:
http://devel.opensaf.org/ticket/3092
--
This is related to the issues in #3072 and #3086.

Repeated fast switching of OI role using the same OI handle can
in theory result in imma library crashes due to a message being
delivered to the wrong implementer. That is ImplementerSet??,
implementerClear, implementerSet, etc. can result in a callback
arriving on the handle when has changed its role from that which
the callback was intended for.

This ticket could be classed as a defect. The reason we class it
as an enhancement is that we have not actually seen this happen
yet and the fix is quite complex and thus not without risk.

If a role switch occurs when there are messages backloged destined
for the old role, but arriving to the OI in the new role, then
this will cause the same symptom as the race fix in #3072.

The backloged messages may actually reside in two places.
Either backlogged in the incomming MDS buffer and not yet
processed by the MDS thread, Or backlogged in the process internal
IPC queue between the MDS thread and the application thread.

The crash actually observed was in the MDS thread which indicates
MDS backlog to the old role. We have not seen an incident of the
second case, but it is sure to happen sooner or later since
we have seen the first case.

I propose to fix this issue by adding a generation counter to
the client_node/handle. The generation counter is incremented
each time a successful reply on implementerClear is received.

Messages put to the IPC queue by the MDS thread will be stamped
with the generation count at the time of MDS reception. On
the other side, saImmOiDispatch() will check that the generation
count of the message matches the generation count of the handle
at the time of reception in the application thread. If it does not
match then the message is discarded.

On the server side, the IMMND will also have a generationCount
associated with the connection. It will increment the
generationCount when it is replying OK on an implementerClear
for that connection.

The IMMND sending an OI callback message to the client will stamp
the message with the generationCount. The MDS thread in the client
receiving the message will check that the generation count of
the MDS message destined for OI has the same generation count
as the generationCount of the handle. If not, the message is
discarded.

There should be no need for the OI to send the generationCount
to the server. This since they are incremented on both sides
by the same logical event, a successful implementerClear.

There is of course the issue of ERR_TIMEOUT on implementerClear.
If this happens then the only practical solution would be to
mark the handle as stale and exposed. But this case should be rare.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #19 IMM: PBE should periodically audit the imm.db file

2013-10-29 Thread Anders Bjornerstedt
- **assigned_to**: Zoran Milinkovic -- Anders Bjornerstedt



---

** [tickets:#19] IMM: PBE should periodically audit the imm.db file**

**Status:** assigned
**Created:** Tue May 07, 2013 08:43 AM UTC by Anders Bjornerstedt
**Last Updated:** Tue May 07, 2013 08:43 AM UTC
**Owner:** Anders Bjornerstedt

The Imm Persistent Back-End writes transactions/CCBs incrementally
to an slqlite file imm.db. This file resides on a replicated file
system. The replicated file system guards against hardware problems
such as failure of the disk or the host where the disk resides.

But there is always a risk of the imm.db file being corrupted
accidentally. This could be due to bugs in the PBE; or due to
network partitioning of the cluster causing two PBEs to
concurrently write to the same file; or accidents with the
backup and restore framework; or problems with the very complex
communication stack which the shared filesystem is (drbd,
journaling, nfs, sqlite recovery).

The problem is that the imm.db file is a logically a single
point of failure at cluster start.

If the imm.db is corrupted due to whatever reason, then
this may not be discovered until the critical time when it
is needed for a cluster restart.

This enhancement proposes that the PBE shall have some form
of periodic audit of of the existing imm.db file.

One possibility is for the PBE to periodically copy the imm.db
file to a local tmp directory. During the copy the PBE will
buffer  delay the regular user requests (Ccbs  PRTA updates).
As soon as the copy has been made, a pseudo loading will
be invoked using the copy of imm.db. In essence the immloder
is invoked such that it reads the imm.db in exactly the way
it does during loading, but does not try to actually load
anything towards the immsv.

Note that this level of audit will only catch consistency problems
in the PBE/sqlite representation of the imm data.
Loading may fail on higher levels, by failing checks inside
the immsv or applications (failing validation by OIs).

THe point of this is to discover an inconsistency earlier,
when the problem has hopefully not impacted the executing
cluster. IF a problem is detected, then the PBE will restart
and generate a new version of the imm.db file.

Migrated from:
http://devel.opensaf.org/ticket/2451
--


The audit could actually verify snapshot value equality between the sqlite 
representation
in PBE and the in-memory representation in immsv. By initializing an iterator
towards the immsv during the short stop period for mutations enforced during the
file copy, the iterator will take a snapshot of the in-memory representation.

That snapshot should reflect all committed CCbs and PRTA updates. The same 
values
should be commited to the PBE representation.
-
 http://list.opensaf.org/pipermail/devel/2012-February/021139.html
-
The fix for this enhancement should be based on an improvement of 
verifyPbeState(..)
in imm_pbe_dump.cc
That function is executed each time the PBE re-attaches to the imm.db file.
Currently it is very weak. It should ideally verify the state of all persistent
objects both ways. All objects that exist in the imm.db must exist in the imm 
and
have the same state; and all persistent objects that exist in the imm must 
exist in
the imm.db file and have the same state.

This same function could be periodically invoked by the immnd-coord using an 
admin-op
towards the pbe. This should only be done during periods when there is a lull in
persistence traffic. The frequency can be quite low, but could also be increased
in relation to write traffic.

Finally, there is a point in closing and re-opening the imm.db file before 
performing
the verification. This to protect agains accidental removal of the file (inode).
-



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #605 immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.

2013-10-29 Thread Hans Feldt
- **status**: review -- fixed
- **assigned_to**: Hans Feldt --  nobody 



---

** [tickets:#605] immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.**

**Status:** fixed
**Created:** Wed Oct 23, 2013 11:33 AM UTC by Hans Feldt
**Last Updated:** Wed Oct 23, 2013 01:16 PM UTC
**Owner:** nobody

This happens from time to time when executing immomtest

Code in question:

case NCS_OS_LOCK_UNLOCK:
if ((rc = pthread_mutex_unlock(lock-lock)) != 0) { /* unlock 
for all tasks */
assert(0);
return (NCSCC_RC_FAILURE);
}
break;

As a first step we could change this piece of code to use 
osaf_mutex_unlock_ordie instead which will log the return code before abort. 
That could take us closer to the real problem.

In imma (and others libraries) there is a mix of direct pthread calls and NCS 
macros. Should probably align the code base to use the new utility functions 
instead.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #605 immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.

2013-10-29 Thread Hans Feldt
changeset:   4574:b424e8c4e643
branch:  opensaf-4.2.x
parent:  4569:dc118b8f1a07
user:Hans Feldt hans.fe...@ericsson.com
date:Tue Oct 29 21:24:35 2013 +0100
summary: leap: change ncs_os_lock to syslog before abort [#605]

changeset:   4575:03c46e565f76
branch:  opensaf-4.3.x
parent:  4570:b350c2a0377e
user:Hans Feldt hans.fe...@ericsson.com
date:Tue Oct 29 21:30:55 2013 +0100
summary: leap: change ncs_os_lock to syslog before abort [#605]

changeset:   4576:5a7622688312
tag: tip
parent:  4573:90aa577fb3ef
user:Hans Feldt hans.fe...@ericsson.com
date:Tue Oct 29 21:30:55 2013 +0100
summary: leap: change ncs_os_lock to syslog before abort [#605]



---

** [tickets:#605] immomtest: os_defs.c:447: ncs_os_lock: Assertion `0' failed.**

**Status:** fixed
**Created:** Wed Oct 23, 2013 11:33 AM UTC by Hans Feldt
**Last Updated:** Wed Oct 23, 2013 01:16 PM UTC
**Owner:** nobody

This happens from time to time when executing immomtest

Code in question:

case NCS_OS_LOCK_UNLOCK:
if ((rc = pthread_mutex_unlock(lock-lock)) != 0) { /* unlock 
for all tasks */
assert(0);
return (NCSCC_RC_FAILURE);
}
break;

As a first step we could change this piece of code to use 
osaf_mutex_unlock_ordie instead which will log the return code before abort. 
That could take us closer to the real problem.

In imma (and others libraries) there is a mix of direct pthread calls and NCS 
macros. Should probably align the code base to use the new utility functions 
instead.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets


[tickets] [opensaf:tickets] #570 AMF: ava down event ignored in TERMINATING state

2013-10-29 Thread Hans Feldt
changeset:   4577:07e88841d268
branch:  opensaf-4.2.x
parent:  4574:b424e8c4e643
user:Hans Feldt hans.fe...@ericsson.com
date:Tue Oct 29 21:57:47 2013 +0100
summary: amf: change saAmfResponse to be synchronous in TerminateCallback 
context [#570]

changeset:   4578:19f8ca3ee0e3
branch:  opensaf-4.3.x
parent:  4575:03c46e565f76
user:Hans Feldt hans.fe...@ericsson.com
date:Tue Oct 29 21:57:47 2013 +0100
summary: amf: change saAmfResponse to be synchronous in TerminateCallback 
context [#570]

changeset:   4579:b26e27059891
tag: tip
parent:  4576:5a7622688312
user:Hans Feldt hans.fe...@ericsson.com
date:Wed Oct 30 06:55:31 2013 +0100
summary: amf: change saAmfResponse to be synchronous in TerminateCallback 
context [#570]



---

** [tickets:#570] AMF: ava down event ignored in TERMINATING state**

**Status:** fixed
**Created:** Wed Sep 18, 2013 06:45 AM UTC by Hans Feldt
**Last Updated:** Wed Oct 16, 2013 07:51 AM UTC
**Owner:** nobody

If a component crash or exit in context of the terminate callback, AMF will not 
use the ava down event to trigger cleanup and finish component termination. 
Instead the CallbackTimeout is awaited which can be very long.

This is a problem if this happens during an upgrade, it will cause the upgrade 
to fail potentially leading to system restore.

Ignoring the event (in avnd_err.c) was added in:

changeset:   1646:92e6e65eefc0
user:Nagendra Kumar nku...@emerson.com
date:Thu Aug 26 18:43:31 2010 +0530
summary: Ticket 1433: Allowing dynamic configuration changes for AMF 
logical entities

reasons unclear. This has to be revisited. As far as I can tell there is no 
race between the ava down event and the response message. Normally a process 
calls saAmfResponse(OK) and then exit(0). saAmfResponse(OK) under the hood 
sends a message and at least TIPC will do run to completion meaning it will 
post it to the receivers (amfnd) socket receive buffer. After that the process 
exit and a topology event is created and written to another socket receive 
buffer (MDS lib for amfnd). Eventually the MDS thread in amfnd context will 
receiver both messages and write them to amfnd mailbox with the same prio.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets