date:20150811

[tickets] [opensaf:tickets] #242 cpsv : ckptnd crashed while running multi thread application during section iteration get next

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#242] cpsv : ckptnd crashed while running multi thread application 
during section iteration get next**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Thu May 16, 2013 06:31 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu Aug 06, 2015 04:23 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[checkpoint_app1.c](https://sourceforge.net/p/opensaf/tickets/242/attachment/checkpoint_app1.c)
 (12.9 kB; application/octet-stream)


from http://devel.opensaf.org/ticket/2864


The issue is seen on SLES 64bit VMs


There are two threads in the application, a writer thread and a reader thread.


Writer thread does the follows:
1) Creates the checkpoint
2) In a loop opens the same checkpoint in write mode, creates a section, writes 
into the section and closes the checkpoint


Reader thread does as follows:


1) In a loop open the checkpoint created by writer thread, do a section 
iteration initialize and read the section returned by section descriptor of 
iterationNext() and close the checkpoint


Bt observed:


(gdb) bt
#0 0x00417606 in cpnd_proc_fill_sec_desc (pTmpSecPtr=0x0, 
sec_des=0x7fffa9c28530) at cpnd_proc.c:1637
#1 0x00417b42 in cpnd_proc_getnext_section (cp_node=0x64a810, 
get_next=0x654bb0, sec_des=0x7fffa9c28530, 


n_secs_trav=0x7fffa9c2852c) at cpnd_proc.c:1756


#2 0x0040f680 in cpnd_evt_proc_ckpt_iter_getnext (cb=0x637f30, 
evt=0x654ba0, sinfo=0x6551f8) at cpnd_evt.c:4122
#3 0x004059df in cpnd_process_evt (evt=0x654b90) at cpnd_evt.c:241
#4 0x00411619 in cpnd_main_process (cb=0x637f30) at cpnd_init.c:544
#5 0x004118e3 in main (argc=1, argv=0x7fffa9c28e68) at cpnd_main.c:72
(gdb) fr 2
#2 0x0040f680 in cpnd_evt_proc_ckpt_iter_getnext (cb=0x637f30, 
evt=0x654ba0, sinfo=0x6551f8) at cpnd_evt.c:4122
4122 cpnd_evt.c: No such file or directory.


in cpnd_evt.c


(gdb) p *evt
$1 = {dont_free_me = false, error = 0, type = CPND_EVT_A2ND_CKPT_ITER_GETNEXT, 
info = {initReq = {version = {releaseCode = 51 '3', 


majorVersion = 0 '\0', minorVersion = 0 '\0'}}, finReq = {client_hdl = 51}, 
openReq = {client_hdl = 51, lcl_ckpt_hdl = 11, 


ckpt_name = {length = 61664, value = 
d\000\000\000\000\000�\202a\000\000\000\000\000\005\000\000\000\t, '\0' 
repeats 236 times}, 
ckpt_attrib = {creationFlags = 0, checkpointSize = 0, retentionDuration = 0, 
maxSections = 0, maxSectionSize = 0, 


maxSectionIdSize = 0}, ckpt_flags = 0, invocation = 0, timeout = 0}, closeReq = 
{client_hdl = 51, ckpt_id = 11, 


ckpt_flags = 6615264}, ulinkReq = {ckpt_name = {length = 51, 


value = 
\000\000\000\000\000\000\v\000\000\000\000\000\000\000��d\000\000\000\000\000�\202a\000\000\000\000\000\005\000\000\000\t,
 '\0' repeats 220 times}}, rdsetReq = {ckpt_id = 51, reten_time = 11}, 
arsetReq = {ckpt_id = 51}, statReq = {ckpt_id = 51}, 


refCntsetReq = {no_of_nodes = 51, ref_cnt_array = {{ckpt_id = 11, ckpt_ref_cnt 
= 6615264}, {ckpt_id = 6390432, ckpt_ref_cnt = 5}, {


ckpt_id = 0, ckpt_ref_cnt = 0} repeats 98 times}}, sec_creatReq = {ckpt_id = 
51, lcl_ckpt_id = 11, agent_mdest = 6615264, 


sec_attri = {sectionId = 0x6182a0, expirationTime = 38654705669}, init_data = 
0x0, init_size = 0}, sec_delReq = {ckpt_id = 51, 
sec_id = {idLen = 11, id = 0x64f0e0 section_4_1}, lcl_ckpt_id = 6390432, 
agent_mdest = 38654705669}, sec_expset = {ckpt_id = 51, 
sec_id = {idLen = 11, id = 0x64f0e0 section_4_1}, exp_time = 6390432}, 
iter_getnext = {ckpt_id = 51, section_id = {idLen = 11, 


id = 0x64f0e0 section_4_1}, iter_id = 6390432, filter = SA_CKPT_SECTIONS_ANY, 
n_secs_trav = 9, exp_tmr = 0}, arr_ntfy = {


client_hdl = 51}, ckpt_write = {type = 51, ckpt_id = 11, lcl_ckpt_id = 6615264, 
agent_mdest = 6390432, num_of_elmts = 5, 
all_repl_evt_flag = 9, data = 0x0, seqno = 0, last_seq = 0 '\0', ckpt_sync = 
{ckpt_id = 0, lcl_ckpt_hdl = 0, client_hdl = 0, 


invocation = 0, cpa_sinfo = {to_svc = 0, dest = 0, stype = MDS_SENDTYPE_SND, 
ctxt = {length = 0 '\0', 


data = '\0' repeats 11 times}}, is_ckpt_open = false}}, ckpt_read = {type = 
51, ckpt_id = 11, lcl_ckpt_id = 6615264, 


agent_mdest = 6390432, num_of_elmts = 5, all_repl_evt_flag = 9, data = 0x0, 
seqno = 0, last_seq = 0 '\0', ckpt_sync = {ckpt_id = 0, 


lcl_ckpt_hdl = 0, client_hdl = 0, invocation = 0, cpa_sinfo = {to_svc = 0, dest 
= 0, stype = MDS_SENDTYPE_SND, ctxt = {


length = 0 '\0', data = '\0' repeats 11 times}}, is_ckpt_open = false}}, 
ckpt_sync = {ckpt_id = 51, lcl_ckpt_hdl = 11, 


client_hdl = 6615264, invocation = 6390432, cpa_sinfo = {to_svc = 5, dest = 0, 
stype = MDS_SENDTYPE_SND, ctxt = {length = 0 '\0', 


data = '\0' repeats 11 times}}, is_ckpt_open = false}, ckpt_read_ack = 
{ckpt_id = 51, mds_dest = 11}, ckpt_info = {error = 51, 


ckpt_id = 11, is_active_exists = 224, active_dest = 6390432, dest_cnt = 5, 
dest_list = 0x0, attributes = {creationFlags = 0, 


checkpointSize = 0, retentionDuration = 0, maxSections = 0, maxSectionSize = 0,

[tickets] [opensaf:tickets] #272 checkpoint overwrite returns timeout when controllers are running with different compatible versions

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **assigned_to**: A V Mahesh (AVM)
- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#272] checkpoint overwrite returns timeout when controllers are 
running with different compatible versions**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Fri May 17, 2013 11:40 AM UTC by Sirisha Alla
**Last Updated:** Thu Aug 06, 2015 04:26 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[logs.tar.gz](https://sourceforge.net/p/opensaf/tickets/272/attachment/logs.tar.gz)
 (175.5 kB; application/x-gzip)


The issue is seen on OEL6.4 TCP setup. Changeset being used is 4241 with 
patches 2794 and 3117.

Active controller(SC-1) is running with 4.3 version while standby controller 
(SC-2) is running with cs3533(4.2.x)

A non collocated checkpoint replica is created on Active controller.
A section is created in the checkpoint.
Write and Read APIs are successfull but overwrite API is returning timeout for 
5 seconds after which application timesout and exits.

No ckptnd and agent crashes observed. When the same application is run on SC-2, 
it runs without any error.

Attaching the journal and the traces of ckptnd and ckptd on both the 
controllers.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #241 cpsv : saCkptCheckpointOpen writes to const SaNameT

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned



---

** [tickets:#241] cpsv : saCkptCheckpointOpen writes to const SaNameT**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:28 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:47 PM UTC
**Owner:** A V Mahesh (AVM)


from http://devel.opensaf.org/ticket/1731

Problem:
osaf/libs/agents/saf/cpa/cpa_api.c line 648 : 
m_CPSV_SET_SANAMET(checkpointName);
However, checkpointName is: const SaNameT *checkpointName
and m_CPSV_SET_SANAMET does memset( (uns8 *)name-value[name-length], 0, 
(SA_MAX_NAME_LENGTH - name-length) )


This causes a segfault if the value passed in is in read-only memory.


bug is present in opensaf-staging/1057c1e6ebba I'm not sure what version that 
is.


Example:
#define CKPT_NAME safCkpt=My_Ckpt,safApp=safCkptService
const SaNameT ckpt_name = { sizeof(CKPT_NAME) - 1, CKPT_NAME };


Then call saCkptCheckpointOpen on it





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #265 mds : OpenSAF cannot start with mutex type PTHREAD_MUTEX_ERRORCHECK_NP

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#265] mds : OpenSAF cannot start with mutex type 
PTHREAD_MUTEX_ERRORCHECK_NP**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 08:48 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:43 PM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/759

In pursuit of the problem described in http://devel.opensaf.org/ticket/753 I 
changed the mutex type in the general code (ncs_os_lock in os_defs.c) to 
PTHREAD_MUTEX_ERRORCHECK_NP.


Then I get recursive locking in MDS:


opensaf-staging$ cat /var/lib/opensaf/stdouts/ncs_rde 
35: Resource deadlock avoided
ncs_rde: os_defs.c:783: ncs_os_lock: Assertion `0' failed.


(gdb) bt
#0 0x7f31603db4b5 in *GI_raise (sig=value optimized out) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x7f31603def50 in *GI_abort () at abort.c:92
#2 0x7f31603d4481 in *GI_assert_fail (assertion=0x7f3160b8dbf1 0, 
file=value optimized out, line=783, function=0x7f3160b8f4b4 ncs_os_lock) at 
assert.c:81
#3 0x7f3160b62313 in ncs_os_lock (lock=value optimized out, 
request=value optimized out, type=value optimized out) at os_defs.c:783
#4 0x7f3160b4c97c in ncs_spir_api (info=0x7fff8c182190) at ncs_sprr.c:360
#5 0x7f3160b88d2c in mda_lib_req (req=0x7fff8c182400) at ncs_mda.c:157
#6 0x7f3160b4cf77 in ncs_spir_api (info=0x7fff8c1826f0) at ncs_sprr.c:526
#7 0x7f3160b88ebb in mda_lib_req (req=0x7fff8c1829a0) at ncs_mda.c:105
#8 0x7f3160b4bab0 in ncs_mds_startup (argc=value optimized out, 
argv=0x7fff8c182b50) at ncs_main_pub.c:353
#9 0x7f3160b4c372 in ncs_core_agents_startup (argc=0, argv=0x7fff8c182b50) 
at ncs_main_pub.c:446
#10 0x7f3160b4c429 in ncs_agents_startup (argc=923, argv=0x39b) at 
ncs_main_pub.c:225
#11 0x00402d20 in rde_agents_startup () at rde_amf.c:425
#12 0x0040421f in main (argc=value optimized out, 
argv=0x7fff8c182fa8) at rde_main.c:122





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #266 mds : Error codes are not forwarded in ncsmds_api

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#266] mds : Error codes are not forwarded in ncsmds_api**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 08:50 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:43 PM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/2267

Return value from enc_full or flat for example is not forwarded all the way up 
to caller of the MDS API.
rc = ncsmds_api(mds_info)

For example invalid param from encoding could be useful to return back to the 
user. Now only NCSCC_RC_FAILURE will be returned back for all errors.


This is pattern is on other places in MDS API also. What is the reason for not 
forward return codes?


from mds_c_sndrcv.c status below is not forwarded:


m_MDS_LOG_DBG(MDS_SND_RCV : calling cb ptr enc or enc flatin 
mcm_msg_encode_full_or_flat_and_send\n);


status = svc_cb-cback_ptr(cbinfo);


if (status != NCSCC_RC_SUCCESS) {


m_MDS_LOG_ERR


(MDS_SND_RCV: Encode callback of Dest =%d, Adest=%llx, svc-id=%d failed while 
sending to svc=%d),


dest_vdest_id, adest, svc_cb-svc_id, to_svc_id);


m_MDS_LOG_DBG(MDS_SND_RCV : Leaving mcm_msg_encode_full_or_flat_and_send\n);
if (msg_send.msg.encoding == MDS_ENC_TYPE_FLAT) {


m_MMGR_FREE_BUFR_LIST(msg_send.msg.data.flat_uba.start);


} else if (msg_send.msg.encoding == MDS_ENC_TYPE_FULL) {


m_MMGR_FREE_BUFR_LIST(msg_send.msg.data.fullenc_uba.start);


}
return NCSCC_RC_FAILURE;


}





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

2015-08-11 Thread Pham Hoang Nhat

Currently, I'm busy with other stuff. I'll fix this in next release.


---

** [tickets:#1423] ckptnd doesn't handle fault case when creating share memory 
at start up**

**Status:** assigned
**Milestone:** future
**Created:** Tue Jul 21, 2015 06:33 AM UTC by Pham Hoang Nhat
**Last Updated:** Tue Aug 11, 2015 06:10 AM UTC
**Owner:** Pham Hoang Nhat


Observed behaviour
--
When installing a campaign a test component, the ckptnd trigger a core dump. 

Error messages
--
Following is the message in the syslog.

Jun 17 07:50:41 SC-2-2 osafckptnd[11361]: ER cpnd open request fail for RDWR 
mode (null)
Jun 17 07:50:51 SC-2-2 kernel: [  494.474214] osafckptnd[11361]: segfault at 0 
ip 7f25cd609608 sp 7fffdb6290b8 error 4 in 
libc-2.19.so[7f25cd57f000+19e000]

Following is the bt:
(gdb) bt
#0  0x7fb733293608 in _wordcopy_fwd_dest_aligned () from /lib64/libc.so.6
#1  0x7fb73328db8a in __memmove_sse2 () from /lib64/libc.so.6
#2  0x7fb7343258cc in ncs_os_posix_shm (req=0x7fffe65e7090) at os_defs.c:836
#3  0x00415d1f in cpnd_find_free_loc ()
#4  0x00415f46 in cpnd_restart_shm_client_update ()
#5  0x00405a5b in cpnd_evt_proc_ckpt_init ()
#6  0x0040d532 in cpnd_process_evt ()
#7  0x0040e235 in cpnd_main_process ()
#8  0x0040edf7 in main ()



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #68 failover didnot succeed and cluster got reset due to MDS problems.

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **assigned_to**: A V Mahesh (AVM)
- **Type**: enhancement -- defect
- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#68] failover didnot succeed and cluster got reset due to MDS 
problems.**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Sat May 11, 2013 05:22 PM UTC by surender khetavath
**Last Updated:** Fri Aug 07, 2015 04:19 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- [logs.tgz](https://sourceforge.net/p/opensaf/tickets/68/attachment/logs.tgz) 
(16.2 MB; application/x-compressed-tar)


Changeset : 4241 with 27943117 patch
Model : TwoN
configuration: 1App,1SG,4SUs with 3comps each and 5SIs with 3CSIs each
Transport : TCP/ipv6-linklocal
PBE enabled. 

scenario:
sc1 was active and sc2 standby.
Active SU on Sc1 was shutdown and component was made to reject quiescing 
assignment. Component got restarted for 10times as compRestartMax=10 and then 
escalated to nodefailover following a suFailover. 

sc-2 didnot become active, and eventually rebooted. Thus causing a cluster 
reset. 

syslog on sc-1:
--
May 11 21:24:49 sc-1 osafimmnd[4683]: WA Error code 2 returned for message type 
21 - ignoring
May 11 21:24:49 sc-1 osafamfnd[4790]: NO Received reboot order, ordering reboot 
now!
May 11 21:24:49 sc-1 osafamfnd[4790]: Rebooting OpenSAF NodeId = 131343 EE Name 
= , Reason: Received reboot order
May 11 21:24:49 sc-1 opensaf_reboot: Rebooting local node
May 11 21:24:49 sc-1 osafimmnd[4683]: WA MESSAGE:5319 OUT OF ORDER my highest 
processed:5317, exiting
May 11 21:24:49 sc-1 osafimmpbed: WA PBE lost contact with parent IMMND - 
Exiting
May 11 21:24:49 sc-1 osafntfimcnd[4734]: ER saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
May 11 21:24:49 sc-1 osafimmd[4668]: WA IMMND coordinator at 2010f apparently 
crashed = electing new coord
May 11 21:24:49 sc-1 osafimmd[4668]: ER Failed to find candidate for new IMMND 
coordinator
May 11 21:24:49 sc-1 osafimmd[4668]: ER Active IMMD has to restart the IMMSv. 
All IMMNDs will restart
May 11 21:24:49 sc-1 osafimmd[4668]: ER IMM RELOAD  = ensure cluster restart 
by IMMD exit at both SCs, exiting


syslog on sc-2:

May 11 21:24:49 sc-2 osafimmd[3894]: WA IMMD not re-electing coord for 
switch-over (si-swap) coord at (2010f)
May 11 21:24:49 sc-2 osafntfimcnd[3969]: NO exiting on signal 15
May 11 21:24:49 sc-2 osafsmfd[4052]: ER amf_active_state_handler oi activate 
FAILED
May 11 21:24:49 sc-2 osafamfnd[4023]: NO 
'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 
'csiSetcallbackFailed' : Recovery is 'nodeFailfast'
May 11 21:24:49 sc-2 osafamfnd[4023]: ER 
safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due 
to:csiSetcallbackFailed Recovery is:nodeFailfast
May 11 21:24:49 sc-2 osafamfnd[4023]: Rebooting OpenSAF NodeId = 131599 EE Name 
= , Reason: Component faulted: recovery is node failfast
May 11 21:24:49 sc-2 osafmsgd[4216]: ER mqd_imm_declare_implementer failed: err 
= 14
May 11 21:24:49 sc-2 osafckptd[4202]: ER cpd immOiImplmenterSet failed with err 
= 14
May 11 21:24:49 sc-2 opensaf_reboot: Rebooting local node



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1338 mds : Optimized the mds_library_mutex locks for better core readability

2015-08-11 Thread A V Mahesh (AVM)




---

** [tickets:#1338] mds : Optimized the mds_library_mutex locks for better core 
readability**

**Status:** assigned
**Milestone:** 4.7-Tentative
**Created:** Fri Apr 24, 2015 05:16 AM UTC by A V Mahesh (AVM)
**Last Updated:** Fri Aug 07, 2015 03:49 AM UTC
**Owner:** A V Mahesh (AVM)


Now  in mds code mds_library_mutex  unlock/lock was taken before and after 
function
mds_mcm_time_wait() , and this is done acroos the code , if we move this
mds_library_mutex  unlock/lock  in side the fucntion mds_mcm_time_wait() , code 
will have more readability and   some code cleanup.

Example changes :

@@ -2435,9 +2438,7 @@ static uint32_t mcm_pvt_normal_svc_sndrs
 fr_svc_id, to_svc_id, to_dest);
return status;
} else {
-   osaf_mutex_unlock_ordie(gl_mds_library_mutex);
if (NCSCC_RC_SUCCESS != mds_mcm_time_wait(sync_queue-sel_obj, 
req-info.sndrsp.i_time_to_wait)) {
-   osaf_mutex_lock_ordie(gl_mds_library_mutex);
/* This is for response for local dest */
if (sync_queue-status == NCSCC_RC_SUCCESS) {
/* sucess case */
@@ -2458,7 +2459,6 @@ static uint32_t mcm_pvt_normal_svc_sndrs
mcm_pvt_del_sync_send_entry((MDS_PWE_HDL)env_hdl, 
fr_svc_id, xch_id, req-i_sendtype, 0);
return NCSCC_RC_REQ_TIMOUT;
} else {
-   osaf_mutex_lock_ordie(gl_mds_library_mutex);
 
if (NCSCC_RC_SUCCESS != 
mds_check_for_mds_existence(sync_queue-sel_obj, env_hdl, fr_svc_id, 
to_svc_id)) {
m_MDS_LOG_INFO(MDS_SND_RCV: MDS entry doesnt 
exist\n);
@@ -2549,15 +2549,18 @@ static uint32_t mds_await_active_tbl_del
 
 static uint32_t mds_mcm_time_wait(NCS_SEL_OBJ *sel_obj, uint32_t time_val)
 {
+   osaf_mutex_unlock_ordie(gl_mds_library_mutex);
/* Now wait for the response to come */
int count = osaf_poll_one_fd(sel_obj-rmv_obj,
time_val == 0 ? -1 : (time_val * 10));
 
+   osaf_mutex_lock_ordie(gl_mds_library_mutex);


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1317 ckpt : stale replicas observed in a 70 node cluster

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **assigned_to**: A V Mahesh (AVM)
- **Milestone**: 4.4.2 -- 4.5.2



---

** [tickets:#1317] ckpt : stale replicas observed in a 70 node cluster**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Wed Apr 15, 2015 10:16 AM UTC by Sirisha Alla
**Last Updated:** Wed Apr 15, 2015 10:16 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[logs.tar.bz2](https://sourceforge.net/p/opensaf/tickets/1317/attachment/logs.tar.bz2)
 (6.5 MB; application/x-bzip)


This issue is observed on cs6377 (46FC Tag). The cluster is 0f 70 nodes and 2 
checkpoint applications run on each node. The application running on the active 
controller creates the checkpoint, while the applications running on other 
nodes open the same checkpoint and use them. After sections are created, 
written and read from all the applications finalizes the handles used. The 
retention duration of the checkpoint is specified to a minimal value of 1000 
nanoseconds.

/dev/shm on the active controller after the applications exited.

SLES-64BIT-SLOT1:~ # date;ls -lrt /dev/shm/
Wed Apr 15 14:25:09 IST 2015
total 1772
-rw-r--r-- 1 opensaf opensaf 1076040 Apr 15 13:38 
opensaf_NCS_MQND_QUEUE_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf  328000 Apr 15 13:38 opensaf_NCS_GLND_RES_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf  16 Apr 15 13:38 opensaf_NCS_GLND_LCK_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf   88000 Apr 15 13:38 opensaf_NCS_GLND_EVT_CKPT_INFO
-rw-r--r-- 1 opensaf opensaf  704008 Apr 15 13:38 
opensaf_CPND_CHECKPOINT_INFO_131343
-rw-r--r-- 1 opensaf opensaf   79848 Apr 15 13:55 
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_4
-rw-r--r-- 1 opensaf opensaf   79848 Apr 15 13:56 
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_9
-rw-r--r-- 1 opensaf opensaf   79848 Apr 15 13:57 
opensaf_safCkpt=active_replica_ckpt_name_1_sysgrou_131343_16
SLES-64BIT-SLOT1:~ # date;immfind|grep -i ckpt
Wed Apr 15 14:25:11 IST 2015
safApp=safCkptService
SLES-64BIT-SLOT1:~ # 

When the same checkpoint name is being tried created, checkpoint service is not 
creating a new replica in the shared memory.

cpd,cpnd traces are attached.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1305 cpsv: non-collocated ckpts are not receiving track changes if physical replica doesn't exist

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- 4.6.1



---

** [tickets:#1305] cpsv: non-collocated ckpts are not receiving  track changes 
if physical replica doesn't exist**

**Status:** assigned
**Milestone:** 4.6.1
**Created:** Tue Apr 07, 2015 09:03 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu Aug 06, 2015 04:28 AM UTC
**Owner:** A V Mahesh (AVM)


The track changes through a callback are not being received for non-collocated 
Checkpoints,
if physical replica doesn't exist on that particular controller/payload blade.


For the non-collocated Checkpoints, OpenSAF Checkpoint Service will specify
the location of the checkpoint replicas as per the following policy:

If a non-collocated checkpoint is opened for the first time by an 
application residing
on a payload blade, the replicas will be created on the local payload blade 
and both the system controller nodes.
In this case, the replica residing on the payload blade is designated as 
active replica.

If a non-collocated checkpoint is opened for the first time by an 
application residing
on the system controller nodes, the replica will be created only on the 
system controller
blade. In this case, this replica on a system controller node will act as 
the active replica.

If another application opens the same checkpoint from a payload node, the 
checkpoint service
will not create the replica on that node.




---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1285 MDS TCP: zero bytes recvd results in application exit

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#1285] MDS TCP: zero bytes recvd results in application exit**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Thu Mar 26, 2015 09:49 AM UTC by Girish
**Last Updated:** Fri Aug 07, 2015 04:03 AM UTC
**Owner:** A V Mahesh (AVM)


sometimes application using opensaf exits with below message:

 Feb 20 15:24:59 fedvm1 RIB[28549]: MDTM:socket_recv() = 0, conn lost with dh 
server, exiting library err :Success
Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 
'safSu=SU1,safSg=app-simplex,safApp=appos' component restart probation timer 
started (timeout: 40 ns)
Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO Restarting a component of 
'safSu=SU1,safSg=app-simplex,safApp=appos' (comp restart count: 1)
Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO 
'safComp=App,safSu=SU1,safSg=app-simplex,safApp=appos' faulted due to 'avaDown' 
: Recovery is 'componentRestart'

Exits at location 
osaf/libs/core/mds/mds_dt_trans.c::mdtm_process_poll_recv_data_tcp

recd_bytes = recv(tcp_cb-DBSRsock, tcp_cb-buffer, local_len_buf, 0);
if (recd_bytes  0) {
return;
} else if (0 == recd_bytes) {
syslog(LOG_ERR, MDTM:socket_recv() = 
%d, conn lost with dh server, exiting library err :%d len:%d, recd_bytes, 
errno,
  local_len_buf);
close(tcp_cb-DBSRsock);
exit(0);
} else if (local_len_buf  recd_bytes) {


 local_len_buf turns out be 0


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1442 log: unable to create new cfg/log files if openning files are corrupted

2015-08-11 Thread Vu Minh Nguyen




---

** [tickets:#1442] log: unable to create new cfg/log files if openning files 
are corrupted**

**Status:** unassigned
**Milestone:** 4.5.2
**Created:** Tue Aug 11, 2015 07:20 AM UTC by Vu Minh Nguyen
**Last Updated:** Tue Aug 11, 2015 07:20 AM UTC
**Owner:** nobody


When something wrong with opening cfg/log file (e.g: files on disk are 
deleted/moved),  if there is any action that leads to create new cfg/log files, 
logsv will get failed to do that action as logsv sees it failed to rename the 
files (appending closed time to file names), then it ignores creating new ones.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1433 log: saflogger should use built-in default log file format in log server

2015-08-11 Thread Vu Minh Nguyen

- **status**: unassigned -- accepted
- **assigned_to**: Vu Minh Nguyen



---

** [tickets:#1433] log: saflogger should use built-in default log file format 
in log server**

**Status:** accepted
**Milestone:** 5.0
**Created:** Wed Aug 05, 2015 06:33 AM UTC by Vu Minh Nguyen
**Last Updated:** Wed Aug 05, 2015 06:33 AM UTC
**Owner:** Vu Minh Nguyen


Currently, saflogger tool uses its own defined log file format for application 
stream instead of one in log server.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #264 mds : Refactor MDS tests

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#264] mds : Refactor MDS tests**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 08:27 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu May 16, 2013 08:28 AM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/2848

MDS tests are designed for tetware. They should be ported to the simple unit 
test frame work in OpenSAF.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #263 mds : Suspicios comparison

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#263] mds : Suspicios comparison**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 08:25 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:44 PM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/2639


The following comparison is always true:


(recv-snd_type != MDS_SENDTYPE_ACK) || (recv-snd_type != 
MDS_SENDTYPE_RACK)
Was the intention perhaps:


(recv-snd_type != MDS_SENDTYPE_ACK)  (recv-snd_type != MDS_SENDTYPE_RACK)
Code is found in file osaf/libs/core/mds/mds_c_sndrcv.c, line 4024:


/* For the message loss indication */
if ((true == svccb-i_msg_loss_indication)  
((recv-snd_type != MDS_SENDTYPE_ACK) || 
(recv-snd_type != MDS_SENDTYPE_RACK) )) {
/* Get the subscription table result table function pointer */
MDS_SUBSCRIPTION_RESULTS_INFO *lcl_subtn_res = NULL;
if ( NCSCC_RC_SUCCESS == 
mds_get_subtn_res_tbl_by_adest(recv-dest_svc_hdl, recv-src_svc_id,
recv-src_vdest, recv-src_adest, 
lcl_subtn_res) ) {
if (recv-src_seq_num != lcl_subtn_res-msg_rcv_cnt) {
m_MDS_LOG_ERR
(MDS_SND_RCV: msg loss detected, Src 
SVC=%d, Src vdest id= %d, Src adest=%llu, local svc id=%d msg num=%d, recvd 
cnt=%d\n, recv-src_svc_id, recv-src_vdest, recv-src_adest, svccb-svc_id, 
recv-src_seq_num, lcl_subtn_res-msg_rcv_cnt);

mds_mcm_msg_loss(recv-dest_svc_hdl, 
recv-src_adest, 
recv-src_svc_id, recv-src_vdest);
lcl_subtn_res-msg_rcv_cnt = recv-src_seq_num;
lcl_subtn_res-msg_rcv_cnt++;
} else {
lcl_subtn_res-msg_rcv_cnt++;
}
} else {
m_MDS_LOG_INFO(MDS_SND_RCV: msg loss enabled but no 
subcription exists\n);
}
}



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #258 mds: MDS should use TIPC importance

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#258] mds: MDS should use TIPC importance**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 08:11 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu May 16, 2013 08:14 AM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/1772 

If the application is using TIPC as its cluster communication protocol, OpenSAF 
control signalling can be blocked by the application. By using TIPC importance 
OpenSAF together with a nicely behaving application can avoid this scenario.


Map MDS priority to TIPC importance.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #238 cpsv : Write for asynchronous non collocated checkpoint returns SA_AIS_ERR_NOT_EXIST in some processes

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#238] cpsv : Write for asynchronous non collocated checkpoint 
returns SA_AIS_ERR_NOT_EXIST in some processes**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Thu May 16, 2013 06:17 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu Aug 06, 2015 04:25 AM UTC
**Owner:** A V Mahesh (AVM)


From  http://devel.opensaf.org/ticket/2384

Changeset : 3065
Setup: 70 node SLES11 VM setup.


Problem Description:



70 processes are running the below test scenario with each node hosting a 
single process.


1) The application that is running on SC-1 opens a non-collocated checkpoint, 
creates a section in the checkpoint.
2) The rest of the applications creates the checkpoint and once the section 
create is successful on SC-1, writes into the same section.


Some of the applications return SA_AIS_ERR_NOT_EXIST for write operation.


Traces are not enabled on the setup, and /var/log/messages for both the 
controllers can be provided







---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #239 cpsv : section create returns ERR_EXIST after few try agains on 70 node cluster

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#239] cpsv : section create returns ERR_EXIST after few try agains 
on 70 node cluster**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Thu May 16, 2013 06:19 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu Aug 06, 2015 04:24 AM UTC
**Owner:** A V Mahesh (AVM)


From http://devel.opensaf.org/ticket/3042

This is seen on 70 SLES VM setup. One checkpoint application runs on each node.


1) Checkpoint Application on active controller creates an asynchronous 
collocated checkpoint. The applications on other nodes open the same checkpoint
2) Replica is set active on active controller and section is created
3) Section create API returns TRY_AGAIN few times and returns ERR_EXIST.


When application gets try again, the section should not be created in the 
checkpoint. This is always not reproducible. 


snippet from test journal:


520|0 15 00130961 1 21| FAILED : Section 11 created in active colloc ckpt
520|0 15 00130961 1 22| Return Value : SA_AIS_ERR_TRY_AGAIN
520|0 15 00130961 1 23|
520|0 15 00130961 1 24| Try again count : 8 
520|0 15 00130961 1 25|
520|0 15 00130961 1 26| FAILED : Section 11 created in active colloc ckpt 
520|0 15 00130961 1 27| Return Value : SA_AIS_ERR_EXIST


Attaching CPD and CPND traces of both the controllers





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

2015-08-11 Thread A V Mahesh (AVM)

- **Type**: defect -- enhancement
- **Comment**:

Are you targeting this ticket for 4. 7 ?
 If so change the milestone to 4.7,  else I  will change it to enhancement.



---

** [tickets:#1423] ckptnd doesn't handle fault case when creating share memory 
at start up**

**Status:** assigned
**Milestone:** future
**Created:** Tue Jul 21, 2015 06:33 AM UTC by Pham Hoang Nhat
**Last Updated:** Tue Jul 21, 2015 06:35 AM UTC
**Owner:** Pham Hoang Nhat


Observed behaviour
--
When installing a campaign a test component, the ckptnd trigger a core dump. 

Error messages
--
Following is the message in the syslog.

Jun 17 07:50:41 SC-2-2 osafckptnd[11361]: ER cpnd open request fail for RDWR 
mode (null)
Jun 17 07:50:51 SC-2-2 kernel: [  494.474214] osafckptnd[11361]: segfault at 0 
ip 7f25cd609608 sp 7fffdb6290b8 error 4 in 
libc-2.19.so[7f25cd57f000+19e000]

Following is the bt:
(gdb) bt
#0  0x7fb733293608 in _wordcopy_fwd_dest_aligned () from /lib64/libc.so.6
#1  0x7fb73328db8a in __memmove_sse2 () from /lib64/libc.so.6
#2  0x7fb7343258cc in ncs_os_posix_shm (req=0x7fffe65e7090) at os_defs.c:836
#3  0x00415d1f in cpnd_find_free_loc ()
#4  0x00415f46 in cpnd_restart_shm_client_update ()
#5  0x00405a5b in cpnd_evt_proc_ckpt_init ()
#6  0x0040d532 in cpnd_process_evt ()
#7  0x0040e235 in cpnd_main_process ()
#8  0x0040edf7 in main ()



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1436 MDS (TCP transport) fragment gets dropped, not received on standby node

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **assigned_to**: A V Mahesh (AVM)



---

** [tickets:#1436] MDS (TCP transport) fragment gets dropped, not received on 
standby node**

**Status:** assigned
**Milestone:** 4.6.1
**Created:** Thu Aug 06, 2015 06:47 AM UTC by Girish
**Last Updated:** Mon Aug 10, 2015 10:49 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[cpsv_test_app.c](https://sourceforge.net/p/opensaf/tickets/1436/attachment/cpsv_test_app.c)
 (8.5 kB; text/x-csrc)


Opensaf version: 4.6
Linux: Standard Fedora 22 release, no additional patches required
default wmem_max/rmem_max values 
default buffer sizes for MDS_SOCK_SND_RCV_BUF_SIZE and DTM_SOCK_SND_RCV_BUF_SIZE
Active-standby model
opensaf run as root user/group

Steps:
 1. start opensaf on node1 (active) and node2 (standby)
 2. start ckpt_demo (modified application attached) on active node, ./ckpt_demo 
1
 3. wait till all the data is checkpointed
 4. start ckpt_demo on standby node, ./ckpt_demo 0
 
 Notice Error messages in mds.log:
 
 MDTM: Some stale message recd, hence dropping adest=
 
 My investigation is that one of the fragment is lost, active node sends - 
where as standby by node does not receive.
 
 mds log on standby:
 
 May 29  4:30:03.089974 8461 ERR| 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.089995 8461 ERR|before mds_mdtm_process_recvdata fun-call 
1, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090014 8461 ERR|MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3049, from src_Tipc_id=0x0002020f:25826, pkt_type=35817
May 29  4:30:03.090032 8461 ERR|MDTM: Reassembling in FULL UB
May 29  4:30:03.090174 8461 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.090198 8461 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.090216 8461 ERR| 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.090238 8461 ERR|before mds_mdtm_process_recvdata fun-call 
1, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090257 8461 ERR|MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3050, from src_Tipc_id=0x0002020f:25826, pkt_type=35818
May 29  4:30:03.090275 8461 ERR|MDTM: Reassembling in FULL UB
May 29  4:30:03.090735 8461 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.090762 8461 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.090780 8461 ERR| 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.090801 8461 ERR|before mds_mdtm_process_recvdata fun-call 
1, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.090820 8461 ERR|MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3051, from src_Tipc_id=0x0002020f:25826, pkt_type=35819
May 29  4:30:03.090838 8461 ERR|MDTM: Reassembling in FULL UB
May 29  4:30:03.090978 8461 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091028 8461 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091047 8461 ERR| 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.091068 8461 ERR|before mds_mdtm_process_recvdata fun-call 
1, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.091087 8461 ERR|MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3053, from src_Tipc_id=0x0002020f:25826, pkt_type=35821
May 29  4:30:03.091106 8461 ERR|MDTM: ERROR Frag recd is not next frag so 
dropping adest=0x0002020f64e2
May 29  4:30:03.091125 8461 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091143 8461 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091160 8461 ERR| 
mdtm_process_poll_recv_data_tcp
May 29  4:30:03.091180 8461 ERR|before mds_mdtm_process_recvdata fun-call 
1, recd_bytes=1454, buff_toal_len=1454
May 29  4:30:03.091198 8461 ERR|MDTM: Recd message with Fragment 
Seqnum=18, frag_num=3054, from src_Tipc_id=0x0002020f:25826, pkt_type=35822
May 29  4:30:03.091216 8461 ERR|MDTM: Message is dropped as msg is out of 
seq TRANSPOR-ID=0x0002020f64e2 
May 29  4:30:03.091235 8461 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:30:03.091283 8461 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:30:03.091302 8461 ERR| 
mdtm_process_poll_recv_data_tcp


mds log on active:

May 29  4:29:36.021518 25826 ERR|before mds_mdtm_process_recvdata 
fun-call 1, recd_bytes=1454, buff_toal_len=1454
May 29  4:29:36.021537 25826 ERR|MDTM: Recd message with Fragment 
Seqnum=5, frag_num=3049, from src_Tipc_id=0x0002020f:25995, pkt_type=35817
May 29  4:29:36.021554 25826 ERR|MDTM: Reassembling in flat UB
May 29  4:29:36.021702 25995 ERR|successfully sent message, send_len=1456
May 29  4:29:36.021729 25995 ERR|MDTM:2 Sending message with Service 
Seqno=4, Fragment Seqnum=5, frag_num=35818, TO Dest_Tipc_id=0x0002020f:25826
May 29  4:29:36.021778 25826 ERR|mdtm_process_recv_events_tcp: pollres=1
May 29  4:29:36.021800 25826 ERR|mdtm_process_recv_events_tcp: 
pfd[0].revents=1
May 29  4:29:36.021817 25826 ERR|

[tickets] [opensaf:tickets] #1440 Mds: application crashes with core dump in mds

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- review



---

** [tickets:#1440] Mds: application crashes with core dump in mds**

**Status:** review
**Milestone:** 4.5.2
**Created:** Mon Aug 10, 2015 08:35 AM UTC by Nagendra Kumar
**Last Updated:** Mon Aug 10, 2015 08:39 AM UTC
**Owner:** A V Mahesh (AVM)


Application crashes in mds code at below location:

mds_c_sndrcv.c, line no: 4047 :

m_MDS_LOG_ERR(MDS_SND_RCV: msg loss detected, Src svc_id = %s(%d), Src vdest 
id= %d,\
Src Adest = %PRIu64, local 
svc_id = %s(%d) msg num=%d, recvd cnt=%d\n,

ncsmds_svc_names[recv-src_svc_id], recv-src_vdest, recv-src_adest,

ncsmds_svc_names[svccb-svc_id], recv-src_seq_num, lcl_subtn_res-msg_rcv_cnt);

The reason is mismatch between arguements and desired outputs, the log require 
8 outputs but only 6 parameters were passed in.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #343 ntf: Implement SAI-AIS-NTF-A.02.01

2015-08-11 Thread Praveen

- **summary**: ntf:  -- ntf: Implement SAI-AIS-NTF-A.02.01



---

** [tickets:#343] ntf: Implement SAI-AIS-NTF-A.02.01**

**Status:** unassigned
**Milestone:** future
**Created:** Mon May 27, 2013 08:33 AM UTC by Praveen
**Last Updated:** Wed Jul 15, 2015 02:32 PM UTC
**Owner:** nobody


Migrated from http://devel.opensaf.org/ticket/680.
New functions:
 saNtfVariableDataSizeGet
 SaNtfStaticSuppressionFilterSetCallcackT
 

Changed functions:
 


saNtfInitialize_2 *
 


-due to suppression callbacks
 

saNtfStateChangeNotificationFilter_2
 saNtfStateChangeNotificationAllocateFilter_2
 saNtfLocalizedMessageFree_2
 -add ntfHandle
 saNtfNotificationUnsubscribe_2
 -add ntfHandle
 saNtfNotificationReadInitialize_2 *
 saNtfCallbacksT_2 *
 


-due to suppression callbacks
 
•= changed in A.03.01 

Admin API - IMM integration
 Notifications



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #467 checkpoint with COLLOCATED flag forcing to register for arrival callback

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#467] checkpoint with COLLOCATED flag forcing to register for 
arrival callback**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Mon Jun 24, 2013 06:36 AM UTC by A V Mahesh (AVM)
**Last Updated:** Thu Aug 06, 2015 04:30 AM UTC
**Owner:** A V Mahesh (AVM)


 am using opensaf 4.0.0
http://devel.opensaf.org/ticket/1866


I am running a simple Amf demo for counting which uses checkpoint.


my checkpoint creation flags are : SA_CKPT_CHECKPOINT_COLLOCATED| 
SA_CKPT_WR_ALL_REPLICAS


i tested it on a 2 node cluster(both target hardware and UML nodes).


problem is that unless i register for arrivalcallback, my standby component is 
faulting. amf is reporting healthcheck timeout.


i tested for SA_CKPT_CHECKPOINT_COLLOCATED| SA_CKPT_WR_ACTIVE_REPLICA also . I 
am facing facing same issue.


If I remove the collocated flag, it works fine. 





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #722 payloads did not go for reboot when both the controllers rebooted

2015-08-11 Thread A V Mahesh (AVM)

- **status**: unassigned -- assigned
- **assigned_to**: A V Mahesh (AVM)
- **Milestone**: 4.7-Tentative -- 4.5.2



---

** [tickets:#722] payloads did not go for reboot when both the controllers 
rebooted**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Thu Jan 16, 2014 07:36 AM UTC by Sirisha Alla
**Last Updated:** Fri Aug 07, 2015 04:24 AM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[payloadnoreboot.tar.bz2](https://sourceforge.net/p/opensaf/tickets/722/attachment/payloadnoreboot.tar.bz2)
 (765.1 kB; application/x-bzip)


The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active which is reported in the ticket #721

When both controllers got rebooted, the payloads did not get rebooted. Instead 
the opensaf services are up and running. CLM shows that both the payloads are 
not part of cluster. When the payloads are restarted manually, they joined the 
cluster.

PL-3 syslog:

Jan 15 18:23:09 SLES-64BIT-SLOT3 osafimmnd[3550]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released = class extent is 
UNSAFE
Jan 15 18:23:59 SLES-64BIT-SLOT3 logger: Invoking failover from 
invoke_failover.sh
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Director Service in 
NOACTIVE state - fevs replies pending:1 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: NO No IMMD service = cluster 
restart
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafamfnd[3572]: NO 
'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: Recovery is 'componentRestart'
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: Started
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: NO Persistent Back-End 
capability configured, Pbe file:imm.db (suffix may get added)
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176901] TIPC: Resetting link 
1.1.3:eth0-1.1.2:eth0, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176911] TIPC: Lost link 
1.1.3:eth0-1.1.2:eth0 on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176918] TIPC: Lost contact with 
1.1.2
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256091] TIPC: Resetting link 
1.1.3:eth0-1.1.1:eth0, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256100] TIPC: Lost link 
1.1.3:eth0-1.1.1:eth0 on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256106] TIPC: Lost contact with 
1.1.1
Jan 15 18:24:25 SLES-64BIT-SLOT3 kernel: [ 6361.425537] TIPC: Established link 
1.1.3:eth0-1.1.2:eth0 on network plane A
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS -- IMM_SERVER_CLUSTER_WAITING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING -- IMM_SERVER_LOADING_PENDING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING -- IMM_SERVER_LOADING_CLIENT
Jan 15 18:24:29 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 1 does not exist
Jan 15 18:24:36 SLES-64BIT-SLOT3 kernel: [ 6372.473240] TIPC: Established link 
1.1.3:eth0-1.1.1:eth0 on network plane A
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 2 does not exist
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO NODE STATE- 
IMM_NODE_LOADING
Jan 15 18:24:45 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:5000
Jan 15 18:24:46 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:6000
Jan 15 18:24:47 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:7000
Jan 15 18:24:48 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:8000
Jan 15 18:24:49 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:9000

After both the controllers came up following is the status:

SLES-64BIT-SLOT1:~ # immlist safNode=PL-3,safCluster=myClmCluster
Name   Type Value(s)

safNodeSA_STRING_T  safNode=PL-3
saClmNodeLockCallbackTimeout   SA_TIME_T500 
(0xba43b7400, Thu Jan  1 05:30:50 1970)
saClmNodeIsMember  SA_UINT32_T  Empty
saClmNodeInitialViewNumber SA_UINT64_T  Empty
saClmNodeIDSA_UINT32_T  Empty
saClmNodeEE

[tickets] [opensaf:tickets] #520 Mds: Tune MDS logging to minimal informative

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 5.0 -- 4.7-Tentative



---

** [tickets:#520] Mds: Tune MDS logging  to minimal  informative **

**Status:** assigned
**Milestone:** 4.7-Tentative
**Created:** Thu Jul 25, 2013 01:19 PM UTC by hano
**Last Updated:** Fri Mar 13, 2015 12:09 PM UTC
**Owner:** A V Mahesh (AVM)


Minimize the  MDS logging  to  only in case of  required so that it can not 
reach  1 Mb of log rotation range/size sooner .


amfnd core dump is produced when amfnd main thread (10720) is waiting for a 
pthread mutex, gl_mds_library_mutex, which is held by the mds thread (10723).
The amf watchdog detects this (no healthchecks received) and sends an abort 
signal to the amfnd. Holding a mutex during file operations in MDS is not 
correct and should be corrected. (HR50165)
 

 #0  0x7f7830d70294 in __lll_lock_wait () from /lib64/libpthread.so.0

(gdb) p gl_mds_library_mutex

$1 = {__data = {__lock = 2, __count = 1, __owner = 10723, __nusers = 1, __kind 
= 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},

  __size =  ã ) , ' ' repeats 22 times, __align = 4294967298}

(gdb) info thr

  Id   Target Id Frame

  4Thread 0x7f7832263b00 (LWP 10723) 0x7f783083e20d in write () from 
/lib64/libc.so.6

  3Thread 0x7f7832283b00 (LWP 10722) 0x7f7830844f53 in select () from 
/lib64/libc.so.6

  2Thread 0x7f7832243b00 (LWP 10724) 0x7f7830d7076d in read () from 
/lib64/libpthread.so.0

* 1Thread 0x7f7832286700 (LWP 10720) 0x7f7830d70294 in __lll_lock_wait 
() from /lib64/libpthread.so.0

 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #249 mds : tipc Invalid read errors in mds

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#249] mds : tipc Invalid read errors in mds**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:44 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:45 PM UTC
**Owner:** nobody
**Attachments:**

- 
[valgrind.log](https://sourceforge.net/p/opensaf/tickets/249/attachment/valgrind.log)
 (23.1 kB; application/octet-stream)


from http://devel.opensaf.org/ticket/1820

We are using OpenSAF4.1 and while running valgrind on one of our applications 
we noticed that there are a bunch of invalid read errors that seem to arise in 
the MDS library code. Attached is the valgrind report of the same.





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #250 mds : tipc Missing error Checking in mds_dt_tipc.c

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#250] mds : tipc Missing error Checking in mds_dt_tipc.c**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:45 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:45 PM UTC
**Owner:** nobody


http://devel.opensaf.org/ticket/574

sing error handling of calls to ncs_enc_init_space_pp() and
ncs_encode_n_octets_in_uba() in mds_dt_tipc.c. 


This can cause a segmentation fault. 





---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #253 mds : 1.5 sec wait added in RSP send causes problems in MDS clients

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: A V Mahesh (AVM) --  nobody 



---

** [tickets:#253] mds : 1.5 sec wait added in RSP send causes problems in MDS 
clients**

**Status:** unassigned
**Milestone:** future
**Created:** Thu May 16, 2013 06:54 AM UTC by A V Mahesh (AVM)
**Last Updated:** Wed Jul 15, 2015 02:44 PM UTC
**Owner:** nobody


from http://devel.opensaf.org/ticket/2825


Single threaded LOG server stalled waiting for file system for a longer time 
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients 
(e.g. NTF server) to timeout and retry. This creates a backlog of outdated 
messages in the LOG server mailbox. When those eventually are handled, the 1.5 
sec in MDS is added to each RSP send. Therefore the LOG server never catch up 
with received messages in the mailbox. 


The change introduced in #2611 introduced an unacceptable hidden delay when 
sending messages that can have consequences for any client with soft real time 
requirements. For example AMF HC timeouts.


References:
http://devel.opensaf.org/ticket/2611
 http://list.opensaf.org/pipermail/devel/2012-April/022254.html


Workaround:
LOG server throws away rotten messages that are older than 10 sec.


Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message 
is received and potentially delivered to the client.


Replying to hafe:


Single threaded LOG server stalled waiting for file system for a longer time 
than 10 sec which is the sync tmo in the LOG library. This causes LOG clients 
(e.g. NTF server) to timeout and retry. 


LOG service or any other service(like dtsv) that does disk i/o are prone to 
these situations. 


This creates a backlog of outdated messages in the LOG server mailbox. When 
those eventually are handled, the 1.5 sec in MDS is added to each RSP send. 
Therefore the LOG server never catch up with received messages in the mailbox. 


This is a case of a slow receiver. More in the next comment



The change introduced in #2611 introduced an unacceptable hidden delay when 
sending messages that can have consequences for any client with soft real time 
requirements. For example AMF HC timeouts.


I don't think that change(in MDS) can 'directly and always' result in making 
LOG a 'slow transmitter'! Because, the 1.5 seconds i believe is only when the 
MDS client startsup, like during a node bootup.


Having said that, such services that are dependent on responses from external 
resources(modules) like disk i/o in this case, should be tuned to have 
generally bigger healthcheck timeouts.


Surya, could you please comment on Hans' theory on the 1.5 seconds.



References:
http://devel.opensaf.org/ticket/2611
 http://list.opensaf.org/pipermail/devel/2012-April/022254.html

Workaround:
LOG server throws away rotten messages that are older than 10 sec.

Proposed long term solution:
MDS should buffer incoming data messages until the corresponding SVC up message 
is received and potentially delivered to the client.


  Changed 8 months ago by mathi ¶
  I mean, if we try to formulate and understand the problem


If the problem is health check timeouts we should do the following


•increase the timeout for healthcheck, and 
•if necessary, introduce a separate healthcheck thread. 
If the problem is about clients' receiving retry, then these situations would 
occur typically when the shared filesystem is/was undergoing a role change or 
is in the process of some heavy sync operation, etc. In such situations, 
returning TRY_AGAIN is a genuine way of handling such situations (typically 
these situations can occur only during an upgrade kind of scenario that might 
involve role change or when some fault at the disk level and not during normal 
lifecycle when the healthchecks.)


If the problem is timeout that which is caused by the slow processing, then we 
could think of introducing some protocol between the LGA and LGS to improve the 
congestion, i mean i'm tending to think in this angle, the end solution may 
involve LGA, LGS or even MDS but i think the problems being describe here would 
have occurred even without the 2611 and as such 2611 cannot contribute much to 
this problem getting formulated in this ticket.


Having said that, throwing away older messages shouldn't be a problem, but i'm 
trying to understand how could that improve the situation...


  Changed 7 months ago by nagendra ¶
  ■owner changed from surya to nagendra 
■status changed from new to accepted 
  Changed 7 months ago by nagendra ¶
  ■owner changed from nagendra to surya 
■status changed from accepted to assigned 
  Changed 7 months ago by surya ¶
  ■status changed from assigned to accepted 
  Changed 7 months ago by surya ¶
  ■patch_waiting changed from no to yes 
  Changed 7 months ago by mahesh ¶
  Steps to test:


1)Pause osaflogd process (# kill -STOP osaflogd PID )
2)Write to system stream using saflogger tool(#/usr/local/bin/saflogger -y 


Out of

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **assigned_to**: Pham Hoang Nhat --  nobody 



---

** [tickets:#1423] ckptnd doesn't handle fault case when creating share memory 
at start up**

**Status:** unassigned
**Milestone:** future
**Created:** Tue Jul 21, 2015 06:33 AM UTC by Pham Hoang Nhat
**Last Updated:** Tue Aug 11, 2015 06:36 AM UTC
**Owner:** nobody


Observed behaviour
--
When installing a campaign a test component, the ckptnd trigger a core dump. 

Error messages
--
Following is the message in the syslog.

Jun 17 07:50:41 SC-2-2 osafckptnd[11361]: ER cpnd open request fail for RDWR 
mode (null)
Jun 17 07:50:51 SC-2-2 kernel: [  494.474214] osafckptnd[11361]: segfault at 0 
ip 7f25cd609608 sp 7fffdb6290b8 error 4 in 
libc-2.19.so[7f25cd57f000+19e000]

Following is the bt:
(gdb) bt
#0  0x7fb733293608 in _wordcopy_fwd_dest_aligned () from /lib64/libc.so.6
#1  0x7fb73328db8a in __memmove_sse2 () from /lib64/libc.so.6
#2  0x7fb7343258cc in ncs_os_posix_shm (req=0x7fffe65e7090) at os_defs.c:836
#3  0x00415d1f in cpnd_find_free_loc ()
#4  0x00415f46 in cpnd_restart_shm_client_update ()
#5  0x00405a5b in cpnd_evt_proc_ckpt_init ()
#6  0x0040d532 in cpnd_process_evt ()
#7  0x0040e235 in cpnd_main_process ()
#8  0x0040edf7 in main ()



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1443 log: service is crashed if creating and deleting conf obj class continuously

2015-08-11 Thread Vu Minh Nguyen




---

** [tickets:#1443] log: service is crashed if creating and deleting conf obj 
class continuously**

**Status:** unassigned
**Milestone:** 4.5.2
**Created:** Tue Aug 11, 2015 07:37 AM UTC by Vu Minh Nguyen
**Last Updated:** Tue Aug 11, 2015 07:37 AM UTC
**Owner:** nobody


When creating application object class and deleting it continuously, log 
service could be crashed.

To reproduce this case, perform following command.

 for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20; do immcfg -c 
 SaLogStreamConfig safLgStrCfg=TestLog -a saLogStreamPathName=. -a 
 saLogStreamFileName=TestLog; echo create ($i) - $?; immcfg -d 
 safLgStrCfg=TestLog; echo Delete ($i) - $?; done

Output something likes:
 create (1) - 0
 Delete (1) - 0
 create (2) - 0
 error - saImmOmCcbObjectDelete for 'safLgStrCfg= TestLog' FAILED: 
 SA_AIS_ERR_FAILED_OPERATION (21)
 error - saImmOmCcbApply FAILED: SA_AIS_ERR_FAILED_OPERATION (21)
 Delete (2) - 1
 error - saImmOmCcbObjectCreate_2 FAILED with SA_AIS_ERR_EXIST (14)
 create (3) - 1
 reboot: Restarting system

Here is the analysis:

1. When creating obj class is done by IMM, but logsv have not finished the 
`apply callback` job yet. 
In this case, it needs to update a run-time attribute ` 
saLogStreamCreationTimestamp`.
This is done in main thread.

2. If deleting this obj class comes before `apply callback` job finishes, IMM 
will
mark that obj class as `IMM_DELETE_LOCK` and call respective callbacks to logsv
and *wait for response*, but logsv is busy in doing `apply callback` in (1).

When the request `update runtime attribute` to IMM by logsv, IMM will returns 
TRY_AGAIN.

IMM waits for logsv response to release “IMM_DELETE_LOCK”, while logsv still 
get stuck
in `update rt attribute` as getting TRY_AGAIN. 

Consequently, logsv might be terminated if number of try-again is reached or 
delete action gets failed. 



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1224 ckpd: enhanced trace log and check of user parameters

2015-08-11 Thread A V Mahesh (AVM)

- **status**: assigned -- unassigned
- **Milestone**: 5.0 -- future



---

** [tickets:#1224] ckpd: enhanced trace log and check of user parameters **

**Status:** unassigned
**Milestone:** future
**Created:** Tue Dec 02, 2014 06:32 AM UTC by Ingvar Bergström
**Last Updated:** Thu Jul 09, 2015 02:55 AM UTC
**Owner:** A V Mahesh (AVM)


The checkpoint service shall provide better trace logging. 
Check of user parameters in library user interface shall be enhanced.


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #787 Processes that use opensaf agents get killed if opensafd is stopped when TCP is the transport

2015-08-11 Thread A V Mahesh (AVM)

- **Milestone**: 4.7-Tentative -- future



---

** [tickets:#787] Processes that use opensaf agents get killed if opensafd is 
stopped when TCP is the transport**

**Status:** unassigned
**Milestone:** future
**Created:** Fri Feb 14, 2014 06:36 AM UTC by manu
**Last Updated:** Fri Aug 07, 2015 04:15 AM UTC
**Owner:** nobody


This issue is seen when the transport is TCP. When Opensafd is stopped 
,application process that are using services of opensaf exits. 

From 4.4 opensafd stop is equivalent to node down with the implementation of 
#220. Applications that are using OpenSAF should not exit when opensaf is 
stopped when the transport is TCP.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1123 AMF: delete of attributes related to ticket #819 is not working properly

2015-08-11 Thread hano

- **Priority**: major -- minor



---

** [tickets:#1123] AMF: delete of attributes related to ticket #819 is not 
working properly**

**Status:** assigned
**Milestone:** 4.7-Tentative
**Created:** Mon Sep 22, 2014 02:03 PM UTC by hano
**Last Updated:** Wed Jul 15, 2015 01:20 PM UTC
**Owner:** hano


A problem with ticket #819 related to delete of attributes is when the 
information
that an attribute has been deleted is not known at the amfnd side, amfd uses 
the value_is_deleted flag to update an attribute with e.g. the  global 
attribute. 
This needs to be solved, either:

1) re-introduce the applier.
2) add a new field/variable  to the AVSV_PARAM_INFO to e.g. indicate 
delete of an attribute value. There may be upgrade problems with this.
3) do all processing regarding changing base and inherited 
attributes in amfd.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #819 AMF: support immediate effect when changing comp/hc-type attributes

2015-08-11 Thread hano

- **status**: review -- fixed



---

** [tickets:#819] AMF: support immediate effect when changing comp/hc-type 
attributes**

**Status:** fixed
**Milestone:** 5.0
**Created:** Tue Mar 25, 2014 10:16 AM UTC by Hans Feldt
**Last Updated:** Tue Jan 20, 2015 08:28 AM UTC
**Owner:** hano


This is a continuation of ticket #539. Use case is changing for example a HC 
timeout and it should take effect immediately without need for restart.

AMF should support the writable attributes of sutype, su, comptype, comp, 
hctype, hc and SaAmfCompGlobalAttributes. Basically everything related to the 
execution of component and error handling in the AMF node director.



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1437 AMF: Enhance csi assignment/removal illustration in samples/amf/sa_aware/amf_demo

2015-08-11 Thread Minh Hon Chau

- **status**: unassigned -- assigned
- **assigned_to**: Minh Hon Chau
- **Milestone**: future -- 4.7-Tentative



---

** [tickets:#1437] AMF: Enhance csi assignment/removal illustration in 
samples/amf/sa_aware/amf_demo**

**Status:** assigned
**Milestone:** 4.7-Tentative
**Created:** Thu Aug 06, 2015 07:11 AM UTC by Minh Hon Chau
**Last Updated:** Thu Aug 06, 2015 07:11 AM UTC
**Owner:** Minh Hon Chau


There are 2 points in this enhancement ticket:

1. Currently after loading 2N amfdemo sample, if issue command amf-adm 
shutdown active su, amfdemo crashes with this error
**Aug  6 15:04:54 PL-4 amf_demo[577]: saAmfHAStateGet FAILED - 7**
It's due to the saAmfHAStateGet calling with null csiName (TARGET_ALL). As 
a sample app, this should be corrected to use saAmfHAStateGet() properly

 2. The sample is not showing the csi ha state transition/life cycle (which it 
currently accepts all csi_assign and csi_remove callback without respect of csi 
existence). This illustration could be made for amf_demo sample to show how the 
states goes from STANDBY to ACTIVE, ACTIVE to QUIESCED,... It's also used to 
test whether AMF behaves correctly with its application in term of ha 
 


---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.--
___
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #242 cpsv : ckptnd crashed while running multi thread application during section iteration get next

[tickets] [opensaf:tickets] #272 checkpoint overwrite returns timeout when controllers are running with different compatible versions

[tickets] [opensaf:tickets] #241 cpsv : saCkptCheckpointOpen writes to const SaNameT

[tickets] [opensaf:tickets] #265 mds : OpenSAF cannot start with mutex type PTHREAD_MUTEX_ERRORCHECK_NP

[tickets] [opensaf:tickets] #266 mds : Error codes are not forwarded in ncsmds_api

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

[tickets] [opensaf:tickets] #68 failover didnot succeed and cluster got reset due to MDS problems.

[tickets] [opensaf:tickets] #1338 mds : Optimized the mds_library_mutex locks for better core readability

[tickets] [opensaf:tickets] #1317 ckpt : stale replicas observed in a 70 node cluster

[tickets] [opensaf:tickets] #1305 cpsv: non-collocated ckpts are not receiving track changes if physical replica doesn't exist

[tickets] [opensaf:tickets] #1285 MDS TCP: zero bytes recvd results in application exit

[tickets] [opensaf:tickets] #1442 log: unable to create new cfg/log files if openning files are corrupted

[tickets] [opensaf:tickets] #1433 log: saflogger should use built-in default log file format in log server

[tickets] [opensaf:tickets] #264 mds : Refactor MDS tests

[tickets] [opensaf:tickets] #263 mds : Suspicios comparison

[tickets] [opensaf:tickets] #258 mds: MDS should use TIPC importance

[tickets] [opensaf:tickets] #238 cpsv : Write for asynchronous non collocated checkpoint returns SA_AIS_ERR_NOT_EXIST in some processes

[tickets] [opensaf:tickets] #239 cpsv : section create returns ERR_EXIST after few try agains on 70 node cluster

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

[tickets] [opensaf:tickets] #1436 MDS (TCP transport) fragment gets dropped, not received on standby node

[tickets] [opensaf:tickets] #1440 Mds: application crashes with core dump in mds

[tickets] [opensaf:tickets] #343 ntf: Implement SAI-AIS-NTF-A.02.01

[tickets] [opensaf:tickets] #467 checkpoint with COLLOCATED flag forcing to register for arrival callback

[tickets] [opensaf:tickets] #722 payloads did not go for reboot when both the controllers rebooted

[tickets] [opensaf:tickets] #520 Mds: Tune MDS logging to minimal informative

[tickets] [opensaf:tickets] #249 mds : tipc Invalid read errors in mds

[tickets] [opensaf:tickets] #250 mds : tipc Missing error Checking in mds_dt_tipc.c

[tickets] [opensaf:tickets] #253 mds : 1.5 sec wait added in RSP send causes problems in MDS clients

[tickets] [opensaf:tickets] #1423 ckptnd doesn't handle fault case when creating share memory at start up

[tickets] [opensaf:tickets] #1443 log: service is crashed if creating and deleting conf obj class continuously

[tickets] [opensaf:tickets] #1224 ckpd: enhanced trace log and check of user parameters

[tickets] [opensaf:tickets] #787 Processes that use opensaf agents get killed if opensafd is stopped when TCP is the transport

[tickets] [opensaf:tickets] #1123 AMF: delete of attributes related to ticket #819 is not working properly

[tickets] [opensaf:tickets] #819 AMF: support immediate effect when changing comp/hc-type attributes

[tickets] [opensaf:tickets] #1437 AMF: Enhance csi assignment/removal illustration in samples/amf/sa_aware/amf_demo

35 matches

Site Navigation

Mail list logo

Footer information