Hi Mathi, Crash in CPA is because of memory corruption caused by our application, there is nothing wrong in CPA. this issue is fixed now.
Sorry for false reporting. Regards, Girish -----Original Message----- From: Mathivanan Naickan Palanivelu [mailto:[email protected]] Sent: Monday, February 23, 2015 5:36 PM To: [email protected] Cc: [email protected]; [email protected] Subject: Re: [users] Issues with CPSv Girish, It appears that the crash is happening in the CPA (CheckpointAgent) library linked to your application. To enable the CPA traces, you could export CPA_TRACE_PATHNAME=/tmp/myapp_cpadebug.log before starting your application and share the traces. Cheers, Mathi. ----- [email protected] wrote: > Hi Mathi/Mahesh, > > First of all thanks for helping me in resolving this issue. > > Do you require CPA(application) or traces of CPA? If it is traces, > please let me know how to get it. > > Regards, > Girish > > -----Original Message----- > From: Mathivanan Naickan Palanivelu [mailto:[email protected]] > Sent: Friday, February 20, 2015 3:55 PM > To: [email protected] > Cc: [email protected]; [email protected] > Subject: Re: [users] Issues with CPSv > > Hi, > > Please raise a ticket for this crash and share the traces of CPND and > CPA(your application). > Also, you should specify a testcase or try to explain what the > application is doing and at what point the crash is occuring? > > > Thanks, > Mathi. > > ----- [email protected] wrote: > > > Hi, > > > > > > > > I don’t get this issue with opensaf version 4.3, but I get > segfault: > > > > > > > > application sometimes crashes, stack trace as below: > > > > > > > > Program received signal SIGSEGV, Segmentation fault. > > > > search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > "H\356\367\b") at patricia.c:94 > > > > 94 patricia.c: No such file or directory. > > > > (gdb) bt > > > > #0 search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > "H\356\367\b") at patricia.c:94 > > > > #1 0xb76d0bef in ncs_patricia_tree_get > (pTree=pTree@entry=0x8f733e4, > > pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434 > > > > #2 0xb7738493 in cpa_lcl_ckpt_node_get > > (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4, > > lc_hdl=lc_hdl@entry=0xbfa0cdf8, lc_node=lc_node@entry=0xbfa0ce10) > > > > at cpa_db.c:195 > > > > #3 0xb7734d76 in saCkptCheckpointWrite > (checkpointHandle=150466120, > > ioVector=0x92c6d28, numberOfElements=1320, > > > > erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) at > > cpa_api.c:3134 > > > > > > > > (gdb) p pNode > > > > $2 = (NCS_PATRICIA_NODE *) 0x5e > > > > (gdb) p *pTree > > > > $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = 0x8f733e4, > > key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0, > > actual_key_size = 0, > > > > node_size = 0}, n_nodes = 3} > > > > > > > > > > > > Regards, > > > > Girish > > > > > > > > *From:* Girish Nagaraj [mailto:[email protected]] > > *Sent:* Friday, February 20, 2015 3:34 PM > > *To:* 'A V Mahesh'; '[email protected]' > > *Subject:* RE: [users] Issues with CPSv > > > > > > > > Hi, > > > > > > > > Yes, similar issue in TCP also: exits with message: > > > > > > > > Feb 20 15:24:59 fedvm1 RIB[28549]: MDTM:socket_recv() = 0, conn > lost > > with dh server, exiting library err :Success > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO > > 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart > > probation timer started (timeout: 4000000000 ns) > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO Restarting a component > of > > 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp restart count: > 1) > > > > Feb 20 15:24:59 fedvm1 osafamfnd[28263]: NO > > 'safComp=ribd,safSu=SU1,safSg=zebos-simplex,safApp=zebos' faulted > due > > to 'avaDown' : Recovery is 'componentRestart' > > > > > > > > I experimented with code changes: > > > > > > > > recd_bytes = recv(tcp_cb->DBSRsock, tcp_cb->len_buff, 2, > > MSG_NOSIGNAL); > > > > if (0 == recd_bytes) { > > > > syslog(LOG_ERR, "MDTM:socket_recv() > = > > %d, conn lost with dh server, exiting library err 111:%d", > recd_bytes, > > errno); > > > > close(tcp_cb->DBSRsock); > > > > exit(0); > > > > } else if (2 == recd_bytes) { > > > > uint16_t local_len_buf = 0; > > > > > > > > data = tcp_cb->len_buff; > > > > local_len_buf = > > ncs_decode_16bit(&data); > > > > > > > > /* MY CHANGE START */ > > > > *if (0 == local_len_buf)* > > > > * return;* > > > > /* MY CHANGE END */ > > > > > > > > tcp_cb->buff_total_len = > > local_len_buf; > > > > tcp_cb->num_by_read_for_len_buff = > 2; > > > > > > > > if (NULL == (tcp_cb->buffer = > > calloc(1, (local_len_buf + 1)))) { > > > > /* Length + 2 is done to > reuse > > the same buffer > > > > while sending to other > > nodes */ > > > > syslog(LOG_ERR, "Memory > > allocation failed in dtm_intranode_processing"); > > > > return; > > > > } > > > > recd_bytes = recv(tcp_cb->DBSRsock, > > tcp_cb->buffer, local_len_buf, 0); > > > > if (recd_bytes < 0) { > > > > return; > > > > } else if (0 == recd_bytes) { > > > > syslog(LOG_ERR, > > "MDTM:socket_recv() > > = %d, conn lost with dh server, exiting library err 222:%d len:%d", > > recd_bytes, errno, > > > > > > local_len_buf); > > > > close(tcp_cb->DBSRsock); > > > > exit(0); > > > > > > > > This caused many other issues, so I think just returning won’t > work. > > > > > > > > Regards, > > > > Girish > > > > > > > > -----Original Message----- > > From: A V Mahesh [mailto:[email protected] > > <[email protected]>] > > Sent: Friday, February 20, 2015 1:38 PM > > To: Girish Nagaraj; [email protected] > > Subject: Re: [users] Issues with CPSv > > > > > > > > Hi, > > > > > > > > On 2/20/2015 1:19 PM, Girish Nagaraj wrote: > > > > > Hi , > > > > > > > > > > I think this is not connection loss, we are passing 0 (len of > > bytes > > > > > to be > > > > > read) to recv() function. Which returns back 0 received bytes. > > > > > > > > You mean, you are seeing issue similar to `TIPC ticket #1227 > > mds/tipc > > > > : protect mds application form zero bytes hacking messages` for TCP > as > > well ? > > > > > > > > -AVM > > > > > > > > > > > > > > local_len_buf = ncs_decode_16bit(&data); > > > > > > > > > > Is there mistake in decoding local_len_buf? > > > > > > > > > > Regards, > > > > > Girish > > > > > > > > > > -----Original Message----- > > > > > From: A V Mahesh [mailto:[email protected] > > <[email protected]> > > ] > > > > > Sent: Friday, February 20, 2015 11:03 AM > > > > > To: [email protected] > > > > > Subject: Re: [users] Issues with CPSv > > > > > > > > > > Hi, > > > > > > > > > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: > > > > >> local_len_buf turns out be 0, this causes recv() to return 0 and > > > > >> application exits. Is this programming bug?? > > > > > This is expected behavior , if any connection loss happens on TCP > > > > > socket will recives ZERO size bytes, this not related to CPSv. > > > > > > > > > > -AVM > > > > > > > > > > > > > > > On 2/19/2015 3:42 PM, Girish Nagaraj wrote: > > > > >> Hi, > > > > >> > > > > >> > > > > >> > > > > >> *Background*: > > > > >> > > > > >> Opensaf version: 4.5 > > > > >> > > > > >> Number of checkpoints used: 2 > > > > >> > > > > >> In our application we use CPSv to save application data and when > > > > >> application faults, it is restarted and it’s state is restored > back > > > > >> by reading data from checkpoints > > > > >> > > > > >> Model: Simplex > > > > >> > > > > >> > > > > >> > > > > >> * Issue faced:* > > > > >> > > > > >> application sometimes crashes, stack trace as below: > > > > >> > > > > >> > > > > >> > > > > >> Program received signal SIGSEGV, Segmentation fault. > > > > >> > > > > >> search (pTree=pTree@entry=0x8f733e4, key=key@entry=0xbfa0cdf8 > > > > >> "H\356\367\b") at patricia.c:94 > > > > >> > > > > >> 94 patricia.c: No such file or directory. > > > > >> > > > > >> (gdb) bt > > > > >> > > > > >> #0 search (pTree=pTree@entry=0x8f733e4, > key=key@entry=0xbfa0cdf8 > > > > >> "H\356\367\b") at patricia.c:94 > > > > >> > > > > >> #1 0xb76d0bef in ncs_patricia_tree_get > > (pTree=pTree@entry=0x8f733e4, > > > > >> pKey=pKey@entry=0xbfa0cdf8 "H\356\367\b") at patricia.c:434 > > > > >> > > > > >> #2 0xb7738493 in cpa_lcl_ckpt_node_get > > > > >> (lcl_ckpt_tree=lcl_ckpt_tree@entry=0x8f733e4, > > > > >> lc_hdl=lc_hdl@entry=0xbfa0cdf8, > lc_node=lc_node@entry=0xbfa0ce10) > > > > >> > > > > >> at cpa_db.c:195 > > > > >> > > > > >> #3 0xb7734d76 in saCkptCheckpointWrite > > (checkpointHandle=150466120, > > > > >> ioVector=0x92c6d28, numberOfElements=1320, > > > > >> > > > > >> > erroneousVectorIndex=erroneousVectorIndex@entry=0xbfa0d35c) > > at > > > > >> cpa_api.c:3134 > > > > >> > > > > >> > > > > >> > > > > >> (gdb) p pNode > > > > >> > > > > >> $2 = (NCS_PATRICIA_NODE *) 0x5e > > > > >> > > > > >> (gdb) p *pTree > > > > >> > > > > >> $4 = {root_node = {bit = -1, left = 0x8f7e9c0, right = > 0x8f733e4, > > > > >> key_info = 0x8f734b8 ""}, params = {key_size = 8, info_size = 0, > > > > >> actual_key_size = 0, > > > > >> > > > > >> node_size = 0}, n_nodes = 3} > > > > >> > > > > >> > > > > >> > > > > >> sometimes application exits with below message: > > > > >> > > > > >> > > > > >> > > > > >> Feb 19 15:13:31 controller2 RIB[28395]: MDTM:socket_recv() = 0, > > conn > > > > >> lost with dh server, exiting library err:0 len:0 > > > > >> > > > > >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO > > > > >> 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' component restart > > > > >> probation timer started (timeout: 4000000000 ns) > > > > >> > > > > >> Feb 19 15:13:31 controller2 osafamfnd[28110]: NO Restarting a > > > > >> component of 'safSu=SU1,safSg=zebos-simplex,safApp=zebos' (comp > > > > >> restart count: 1) > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> Below is the modified code snippet from file > > > > >> osaf/libs/core/mds/mds_dt_trans.c > > > > >> > > > > >> > > > > >> > > > > >> } else if (2 == recd_bytes) { > > > > >> > > > > >> uint16_t local_len_buf = 0; > > > > >> > > > > >> > > > > >> > > > > >> data = tcp_cb->len_buff; > > > > >> > > > > >> local_len_buf = > > > > >> ncs_decode_16bit(&data); > > > > >> > > > > >> tcp_cb->buff_total_len = > > > > >> local_len_buf; > > > > >> > > > > >> > tcp_cb->num_by_read_for_len_buff > > = > > > > >> 2; > > > > >> > > > > >> > > > > >> > > > > >> if (NULL == (tcp_cb->buffer = > > > > >> calloc(1, (local_len_buf + 1)))) { > > > > >> > > > > >> /* Length + 2 is done > to > > > > >> reuse the same buffer > > > > >> > > > > >> while sending to > other > > > > >> nodes */ > > > > >> > > > > >> syslog(LOG_ERR, > "Memory > > > > >> allocation failed in dtm_intranode_processing"); > > > > >> > > > > >> return; > > > > >> > > > > >> } > > > > >> > > > > >> recd_bytes = > > recv(tcp_cb->DBSRsock, > > > > >> tcp_cb->buffer, local_len_buf, 0); > > > > >> > > > > >> if (recd_bytes < 0) { > > > > >> > > > > >> return; > > > > >> > > > > >> } else if (0 == recd_bytes) { > > > > >> > > > > >> syslog(LOG_ERR, > > > > >> "MDTM:socket_recv() = %d, conn lost with dh server, exiting > library > > > > >> err:%d len:%d", recd_bytes, errno, local_len_buf); > > > > >> > > > > >> > close(tcp_cb->DBSRsock); > > > > >> > > > > >> exit(0); *<<<<<<<EXITS > > > > >> HERE>>>>>>>>>>* > > > > >> > > > > >> } else if (local_len_buf > > > > > >> recd_bytes) { > > > > >> > > > > >> /* can happen only in > two > > > > >> cases, system call interrupt or half data, */ > > > > >> > > > > >> TRACE("less data recd, > > recd > > > > >> bytes = %d, actual len = %d", recd_bytes, > > > > >> > > > > >> local_len_buf); > > > > >> > > > > >> tcp_cb->bytes_tb_read > = > > > > >> tcp_cb->buff_total_len - recd_bytes; > > > > >> > > > > >> return; > > > > >> > > > > >> > > > > >> > > > > >> local_len_buf turns out be 0, this causes recv() to return 0 and > > > > >> application exits. Is this programming bug?? > > > > >> > > > > >> > > > > >> > > > > >> Could someone please help to resolve these issues. > > > > >> > > > > >> > > > > >> > > > > >> Regards, > > > > >> > > > > >> Girish > > > > >> > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade > BIRT > > > > > Server from Actuate! Instantly Supercharge Your Business Reports > and > > > > > Dashboards with Interactivity, Sharing, Native Excel Exports, App > > > > > Integration & more Get technology previously reserved for > > > > > billion-dollar corporations, FREE > > > > > > > > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg. > > > > > clktrk _______________________________________________ > > > > > Opensaf-users mailing list > > > > > [email protected] > > > > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > > > > > > > > -- > > . > > > ---------------------------------------------------------------------- > > -------- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT > > Server from Actuate! Instantly Supercharge Your Business Reports > and > > Dashboards with Interactivity, Sharing, Native Excel Exports, App > > Integration & more Get technology previously reserved for > > billion-dollar corporations, FREE > > > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg. > > clktrk _______________________________________________ > > Opensaf-users mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-users > > -- > . -- . ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
