HI Anders Widell / HansN, On 9/16/2016 2:03 PM, Anders Widell wrote: > The idea was to just log reception of error info messages, for > trouble-shooting purposes.
After multiple attempts, i manged to simulate TIPC_ERR_OVERLOAD error. After TIPC_ERR_OVERLOAD error is hit the cluster going to UN-recoverable state , because the send buffers are full. So we have two options : 1) Set TIPC_DEST_DROPPABLE to false , log TIPC_ERR_OVERLOAD error and then graceful exist of sender, which allows remaining nodes to be survived. 2) keep the current configuration as it is ( TIPC_DEST_DROPPABLE to true ) ================================================================================================================= Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Received node_up from 2040f: msg_id 1 Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Node 'PL-4' joined the cluster Sep 20 15:14:09 SC-1 osafimmnd[3695]: NO Implementer connected: 19 (MsgQueueService132111) <0, 2040f> *Sep 20 15:16:59 SC-1 osafimmd[3684]: 77 MDTM: undelivered message condition ancillary data: TIPC_ERR_OVERLOAD* Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Director Service in NOACTIVE state - fevs replies pending:1 fevs highest processed:218744 Sep 20 15:17:00 SC-1 osafamfnd[3773]: NO 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'nodeFailfast' Sep 20 15:17:00 SC-1 osafamfnd[3773]: ER safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery is:nodeFailfast Sep 20 15:17:00 SC-1 osafamfnd[3773]: Rebooting OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 131343, SupervisionTime = 60 Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA DISCARD DUPLICATE FEVS message:218744 Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Error code 2 returned for message type 82 - ignoring Sep 20 15:17:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60 Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA SC Absence IS allowed:900 IMMD service is DOWN Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO IMMD SERVICE IS DOWN, HYDRA IS CONFIGURED => UNREGISTERING IMMND form MDS Sep 20 15:17:00 SC-1 osafntfimcnd[3742]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:20002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 1 <2, 2010f> (safLogService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:d0d0002010f sv_id:26 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:100002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 2 <16, 2010f> (@safLogService_appl) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:130002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 3 <19, 2010f> (@OpenSafImmReplicatorA) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:140002010f sv_id:26 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:150002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 4 <21, 2010f> (safClmService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1a0002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 5 <26, 2010f> (safAmfService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1b0002010f sv_id:26 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bc0002010f sv_id:26 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bd0002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 6 <1469, 2010f> (MsgQueueService131343) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c00002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 10 <1472, 2010f> (safEvtService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c40002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 8 <1476, 2010f> (safSmfService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c60002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 9 <1478, 2010f> (safLckService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c70002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 7 <1479, 2010f> (safMsgGrpService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5cc0002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5ce0002010f sv_id:27 Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 12 <1486, 2010f> (safCheckPointService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 13 <0, 2020f(down)> (MsgQueueService131599) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 14 <0, 2020f(down)> (@OpenSafImmReplicatorB) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 15 <0, 2020f(down)> (@safAmfService2020f) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2020f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 16 <0, 2030f(down)> (MsgQueueService131855) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2030f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 19 <0, 2040f(down)> (MsgQueueService132111) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2040f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO MDS unregisterede. sleeping ... Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Sleep done registering IMMND with MDS Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fe8fa0043 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60040 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb6002e already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60037 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60028 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb6003d already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb6002b already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb6001c already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60019 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcba0012 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60028 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: dest 2010fdcb60019 already exist Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO SUCCESS IN REGISTERING IMMND WITH MDS Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Re-introduce-me highestProcessed:218744 highestReceived:218744 Sep 20 15:17:03 SC-1 kernel: [ 1794.198381] md: stopping all md devices. Sep 20 15:17:03 SC-1 osafntfimcnd[8997]: WA ntfimcn_imm_init saImmOiInitialize_2() returned SA_AIS_ERR_TIMEOUT (5) Sep 20 15:18:00 SC-1 syslog-ng[1221]: syslog-ng starting up; version='2.0.9' ================================================================================================================= -AVM On 9/16/2016 2:03 PM, Anders Widell wrote: > > I don't think we need (or even should) inform the sender when MDS > receives an error information message from TIPC. Note that these error > information messages are received asynchronously, when the sender has > already received an OK return code from the MDS send call. The idea > was to just log reception of error info messages, for trouble-shooting > purposes. We already have a mechanism in MDS that informs the receiver > about lost MDS messages. If we wish to inform the sender we would need > to introduce a second mechanism in MDS, and at this point I don't > think it is needed. Another approach we could consider is that MDS > retransmits the message transparently without informing the sender. > This would require MDS to internally store sent messages for a while, > so that they can be retransmitted. It would also require the receiver > to re-order received messages, since a retransmitted message will be > received out of sequence. > > regards, > > Anders Widell > > > On 09/16/2016 06:40 AM, A V Mahesh wrote: >> Hi HansN, >> >> I managed to create TIPC_ERRINFO/TIPC_RETDATA error cases ( not >> TIPC_ERR_OVERLOAD error ) with normal messages >> and It is observed that TIPC_DEST_DROPPABLE set to true even error >> TIPC_ERRINFO is NOT notified ( it means TIPC_ERR_OVERLOAD ) , >> if TIPC_DEST_DROPPABLE set to false TIPC_ERRINFO/TIPC_RETDATA errors >> are notified. >> >> Now I will also check implication of TIPC_DEST_DROPPABLE set to false >> on multicast and broadcast messages, based on that >> we can re-arrange the TIPC_DEST_DROPPABLE setting to false >> conditions based on agent `i_msg_loss_indication = true` condition >> mds can return to agent the same error TIPC_ERR_OVERLOAD. >> >> TIPC_DEST_DROPPABLE to false: >> >> ================================================================== >> >> Sep 15 16:10:39 SC-1 osafimmnd[32051]: NO Implementer disconnected 13 >> <0, 2040f> (MsgQueueService132111) >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: NO MDS event from svc_id 25 >> (change:4, dest:567413369208836) >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message >> condition ancillary data: TIPC_ERRINFO abort err : 2 >> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message >> condition ancillary data: TIPC_RETDATA >> Sep 15 16:10:39 SC-1 osafamfd[32114]: NO Node 'PL-4' left the cluster >> >> ================================================================== >> >> TIPC_DEST_DROPPABLE to true: >> >> ================================================================== >> >> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Implementer disconnected 13 >> <0, 2040f> (MsgQueueService132111) >> Sep 15 15:59:55 SC-1 osafimmd[26450]: NO MDS event from svc_id 25 >> (change:4, dest:567412923957252) >> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Global discard node >> received for nodeId:2040f pid:410 >> Sep 15 15:59:55 SC-1 osafamfd[28810]: NO Node 'PL-4' left the cluster >> Sep 15 15:59:58 SC-1 kernel: [ 5147.648737] tipc: Resetting link >> <1.1.1:eth0-1.1.4:eth0>, peer not responding >> Sep 15 15:59:58 SC-1 kernel: [ 5147.648756] tipc: Lost link >> <1.1.1:eth0-1.1.4:eth0> on network plane A >> Sep 15 15:59:58 SC-1 kernel: [ 5147.648771] tipc: Lost contact with >> <1.1.4> >> >> ================================================================== >> >> -AVM >> >> >> On 9/1/2016 10:59 AM, Hans Nordebäck wrote: >>> Hi Mahesh, >>> >>> I have not tested this, but the following should work: >>> >>> - Set BSRsock TIPC_IMPORTANCE to TIPC_LOW_IMPORTANCE >>> >>> - set socket receive buffer to a small value: >>> >>> optval = "small socket recieive buffer size" , 5000 ? >>> >>> setsockopt(tipc_cb.BSRsock, SOL_SOCKET, SO_RCVBUF, &optval, optlen) >>> >>> - sysctl -w net.tipc.tipc_rmem="5000 40000000 68240400" (or smaller >>> values) >>> >>> - add some delays when processing messages in >>> mdtm_process_recv_events(), to provoke overloading the socket >>> receive buffer. >>> >>> We experience dropped packages in a 75 node system, and as a >>> workaround increasing the default so receive buffer size it seems >>> working for that setup. >>> >>> /Thanks HansN >>> >>> On 09/01/2016 05:50 AM, A V Mahesh wrote: >>>> Hi HansN, >>>> >>>> Do you have any tips to created overload case, >>>> >>>> I would like test and observe TIPC_DEST_DROPPABLE enabled & >>>> disabled cases. >>>> >>>> -AVM >>>> >>>> >>>> On 9/1/2016 9:12 AM, A V Mahesh wrote: >>>>> Hi HansN, >>>>> >>>>> Sorry for the delay. >>>>> >>>>> I will test it and get back to you soon. >>>>> >>>>> -AVM >>>>> >>>>> >>>>> On 8/31/2016 4:29 PM, Hans Nordebäck wrote: >>>>>> Hi Mahesh, >>>>>> Any updates on this? >>>>>> >>>>>> /Regards HansN >>>>>> >>>>>> -----Original Message----- >>>>>> From: Anders Widell >>>>>> Sent: den 25 augusti 2016 13:11 >>>>>> To: A V Mahesh <mahesh.va...@oracle.com>; Hans Nordebäck >>>>>> <hans.nordeb...@ericsson.com>; mathi.naic...@oracle.com >>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957] >>>>>> >>>>>> Hi! >>>>>> >>>>>> This is what the TIPC user documentation says about >>>>>> TIPC_DEST_DROPPABLE: >>>>>> "This option governs the handling of messages sent by the socket >>>>>> if the message cannot be delivered to its destination, either >>>>>> because the receiver is congested or because the specified >>>>>> receiver does not exist. >>>>>> If enabled, the message is discarded; otherwise the message is >>>>>> returned to the sender." >>>>>> >>>>>> This is what the TIPC user documentation says about the return >>>>>> value from the recvmsg() system call: "When used with a >>>>>> connectionless socket, a return value of 0 indicates the arrival >>>>>> of a returned data message that was originally sent by this socket." >>>>>> >>>>>> I think the documentation is pretty clear. If you set >>>>>> TIPC_DEST_DROPPABLE to true, the receiver can discard messages >>>>>> e.g. when the receive buffer is full. The sender will not be >>>>>> notified in this case. If TIPC_DEST_DROPPABLE is set to false, >>>>>> the message will be returned to the sender in case of a full >>>>>> receive buffer. The sender knows that it has received such a >>>>>> returned message when the recvmsg() call returns zero. >>>>>> >>>>>> regards, >>>>>> Anders Widell >>>>>> >>>>>> On 08/25/2016 11:30 AM, A V Mahesh wrote: >>>>>>> Hi HansN, >>>>>>> >>>>>>> >>>>>>> On 8/23/2016 5:22 PM, Hans Nordebäck wrote: >>>>>>> >>>>>>>> Hi Mahesh, >>>>>>>> >>>>>>>> Yes, this is my understanding too, if TIPC_DROPPABLE = true >>>>>>>> tipc may >>>>>>>> drop messages silently, at receive sock buffer full >>>>>>>> condition, but >>>>>>>> do not return any ancillary message. >>>>>>>> If TIPC_DROPPABLE = false tipc may drop message but will send an >>>>>>>> ancillary message to inform about TIPC_ERR_OVERLOAD. >>>>>>> [AVM] >>>>>>> >>>>>>> My observation are understanding is different, based on TIPC >>>>>>> code and >>>>>>> Linux TIPC 2.0 Programmer's Guide , that the TIPC_ERR_OVERLOAD >>>>>>> error >>>>>>> returned when TIPC is unable to enqueue an incoming message on the >>>>>>> receiving socket's receive queue irrelevant of TIPC_DEST_DROPPABLE >>>>>>> enabled or disabled. >>>>>>> >>>>>>> The only difference between TIPC_DEST_DROPPABLE enabled or >>>>>>> disabled is >>>>>>> , If TIPC_DEST_DROPPABLE enabled, the message is discarded and >>>>>>> recvmsg() returned size is ZERO and application will get errors, if >>>>>>> TIPC_DEST_DROPPABLE disabled the message is returned to the >>>>>>> sender it >>>>>>> means the recvmsg() returned size is user send data size and >>>>>>> application will get errors . >>>>>>> >>>>>>> I did check the TIPC code and documentations and I haven't get any >>>>>>> evidences that TIPC_ERR_OVERLOAD error code will be send only If >>>>>>> TIPC_DEST_DROPPABLE = false. >>>>>>> >>>>>>> Even while testing #1227 >>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) my >>>>>>> observations and understanding was, an individual TIPC socket is >>>>>>> only >>>>>>> allowed to queue up >>>>>>> OVERLOAD_LIMIT_BASE/2 messages of the lowest importance level >>>>>>> before >>>>>>> it starts rejecting them. >>>>>>> Once a socket receiving queue length exceeds the maximum limit >>>>>>> value, >>>>>>> the receiving socket will send out a reject message with >>>>>>> TIPC_ERR_OVERLOAD error code with cmsg_type as >>>>>>> TIPC_ERRINFO/TIPC_RETDATA, and the tipc code and Linux TIPC 2.0 >>>>>>> Programmer's Guide confirmed the same . >>>>>>> >>>>>>> tipc/socket.c >>>>>>> ======================================================= >>>>>>> /* Reject message if there isn't room to queue it */ >>>>>>> >>>>>>> recv_q_len = (u32)atomic_read(&tipc_queue_size); >>>>>>> if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) { >>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE)) >>>>>>> return TIPC_ERR_OVERLOAD; >>>>>>> } >>>>>>> recv_q_len = skb_queue_len(&sk->sk_receive_queue); >>>>>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) { >>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2)) >>>>>>> return TIPC_ERR_OVERLOAD; >>>>>>> } >>>>>>> ======================================================= >>>>>>> >>>>>>> >>>>>>> 2.1.17. setsockopt() of TIPC 2.0 Programmer's Guide >>>>>>> ======================================================= >>>>>>> TIPC_DEST_DROPPABLE >>>>>>> This option governs the handling of messages sent by the socket >>>>>>> if the >>>>>>> message cannot be delivered to its destination, either because the >>>>>>> receiver is congested or because the specified receiver does not >>>>>>> exist. If enabled, the message is discarded; otherwise the >>>>>>> message is >>>>>>> returned to the sender. >>>>>>> >>>>>>> By default, this option is disabled for SOCK_SEQPACKET and >>>>>>> SOCK_STREAM >>>>>>> socket types, and enabled for SOCK_RDM and SOCK_DGRAM, This >>>>>>> arrangement ensures proper teardown of failed connections when >>>>>>> connection-oriented data transfer is used, without increasing the >>>>>>> complexity of connectionless data transfer. >>>>>>> >>>>>>> TIPC_SRC_DROPPABLE >>>>>>> This option governs the handling of messages sent by the socket if >>>>>>> link congestion occurs. If enabled, the message is discarded; >>>>>>> otherwise the system queues the message for later transmission. >>>>>>> By default, this option is disabled for SOCK_SEQPACKET, >>>>>>> SOCK_STREAM, >>>>>>> and SOCK_RDM socket types (resulting in "reliable" data >>>>>>> transfer), and >>>>>>> enabled for SOCK_DGRAM (resulting in "unreliable" data transfer). >>>>>>> ======================================================= >>>>>>> >>>>>>> Now I will try to create OVERLOAD case and update you soon my >>>>>>> latest >>>>>>> observations. >>>>>>> >>>>>>> -AVM >>>>>>> >>>>>>>> Correcting this and adding an abort is not backward compatible as >>>>>>>> some service already handle flow control in some way, only log >>>>>>>> when >>>>>>>> packages are dropped. >>>>>>>> Regarding ticket #1960 there are other solutions than introducing >>>>>>>> flow control in MDS, e.g. expose an option to the service to >>>>>>>> choose >>>>>>>> connection oriented or connection less. >>>>>>>> The problem with dropped messages seems in one case related to, >>>>>>>> (by >>>>>>>> MDS), intensive MDS logging. >>>>>>>> >>>>>>>> /Thanks HansN >>>>>>>> -----Original Message----- >>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>>>>>> Sent: den 23 augusti 2016 11:27 >>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell >>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com >>>>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957] >>>>>>>> >>>>>>>> Hi HansN, >>>>>>>> >>>>>>>> It seems I am missing some thing , please allow me to under stand >>>>>>>> >>>>>>>> If I currently understand you observation : >>>>>>>> >>>>>>>> With current Opensaf code ( this #1957 patch NOT applied ) , by >>>>>>>> default TIPC_DROPPABLE=true ,while running Opensaf with that >>>>>>>> binary >>>>>>>> when TIPC_ERR_OVERLOAD occurring, TIPC is not given errors >>>>>>>> TIPC_ERRINFO or TIPC_RETDATA and following code is not being >>>>>>>> get hit >>>>>>>> of function recvfrom_connectionless(), is my understanding right ? >>>>>>>> >>>>>>>> ===================================================================== >>>>>>>> >>>>>>>> ======================================== >>>>>>>> >>>>>>>> >>>>>>>> *if (anc->cmsg_type == TIPC_ERRINFO) {* >>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a >>>>>>>> returned >>>>>>>> data message or a connection termination message so abort */ >>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>> ancillary >>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) ); >>>>>>>> *abort();* >>>>>>>> *} else if (anc->cmsg_type == TIPC_RETDATA) {* >>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure >>>>>>>> TIPC to >>>>>>>> return rejected messages to the sender ) >>>>>>>> we will hit this when we implement MDS retransmit lost >>>>>>>> messages abort can be replaced with flow control logic*/ >>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) { >>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr); >>>>>>>> cptr++; >>>>>>>> } >>>>>>>> /* TIPC_RETDATA -The contents of a returned data message so >>>>>>>> abort */ >>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>> ancillary >>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) ); >>>>>>>> *abort();* >>>>>>>> } >>>>>>>> >>>>>>>> ===================================================================== >>>>>>>> >>>>>>>> ======================================== >>>>>>>> >>>>>>>> >>>>>>>> -AVM >>>>>>>> >>>>>>>> >>>>>>>> On 8/23/2016 1:08 PM, Hans Nordebäck wrote: >>>>>>>>> Hi Mahesh, >>>>>>>>> >>>>>>>>> Please see response below with [HansN] /Thanks HansN >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>>>>>>> Sent: den 23 augusti 2016 08:25 >>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell >>>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com >>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages >>>>>>>>> [#1957] >>>>>>>>> >>>>>>>>> Hi HansN >>>>>>>>> >>>>>>>>> Please see response below with [AVM] >>>>>>>>> >>>>>>>>> -AVM >>>>>>>>> >>>>>>>>> On 8/23/2016 11:41 AM, Hans Nordebäck wrote: >>>>>>>>>> Hi Mahesh, >>>>>>>>>> >>>>>>>>>> please see comments below. >>>>>>>>>> >>>>>>>>>> /Thanks HansN >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 08/23/2016 07:21 AM, A V Mahesh wrote: >>>>>>>>>>> Hi HansN, >>>>>>>>>>> >>>>>>>>>>> Let us fist discuss the error handling and abort, then we >>>>>>>>>>> can come >>>>>>>>>>> back to interpretation of TIPC currently does permit OR does >>>>>>>>>>> not permit an application to send a multicast message with the >>>>>>>>>>> "destination droppable" setting disabled. >>>>>>>>>>> >>>>>>>>>>> Let us disable TIPC_DEST_DROPPABLE, so that TIPC will try to >>>>>>>>>>> return an undelivered multicast message to its sender and we >>>>>>>>>>> can >>>>>>>>>>> determine issue is because of TIPC_ERR_OVERLOAD, this helps in >>>>>>>>>>> debugging , so that application may increased >>>>>>>>>>> SO_SNDBUF/SO_RCVBUF >>>>>>>>>>> to reduce the problem. >>>>>>>>>>> >>>>>>>>>>> But still we need to abort(), the reason for that is current >>>>>>>>>>> MDS >>>>>>>>>>> implementations doesn't have flow control logic ( no retry >>>>>>>>>>> because >>>>>>>>>>> of error ) , so Application like AMF can go wrong and >>>>>>>>>>> cluster will >>>>>>>>>>> go into unstable/recoverble state. >>>>>>>>>>> >>>>>>>>>> [HansN] In the current implementation messages are dropped >>>>>>>>>> silently >>>>>>>>>> and no abort is done. >>>>>>>>> [AVM] I can see abort(); in current code , you mean abort(); is >>>>>>>>> not working and application(amf) is not existing ? >>>>>>>>> [HansN] In case of TIPC_DROPPABLE=true and messages are dropped, >>>>>>>>> (TIPC_ERR_OVERLOAD) no abort is be performed, e.g amfd >>>>>>>>> detects this >>>>>>>>> in the msg sanity chk and logs "invalid msg id ..." >>>>>>>>> ==================================================================== >>>>>>>>> >>>>>>>>> == >>>>>>>>> ====== >>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) { >>>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a >>>>>>>>> returned >>>>>>>>> data message or a connection termination message so abort */ >>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>>> ancillary >>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) ); >>>>>>>>> *abort();* >>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) { >>>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure >>>>>>>>> TIPC >>>>>>>>> to return rejected messages to the sender ) >>>>>>>>> we will hit this when we implement MDS retransmit lost >>>>>>>>> messages abort can be replaced with flow control logic*/ >>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) { >>>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr); >>>>>>>>> cptr++; >>>>>>>>> } >>>>>>>>> /* TIPC_RETDATA -The contents of a returned data >>>>>>>>> message so >>>>>>>>> abort */ >>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>>> ancillary >>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) ); >>>>>>>>> *abort();* >>>>>>>>> } >>>>>>>>> ==================================================================== >>>>>>>>> >>>>>>>>> == >>>>>>>>> ====== >>>>>>>>>> This patch enables logging >>>>>>>>>> when packages are dropped to help in debugging. I don't agree >>>>>>>>>> that >>>>>>>>>> we should also introduce abort, but instead: >>>>>>>>>> 1) Implement a solution to handle dropped packages, ticket #1960 >>>>>>>>> [AVM] This is nothing but flow control implementation in MDS, >>>>>>>>> this >>>>>>>>> is future enhancement >>>>>>>>> >>>>>>>>>> 2) Investigate why packages may be dropped, the receiving MDS >>>>>>>>>> thread is a real time thread and should be able to consume a >>>>>>>>>> large >>>>>>>>>> amount of incoming messages. >>>>>>>>>> E.g. is the receiving MDS thread "live hanging" due to locks, >>>>>>>>>> file >>>>>>>>>> I/O etc? >>>>>>>>>>> This was the reason we haven't gone for it while addressing >>>>>>>>>>> Ticket >>>>>>>>>>> #1227 >>>>>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) >>>>>>>>>>> So currently we don't have any advantage of disabling >>>>>>>>>>> TIPC_DEST_DROPPABLE and not allowing multicast messages. >>>>>>>>>>> >>>>>>>>>>> -AVM >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 8/18/2016 2:43 PM, Hans Nordeback wrote: >>>>>>>>>>>> osaf/libs/core/mds/mds_dt_tipc.c | 32 >>>>>>>>>>>> +++++++++++++++++++++++++------- >>>>>>>>>>>> 1 files changed, 25 insertions(+), 7 deletions(-) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>> b/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>> --- a/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>> +++ b/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>> @@ -320,6 +320,15 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, >>>>>>>>>>>> m_MDS_LOG_INFO("MDTM: Successfully set >>>>>>>>>>>> default socket option TIPC_IMP = %d", TIPCIMPORTANCE); >>>>>>>>>>>> } >>>>>>>>>>>> + int droppable = 0; >>>>>>>>>>>> + if (setsockopt(tipc_cb.BSRsock, SOL_TIPC, >>>>>>>>>>>> TIPC_DEST_DROPPABLE, &droppable, sizeof(droppable)) != 0) { >>>>>>>>>>>> + LOG_ER("MDTM: Can't set >>>>>>>>>>>> TIPC_DEST_DROPPABLE to >>>>>>>>>>>> + zero >>>>>>>>>>>> err :%s\n", strerror(errno)); >>>>>>>>>>>> + m_MDS_LOG_ERR("MDTM: Can't set >>>>>>>>>>>> + TIPC_DEST_DROPPABLE >>>>>>>>>>>> to zero err :%s\n", strerror(errno)); >>>>>>>>>>>> + osafassert(0); >>>>>>>>>>>> + } else { >>>>>>>>>>>> + m_MDS_LOG_NOTIFY("MDTM: Successfully set >>>>>>>>>>>> TIPC_DEST_DROPPABLE to zero"); >>>>>>>>>>>> + } >>>>>>>>>>>> + >>>>>>>>>>>> return NCSCC_RC_SUCCESS; >>>>>>>>>>>> } >>>>>>>>>>>> @@ -563,6 +572,8 @@ ssize_t recvfrom_connectionless >>>>>>>>>>>> (int sd, >>>>>>>>>>>> unsigned char *cptr; >>>>>>>>>>>> int i; >>>>>>>>>>>> int has_addr; >>>>>>>>>>>> + int anc_data[2]; >>>>>>>>>>>> + >>>>>>>>>>>> ssize_t sz; >>>>>>>>>>>> has_addr = (from != NULL) && (addrlen != NULL); @@ >>>>>>>>>>>> -591,19 >>>>>>>>>>>> +602,26 @@ ssize_t recvfrom_connectionless (int sd, >>>>>>>>>>>> if the message was sent using a TIPC >>>>>>>>>>>> name or >>>>>>>>>>>> name sequence as the >>>>>>>>>>>> destination rather than a TIPC port ID So >>>>>>>>>>>> abort for TIPC_ERRINFO and TIPC_RETDATA*/ >>>>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) { >>>>>>>>>>>> - /* TIPC_ERRINFO - TIPC error code >>>>>>>>>>>> associated with a >>>>>>>>>>>> returned data message or a connection termination message so >>>>>>>>>>>> abort */ >>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err :%s", >>>>>>>>>>>> strerror(errno) ); >>>>>>>>>>>> - abort(); >>>>>>>>>>>> + anc_data[0] = *((unsigned >>>>>>>>>>>> int*)(CMSG_DATA(anc) + >>>>>>>>>>>> 0)); >>>>>>>>>>>> + if (anc_data[0] == TIPC_ERR_OVERLOAD) { >>>>>>>>>>>> + LOG_CR("MDTM: undelivered message >>>>>>>>>>>> condition >>>>>>>>>>>> ancillary data: TIPC_ERR_OVERLOAD"); >>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered >>>>>>>>>>>> + message >>>>>>>>>>>> condition ancillary data: TIPC_ERR_OVERLOAD"); >>>>>>>>>>>> + } else { >>>>>>>>>>>> + /* TIPC_ERRINFO - TIPC error code >>>>>>>>>>>> associated >>>>>>>>>>>> with a returned data message or a connection termination >>>>>>>>>>>> message >>>>>>>>>>>> so abort */ >>>>>>>>>>>> + LOG_CR("MDTM: undelivered message >>>>>>>>>>>> condition >>>>>>>>>>>> ancillary data: TIPC_ERRINFO abort err : %d", anc_data[0]); >>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered >>>>>>>>>>>> + message >>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err : %d", >>>>>>>>>>>> anc_data[0]); >>>>>>>>>>>> + } >>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) { >>>>>>>>>>>> - /* If we set TIPC_DEST_DROPPABLE off messge >>>>>>>>>>>> (configure TIPC to return rejected messages to the sender ) >>>>>>>>>>>> + /* If we set TIPC_DEST_DROPPABLE off message >>>>>>>>>>>> (configure TIPC to return rejected messages to the sender ) >>>>>>>>>>>> we will hit this when we implement MDS >>>>>>>>>>>> retransmit lost messages abort can be replaced with flow >>>>>>>>>>>> control >>>>>>>>>>>> logic*/ >>>>>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); >>>>>>>>>>>> i > 0; >>>>>>>>>>>> i--) { >>>>>>>>>>>> - m_MDS_LOG_DBG("MDTM: returned byte >>>>>>>>>>>> 0x%02x\n", >>>>>>>>>>>> *cptr); >>>>>>>>>>>> + LOG_CR("MDTM: returned byte 0x%02x\n", >>>>>>>>>>>> *cptr); >>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: returned byte >>>>>>>>>>>> 0x%02x\n", *cptr); >>>>>>>>>>>> cptr++; >>>>>>>>>>>> } >>>>>>>>>>>> /* TIPC_RETDATA -The contents of a >>>>>>>>>>>> returned >>>>>>>>>>>> data message so abort */ >>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>> condition ancillary data: TIPC_RETDATA abort err :%s", >>>>>>>>>>>> strerror(errno) ); >>>>>>>>>>>> - abort(); >>>>>>>>>>>> + LOG_CR("MDTM: undelivered message condition >>>>>>>>>>>> ancillary data: TIPC_RETDATA"); >>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>> condition ancillary data: TIPC_RETDATA"); >>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_DESTNAME) { >>>>>>>>>>>> if (sz == 0) { >>>>>>>>>>>> m_MDS_LOG_DBG("MDTM: recd bytes=0 on >>>>>>>>>>>> received on sock, abnormal/unknown condition. Ignoring"); >>>>>> >>>>> >>>> >>> >>> >> > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel