Hi HansN, On 9/20/2016 4:17 PM, Hans Nordebäck wrote: > Hi Mahesh, > > I think only logging is needed as proposed in the patch, as some services are > already handling dropped messages. This logging will help in > trouble shooting. Keeping TIPC_DEST_DROPPABLE to true will only make TIPC to > silently drop messages, the original problem persists and needs investigation, > i.e. why the socket receive buffer is overloaded, one reason may be that the > MDS poll/receive loop together with the "big" mutex lock, (ticket #520). [AVM] One valid reason could be, in case of TIPC_ERR_OVERLOAD recd_bytes is NOT zero , so buffer is overloaded can occur at TIPC or MDS level , I will investigate more and update.
> Did you check why MDS message loss mechanism doesn't detect on TIPC dropped > messages, AMF > do detect this via e.g "out of sync", "msg id mismatch" and so on? [AVM] You mean IMMD message loss mechanism ? -AVM > > /Regards HansN > > -----Original Message----- > From: A V Mahesh [mailto:mahesh.va...@oracle.com] > Sent: den 20 september 2016 12:29 > To: Anders Widell <anders.wid...@ericsson.com>; Hans Nordebäck > <hans.nordeb...@ericsson.com> > Cc: opensaf-devel@lists.sourceforge.net; mathi.naic...@oracle.com > Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957] > > HI Anders Widell / HansN, > > On 9/16/2016 2:03 PM, Anders Widell wrote: >> The idea was to just log reception of error info messages, for >> trouble-shooting purposes. > After multiple attempts, i manged to simulate TIPC_ERR_OVERLOAD > error. After TIPC_ERR_OVERLOAD error is hit > the cluster going to UN-recoverable state , because the send buffers are full. > > So we have two options : > > 1) Set TIPC_DEST_DROPPABLE to false , log TIPC_ERR_OVERLOAD error and then > graceful exist of sender, > which allows remaining nodes to be survived. > > 2) keep the current configuration as it is ( TIPC_DEST_DROPPABLE to true ) > > ================================================================================================================= > Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Received node_up from 2040f: > msg_id 1 > Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Node 'PL-4' joined the cluster Sep 20 > 15:14:09 SC-1 osafimmnd[3695]: NO Implementer connected: 19 > (MsgQueueService132111) <0, 2040f> > *Sep 20 15:16:59 SC-1 osafimmd[3684]: 77 MDTM: undelivered message condition > ancillary data: TIPC_ERR_OVERLOAD* Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA > Director Service in NOACTIVE state - fevs replies pending:1 fevs highest > processed:218744 Sep 20 15:17:00 SC-1 osafamfnd[3773]: NO > 'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : > Recovery is 'nodeFailfast' > Sep 20 15:17:00 SC-1 osafamfnd[3773]: ER > safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown > Recovery is:nodeFailfast Sep 20 15:17:00 SC-1 osafamfnd[3773]: Rebooting > OpenSAF NodeId = 131343 EE Name = , Reason: Component faulted: recovery is > node failfast, OwnNodeId = 131343, SupervisionTime = 60 Sep 20 15:17:00 SC-1 > osafimmnd[3695]: WA DISCARD DUPLICATE FEVS > message:218744 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Error code 2 returned for message > type 82 - ignoring Sep 20 15:17:00 SC-1 opensaf_reboot: Rebooting local node; > timeout=60 Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA SC Absence IS allowed:900 > IMMD service is DOWN Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO IMMD SERVICE IS > DOWN, HYDRA IS CONFIGURED => UNREGISTERING IMMND form MDS Sep 20 15:17:00 > SC-1 osafntfimcnd[3742]: NO saImmOiDispatch() Fail SA_AIS_ERR_BAD_HANDLE (9) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:20002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 1 <2, > 2010f> (safLogService) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:d0d0002010f > sv_id:26 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:100002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 2 <16, > 2010f> (@safLogService_appl) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:130002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 3 <19, > 2010f> (@OpenSafImmReplicatorA) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:140002010f > sv_id:26 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:150002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 4 <21, > 2010f> (safClmService) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1a0002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 5 <26, > 2010f> (safAmfService) > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1b0002010f > sv_id:26 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bc0002010f > sv_id:26 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bd0002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 6 <1469, > 2010f> (MsgQueueService131343) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO > Removing client id:5c00002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 10 <1472, > 2010f> (safEvtService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing > client id:5c40002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 8 <1476, > 2010f> (safSmfService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing > client id:5c60002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 9 <1478, > 2010f> (safLckService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing > client id:5c70002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 7 <1479, > 2010f> (safMsgGrpService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing > client id:5cc0002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5ce0002010f > sv_id:27 > Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 12 <1486, > 2010f> (safCheckPointService) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO > Implementer disconnected 13 <0, 2020f(down)> (MsgQueueService131599) Sep 20 > 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 14 <0, > 2020f(down)> (@OpenSafImmReplicatorB) Sep 20 15:17:00 SC-1 osafimmnd[3695]: > NO Implementer disconnected 15 <0, 2020f(down)> (@safAmfService2020f) Sep 20 > 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2020f Sep 20 15:17:00 > SC-1 osafimmnd[3695]: NO Implementer disconnected 16 <0, 2030f(down)> > (MsgQueueService131855) Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl > Discarded node 2030f Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer > disconnected 19 <0, 2040f(down)> (MsgQueueService132111) Sep 20 15:17:00 SC-1 > osafimmnd[3695]: NO Impl Discarded node 2040f Sep 20 15:17:00 SC-1 > osafimmnd[3695]: NO MDS unregisterede. sleeping ... > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Sleep done registering IMMND with > MDS Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fe8fa0043 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60040 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb6002e already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60037 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60028 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb6003d already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb6002b already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb6001c already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60019 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcba0012 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60028 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: > dest 2010fdcb60019 already exist > Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO SUCCESS IN REGISTERING IMMND WITH > MDS Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Re-introduce-me > highestProcessed:218744 highestReceived:218744 Sep 20 15:17:03 SC-1 kernel: [ > 1794.198381] md: stopping all md devices. > Sep 20 15:17:03 SC-1 osafntfimcnd[8997]: WA ntfimcn_imm_init > saImmOiInitialize_2() returned SA_AIS_ERR_TIMEOUT (5) Sep 20 15:18:00 SC-1 > syslog-ng[1221]: syslog-ng starting up; version='2.0.9' > ================================================================================================================= > > -AVM > > On 9/16/2016 2:03 PM, Anders Widell wrote: >> I don't think we need (or even should) inform the sender when MDS >> receives an error information message from TIPC. Note that these error >> information messages are received asynchronously, when the sender has >> already received an OK return code from the MDS send call. The idea >> was to just log reception of error info messages, for trouble-shooting >> purposes. We already have a mechanism in MDS that informs the receiver >> about lost MDS messages. If we wish to inform the sender we would need >> to introduce a second mechanism in MDS, and at this point I don't >> think it is needed. Another approach we could consider is that MDS >> retransmits the message transparently without informing the sender. >> This would require MDS to internally store sent messages for a while, >> so that they can be retransmitted. It would also require the receiver >> to re-order received messages, since a retransmitted message will be >> received out of sequence. >> >> regards, >> >> Anders Widell >> >> >> On 09/16/2016 06:40 AM, A V Mahesh wrote: >>> Hi HansN, >>> >>> I managed to create TIPC_ERRINFO/TIPC_RETDATA error cases ( not >>> TIPC_ERR_OVERLOAD error ) with normal messages and It is observed >>> that TIPC_DEST_DROPPABLE set to true even error TIPC_ERRINFO is NOT >>> notified ( it means TIPC_ERR_OVERLOAD ) , if TIPC_DEST_DROPPABLE set >>> to false TIPC_ERRINFO/TIPC_RETDATA errors are notified. >>> >>> Now I will also check implication of TIPC_DEST_DROPPABLE set to false >>> on multicast and broadcast messages, based on that we can re-arrange >>> the TIPC_DEST_DROPPABLE setting to false conditions based on agent >>> `i_msg_loss_indication = true` condition mds can return to agent the >>> same error TIPC_ERR_OVERLOAD. >>> >>> TIPC_DEST_DROPPABLE to false: >>> >>> ================================================================== >>> >>> Sep 15 16:10:39 SC-1 osafimmnd[32051]: NO Implementer disconnected 13 >>> <0, 2040f> (MsgQueueService132111) Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: NO MDS event >>> from svc_id 25 (change:4, dest:567413369208836) Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: >>> undelivered message condition ancillary data: TIPC_ERRINFO abort err >>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered >>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: >>> undelivered message condition ancillary data: TIPC_ERRINFO abort err >>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered >>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: >>> undelivered message condition ancillary data: TIPC_ERRINFO abort err >>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered >>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: >>> undelivered message condition ancillary data: TIPC_ERRINFO abort err >>> : 2 Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered >>> message condition ancillary data: TIPC_RETDATA Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 777 MDTM: undelivered message condition ancillary >>> data: TIPC_ERRINFO abort err : 2 Sep 15 16:10:39 SC-1 >>> osafimmd[32040]: 7777 MDTM: undelivered message condition ancillary >>> data: TIPC_RETDATA Sep 15 16:10:39 SC-1 osafamfd[32114]: NO Node >>> 'PL-4' left the cluster >>> >>> ================================================================== >>> >>> TIPC_DEST_DROPPABLE to true: >>> >>> ================================================================== >>> >>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Implementer disconnected 13 >>> <0, 2040f> (MsgQueueService132111) Sep 15 15:59:55 SC-1 >>> osafimmd[26450]: NO MDS event from svc_id 25 (change:4, >>> dest:567412923957252) Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO >>> Global discard node received for nodeId:2040f pid:410 Sep 15 15:59:55 >>> SC-1 osafamfd[28810]: NO Node 'PL-4' left the cluster Sep 15 15:59:58 >>> SC-1 kernel: [ 5147.648737] tipc: Resetting link >>> <1.1.1:eth0-1.1.4:eth0>, peer not responding Sep 15 15:59:58 SC-1 >>> kernel: [ 5147.648756] tipc: Lost link <1.1.1:eth0-1.1.4:eth0> on >>> network plane A Sep 15 15:59:58 SC-1 kernel: [ 5147.648771] tipc: >>> Lost contact with <1.1.4> >>> >>> ================================================================== >>> >>> -AVM >>> >>> >>> On 9/1/2016 10:59 AM, Hans Nordebäck wrote: >>>> Hi Mahesh, >>>> >>>> I have not tested this, but the following should work: >>>> >>>> - Set BSRsock TIPC_IMPORTANCE to TIPC_LOW_IMPORTANCE >>>> >>>> - set socket receive buffer to a small value: >>>> >>>> optval = "small socket recieive buffer size" , 5000 ? >>>> >>>> setsockopt(tipc_cb.BSRsock, SOL_SOCKET, SO_RCVBUF, &optval, >>>> optlen) >>>> >>>> - sysctl -w net.tipc.tipc_rmem="5000 40000000 68240400" (or smaller >>>> values) >>>> >>>> - add some delays when processing messages in >>>> mdtm_process_recv_events(), to provoke overloading the socket >>>> receive buffer. >>>> >>>> We experience dropped packages in a 75 node system, and as a >>>> workaround increasing the default so receive buffer size it seems >>>> working for that setup. >>>> >>>> /Thanks HansN >>>> >>>> On 09/01/2016 05:50 AM, A V Mahesh wrote: >>>>> Hi HansN, >>>>> >>>>> Do you have any tips to created overload case, >>>>> >>>>> I would like test and observe TIPC_DEST_DROPPABLE enabled & >>>>> disabled cases. >>>>> >>>>> -AVM >>>>> >>>>> >>>>> On 9/1/2016 9:12 AM, A V Mahesh wrote: >>>>>> Hi HansN, >>>>>> >>>>>> Sorry for the delay. >>>>>> >>>>>> I will test it and get back to you soon. >>>>>> >>>>>> -AVM >>>>>> >>>>>> >>>>>> On 8/31/2016 4:29 PM, Hans Nordebäck wrote: >>>>>>> Hi Mahesh, >>>>>>> Any updates on this? >>>>>>> >>>>>>> /Regards HansN >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Anders Widell >>>>>>> Sent: den 25 augusti 2016 13:11 >>>>>>> To: A V Mahesh <mahesh.va...@oracle.com>; Hans Nordebäck >>>>>>> <hans.nordeb...@ericsson.com>; mathi.naic...@oracle.com >>>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages >>>>>>> [#1957] >>>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> This is what the TIPC user documentation says about >>>>>>> TIPC_DEST_DROPPABLE: >>>>>>> "This option governs the handling of messages sent by the socket >>>>>>> if the message cannot be delivered to its destination, either >>>>>>> because the receiver is congested or because the specified >>>>>>> receiver does not exist. >>>>>>> If enabled, the message is discarded; otherwise the message is >>>>>>> returned to the sender." >>>>>>> >>>>>>> This is what the TIPC user documentation says about the return >>>>>>> value from the recvmsg() system call: "When used with a >>>>>>> connectionless socket, a return value of 0 indicates the arrival >>>>>>> of a returned data message that was originally sent by this socket." >>>>>>> >>>>>>> I think the documentation is pretty clear. If you set >>>>>>> TIPC_DEST_DROPPABLE to true, the receiver can discard messages >>>>>>> e.g. when the receive buffer is full. The sender will not be >>>>>>> notified in this case. If TIPC_DEST_DROPPABLE is set to false, >>>>>>> the message will be returned to the sender in case of a full >>>>>>> receive buffer. The sender knows that it has received such a >>>>>>> returned message when the recvmsg() call returns zero. >>>>>>> >>>>>>> regards, >>>>>>> Anders Widell >>>>>>> >>>>>>> On 08/25/2016 11:30 AM, A V Mahesh wrote: >>>>>>>> Hi HansN, >>>>>>>> >>>>>>>> >>>>>>>> On 8/23/2016 5:22 PM, Hans Nordebäck wrote: >>>>>>>> >>>>>>>>> Hi Mahesh, >>>>>>>>> >>>>>>>>> Yes, this is my understanding too, if TIPC_DROPPABLE = true >>>>>>>>> tipc may drop messages silently, at receive sock buffer full >>>>>>>>> condition, but do not return any ancillary message. >>>>>>>>> If TIPC_DROPPABLE = false tipc may drop message but will send >>>>>>>>> an ancillary message to inform about TIPC_ERR_OVERLOAD. >>>>>>>> [AVM] >>>>>>>> >>>>>>>> My observation are understanding is different, based on TIPC >>>>>>>> code and Linux TIPC 2.0 Programmer's Guide , that the >>>>>>>> TIPC_ERR_OVERLOAD error returned when TIPC is unable to enqueue >>>>>>>> an incoming message on the receiving socket's receive queue >>>>>>>> irrelevant of TIPC_DEST_DROPPABLE enabled or disabled. >>>>>>>> >>>>>>>> The only difference between TIPC_DEST_DROPPABLE enabled or >>>>>>>> disabled is , If TIPC_DEST_DROPPABLE enabled, the message is >>>>>>>> discarded and >>>>>>>> recvmsg() returned size is ZERO and application will get errors, >>>>>>>> if TIPC_DEST_DROPPABLE disabled the message is returned to the >>>>>>>> sender it means the recvmsg() returned size is user send data >>>>>>>> size and application will get errors . >>>>>>>> >>>>>>>> I did check the TIPC code and documentations and I haven't get >>>>>>>> any evidences that TIPC_ERR_OVERLOAD error code will be send >>>>>>>> only If TIPC_DEST_DROPPABLE = false. >>>>>>>> >>>>>>>> Even while testing #1227 >>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) my >>>>>>>> observations and understanding was, an individual TIPC socket is >>>>>>>> only allowed to queue up >>>>>>>> OVERLOAD_LIMIT_BASE/2 messages of the lowest importance level >>>>>>>> before it starts rejecting them. >>>>>>>> Once a socket receiving queue length exceeds the maximum limit >>>>>>>> value, the receiving socket will send out a reject message with >>>>>>>> TIPC_ERR_OVERLOAD error code with cmsg_type as >>>>>>>> TIPC_ERRINFO/TIPC_RETDATA, and the tipc code and Linux TIPC 2.0 >>>>>>>> Programmer's Guide confirmed the same . >>>>>>>> >>>>>>>> tipc/socket.c >>>>>>>> ======================================================= >>>>>>>> /* Reject message if there isn't room to queue it */ >>>>>>>> >>>>>>>> recv_q_len = (u32)atomic_read(&tipc_queue_size); >>>>>>>> if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) { >>>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE)) >>>>>>>> return TIPC_ERR_OVERLOAD; } recv_q_len = >>>>>>>> skb_queue_len(&sk->sk_receive_queue); >>>>>>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) { >>>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2)) >>>>>>>> return TIPC_ERR_OVERLOAD; } >>>>>>>> ======================================================= >>>>>>>> >>>>>>>> >>>>>>>> 2.1.17. setsockopt() of TIPC 2.0 Programmer's Guide >>>>>>>> ======================================================= >>>>>>>> TIPC_DEST_DROPPABLE >>>>>>>> This option governs the handling of messages sent by the socket >>>>>>>> if the message cannot be delivered to its destination, either >>>>>>>> because the receiver is congested or because the specified >>>>>>>> receiver does not exist. If enabled, the message is discarded; >>>>>>>> otherwise the message is returned to the sender. >>>>>>>> >>>>>>>> By default, this option is disabled for SOCK_SEQPACKET and >>>>>>>> SOCK_STREAM socket types, and enabled for SOCK_RDM and >>>>>>>> SOCK_DGRAM, This arrangement ensures proper teardown of failed >>>>>>>> connections when connection-oriented data transfer is used, >>>>>>>> without increasing the complexity of connectionless data >>>>>>>> transfer. >>>>>>>> >>>>>>>> TIPC_SRC_DROPPABLE >>>>>>>> This option governs the handling of messages sent by the socket >>>>>>>> if link congestion occurs. If enabled, the message is discarded; >>>>>>>> otherwise the system queues the message for later transmission. >>>>>>>> By default, this option is disabled for SOCK_SEQPACKET, >>>>>>>> SOCK_STREAM, and SOCK_RDM socket types (resulting in "reliable" >>>>>>>> data transfer), and enabled for SOCK_DGRAM (resulting in >>>>>>>> "unreliable" data transfer). >>>>>>>> ======================================================= >>>>>>>> >>>>>>>> Now I will try to create OVERLOAD case and update you soon my >>>>>>>> latest observations. >>>>>>>> >>>>>>>> -AVM >>>>>>>> >>>>>>>>> Correcting this and adding an abort is not backward compatible >>>>>>>>> as some service already handle flow control in some way, only >>>>>>>>> log when packages are dropped. >>>>>>>>> Regarding ticket #1960 there are other solutions than >>>>>>>>> introducing flow control in MDS, e.g. expose an option to the >>>>>>>>> service to choose connection oriented or connection less. >>>>>>>>> The problem with dropped messages seems in one case related to, >>>>>>>>> (by MDS), intensive MDS logging. >>>>>>>>> >>>>>>>>> /Thanks HansN >>>>>>>>> -----Original Message----- >>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>>>>>>> Sent: den 23 augusti 2016 11:27 >>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell >>>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com >>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages >>>>>>>>> [#1957] >>>>>>>>> >>>>>>>>> Hi HansN, >>>>>>>>> >>>>>>>>> It seems I am missing some thing , please allow me to under >>>>>>>>> stand >>>>>>>>> >>>>>>>>> If I currently understand you observation : >>>>>>>>> >>>>>>>>> With current Opensaf code ( this #1957 patch NOT applied ) , by >>>>>>>>> default TIPC_DROPPABLE=true ,while running Opensaf with that >>>>>>>>> binary when TIPC_ERR_OVERLOAD occurring, TIPC is not given >>>>>>>>> errors TIPC_ERRINFO or TIPC_RETDATA and following code is not >>>>>>>>> being get hit of function recvfrom_connectionless(), is my >>>>>>>>> understanding right ? >>>>>>>>> >>>>>>>>> =============================================================== >>>>>>>>> ====== >>>>>>>>> >>>>>>>>> ======================================== >>>>>>>>> >>>>>>>>> >>>>>>>>> *if (anc->cmsg_type == TIPC_ERRINFO) {* >>>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a >>>>>>>>> returned data message or a connection termination message so >>>>>>>>> abort */ >>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>>> ancillary >>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) ); >>>>>>>>> *abort();* >>>>>>>>> *} else if (anc->cmsg_type == TIPC_RETDATA) {* >>>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure >>>>>>>>> TIPC to return rejected messages to the sender ) >>>>>>>>> we will hit this when we implement MDS retransmit >>>>>>>>> lost messages abort can be replaced with flow control logic*/ >>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) { >>>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr); >>>>>>>>> cptr++; >>>>>>>>> } >>>>>>>>> /* TIPC_RETDATA -The contents of a returned data message >>>>>>>>> so abort */ >>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition >>>>>>>>> ancillary >>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) ); >>>>>>>>> *abort();* >>>>>>>>> } >>>>>>>>> >>>>>>>>> =============================================================== >>>>>>>>> ====== >>>>>>>>> >>>>>>>>> ======================================== >>>>>>>>> >>>>>>>>> >>>>>>>>> -AVM >>>>>>>>> >>>>>>>>> >>>>>>>>> On 8/23/2016 1:08 PM, Hans Nordebäck wrote: >>>>>>>>>> Hi Mahesh, >>>>>>>>>> >>>>>>>>>> Please see response below with [HansN] /Thanks HansN >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com] >>>>>>>>>> Sent: den 23 augusti 2016 08:25 >>>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders >>>>>>>>>> Widell <anders.wid...@ericsson.com>; mathi.naic...@oracle.com >>>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net >>>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages >>>>>>>>>> [#1957] >>>>>>>>>> >>>>>>>>>> Hi HansN >>>>>>>>>> >>>>>>>>>> Please see response below with [AVM] >>>>>>>>>> >>>>>>>>>> -AVM >>>>>>>>>> >>>>>>>>>> On 8/23/2016 11:41 AM, Hans Nordebäck wrote: >>>>>>>>>>> Hi Mahesh, >>>>>>>>>>> >>>>>>>>>>> please see comments below. >>>>>>>>>>> >>>>>>>>>>> /Thanks HansN >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 08/23/2016 07:21 AM, A V Mahesh wrote: >>>>>>>>>>>> Hi HansN, >>>>>>>>>>>> >>>>>>>>>>>> Let us fist discuss the error handling and abort, then we >>>>>>>>>>>> can come back to interpretation of TIPC currently does >>>>>>>>>>>> permit OR does not permit an application to send a >>>>>>>>>>>> multicast message with the "destination droppable" setting >>>>>>>>>>>> disabled. >>>>>>>>>>>> >>>>>>>>>>>> Let us disable TIPC_DEST_DROPPABLE, so that TIPC will try to >>>>>>>>>>>> return an undelivered multicast message to its sender and we >>>>>>>>>>>> can determine issue is because of TIPC_ERR_OVERLOAD, this >>>>>>>>>>>> helps in debugging , so that application may increased >>>>>>>>>>>> SO_SNDBUF/SO_RCVBUF to reduce the problem. >>>>>>>>>>>> >>>>>>>>>>>> But still we need to abort(), the reason for that is current >>>>>>>>>>>> MDS implementations doesn't have flow control logic ( no >>>>>>>>>>>> retry because of error ) , so Application like AMF can go >>>>>>>>>>>> wrong and cluster will go into unstable/recoverble state. >>>>>>>>>>>> >>>>>>>>>>> [HansN] In the current implementation messages are dropped >>>>>>>>>>> silently and no abort is done. >>>>>>>>>> [AVM] I can see abort(); in current code , you mean abort(); >>>>>>>>>> is not working and application(amf) is not existing ? >>>>>>>>>> [HansN] In case of TIPC_DROPPABLE=true and messages are >>>>>>>>>> dropped, >>>>>>>>>> (TIPC_ERR_OVERLOAD) no abort is be performed, e.g amfd >>>>>>>>>> detects this in the msg sanity chk and logs "invalid msg id >>>>>>>>>> ..." >>>>>>>>>> ============================================================== >>>>>>>>>> ====== >>>>>>>>>> >>>>>>>>>> == >>>>>>>>>> ====== >>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) { >>>>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a >>>>>>>>>> returned data message or a connection termination message so >>>>>>>>>> abort */ >>>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>> condition ancillary >>>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) ); >>>>>>>>>> *abort();* >>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) { >>>>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure >>>>>>>>>> TIPC to return rejected messages to the sender ) >>>>>>>>>> we will hit this when we implement MDS retransmit >>>>>>>>>> lost messages abort can be replaced with flow control logic*/ >>>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) { >>>>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr); >>>>>>>>>> cptr++; >>>>>>>>>> } >>>>>>>>>> /* TIPC_RETDATA -The contents of a returned data >>>>>>>>>> message so abort */ >>>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>> condition ancillary >>>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) ); >>>>>>>>>> *abort();* >>>>>>>>>> } >>>>>>>>>> ============================================================== >>>>>>>>>> ====== >>>>>>>>>> >>>>>>>>>> == >>>>>>>>>> ====== >>>>>>>>>>> This patch enables logging >>>>>>>>>>> when packages are dropped to help in debugging. I don't agree >>>>>>>>>>> that we should also introduce abort, but instead: >>>>>>>>>>> 1) Implement a solution to handle dropped packages, ticket >>>>>>>>>>> #1960 >>>>>>>>>> [AVM] This is nothing but flow control implementation in MDS, >>>>>>>>>> this is future enhancement >>>>>>>>>> >>>>>>>>>>> 2) Investigate why packages may be dropped, the receiving MDS >>>>>>>>>>> thread is a real time thread and should be able to consume a >>>>>>>>>>> large amount of incoming messages. >>>>>>>>>>> E.g. is the receiving MDS thread "live hanging" due to locks, >>>>>>>>>>> file I/O etc? >>>>>>>>>>>> This was the reason we haven't gone for it while addressing >>>>>>>>>>>> Ticket >>>>>>>>>>>> #1227 >>>>>>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/ >>>>>>>>>>>> ) So currently we don't have any advantage of disabling >>>>>>>>>>>> TIPC_DEST_DROPPABLE and not allowing multicast messages. >>>>>>>>>>>> >>>>>>>>>>>> -AVM >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 8/18/2016 2:43 PM, Hans Nordeback wrote: >>>>>>>>>>>>> osaf/libs/core/mds/mds_dt_tipc.c | 32 >>>>>>>>>>>>> +++++++++++++++++++++++++------- >>>>>>>>>>>>> 1 files changed, 25 insertions(+), 7 deletions(-) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>>> b/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>>> --- a/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>>> +++ b/osaf/libs/core/mds/mds_dt_tipc.c >>>>>>>>>>>>> @@ -320,6 +320,15 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid, >>>>>>>>>>>>> m_MDS_LOG_INFO("MDTM: Successfully set >>>>>>>>>>>>> default socket option TIPC_IMP = %d", TIPCIMPORTANCE); >>>>>>>>>>>>> } >>>>>>>>>>>>> + int droppable = 0; >>>>>>>>>>>>> + if (setsockopt(tipc_cb.BSRsock, SOL_TIPC, >>>>>>>>>>>>> TIPC_DEST_DROPPABLE, &droppable, sizeof(droppable)) != 0) { >>>>>>>>>>>>> + LOG_ER("MDTM: Can't set >>>>>>>>>>>>> TIPC_DEST_DROPPABLE to >>>>>>>>>>>>> + zero >>>>>>>>>>>>> err :%s\n", strerror(errno)); >>>>>>>>>>>>> + m_MDS_LOG_ERR("MDTM: Can't set >>>>>>>>>>>>> + TIPC_DEST_DROPPABLE >>>>>>>>>>>>> to zero err :%s\n", strerror(errno)); >>>>>>>>>>>>> + osafassert(0); >>>>>>>>>>>>> + } else { >>>>>>>>>>>>> + m_MDS_LOG_NOTIFY("MDTM: Successfully set >>>>>>>>>>>>> TIPC_DEST_DROPPABLE to zero"); >>>>>>>>>>>>> + } >>>>>>>>>>>>> + >>>>>>>>>>>>> return NCSCC_RC_SUCCESS; >>>>>>>>>>>>> } >>>>>>>>>>>>> @@ -563,6 +572,8 @@ ssize_t recvfrom_connectionless >>>>>>>>>>>>> (int sd, >>>>>>>>>>>>> unsigned char *cptr; >>>>>>>>>>>>> int i; >>>>>>>>>>>>> int has_addr; >>>>>>>>>>>>> + int anc_data[2]; >>>>>>>>>>>>> + >>>>>>>>>>>>> ssize_t sz; >>>>>>>>>>>>> has_addr = (from != NULL) && (addrlen != NULL); >>>>>>>>>>>>> @@ >>>>>>>>>>>>> -591,19 >>>>>>>>>>>>> +602,26 @@ ssize_t recvfrom_connectionless (int sd, >>>>>>>>>>>>> if the message was sent using a TIPC >>>>>>>>>>>>> name or name sequence as the >>>>>>>>>>>>> destination rather than a TIPC port ID >>>>>>>>>>>>> So abort for TIPC_ERRINFO and TIPC_RETDATA*/ >>>>>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) { >>>>>>>>>>>>> - /* TIPC_ERRINFO - TIPC error code >>>>>>>>>>>>> associated with a >>>>>>>>>>>>> returned data message or a connection termination message >>>>>>>>>>>>> so abort */ >>>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err :%s", >>>>>>>>>>>>> strerror(errno) ); >>>>>>>>>>>>> - abort(); >>>>>>>>>>>>> + anc_data[0] = *((unsigned >>>>>>>>>>>>> int*)(CMSG_DATA(anc) + >>>>>>>>>>>>> 0)); >>>>>>>>>>>>> + if (anc_data[0] == TIPC_ERR_OVERLOAD) { >>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message >>>>>>>>>>>>> condition >>>>>>>>>>>>> ancillary data: TIPC_ERR_OVERLOAD"); >>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>>> condition ancillary data: TIPC_ERR_OVERLOAD"); >>>>>>>>>>>>> + } else { >>>>>>>>>>>>> + /* TIPC_ERRINFO - TIPC error code >>>>>>>>>>>>> associated >>>>>>>>>>>>> with a returned data message or a connection termination >>>>>>>>>>>>> message so abort */ >>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message >>>>>>>>>>>>> condition >>>>>>>>>>>>> ancillary data: TIPC_ERRINFO abort err : %d", anc_data[0]); >>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err : %d", >>>>>>>>>>>>> anc_data[0]); >>>>>>>>>>>>> + } >>>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) { >>>>>>>>>>>>> - /* If we set TIPC_DEST_DROPPABLE off messge >>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender ) >>>>>>>>>>>>> + /* If we set TIPC_DEST_DROPPABLE off >>>>>>>>>>>>> + message >>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender ) >>>>>>>>>>>>> we will hit this when we implement >>>>>>>>>>>>> MDS retransmit lost messages abort can be replaced with >>>>>>>>>>>>> flow control logic*/ >>>>>>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); >>>>>>>>>>>>> i > 0; >>>>>>>>>>>>> i--) { >>>>>>>>>>>>> - m_MDS_LOG_DBG("MDTM: returned byte >>>>>>>>>>>>> 0x%02x\n", >>>>>>>>>>>>> *cptr); >>>>>>>>>>>>> + LOG_CR("MDTM: returned byte 0x%02x\n", >>>>>>>>>>>>> *cptr); >>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: returned byte >>>>>>>>>>>>> 0x%02x\n", *cptr); >>>>>>>>>>>>> cptr++; >>>>>>>>>>>>> } >>>>>>>>>>>>> /* TIPC_RETDATA -The contents of a >>>>>>>>>>>>> returned data message so abort */ >>>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message >>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA abort err :%s", >>>>>>>>>>>>> strerror(errno) ); >>>>>>>>>>>>> - abort(); >>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message >>>>>>>>>>>>> + condition >>>>>>>>>>>>> ancillary data: TIPC_RETDATA"); >>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered >>>>>>>>>>>>> + message >>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA"); >>>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_DESTNAME) { >>>>>>>>>>>>> if (sz == 0) { >>>>>>>>>>>>> m_MDS_LOG_DBG("MDTM: recd bytes=0 on received on sock, >>>>>>>>>>>>> abnormal/unknown condition. Ignoring"); >>>> ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel