HI Anders Widell / HansN,

On 9/16/2016 2:03 PM, Anders Widell wrote:
> The idea was to just log reception of error info messages, for 
> trouble-shooting purposes.

After multiple attempts,  i manged to simulate TIPC_ERR_OVERLOAD 
error.    After  TIPC_ERR_OVERLOAD error is hit
the cluster going to UN-recoverable state , because the send buffers are 
full.

So we have two options :

1)  Set  TIPC_DEST_DROPPABLE to false ,  log TIPC_ERR_OVERLOAD error  
and then  graceful  exist of sender,
      which allows remaining nodes to be survived.

2)  keep the current configuration as it is ( TIPC_DEST_DROPPABLE to true )

=================================================================================================================
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Received node_up from 2040f: 
msg_id 1
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Node 'PL-4' joined the cluster
Sep 20 15:14:09 SC-1 osafimmnd[3695]: NO Implementer connected: 19 
(MsgQueueService132111) <0, 2040f>
*Sep 20 15:16:59 SC-1 osafimmd[3684]: 77 MDTM: undelivered message 
condition ancillary data: TIPC_ERR_OVERLOAD*
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Director Service in NOACTIVE 
state - fevs replies pending:1 fevs highest processed:218744
Sep 20 15:17:00 SC-1 osafamfnd[3773]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 
'avaDown' : Recovery is 'nodeFailfast'
Sep 20 15:17:00 SC-1 osafamfnd[3773]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown 
Recovery is:nodeFailfast
Sep 20 15:17:00 SC-1 osafamfnd[3773]: Rebooting OpenSAF NodeId = 131343 
EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA DISCARD DUPLICATE FEVS 
message:218744
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Error code 2 returned for 
message type 82 - ignoring
Sep 20 15:17:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA SC Absence IS allowed:900 IMMD 
service is DOWN
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO IMMD SERVICE IS DOWN, HYDRA IS 
CONFIGURED => UNREGISTERING IMMND form MDS
Sep 20 15:17:00 SC-1 osafntfimcnd[3742]: NO saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:20002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 1 <2, 
2010f> (safLogService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:d0d0002010f 
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:100002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 2 <16, 
2010f> (@safLogService_appl)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:130002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 3 <19, 
2010f> (@OpenSafImmReplicatorA)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:140002010f 
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:150002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 4 <21, 
2010f> (safClmService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1a0002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 5 <26, 
2010f> (safAmfService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1b0002010f 
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bc0002010f 
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bd0002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 6 
<1469, 2010f> (MsgQueueService131343)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c00002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 10 
<1472, 2010f> (safEvtService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c40002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 8 
<1476, 2010f> (safSmfService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c60002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 9 
<1478, 2010f> (safLckService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c70002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 7 
<1479, 2010f> (safMsgGrpService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5cc0002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5ce0002010f 
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 12 
<1486, 2010f> (safCheckPointService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 13 <0, 
2020f(down)> (MsgQueueService131599)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 14 <0, 
2020f(down)> (@OpenSafImmReplicatorB)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 15 <0, 
2020f(down)> (@safAmfService2020f)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2020f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 16 <0, 
2030f(down)> (MsgQueueService131855)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2030f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 19 <0, 
2040f(down)> (MsgQueueService132111)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2040f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO MDS unregisterede. sleeping ...
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Sleep done registering IMMND 
with MDS
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fe8fa0043 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60040 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6002e already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60037 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6003d already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6002b already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb6001c already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcba0012 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback: 
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO SUCCESS IN REGISTERING IMMND 
WITH MDS
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Re-introduce-me 
highestProcessed:218744 highestReceived:218744
Sep 20 15:17:03 SC-1 kernel: [ 1794.198381] md: stopping all md devices.
Sep 20 15:17:03 SC-1 osafntfimcnd[8997]: WA ntfimcn_imm_init 
saImmOiInitialize_2() returned SA_AIS_ERR_TIMEOUT (5)
Sep 20 15:18:00 SC-1 syslog-ng[1221]: syslog-ng starting up; version='2.0.9'
=================================================================================================================

-AVM

On 9/16/2016 2:03 PM, Anders Widell wrote:
>
> I don't think we need (or even should) inform the sender when MDS 
> receives an error information message from TIPC. Note that these error 
> information messages are received asynchronously, when the sender has 
> already received an OK return code from the MDS send call. The idea 
> was to just log reception of error info messages, for trouble-shooting 
> purposes. We already have a mechanism in MDS that informs the receiver 
> about lost MDS messages. If we wish to inform the sender we would need 
> to introduce a second mechanism in MDS, and at this point I don't 
> think it is needed. Another approach we could consider is that MDS 
> retransmits the message transparently without informing the sender. 
> This would require MDS to internally store sent messages for a while, 
> so that they can be retransmitted. It would also require the receiver 
> to re-order received messages, since a retransmitted message will be 
> received out of sequence.
>
> regards,
>
> Anders Widell
>
>
> On 09/16/2016 06:40 AM, A V Mahesh wrote:
>> Hi HansN,
>>
>> I managed to create TIPC_ERRINFO/TIPC_RETDATA  error cases ( not  
>> TIPC_ERR_OVERLOAD error )  with normal messages
>> and It is observed that  TIPC_DEST_DROPPABLE set to true even error 
>> TIPC_ERRINFO is NOT notified ( it means TIPC_ERR_OVERLOAD ) ,
>> if TIPC_DEST_DROPPABLE set to false TIPC_ERRINFO/TIPC_RETDATA errors 
>> are notified.
>>
>> Now I will also check implication of TIPC_DEST_DROPPABLE set to false 
>> on multicast and broadcast  messages, based on that
>> we can re-arrange the TIPC_DEST_DROPPABLE setting to false 
>> conditions  based on agent `i_msg_loss_indication = true` condition
>> mds can return to agent the same error  TIPC_ERR_OVERLOAD.
>>
>> TIPC_DEST_DROPPABLE to false:
>>
>> ==================================================================
>>
>> Sep 15 16:10:39 SC-1 osafimmnd[32051]: NO Implementer disconnected 13 
>> <0, 2040f> (MsgQueueService132111)
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: NO MDS event from svc_id 25 
>> (change:4, dest:567413369208836)
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]:  777 MDTM: undelivered message 
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message 
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafamfd[32114]: NO Node 'PL-4' left the cluster
>>
>> ==================================================================
>>
>> TIPC_DEST_DROPPABLE to true:
>>
>> ==================================================================
>>
>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Implementer disconnected 13 
>> <0, 2040f> (MsgQueueService132111)
>> Sep 15 15:59:55 SC-1 osafimmd[26450]: NO MDS event from svc_id 25 
>> (change:4, dest:567412923957252)
>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Global discard node 
>> received for nodeId:2040f pid:410
>> Sep 15 15:59:55 SC-1 osafamfd[28810]: NO Node 'PL-4' left the cluster
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648737] tipc: Resetting link 
>> <1.1.1:eth0-1.1.4:eth0>, peer not responding
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648756] tipc: Lost link 
>> <1.1.1:eth0-1.1.4:eth0> on network plane A
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648771] tipc: Lost contact with 
>> <1.1.4>
>>
>> ==================================================================
>>
>> -AVM
>>
>>
>> On 9/1/2016 10:59 AM, Hans Nordebäck wrote:
>>> Hi Mahesh,
>>>
>>> I have not tested this, but the following should work:
>>>
>>> - Set BSRsock TIPC_IMPORTANCE to TIPC_LOW_IMPORTANCE
>>>
>>> - set socket receive buffer to a small value:
>>>
>>>   optval = "small socket recieive buffer size" , 5000 ?
>>>
>>>   setsockopt(tipc_cb.BSRsock, SOL_SOCKET, SO_RCVBUF, &optval, optlen)
>>>
>>> -  sysctl -w net.tipc.tipc_rmem="5000 40000000 68240400" (or smaller 
>>> values)
>>>
>>> - add some delays when processing messages in 
>>> mdtm_process_recv_events(), to provoke overloading the socket 
>>> receive buffer.
>>>
>>> We experience dropped packages in a 75 node system, and as a 
>>> workaround increasing the default so receive buffer size it seems 
>>> working for that setup.
>>>
>>> /Thanks HansN
>>>
>>> On 09/01/2016 05:50 AM, A V Mahesh wrote:
>>>> Hi HansN,
>>>>
>>>> Do you have any tips to created overload case,
>>>>
>>>> I would like test and observe TIPC_DEST_DROPPABLE enabled & 
>>>> disabled cases.
>>>>
>>>> -AVM
>>>>
>>>>
>>>> On 9/1/2016 9:12 AM, A V Mahesh wrote:
>>>>> Hi HansN,
>>>>>
>>>>> Sorry for the delay.
>>>>>
>>>>> I will test it and get back to you soon.
>>>>>
>>>>> -AVM
>>>>>
>>>>>
>>>>> On 8/31/2016 4:29 PM, Hans Nordebäck wrote:
>>>>>> Hi Mahesh,
>>>>>> Any updates on this?
>>>>>>
>>>>>> /Regards HansN
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell
>>>>>> Sent: den 25 augusti 2016 13:11
>>>>>> To: A V Mahesh <mahesh.va...@oracle.com>; Hans Nordebäck 
>>>>>> <hans.nordeb...@ericsson.com>; mathi.naic...@oracle.com
>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957]
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> This is what the TIPC user documentation says about 
>>>>>> TIPC_DEST_DROPPABLE:
>>>>>> "This option governs the handling of messages sent by the socket 
>>>>>> if the message cannot be delivered to its destination, either 
>>>>>> because the receiver is congested or because the specified 
>>>>>> receiver does not exist.
>>>>>> If enabled, the message is discarded; otherwise the message is 
>>>>>> returned to the sender."
>>>>>>
>>>>>> This is what the TIPC user documentation says about the return 
>>>>>> value from the recvmsg() system call: "When used with a 
>>>>>> connectionless socket, a return value of 0 indicates the arrival 
>>>>>> of a returned data message that was originally sent by this socket."
>>>>>>
>>>>>> I think the documentation is pretty clear. If you set 
>>>>>> TIPC_DEST_DROPPABLE to true, the receiver can discard messages 
>>>>>> e.g. when the receive buffer is full. The sender will not be 
>>>>>> notified in this case. If TIPC_DEST_DROPPABLE is set to false, 
>>>>>> the message will be returned to the sender in case of a full 
>>>>>> receive buffer. The sender knows that it has received such a 
>>>>>> returned message when the recvmsg() call returns zero.
>>>>>>
>>>>>> regards,
>>>>>> Anders Widell
>>>>>>
>>>>>> On 08/25/2016 11:30 AM, A V Mahesh wrote:
>>>>>>> Hi HansN,
>>>>>>>
>>>>>>>
>>>>>>> On 8/23/2016 5:22 PM, Hans Nordebäck wrote:
>>>>>>>
>>>>>>>> Hi Mahesh,
>>>>>>>>
>>>>>>>> Yes, this is my understanding too, if TIPC_DROPPABLE = true 
>>>>>>>> tipc may
>>>>>>>> drop messages silently,  at receive sock buffer full 
>>>>>>>> condition,  but
>>>>>>>> do not return any ancillary message.
>>>>>>>> If TIPC_DROPPABLE = false tipc may drop message but will send an
>>>>>>>> ancillary message to inform about TIPC_ERR_OVERLOAD.
>>>>>>> [AVM]
>>>>>>>
>>>>>>> My observation are understanding is different, based on TIPC 
>>>>>>> code and
>>>>>>> Linux TIPC 2.0 Programmer's Guide , that the TIPC_ERR_OVERLOAD 
>>>>>>> error
>>>>>>> returned when TIPC is unable to enqueue an incoming message on the
>>>>>>> receiving socket's receive queue irrelevant of TIPC_DEST_DROPPABLE
>>>>>>> enabled or disabled.
>>>>>>>
>>>>>>> The only difference between TIPC_DEST_DROPPABLE enabled or 
>>>>>>> disabled is
>>>>>>> , If  TIPC_DEST_DROPPABLE enabled, the message is discarded and
>>>>>>> recvmsg() returned size is ZERO and application will get errors, if
>>>>>>> TIPC_DEST_DROPPABLE disabled  the message is returned to the 
>>>>>>> sender it
>>>>>>> means the recvmsg() returned size is user send data size and
>>>>>>> application will get errors .
>>>>>>>
>>>>>>> I did check the TIPC code and documentations  and I haven't get any
>>>>>>> evidences that  TIPC_ERR_OVERLOAD error code will be send only If
>>>>>>> TIPC_DEST_DROPPABLE = false.
>>>>>>>
>>>>>>> Even while testing #1227
>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) my
>>>>>>> observations and understanding was, an individual TIPC socket is 
>>>>>>> only
>>>>>>> allowed to queue up
>>>>>>> OVERLOAD_LIMIT_BASE/2 messages of the lowest importance level 
>>>>>>> before
>>>>>>> it starts rejecting them.
>>>>>>> Once a socket receiving queue length exceeds the maximum limit 
>>>>>>> value,
>>>>>>> the receiving socket will send out a reject message with
>>>>>>> TIPC_ERR_OVERLOAD error code with cmsg_type as
>>>>>>> TIPC_ERRINFO/TIPC_RETDATA, and the tipc code and Linux TIPC 2.0
>>>>>>> Programmer's Guide  confirmed the same .
>>>>>>>
>>>>>>> tipc/socket.c
>>>>>>> =======================================================
>>>>>>> /* Reject message if there isn't room to queue it */
>>>>>>>
>>>>>>> recv_q_len = (u32)atomic_read(&tipc_queue_size);
>>>>>>> if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
>>>>>>>      if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
>>>>>>>          return TIPC_ERR_OVERLOAD;
>>>>>>> }
>>>>>>> recv_q_len = skb_queue_len(&sk->sk_receive_queue);
>>>>>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
>>>>>>>      if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
>>>>>>>          return TIPC_ERR_OVERLOAD;
>>>>>>> }
>>>>>>> =======================================================
>>>>>>>
>>>>>>>
>>>>>>> 2.1.17. setsockopt() of  TIPC 2.0 Programmer's Guide
>>>>>>> =======================================================
>>>>>>> TIPC_DEST_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket 
>>>>>>> if the
>>>>>>> message cannot be delivered to its destination, either because the
>>>>>>> receiver is congested or because the specified receiver does not
>>>>>>> exist. If enabled, the message is discarded; otherwise the 
>>>>>>> message is
>>>>>>> returned to the sender.
>>>>>>>
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET and 
>>>>>>> SOCK_STREAM
>>>>>>> socket types, and enabled for SOCK_RDM and SOCK_DGRAM, This
>>>>>>> arrangement ensures proper teardown of failed connections when
>>>>>>> connection-oriented data transfer is used, without increasing the
>>>>>>> complexity of connectionless data transfer.
>>>>>>>
>>>>>>> TIPC_SRC_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket if
>>>>>>> link congestion occurs. If enabled, the message is discarded;
>>>>>>> otherwise the system queues the message for later transmission.
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET, 
>>>>>>> SOCK_STREAM,
>>>>>>> and SOCK_RDM socket types (resulting in "reliable" data 
>>>>>>> transfer), and
>>>>>>> enabled for SOCK_DGRAM (resulting in "unreliable" data transfer).
>>>>>>> =======================================================
>>>>>>>
>>>>>>> Now I will try to create OVERLOAD case and update you soon my 
>>>>>>> latest
>>>>>>> observations.
>>>>>>>
>>>>>>> -AVM
>>>>>>>
>>>>>>>> Correcting this and adding an abort is not backward compatible as
>>>>>>>> some service already handle flow control in some way, only log 
>>>>>>>> when
>>>>>>>> packages are dropped.
>>>>>>>> Regarding ticket #1960 there are other solutions than introducing
>>>>>>>> flow control in MDS, e.g. expose an option to the service to 
>>>>>>>> choose
>>>>>>>> connection oriented or connection less.
>>>>>>>> The problem with dropped messages seems in one case related to, 
>>>>>>>> (by
>>>>>>>> MDS), intensive MDS logging.
>>>>>>>>
>>>>>>>> /Thanks HansN
>>>>>>>> -----Original Message-----
>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>>>>>> Sent: den 23 augusti 2016 11:27
>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell
>>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com
>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957]
>>>>>>>>
>>>>>>>> Hi HansN,
>>>>>>>>
>>>>>>>> It seems I am missing some thing , please allow me to under stand
>>>>>>>>
>>>>>>>> If I currently understand you observation :
>>>>>>>>
>>>>>>>> With current Opensaf code ( this #1957 patch NOT applied ) , by
>>>>>>>> default TIPC_DROPPABLE=true ,while running Opensaf with that 
>>>>>>>> binary
>>>>>>>> when TIPC_ERR_OVERLOAD  occurring, TIPC is not given errors
>>>>>>>> TIPC_ERRINFO or  TIPC_RETDATA and following code is not being 
>>>>>>>> get hit
>>>>>>>> of function recvfrom_connectionless(), is my understanding right ?
>>>>>>>>
>>>>>>>> ===================================================================== 
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> *if (anc->cmsg_type == TIPC_ERRINFO) {*
>>>>>>>>        /* TIPC_ERRINFO - TIPC error code associated with a 
>>>>>>>> returned
>>>>>>>> data message or a connection termination message  so abort */
>>>>>>>>        m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>> ancillary
>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> *} else if (anc->cmsg_type == TIPC_RETDATA) {*
>>>>>>>>        /* If we set TIPC_DEST_DROPPABLE off messge (configure 
>>>>>>>> TIPC to
>>>>>>>> return rejected messages to the sender )
>>>>>>>>           we will hit this when we implement MDS retransmit lost
>>>>>>>> messages abort can be replaced with flow control logic*/
>>>>>>>>        for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>>            m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>>            cptr++;
>>>>>>>>        }
>>>>>>>>        /* TIPC_RETDATA -The contents of a returned data message so
>>>>>>>> abort */
>>>>>>>>        m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>> ancillary
>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> }
>>>>>>>>
>>>>>>>> ===================================================================== 
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/23/2016 1:08 PM, Hans Nordebäck wrote:
>>>>>>>>> Hi Mahesh,
>>>>>>>>>
>>>>>>>>> Please see response below with [HansN] /Thanks HansN
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: A V Mahesh [mailto:mahesh.va...@oracle.com]
>>>>>>>>> Sent: den 23 augusti 2016 08:25
>>>>>>>>> To: Hans Nordebäck <hans.nordeb...@ericsson.com>; Anders Widell
>>>>>>>>> <anders.wid...@ericsson.com>; mathi.naic...@oracle.com
>>>>>>>>> Cc: opensaf-devel@lists.sourceforge.net
>>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages 
>>>>>>>>> [#1957]
>>>>>>>>>
>>>>>>>>> Hi HansN
>>>>>>>>>
>>>>>>>>> Please see response below with [AVM]
>>>>>>>>>
>>>>>>>>> -AVM
>>>>>>>>>
>>>>>>>>> On 8/23/2016 11:41 AM, Hans Nordebäck wrote:
>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>
>>>>>>>>>> please see comments below.
>>>>>>>>>>
>>>>>>>>>> /Thanks HansN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/23/2016 07:21 AM, A V Mahesh wrote:
>>>>>>>>>>> Hi HansN,
>>>>>>>>>>>
>>>>>>>>>>> Let us fist discuss the error handling and abort, then we 
>>>>>>>>>>> can come
>>>>>>>>>>> back to interpretation of  TIPC currently does permit  OR does
>>>>>>>>>>> not permit an application to send a multicast message with the
>>>>>>>>>>> "destination droppable" setting disabled.
>>>>>>>>>>>
>>>>>>>>>>> Let us disable TIPC_DEST_DROPPABLE, so that TIPC will try to
>>>>>>>>>>> return an undelivered multicast message to its sender and we 
>>>>>>>>>>> can
>>>>>>>>>>> determine issue is  because of TIPC_ERR_OVERLOAD, this helps in
>>>>>>>>>>> debugging , so that application may increased 
>>>>>>>>>>> SO_SNDBUF/SO_RCVBUF
>>>>>>>>>>> to reduce the problem.
>>>>>>>>>>>
>>>>>>>>>>> But still we need to abort(), the reason for that is current 
>>>>>>>>>>> MDS
>>>>>>>>>>> implementations doesn't have flow control logic ( no retry 
>>>>>>>>>>> because
>>>>>>>>>>> of error ) , so Application like AMF can go wrong and 
>>>>>>>>>>> cluster will
>>>>>>>>>>> go into unstable/recoverble state.
>>>>>>>>>>>
>>>>>>>>>> [HansN] In the current implementation messages are dropped 
>>>>>>>>>> silently
>>>>>>>>>> and no abort is done.
>>>>>>>>> [AVM]  I can see  abort(); in current code , you mean abort(); is
>>>>>>>>> not working and application(amf) is not existing ?
>>>>>>>>> [HansN] In case of TIPC_DROPPABLE=true and messages are dropped,
>>>>>>>>> (TIPC_ERR_OVERLOAD)  no abort is be performed, e.g amfd 
>>>>>>>>> detects this
>>>>>>>>> in the msg sanity chk and logs "invalid msg id ..."
>>>>>>>>> ==================================================================== 
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>>         /* TIPC_ERRINFO - TIPC error code associated with a 
>>>>>>>>> returned
>>>>>>>>> data message or a connection termination message so abort */
>>>>>>>>>         m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>>> ancillary
>>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>>         /* If we set TIPC_DEST_DROPPABLE off messge (configure 
>>>>>>>>> TIPC
>>>>>>>>> to return rejected messages to the sender )
>>>>>>>>>            we will hit this when we implement MDS retransmit lost
>>>>>>>>> messages abort can be replaced with flow control logic*/
>>>>>>>>>         for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>>>             m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>>>             cptr++;
>>>>>>>>>         }
>>>>>>>>>         /* TIPC_RETDATA -The contents of a returned data 
>>>>>>>>> message  so
>>>>>>>>> abort */
>>>>>>>>>         m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>>> ancillary
>>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> }
>>>>>>>>> ==================================================================== 
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>>> This patch enables logging
>>>>>>>>>> when packages are dropped to help in debugging. I don't agree 
>>>>>>>>>> that
>>>>>>>>>> we should also introduce abort, but instead:
>>>>>>>>>> 1) Implement a solution to handle dropped packages, ticket #1960
>>>>>>>>> [AVM]  This is nothing but flow control implementation in MDS, 
>>>>>>>>> this
>>>>>>>>> is future enhancement
>>>>>>>>>
>>>>>>>>>> 2) Investigate why packages may be dropped, the receiving MDS
>>>>>>>>>> thread is a real time thread and should be able to consume a 
>>>>>>>>>> large
>>>>>>>>>> amount of incoming messages.
>>>>>>>>>> E.g. is the receiving MDS thread "live hanging" due to locks, 
>>>>>>>>>> file
>>>>>>>>>> I/O etc?
>>>>>>>>>>> This was the reason we haven't gone for it while addressing 
>>>>>>>>>>> Ticket
>>>>>>>>>>> #1227
>>>>>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/)
>>>>>>>>>>> So currently we don't have any advantage of disabling
>>>>>>>>>>> TIPC_DEST_DROPPABLE and not allowing multicast messages.
>>>>>>>>>>>
>>>>>>>>>>> -AVM
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/18/2016 2:43 PM, Hans Nordeback wrote:
>>>>>>>>>>>> osaf/libs/core/mds/mds_dt_tipc.c |  32
>>>>>>>>>>>> +++++++++++++++++++++++++-------
>>>>>>>>>>>>      1 files changed, 25 insertions(+), 7 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> --- a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> +++ b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> @@ -320,6 +320,15 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid,
>>>>>>>>>>>>                      m_MDS_LOG_INFO("MDTM: Successfully set
>>>>>>>>>>>> default socket option TIPC_IMP = %d", TIPCIMPORTANCE);
>>>>>>>>>>>>              }
>>>>>>>>>>>>      +        int droppable = 0;
>>>>>>>>>>>> +        if (setsockopt(tipc_cb.BSRsock, SOL_TIPC,
>>>>>>>>>>>> TIPC_DEST_DROPPABLE, &droppable, sizeof(droppable)) != 0) {
>>>>>>>>>>>> +                LOG_ER("MDTM: Can't set 
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to
>>>>>>>>>>>> + zero
>>>>>>>>>>>> err :%s\n", strerror(errno));
>>>>>>>>>>>> +                m_MDS_LOG_ERR("MDTM: Can't set
>>>>>>>>>>>> + TIPC_DEST_DROPPABLE
>>>>>>>>>>>> to zero err :%s\n", strerror(errno));
>>>>>>>>>>>> +                osafassert(0);
>>>>>>>>>>>> +        } else {
>>>>>>>>>>>> +                m_MDS_LOG_NOTIFY("MDTM: Successfully set
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to zero");
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +
>>>>>>>>>>>>          return NCSCC_RC_SUCCESS;
>>>>>>>>>>>>      }
>>>>>>>>>>>>      @@ -563,6 +572,8 @@ ssize_t recvfrom_connectionless 
>>>>>>>>>>>> (int sd,
>>>>>>>>>>>>          unsigned char *cptr;
>>>>>>>>>>>>          int i;
>>>>>>>>>>>>          int has_addr;
>>>>>>>>>>>> +    int anc_data[2];
>>>>>>>>>>>> +
>>>>>>>>>>>>          ssize_t sz;
>>>>>>>>>>>>            has_addr = (from != NULL) && (addrlen != NULL); @@
>>>>>>>>>>>> -591,19
>>>>>>>>>>>> +602,26 @@ ssize_t recvfrom_connectionless (int sd,
>>>>>>>>>>>>                     if the message was sent using a TIPC 
>>>>>>>>>>>> name or
>>>>>>>>>>>> name sequence as the
>>>>>>>>>>>>                     destination rather than a TIPC port ID So
>>>>>>>>>>>> abort for TIPC_ERRINFO and TIPC_RETDATA*/
>>>>>>>>>>>>                  if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>>>>> -                /* TIPC_ERRINFO - TIPC error code 
>>>>>>>>>>>> associated with a
>>>>>>>>>>>> returned data message or a connection termination message  so
>>>>>>>>>>>> abort */
>>>>>>>>>>>> -                m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> -                abort();
>>>>>>>>>>>> +                anc_data[0] = *((unsigned 
>>>>>>>>>>>> int*)(CMSG_DATA(anc) +
>>>>>>>>>>>> 0));
>>>>>>>>>>>> +                if (anc_data[0] == TIPC_ERR_OVERLOAD) {
>>>>>>>>>>>> +                    LOG_CR("MDTM: undelivered message 
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered
>>>>>>>>>>>> + message
>>>>>>>>>>>> condition ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> +                } else {
>>>>>>>>>>>> +                    /* TIPC_ERRINFO - TIPC error code 
>>>>>>>>>>>> associated
>>>>>>>>>>>> with a returned data message or a connection termination 
>>>>>>>>>>>> message
>>>>>>>>>>>> so abort */
>>>>>>>>>>>> +                    LOG_CR("MDTM: undelivered message 
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERRINFO abort err : %d", anc_data[0]);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered
>>>>>>>>>>>> + message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err : %d",
>>>>>>>>>>>> anc_data[0]);
>>>>>>>>>>>> +                }
>>>>>>>>>>>>                  } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>>>>> -                /* If we set TIPC_DEST_DROPPABLE off messge
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>> +                /* If we set TIPC_DEST_DROPPABLE off message
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>>                         we will hit this when we implement MDS
>>>>>>>>>>>> retransmit lost messages  abort can be replaced with flow 
>>>>>>>>>>>> control
>>>>>>>>>>>> logic*/
>>>>>>>>>>>>                      for (i = anc->cmsg_len - sizeof(*anc); 
>>>>>>>>>>>> i > 0;
>>>>>>>>>>>> i--) {
>>>>>>>>>>>> -                    m_MDS_LOG_DBG("MDTM: returned byte 
>>>>>>>>>>>> 0x%02x\n",
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> +                    LOG_CR("MDTM: returned byte 0x%02x\n", 
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: returned byte
>>>>>>>>>>>> 0x%02x\n", *cptr);
>>>>>>>>>>>>                          cptr++;
>>>>>>>>>>>>                      }
>>>>>>>>>>>>                      /* TIPC_RETDATA -The contents of a 
>>>>>>>>>>>> returned
>>>>>>>>>>>> data message  so abort */
>>>>>>>>>>>> -                m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> -                abort();
>>>>>>>>>>>> +                LOG_CR("MDTM: undelivered message condition
>>>>>>>>>>>> ancillary data: TIPC_RETDATA");
>>>>>>>>>>>> +                m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA");
>>>>>>>>>>>>                  } else if (anc->cmsg_type == TIPC_DESTNAME) {
>>>>>>>>>>>>                      if (sz == 0) {
>>>>>>>>>>>> m_MDS_LOG_DBG("MDTM: recd bytes=0 on
>>>>>>>>>>>> received on sock, abnormal/unknown condition. Ignoring");
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to