HI Anders Widell / HansN,
On 9/16/2016 2:03 PM, Anders Widell wrote:
> The idea was to just log reception of error info messages, for
> trouble-shooting purposes.
After multiple attempts, i manged to simulate TIPC_ERR_OVERLOAD
error. After TIPC_ERR_OVERLOAD error is hit
the cluster going to UN-recoverable state , because the send buffers are
full.
So we have two options :
1) Set TIPC_DEST_DROPPABLE to false , log TIPC_ERR_OVERLOAD error
and then graceful exist of sender,
which allows remaining nodes to be survived.
2) keep the current configuration as it is ( TIPC_DEST_DROPPABLE to true )
=================================================================================================================
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Received node_up from 2040f:
msg_id 1
Sep 20 15:14:09 SC-1 osafamfd[3759]: NO Node 'PL-4' joined the cluster
Sep 20 15:14:09 SC-1 osafimmnd[3695]: NO Implementer connected: 19
(MsgQueueService132111) <0, 2040f>
*Sep 20 15:16:59 SC-1 osafimmd[3684]: 77 MDTM: undelivered message
condition ancillary data: TIPC_ERR_OVERLOAD*
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Director Service in NOACTIVE
state - fevs replies pending:1 fevs highest processed:218744
Sep 20 15:17:00 SC-1 osafamfnd[3773]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'avaDown' : Recovery is 'nodeFailfast'
Sep 20 15:17:00 SC-1 osafamfnd[3773]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown
Recovery is:nodeFailfast
Sep 20 15:17:00 SC-1 osafamfnd[3773]: Rebooting OpenSAF NodeId = 131343
EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA DISCARD DUPLICATE FEVS
message:218744
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA Error code 2 returned for
message type 82 - ignoring
Sep 20 15:17:00 SC-1 opensaf_reboot: Rebooting local node; timeout=60
Sep 20 15:17:00 SC-1 osafimmnd[3695]: WA SC Absence IS allowed:900 IMMD
service is DOWN
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO IMMD SERVICE IS DOWN, HYDRA IS
CONFIGURED => UNREGISTERING IMMND form MDS
Sep 20 15:17:00 SC-1 osafntfimcnd[3742]: NO saImmOiDispatch() Fail
SA_AIS_ERR_BAD_HANDLE (9)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:20002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 1 <2,
2010f> (safLogService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:d0d0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:100002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 2 <16,
2010f> (@safLogService_appl)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:130002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 3 <19,
2010f> (@OpenSafImmReplicatorA)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:140002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:150002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 4 <21,
2010f> (safClmService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1a0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 5 <26,
2010f> (safAmfService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:1b0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bc0002010f
sv_id:26
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5bd0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 6
<1469, 2010f> (MsgQueueService131343)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c00002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 10
<1472, 2010f> (safEvtService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c40002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 8
<1476, 2010f> (safSmfService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c60002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 9
<1478, 2010f> (safLckService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5c70002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 7
<1479, 2010f> (safMsgGrpService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5cc0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Removing client id:5ce0002010f
sv_id:27
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 12
<1486, 2010f> (safCheckPointService)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 13 <0,
2020f(down)> (MsgQueueService131599)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 14 <0,
2020f(down)> (@OpenSafImmReplicatorB)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 15 <0,
2020f(down)> (@safAmfService2020f)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2020f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 16 <0,
2030f(down)> (MsgQueueService131855)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2030f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Implementer disconnected 19 <0,
2040f(down)> (MsgQueueService132111)
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO Impl Discarded node 2040f
Sep 20 15:17:00 SC-1 osafimmnd[3695]: NO MDS unregisterede. sleeping ...
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Sleep done registering IMMND
with MDS
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fe8fa0043 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60040 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb6002e already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60037 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb6003d already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb6002b already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb6001c already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcba0012 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60028 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO MDS: mds_register_callback:
dest 2010fdcb60019 already exist
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO SUCCESS IN REGISTERING IMMND
WITH MDS
Sep 20 15:17:01 SC-1 osafimmnd[3695]: NO Re-introduce-me
highestProcessed:218744 highestReceived:218744
Sep 20 15:17:03 SC-1 kernel: [ 1794.198381] md: stopping all md devices.
Sep 20 15:17:03 SC-1 osafntfimcnd[8997]: WA ntfimcn_imm_init
saImmOiInitialize_2() returned SA_AIS_ERR_TIMEOUT (5)
Sep 20 15:18:00 SC-1 syslog-ng[1221]: syslog-ng starting up; version='2.0.9'
=================================================================================================================
-AVM
On 9/16/2016 2:03 PM, Anders Widell wrote:
>
> I don't think we need (or even should) inform the sender when MDS
> receives an error information message from TIPC. Note that these error
> information messages are received asynchronously, when the sender has
> already received an OK return code from the MDS send call. The idea
> was to just log reception of error info messages, for trouble-shooting
> purposes. We already have a mechanism in MDS that informs the receiver
> about lost MDS messages. If we wish to inform the sender we would need
> to introduce a second mechanism in MDS, and at this point I don't
> think it is needed. Another approach we could consider is that MDS
> retransmits the message transparently without informing the sender.
> This would require MDS to internally store sent messages for a while,
> so that they can be retransmitted. It would also require the receiver
> to re-order received messages, since a retransmitted message will be
> received out of sequence.
>
> regards,
>
> Anders Widell
>
>
> On 09/16/2016 06:40 AM, A V Mahesh wrote:
>> Hi HansN,
>>
>> I managed to create TIPC_ERRINFO/TIPC_RETDATA error cases ( not
>> TIPC_ERR_OVERLOAD error ) with normal messages
>> and It is observed that TIPC_DEST_DROPPABLE set to true even error
>> TIPC_ERRINFO is NOT notified ( it means TIPC_ERR_OVERLOAD ) ,
>> if TIPC_DEST_DROPPABLE set to false TIPC_ERRINFO/TIPC_RETDATA errors
>> are notified.
>>
>> Now I will also check implication of TIPC_DEST_DROPPABLE set to false
>> on multicast and broadcast messages, based on that
>> we can re-arrange the TIPC_DEST_DROPPABLE setting to false
>> conditions based on agent `i_msg_loss_indication = true` condition
>> mds can return to agent the same error TIPC_ERR_OVERLOAD.
>>
>> TIPC_DEST_DROPPABLE to false:
>>
>> ==================================================================
>>
>> Sep 15 16:10:39 SC-1 osafimmnd[32051]: NO Implementer disconnected 13
>> <0, 2040f> (MsgQueueService132111)
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: NO MDS event from svc_id 25
>> (change:4, dest:567413369208836)
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 777 MDTM: undelivered message
>> condition ancillary data: TIPC_ERRINFO abort err : 2
>> Sep 15 16:10:39 SC-1 osafimmd[32040]: 7777 MDTM: undelivered message
>> condition ancillary data: TIPC_RETDATA
>> Sep 15 16:10:39 SC-1 osafamfd[32114]: NO Node 'PL-4' left the cluster
>>
>> ==================================================================
>>
>> TIPC_DEST_DROPPABLE to true:
>>
>> ==================================================================
>>
>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Implementer disconnected 13
>> <0, 2040f> (MsgQueueService132111)
>> Sep 15 15:59:55 SC-1 osafimmd[26450]: NO MDS event from svc_id 25
>> (change:4, dest:567412923957252)
>> Sep 15 15:59:55 SC-1 osafimmnd[26461]: NO Global discard node
>> received for nodeId:2040f pid:410
>> Sep 15 15:59:55 SC-1 osafamfd[28810]: NO Node 'PL-4' left the cluster
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648737] tipc: Resetting link
>> <1.1.1:eth0-1.1.4:eth0>, peer not responding
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648756] tipc: Lost link
>> <1.1.1:eth0-1.1.4:eth0> on network plane A
>> Sep 15 15:59:58 SC-1 kernel: [ 5147.648771] tipc: Lost contact with
>> <1.1.4>
>>
>> ==================================================================
>>
>> -AVM
>>
>>
>> On 9/1/2016 10:59 AM, Hans Nordebäck wrote:
>>> Hi Mahesh,
>>>
>>> I have not tested this, but the following should work:
>>>
>>> - Set BSRsock TIPC_IMPORTANCE to TIPC_LOW_IMPORTANCE
>>>
>>> - set socket receive buffer to a small value:
>>>
>>> optval = "small socket recieive buffer size" , 5000 ?
>>>
>>> setsockopt(tipc_cb.BSRsock, SOL_SOCKET, SO_RCVBUF, &optval, optlen)
>>>
>>> - sysctl -w net.tipc.tipc_rmem="5000 40000000 68240400" (or smaller
>>> values)
>>>
>>> - add some delays when processing messages in
>>> mdtm_process_recv_events(), to provoke overloading the socket
>>> receive buffer.
>>>
>>> We experience dropped packages in a 75 node system, and as a
>>> workaround increasing the default so receive buffer size it seems
>>> working for that setup.
>>>
>>> /Thanks HansN
>>>
>>> On 09/01/2016 05:50 AM, A V Mahesh wrote:
>>>> Hi HansN,
>>>>
>>>> Do you have any tips to created overload case,
>>>>
>>>> I would like test and observe TIPC_DEST_DROPPABLE enabled &
>>>> disabled cases.
>>>>
>>>> -AVM
>>>>
>>>>
>>>> On 9/1/2016 9:12 AM, A V Mahesh wrote:
>>>>> Hi HansN,
>>>>>
>>>>> Sorry for the delay.
>>>>>
>>>>> I will test it and get back to you soon.
>>>>>
>>>>> -AVM
>>>>>
>>>>>
>>>>> On 8/31/2016 4:29 PM, Hans Nordebäck wrote:
>>>>>> Hi Mahesh,
>>>>>> Any updates on this?
>>>>>>
>>>>>> /Regards HansN
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell
>>>>>> Sent: den 25 augusti 2016 13:11
>>>>>> To: A V Mahesh <[email protected]>; Hans Nordebäck
>>>>>> <[email protected]>; [email protected]
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957]
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> This is what the TIPC user documentation says about
>>>>>> TIPC_DEST_DROPPABLE:
>>>>>> "This option governs the handling of messages sent by the socket
>>>>>> if the message cannot be delivered to its destination, either
>>>>>> because the receiver is congested or because the specified
>>>>>> receiver does not exist.
>>>>>> If enabled, the message is discarded; otherwise the message is
>>>>>> returned to the sender."
>>>>>>
>>>>>> This is what the TIPC user documentation says about the return
>>>>>> value from the recvmsg() system call: "When used with a
>>>>>> connectionless socket, a return value of 0 indicates the arrival
>>>>>> of a returned data message that was originally sent by this socket."
>>>>>>
>>>>>> I think the documentation is pretty clear. If you set
>>>>>> TIPC_DEST_DROPPABLE to true, the receiver can discard messages
>>>>>> e.g. when the receive buffer is full. The sender will not be
>>>>>> notified in this case. If TIPC_DEST_DROPPABLE is set to false,
>>>>>> the message will be returned to the sender in case of a full
>>>>>> receive buffer. The sender knows that it has received such a
>>>>>> returned message when the recvmsg() call returns zero.
>>>>>>
>>>>>> regards,
>>>>>> Anders Widell
>>>>>>
>>>>>> On 08/25/2016 11:30 AM, A V Mahesh wrote:
>>>>>>> Hi HansN,
>>>>>>>
>>>>>>>
>>>>>>> On 8/23/2016 5:22 PM, Hans Nordebäck wrote:
>>>>>>>
>>>>>>>> Hi Mahesh,
>>>>>>>>
>>>>>>>> Yes, this is my understanding too, if TIPC_DROPPABLE = true
>>>>>>>> tipc may
>>>>>>>> drop messages silently, at receive sock buffer full
>>>>>>>> condition, but
>>>>>>>> do not return any ancillary message.
>>>>>>>> If TIPC_DROPPABLE = false tipc may drop message but will send an
>>>>>>>> ancillary message to inform about TIPC_ERR_OVERLOAD.
>>>>>>> [AVM]
>>>>>>>
>>>>>>> My observation are understanding is different, based on TIPC
>>>>>>> code and
>>>>>>> Linux TIPC 2.0 Programmer's Guide , that the TIPC_ERR_OVERLOAD
>>>>>>> error
>>>>>>> returned when TIPC is unable to enqueue an incoming message on the
>>>>>>> receiving socket's receive queue irrelevant of TIPC_DEST_DROPPABLE
>>>>>>> enabled or disabled.
>>>>>>>
>>>>>>> The only difference between TIPC_DEST_DROPPABLE enabled or
>>>>>>> disabled is
>>>>>>> , If TIPC_DEST_DROPPABLE enabled, the message is discarded and
>>>>>>> recvmsg() returned size is ZERO and application will get errors, if
>>>>>>> TIPC_DEST_DROPPABLE disabled the message is returned to the
>>>>>>> sender it
>>>>>>> means the recvmsg() returned size is user send data size and
>>>>>>> application will get errors .
>>>>>>>
>>>>>>> I did check the TIPC code and documentations and I haven't get any
>>>>>>> evidences that TIPC_ERR_OVERLOAD error code will be send only If
>>>>>>> TIPC_DEST_DROPPABLE = false.
>>>>>>>
>>>>>>> Even while testing #1227
>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/) my
>>>>>>> observations and understanding was, an individual TIPC socket is
>>>>>>> only
>>>>>>> allowed to queue up
>>>>>>> OVERLOAD_LIMIT_BASE/2 messages of the lowest importance level
>>>>>>> before
>>>>>>> it starts rejecting them.
>>>>>>> Once a socket receiving queue length exceeds the maximum limit
>>>>>>> value,
>>>>>>> the receiving socket will send out a reject message with
>>>>>>> TIPC_ERR_OVERLOAD error code with cmsg_type as
>>>>>>> TIPC_ERRINFO/TIPC_RETDATA, and the tipc code and Linux TIPC 2.0
>>>>>>> Programmer's Guide confirmed the same .
>>>>>>>
>>>>>>> tipc/socket.c
>>>>>>> =======================================================
>>>>>>> /* Reject message if there isn't room to queue it */
>>>>>>>
>>>>>>> recv_q_len = (u32)atomic_read(&tipc_queue_size);
>>>>>>> if (unlikely(recv_q_len >= OVERLOAD_LIMIT_BASE)) {
>>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE))
>>>>>>> return TIPC_ERR_OVERLOAD;
>>>>>>> }
>>>>>>> recv_q_len = skb_queue_len(&sk->sk_receive_queue);
>>>>>>> if (unlikely(recv_q_len >= (OVERLOAD_LIMIT_BASE / 2))) {
>>>>>>> if (rx_queue_full(msg, recv_q_len, OVERLOAD_LIMIT_BASE / 2))
>>>>>>> return TIPC_ERR_OVERLOAD;
>>>>>>> }
>>>>>>> =======================================================
>>>>>>>
>>>>>>>
>>>>>>> 2.1.17. setsockopt() of TIPC 2.0 Programmer's Guide
>>>>>>> =======================================================
>>>>>>> TIPC_DEST_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket
>>>>>>> if the
>>>>>>> message cannot be delivered to its destination, either because the
>>>>>>> receiver is congested or because the specified receiver does not
>>>>>>> exist. If enabled, the message is discarded; otherwise the
>>>>>>> message is
>>>>>>> returned to the sender.
>>>>>>>
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET and
>>>>>>> SOCK_STREAM
>>>>>>> socket types, and enabled for SOCK_RDM and SOCK_DGRAM, This
>>>>>>> arrangement ensures proper teardown of failed connections when
>>>>>>> connection-oriented data transfer is used, without increasing the
>>>>>>> complexity of connectionless data transfer.
>>>>>>>
>>>>>>> TIPC_SRC_DROPPABLE
>>>>>>> This option governs the handling of messages sent by the socket if
>>>>>>> link congestion occurs. If enabled, the message is discarded;
>>>>>>> otherwise the system queues the message for later transmission.
>>>>>>> By default, this option is disabled for SOCK_SEQPACKET,
>>>>>>> SOCK_STREAM,
>>>>>>> and SOCK_RDM socket types (resulting in "reliable" data
>>>>>>> transfer), and
>>>>>>> enabled for SOCK_DGRAM (resulting in "unreliable" data transfer).
>>>>>>> =======================================================
>>>>>>>
>>>>>>> Now I will try to create OVERLOAD case and update you soon my
>>>>>>> latest
>>>>>>> observations.
>>>>>>>
>>>>>>> -AVM
>>>>>>>
>>>>>>>> Correcting this and adding an abort is not backward compatible as
>>>>>>>> some service already handle flow control in some way, only log
>>>>>>>> when
>>>>>>>> packages are dropped.
>>>>>>>> Regarding ticket #1960 there are other solutions than introducing
>>>>>>>> flow control in MDS, e.g. expose an option to the service to
>>>>>>>> choose
>>>>>>>> connection oriented or connection less.
>>>>>>>> The problem with dropped messages seems in one case related to,
>>>>>>>> (by
>>>>>>>> MDS), intensive MDS logging.
>>>>>>>>
>>>>>>>> /Thanks HansN
>>>>>>>> -----Original Message-----
>>>>>>>> From: A V Mahesh [mailto:[email protected]]
>>>>>>>> Sent: den 23 augusti 2016 11:27
>>>>>>>> To: Hans Nordebäck <[email protected]>; Anders Widell
>>>>>>>> <[email protected]>; [email protected]
>>>>>>>> Cc: [email protected]
>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages [#1957]
>>>>>>>>
>>>>>>>> Hi HansN,
>>>>>>>>
>>>>>>>> It seems I am missing some thing , please allow me to under stand
>>>>>>>>
>>>>>>>> If I currently understand you observation :
>>>>>>>>
>>>>>>>> With current Opensaf code ( this #1957 patch NOT applied ) , by
>>>>>>>> default TIPC_DROPPABLE=true ,while running Opensaf with that
>>>>>>>> binary
>>>>>>>> when TIPC_ERR_OVERLOAD occurring, TIPC is not given errors
>>>>>>>> TIPC_ERRINFO or TIPC_RETDATA and following code is not being
>>>>>>>> get hit
>>>>>>>> of function recvfrom_connectionless(), is my understanding right ?
>>>>>>>>
>>>>>>>> =====================================================================
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> *if (anc->cmsg_type == TIPC_ERRINFO) {*
>>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a
>>>>>>>> returned
>>>>>>>> data message or a connection termination message so abort */
>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>> ancillary
>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> *} else if (anc->cmsg_type == TIPC_RETDATA) {*
>>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure
>>>>>>>> TIPC to
>>>>>>>> return rejected messages to the sender )
>>>>>>>> we will hit this when we implement MDS retransmit lost
>>>>>>>> messages abort can be replaced with flow control logic*/
>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>> cptr++;
>>>>>>>> }
>>>>>>>> /* TIPC_RETDATA -The contents of a returned data message so
>>>>>>>> abort */
>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>> ancillary
>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>> *abort();*
>>>>>>>> }
>>>>>>>>
>>>>>>>> =====================================================================
>>>>>>>>
>>>>>>>> ========================================
>>>>>>>>
>>>>>>>>
>>>>>>>> -AVM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/23/2016 1:08 PM, Hans Nordebäck wrote:
>>>>>>>>> Hi Mahesh,
>>>>>>>>>
>>>>>>>>> Please see response below with [HansN] /Thanks HansN
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: A V Mahesh [mailto:[email protected]]
>>>>>>>>> Sent: den 23 augusti 2016 08:25
>>>>>>>>> To: Hans Nordebäck <[email protected]>; Anders Widell
>>>>>>>>> <[email protected]>; [email protected]
>>>>>>>>> Cc: [email protected]
>>>>>>>>> Subject: Re: [PATCH 1 of 1] MDS: Log TIPC dropped messages
>>>>>>>>> [#1957]
>>>>>>>>>
>>>>>>>>> Hi HansN
>>>>>>>>>
>>>>>>>>> Please see response below with [AVM]
>>>>>>>>>
>>>>>>>>> -AVM
>>>>>>>>>
>>>>>>>>> On 8/23/2016 11:41 AM, Hans Nordebäck wrote:
>>>>>>>>>> Hi Mahesh,
>>>>>>>>>>
>>>>>>>>>> please see comments below.
>>>>>>>>>>
>>>>>>>>>> /Thanks HansN
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 08/23/2016 07:21 AM, A V Mahesh wrote:
>>>>>>>>>>> Hi HansN,
>>>>>>>>>>>
>>>>>>>>>>> Let us fist discuss the error handling and abort, then we
>>>>>>>>>>> can come
>>>>>>>>>>> back to interpretation of TIPC currently does permit OR does
>>>>>>>>>>> not permit an application to send a multicast message with the
>>>>>>>>>>> "destination droppable" setting disabled.
>>>>>>>>>>>
>>>>>>>>>>> Let us disable TIPC_DEST_DROPPABLE, so that TIPC will try to
>>>>>>>>>>> return an undelivered multicast message to its sender and we
>>>>>>>>>>> can
>>>>>>>>>>> determine issue is because of TIPC_ERR_OVERLOAD, this helps in
>>>>>>>>>>> debugging , so that application may increased
>>>>>>>>>>> SO_SNDBUF/SO_RCVBUF
>>>>>>>>>>> to reduce the problem.
>>>>>>>>>>>
>>>>>>>>>>> But still we need to abort(), the reason for that is current
>>>>>>>>>>> MDS
>>>>>>>>>>> implementations doesn't have flow control logic ( no retry
>>>>>>>>>>> because
>>>>>>>>>>> of error ) , so Application like AMF can go wrong and
>>>>>>>>>>> cluster will
>>>>>>>>>>> go into unstable/recoverble state.
>>>>>>>>>>>
>>>>>>>>>> [HansN] In the current implementation messages are dropped
>>>>>>>>>> silently
>>>>>>>>>> and no abort is done.
>>>>>>>>> [AVM] I can see abort(); in current code , you mean abort(); is
>>>>>>>>> not working and application(amf) is not existing ?
>>>>>>>>> [HansN] In case of TIPC_DROPPABLE=true and messages are dropped,
>>>>>>>>> (TIPC_ERR_OVERLOAD) no abort is be performed, e.g amfd
>>>>>>>>> detects this
>>>>>>>>> in the msg sanity chk and logs "invalid msg id ..."
>>>>>>>>> ====================================================================
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>> /* TIPC_ERRINFO - TIPC error code associated with a
>>>>>>>>> returned
>>>>>>>>> data message or a connection termination message so abort */
>>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>>> ancillary
>>>>>>>>> data: TIPC_ERRINFO abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>> /* If we set TIPC_DEST_DROPPABLE off messge (configure
>>>>>>>>> TIPC
>>>>>>>>> to return rejected messages to the sender )
>>>>>>>>> we will hit this when we implement MDS retransmit lost
>>>>>>>>> messages abort can be replaced with flow control logic*/
>>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc); i > 0; i--) {
>>>>>>>>> m_MDS_LOG_DBG("MDTM: returned byte 0x%02x\n", *cptr);
>>>>>>>>> cptr++;
>>>>>>>>> }
>>>>>>>>> /* TIPC_RETDATA -The contents of a returned data
>>>>>>>>> message so
>>>>>>>>> abort */
>>>>>>>>> m_MDS_LOG_CRITICAL("MDTM: undelivered message condition
>>>>>>>>> ancillary
>>>>>>>>> data: TIPC_RETDATA abort err :%s", strerror(errno) );
>>>>>>>>> *abort();*
>>>>>>>>> }
>>>>>>>>> ====================================================================
>>>>>>>>>
>>>>>>>>> ==
>>>>>>>>> ======
>>>>>>>>>> This patch enables logging
>>>>>>>>>> when packages are dropped to help in debugging. I don't agree
>>>>>>>>>> that
>>>>>>>>>> we should also introduce abort, but instead:
>>>>>>>>>> 1) Implement a solution to handle dropped packages, ticket #1960
>>>>>>>>> [AVM] This is nothing but flow control implementation in MDS,
>>>>>>>>> this
>>>>>>>>> is future enhancement
>>>>>>>>>
>>>>>>>>>> 2) Investigate why packages may be dropped, the receiving MDS
>>>>>>>>>> thread is a real time thread and should be able to consume a
>>>>>>>>>> large
>>>>>>>>>> amount of incoming messages.
>>>>>>>>>> E.g. is the receiving MDS thread "live hanging" due to locks,
>>>>>>>>>> file
>>>>>>>>>> I/O etc?
>>>>>>>>>>> This was the reason we haven't gone for it while addressing
>>>>>>>>>>> Ticket
>>>>>>>>>>> #1227
>>>>>>>>>>> (https://sourceforge.net/p/opensaf/mailman/message/33207717/)
>>>>>>>>>>> So currently we don't have any advantage of disabling
>>>>>>>>>>> TIPC_DEST_DROPPABLE and not allowing multicast messages.
>>>>>>>>>>>
>>>>>>>>>>> -AVM
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/18/2016 2:43 PM, Hans Nordeback wrote:
>>>>>>>>>>>> osaf/libs/core/mds/mds_dt_tipc.c | 32
>>>>>>>>>>>> +++++++++++++++++++++++++-------
>>>>>>>>>>>> 1 files changed, 25 insertions(+), 7 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> --- a/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> +++ b/osaf/libs/core/mds/mds_dt_tipc.c
>>>>>>>>>>>> @@ -320,6 +320,15 @@ uint32_t mdtm_tipc_init(NODE_ID nodeid,
>>>>>>>>>>>> m_MDS_LOG_INFO("MDTM: Successfully set
>>>>>>>>>>>> default socket option TIPC_IMP = %d", TIPCIMPORTANCE);
>>>>>>>>>>>> }
>>>>>>>>>>>> + int droppable = 0;
>>>>>>>>>>>> + if (setsockopt(tipc_cb.BSRsock, SOL_TIPC,
>>>>>>>>>>>> TIPC_DEST_DROPPABLE, &droppable, sizeof(droppable)) != 0) {
>>>>>>>>>>>> + LOG_ER("MDTM: Can't set
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to
>>>>>>>>>>>> + zero
>>>>>>>>>>>> err :%s\n", strerror(errno));
>>>>>>>>>>>> + m_MDS_LOG_ERR("MDTM: Can't set
>>>>>>>>>>>> + TIPC_DEST_DROPPABLE
>>>>>>>>>>>> to zero err :%s\n", strerror(errno));
>>>>>>>>>>>> + osafassert(0);
>>>>>>>>>>>> + } else {
>>>>>>>>>>>> + m_MDS_LOG_NOTIFY("MDTM: Successfully set
>>>>>>>>>>>> TIPC_DEST_DROPPABLE to zero");
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> return NCSCC_RC_SUCCESS;
>>>>>>>>>>>> }
>>>>>>>>>>>> @@ -563,6 +572,8 @@ ssize_t recvfrom_connectionless
>>>>>>>>>>>> (int sd,
>>>>>>>>>>>> unsigned char *cptr;
>>>>>>>>>>>> int i;
>>>>>>>>>>>> int has_addr;
>>>>>>>>>>>> + int anc_data[2];
>>>>>>>>>>>> +
>>>>>>>>>>>> ssize_t sz;
>>>>>>>>>>>> has_addr = (from != NULL) && (addrlen != NULL); @@
>>>>>>>>>>>> -591,19
>>>>>>>>>>>> +602,26 @@ ssize_t recvfrom_connectionless (int sd,
>>>>>>>>>>>> if the message was sent using a TIPC
>>>>>>>>>>>> name or
>>>>>>>>>>>> name sequence as the
>>>>>>>>>>>> destination rather than a TIPC port ID So
>>>>>>>>>>>> abort for TIPC_ERRINFO and TIPC_RETDATA*/
>>>>>>>>>>>> if (anc->cmsg_type == TIPC_ERRINFO) {
>>>>>>>>>>>> - /* TIPC_ERRINFO - TIPC error code
>>>>>>>>>>>> associated with a
>>>>>>>>>>>> returned data message or a connection termination message so
>>>>>>>>>>>> abort */
>>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> - abort();
>>>>>>>>>>>> + anc_data[0] = *((unsigned
>>>>>>>>>>>> int*)(CMSG_DATA(anc) +
>>>>>>>>>>>> 0));
>>>>>>>>>>>> + if (anc_data[0] == TIPC_ERR_OVERLOAD) {
>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered
>>>>>>>>>>>> + message
>>>>>>>>>>>> condition ancillary data: TIPC_ERR_OVERLOAD");
>>>>>>>>>>>> + } else {
>>>>>>>>>>>> + /* TIPC_ERRINFO - TIPC error code
>>>>>>>>>>>> associated
>>>>>>>>>>>> with a returned data message or a connection termination
>>>>>>>>>>>> message
>>>>>>>>>>>> so abort */
>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message
>>>>>>>>>>>> condition
>>>>>>>>>>>> ancillary data: TIPC_ERRINFO abort err : %d", anc_data[0]);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered
>>>>>>>>>>>> + message
>>>>>>>>>>>> condition ancillary data: TIPC_ERRINFO abort err : %d",
>>>>>>>>>>>> anc_data[0]);
>>>>>>>>>>>> + }
>>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_RETDATA) {
>>>>>>>>>>>> - /* If we set TIPC_DEST_DROPPABLE off messge
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>> + /* If we set TIPC_DEST_DROPPABLE off message
>>>>>>>>>>>> (configure TIPC to return rejected messages to the sender )
>>>>>>>>>>>> we will hit this when we implement MDS
>>>>>>>>>>>> retransmit lost messages abort can be replaced with flow
>>>>>>>>>>>> control
>>>>>>>>>>>> logic*/
>>>>>>>>>>>> for (i = anc->cmsg_len - sizeof(*anc);
>>>>>>>>>>>> i > 0;
>>>>>>>>>>>> i--) {
>>>>>>>>>>>> - m_MDS_LOG_DBG("MDTM: returned byte
>>>>>>>>>>>> 0x%02x\n",
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> + LOG_CR("MDTM: returned byte 0x%02x\n",
>>>>>>>>>>>> *cptr);
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: returned byte
>>>>>>>>>>>> 0x%02x\n", *cptr);
>>>>>>>>>>>> cptr++;
>>>>>>>>>>>> }
>>>>>>>>>>>> /* TIPC_RETDATA -The contents of a
>>>>>>>>>>>> returned
>>>>>>>>>>>> data message so abort */
>>>>>>>>>>>> - m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA abort err :%s",
>>>>>>>>>>>> strerror(errno) );
>>>>>>>>>>>> - abort();
>>>>>>>>>>>> + LOG_CR("MDTM: undelivered message condition
>>>>>>>>>>>> ancillary data: TIPC_RETDATA");
>>>>>>>>>>>> + m_MDS_LOG_CRITICAL("MDTM: undelivered message
>>>>>>>>>>>> condition ancillary data: TIPC_RETDATA");
>>>>>>>>>>>> } else if (anc->cmsg_type == TIPC_DESTNAME) {
>>>>>>>>>>>> if (sz == 0) {
>>>>>>>>>>>> m_MDS_LOG_DBG("MDTM: recd bytes=0 on
>>>>>>>>>>>> received on sock, abnormal/unknown condition. Ignoring");
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel