Hi Minh, ack, some minor comments below/Thanks Hans
On 2019-08-14 08:38, Minh Chau wrote: > --- > src/mds/README | 221 > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 221 insertions(+) > create mode 100644 src/mds/README > > diff --git a/src/mds/README b/src/mds/README > new file mode 100644 > index 0000000..1b94632 > --- /dev/null > +++ b/src/mds/README > @@ -0,0 +1,221 @@ > +/* -*- OpenSAF -*- > + * > + * (C) Copyright 2019 The OpenSAF Foundation > + * > + * This program is distributed in the hope that it will be useful, but > + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY > + * or FITNESS FOR A PARTICULAR PURPOSE. This file and program are licensed > + * under the GNU Lesser General Public License Version 2.1, February 1999. > + * The complete license can be accessed from the following location: > + * http://opensource.org/licenses/lgpl-license.php > + * See the Copying file included with the OpenSAF distribution for full > + * licensing terms. > + * > + * Author(s): Ericsson AB > + * > + */ > +Background > +========== > +If OpenSAF configures TIPC as transport, the MDS library today will use > +TIPC SOCK_RDM socket for message distribution in the cluster. The SOCK_RDM > +datagram socket possibly encounters buffer overflow at receiver ends which > +has been documented in tipc.io[1]. A temporary solution for this buffer > +overflow issue is that the socket buffer size can be increased to a larger > +number. However, if the cluster continues either scaling out or adding more > +components, the system will be under dimensioned, thus the TIPC buffer > +overflow can occur again. > + > +MDS's solution for TIPC buffer overflow > +======================================= > +If MDS disables TIPC_DEST_DROPPABLE, TIPC will return the ancillary message > +when the original message is failed to deliver. By this event, if the message > +has been saved in queue, MDS at sender sides can search and retransmit this > +message to the receivers. > +Once the messages in the sender's queue has been delivered successfully, MDS > +needs to remove them. MDS introduces its internal ACK message as an > +acknowledgment from receivers so that the senders can remove the messages > +out of the queue. > +Also, as such situation of buffer overflow at receivers, the retransmission > may > +not succeed or even become worse at receiver ends (the more retransmission, > +the more overflow to occur). MDS imitates the sliding window in TCP[2] to > +control the flow of data message towards the receivers. > + > +Legacy MDS data message, new (data + ACK) MDS message, and upgradability > +------------------------------------------------------------------------ > +Below is the MDS legacy message format that has been used till OpenSAF > 5.19.07 > + > +oct 0 message length > +oct 1 > +------------------------------------------ > +oct 2 sequence number: incremented for every message sent out to all > destined > +... tipc portid. > +oct 5 > +------------------------------------------ > +oct 6 fragment number: a message with same sequence number can be > fragmented, > +oct 7 identified by this fragment number. > +------------------------------------------ > +oct 8 length check: cross check with message length(oct0,1), NOT USED. > +oct 9 > +------------------------------------------ > +oct 10 protocol version: (MDS_PROT:0xA0 | MDS_VERSION:0x08) = 0xA8, NOT USED > +------------------------------------------ > +oct 11 mds length: length of mds header and mds data, starting from oct13 > +oct 12 > +------------------------------------------ > +oct 13 mds header and data > +... > +------------------------------------------ > + > +The current sequence number/fragment number are being used in MDS for all > +messages sent to all discovered tipc portid(s), meaning that every message > is sent > +to any tipc portid, the sequence/fragment number is increased. The flow > control > +needs its own sequence number sliding between two tipc porid(s) so that > receivers > +can detect message drop due to buffer overload. Therefore, the oct8 and oct9 > are > +now reused as flow control sequence number. The oct10, protocol version, has > new > +value of 0xB8. The format of new data message as below: > + > +oct 0 same > +... > +oct 7 > +------------------------------------------ > +oct 8 flow control sequence number > +oct 9 > +------------------------------------------ > +oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 | MDS_VERSION:0x08) = 0xB8 > +------------------------------------------ > +oct 11 same > +... > +------------------------------------------ > + > +The ACK message is introduced to acknowledge one data message or a chunk of > +accumulative data message. The ACK message format: > + > +oct 0 message length > +oct 1 > +------------------------------------------ > +oct 2 8 bytes, NOT USED > +.... > +oct 9 > +------------------------------------------ > +oct 10 protocol version: (MDS_PROT_TIPC_FCTRL:0xB0 | MDS_VERSION:0x08) = 0xB8 > +------------------------------------------ > +oct 11 protocol identifier: MDS_PROT_FCTRL_ID > +...... > +oct 14 > +------------------------------------------ > +oct 15 flow control message type: CHUNKACK > +------------------------------------------ > +oct 16 service id: service id of data messages to be acknowledged > +oct 17 > +------------------------------------------ > +oct 18 acknowledged sequence > +oct 19 > +------------------------------------------ > +oct 20 chunk size > +oct 21 > +------------------------------------------ > + > +Psuedo code illustrates the data message handling at MDS. > + > +if protocol version is 0xB8 then > + if protocol identifier is MDS_PROT_FCTRL_ID then > + this is mds flow control message. > + if message type is CHUNKACK, then > + this is ACK message for successfully delivered data message(s). > + else > + this is data message within flow control. > + decode oct8,9 as flow control sequence number. > +else > + this is legacy data message. > + > +Because the legacy MDS does not use oct8,9,10, so it can communicate > transparently > +to the new MDS in regard to the presence of flow control. Therefore, the > upgrade > +will not be affected. > + > +In case that the receiver end is at legacy MDS version, the new MDS has a > timer > +mechanism to recognize if the receiver has no flow control supported. This > timer > +is implemented at the sender so that MDS will stop message queuing for a > non-flow > +-control MDS at receivers, namely tx-probation timer. > + > +MDS's sliding window > +-------------------- > +One important factor that needs to be documented in the implementation of > MDS's > +sliding window, is that "TIPC's link layer delivery guarantee, the only > limiting > +factor for datagram delivery is the socket receive buffer size" [1][3][4]. > Therefore, > +MDS at sender side does not have to implement the retransmission timer. > Also, if > +MDS at sender side anticipates the buffer overflow at receiver ends, or > receives > +the first ancillary message, MDS starts queuing messages till the buffer > overflow > +is resolved to resume data message transmission. > + > +(1) Sender sequence window > + > +acked_: last sequence has been acked by receiver > +send_: next sequence to be sent > +nacked_space_: total bytes are sent but not acked > + > +Example: > + 1 2 3 4 5 6 7 8 > +|-----|-----|-----|-----|-----|-----|-----|-----| > + acked_ send_ > +If acked_:3, send_:7, then > +The message with sequence 1,2,3 have been acked. > +The 4,5,6 are sent but not acked, and are still in queue. > +The 7 is not sent yet. > +The nacked_space_: bytes_of(4,5,6) > + > +(2) Receiver sequence window > + > +acked_: last sequence has been acked to sender > +rcv_: last sequence has been received > +nacked_space_: total bytes has not been acked > +Example: > + 1 2 3 4 5 6 7 8 9 10 > +|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| > + acked_ rcv_ > +If acked_:3, rcv_:8 > +The message with sequence 1,2,3: has been acked. > +The 4,5,6,7,8 are received by not acked, still in sender's queue. > +The 9,10 are not received yet > +The nacked_space_: bytes_of(4,5,6,7,8) > + > +TIPC portid state machine and its transition > +-------------------------------------------- > +kDisabled, // no flow control support at this state > +kStartup, // a newly published portid starts at this state > +kTxProb, // txprob timer is running to confirm if the flow control is > supported [HansN] +kTxProb, // the tx probation timer is running to confirm if the flow control is supported [HansN] > kFlowControlEnabled or kFCEnabled ? > +kRcvBuffOverflow // anticipating (or experienced) the receiver's buffer > overflow > + > + kDisabled <--- kStartup -------- > + /|\ | | > + | | | > + | V V > + -----------kTxProb ---> kEnabled <---> kRcvBuffOverflow > + > +At the kRcvBuffOverflow state, the messages are being requested to send by > MDS's > +users will be enqueued at sender sides. When the state returns back to > kEnabled, > +the queued messages will be transmitted, the transmission is open for MDS's > users. > + > +At this version, MDS changes to kRcvBuffOverflow state if the TIPC_RETDATA > event > +is returned, which is known as loss-based buffer overflow detection. Another > +approach is that MDS can utilize the TIPC_USED_RCV_BUFF TIPC socket option > +so that the senders can periodically get update of the receiver's TIPC sock > buffer > +utilization. In that way, the senders can anticipate the buffer overflow in > advance, > +which is called in MDS's context as a loss-less detection. > + > +Configuration > +============= > +ChunkAckTimeout timer: the receivers send the ACK message if this timer > expires. > +If this ChunkAckTimeout is set too large, the round trip of data message > +acknowledgment increases, data message stays too long in the queue. > + > +ChunkAckSize: The number of message to be acknowledged in an ACK message. If > +this ChunkAckSize is too small, there will be a plentiful number of ACK > messages > +sent across two ends, which causes the overhead cost to MDS's message > handling. > + > +References > +========== > +[1] > https://protect2.fireeye.com/url?k=2b8b607f-775f65c0-2b8b20e4-868f633dbf25-3d1a92d322e55200&q=1&u=http%3A%2F%2Ftipc.io%2Fprogramming.html, > 1.3.1. Datagram Messaging > +[2] https://tools.ietf.org/html/rfc793, page 20. > +[3] > https://protect2.fireeye.com/url?k=c3d9b20d-9f0db7b2-c3d9f296-868f633dbf25-c9790ae5e13cfcca&q=1&u=http%3A%2F%2Fwww.tipc.io%2Fprotocol.html%23anchor71, > 7.2.5. Sequence Control and Retransmission > +[4] http://tipc.sourceforge.net/protocol.html, 4.2. Link _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel