Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-04-10 Thread Erik Hugne
On Sat, Apr 09, 2016 at 02:38:15PM -0400, Jon Maloy wrote:
> TIPC based clusters are by default set up with full-mesh link
> connectivity between all nodes. Those links are expected to provide
> a short failure detection time, by default set to 1500 ms. Because
> of this, the background load for neighbor monitoring in an N-node
> cluster increases with a factor N on each node, while the overall
> monitoring traffic through the network infrastructure inceases at
> a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
> scale well beyond ~100 nodes unless we significantly increase failure
> discovery tolerance.
> 
> This commit introduces a framework and an algorithm that drastically
> reduces this background load, while basically maintaining the original
> failure detection times across the whole cluster. Using this algortithm,
> background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
> at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
> now have to actively monitor 38 neighbors in a 400-node cluster, instead
> of as before 399.
> 
> This "Overlapping Ring Supervision Algorithm" is completely distributed
> and employs no centralized state. It goes as follows:
> 
> - Each node makes up a linearly ascending, circular list of all its
>   N known neighbors, based on their TIPC node identity. This algorithm
>   must be the same on all nodes.
> 
> - The node then selects the next M = sqrt(N)-1 nodes downstream in the
>   list, and chooses to actively monitor those. This is called its
>   "local monitoring domain".

So 1.1.1 will actively monitor 1.1.2, 1.1.3,...,1.1.N
and 1.1.2 monitors 1.1.3, 1.1.4,...1.1.N+1 etc?

> 
> - It creates a domain record describing the monitoring domain, and
>   piggy-backs this in the data area of all neigbor monitoring messages
>   (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
>   the cluster eventually (default within 400 ms) will learn about
>   its monitoring domain.
> 
> - Whenever a node discovers a change in its local domain, e.g., a node
>   has been added or has gone down, it creates and sends out a new
>   version of its node record to inform all neighbors about the change.
> 
> - A node receiving a domain record from anybody outside its local domain
>   matches this against its own list (which may not look the same), and
>   chooses to not actively monitor those members of the received domain
>   record that are also present in its own list. Instead, it relies on
>   indications from the direct monitoring nodes if an indirecly monitored
>   node has gone up or down. If a node is indicated lost, the receiving
>   node temporarily activates its own direct monitoring towards that node
>   in order to confirm, or not, that it is actually gone.

> 
> - Since each node is actively montoring sqrt(N) downstream neighbors,
>   each node is also actively monitored by the same number of upstream
>   neighbors. This means that all non-direct monitoring nodes normally
>   will receive sqrt(N) indications that a node is gone.
> 
> - A major drawback with ring monitoring is how to handle failures that
>   causes massive network partitionings. If both a lost node and all its
>   direct monitoring neigbors are inside the lost partition, the nodes in
>   the remaining partition will never receive indications about the loss.
>   To overcome this, each node also chooses to actively monitor some
>   nodes outside its local domain. Those nodes are called remote domain
>   "heads", and are selected in such a way that no node in the cluster
>   is more than one indirect monitoring hop away. Because of this, each
>   node, apart from monitoring the member of its local domain, will also
>   typically monitor sqrt(N) remote head nodes.

What if each node includes it's local domain record in the DSC_REQ messages
aswell? These would be sent at 60 second intervals, meaning that a network
partitioning such as the one you described would be detected after 1min
worst case, but maybe that's too slow?

> 
> - As an optimization, local list status, domain status and domain
>   records are marked with a generation number. This saves senders from
>   unecessarily conveying  unchanged domain records, and receivers from
>   performing unneeded re-adaptations of their node monitoring list, such
>   as re-assigning domain heads.
> 
> Signed-off-by: Jon Maloy 

I like this, but i think it would be good if it's made configurable with
either a sysctl or bearer parameter

//Erik

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532

Re: [tipc-discussion] kernel page allocation problem

2016-04-10 Thread Leon Pollak
On Friday 01 April 2016 15:02:05 Jon Maloy wrote:
> > I saw that the problem is in the lack of memory.
> > I also made a simple stress test (one side sends in a loop) and received 
> > this problem immediately.
> > But User's Manual states that if there is no room for the receiving side 
> > to accept message it prevents the sending side from sending, effectively
> > blocking the sender.
> 
> That is true only for connection oriented messaging (SOC_STREAM, 
> SOCK_SEQPACKET). Is that what you are running?
> If you are just doing a tight loop with 66k messages using TIPC_PORT_NAME 
> you will not have any flow control, and the messages might be dropped in the 
> receiving socket.
> But if you are really tight of memory they might also be dropped at the link 
> layer, which is what you are seeing. 
> I have never seen that happening before, but it is fully possible in TIPC 
> versions before commit 40ba3cdf542a469aaa9  ("tipc: message reassembly using 
> fragment chain") from Nov 6th 2013. (I don't remember which Linux version 
> this corresponds to, but that should be easy to find out). Furthermore, this 
> might even happen with connection oriented messaging, but is less likely.
> 
> So, if you are running connectionless, I recommend you to go to a connection 
oriented mode. If you already are connection oriented, you can try to find out 
which version this fix was in, and re-adapt the code.

Thanks a lot, Jon, again.

After your answer I again looked through the both manuals and did not find 
anything saying that connectionless messaging allows drops. Vice verse, the 
programmer's manual explicitly states that "TIPC is designed to be a reliable 
messaging mechanism, in which an application can send a message and assume 
that the message will be delivered to the specified destination as long as 
that destination is reachable."

Now, is it a bug, corrected in Nov 6th 2013? Or a feature?

i can't move to connection oriented methods, because I need to support one-
to-3 and 3-to-one.

Sorry.
-- 
Leon

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] kernel page allocation problem

2016-04-10 Thread Erik Hugne
On Apr 10, 2016 15:27, "Leon Pollak"  wrote:
>
> After your answer I again looked through the both manuals and did not find
> anything saying that connectionless messaging allows drops. Vice verse,
the
> programmer's manual explicitly states that "TIPC is designed to be a
reliable
> messaging mechanism, in which an application can send a message and assume
> that the message will be delivered to the specified destination as long as
> that destination is reachable."
>
> Now, is it a bug, corrected in Nov 6th 2013? Or a feature?
>
> i can't move to connection oriented methods, because I need to support
one-
> to-3 and 3-to-one.
>

This sounds a lot like the problem when connectionless messages are
received, acked on the tipc link layer and passed to the socket, but
dropped there because the socket receive buffer is full.
If the receiving application does not drain the queue faster than messages
build up, there will be losses as there us no concept of flow control at
this level.
Try bumping the prio of the server, maybe run it as an rt thread?

//E

> Sorry.
> --
> Leon
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301&iu=/ca-pub-7940484522588532
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion