Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-06-15 Thread David Miller
From: Jon Maloy 
Date: Mon, 13 Jun 2016 20:46:22 -0400

> TIPC based clusters are by default set up with full-mesh link
> connectivity between all nodes. Those links are expected to provide
> a short failure detection time, by default set to 1500 ms. Because
> of this, the background load for neighbor monitoring in an N-node
> cluster increases with a factor N on each node, while the overall
> monitoring traffic through the network infrastructure increases at
> a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
> scale well beyond ~100 nodes unless we significantly increase failure
> discovery tolerance.
> 
> This commit introduces a framework and an algorithm that drastically
> reduces this background load, while basically maintaining the original
> failure detection times across the whole cluster. Using this algorithm,
> background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
> at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
> now have to actively monitor 38 neighbors in a 400-node cluster, instead
> of as before 399.
> 
> This "Overlapping Ring Supervision Algorithm" is completely distributed
> and employs no centralized or coordinated state. It goes as follows:
 ...
> Acked-by: Ying Xue 
> Signed-off-by: Jon Maloy 

Applied, thanks.

--
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports. http://pubads.g.doubleclick.net/gampad/clk?id=1444514421=/41014381
___
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion


Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-04-13 Thread Xue, Ying
Hi Jon,

It's very nice solution.

But I have a few of questions about the algorithm:

1. Is it possible to set a minimal node number which identify whether we deploy 
the new algorithm or not? For instance, if the number of nodes in a cluster is 
less than a threshold, such as, 16 or a smaller value, we still use old 
algorithm to detect link's failure. Of course, I am not sure whether it's worth 
doing this. Please identify.

2. Is it possible that we can set different tolerance values for nodes existing 
in local domain and remote domains?  In other words, if neighbors in the same 
local domain, the tolerance can be set to a short time. In contrast, if both 
neighbors are in local domain and a remote domain respectively, the tolerance 
of the link between them can set a bit longer so that the link is not 
unnecessarily broken. Maybe the idea is not reasonable. Instead, with the 
number of nodes in a cluster growing, we can automatically expand default 
link's tolerance value to avoid the fact link is unnecessarily broken, which is 
a bit simpler than the former.

Regards,
Ying

-Original Message-
From: Jon Maloy [mailto:jon.ma...@ericsson.com] 
Sent: 2016年4月10日 2:38
To: tipc-discussion@lists.sourceforge.net; 
parthasarathy.bhuvara...@ericsson.com; Xue, Ying; richard.a...@ericsson.com; 
jon.ma...@ericsson.com
Cc: ma...@donjonn.com
Subject: [PATCH net-next 1/1] tipc: add neighbor monitoring framework

TIPC based clusters are by default set up with full-mesh link connectivity 
between all nodes. Those links are expected to provide a short failure 
detection time, by default set to 1500 ms. Because of this, the background load 
for neighbor monitoring in an N-node cluster increases with a factor N on each 
node, while the overall monitoring traffic through the network infrastructure 
inceases at a ~(N * (N - 1)) rate. Experience has shown that such clusters 
don't scale well beyond ~100 nodes unless we significantly increase failure 
discovery tolerance.

This commit introduces a framework and an algorithm that drastically reduces 
this background load, while basically maintaining the original failure 
detection times across the whole cluster. Using this algortithm, background 
load will now grow at a rate of ~(2 * sqrt(N)) per node, and at ~(2 * N * 
sqrt(N)) in traffic overhead. As an example, each node will now have to 
actively monitor 38 neighbors in a 400-node cluster, instead of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed and 
employs no centralized state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its
  N known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N)-1 nodes downstream in the
  list, and chooses to actively monitor those. This is called its
  "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neigbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirecly monitored
  node has gone up or down. If a node is indicated lost, the receiving
  node temporarily activates its own direct monitoring towards that node
  in order to confirm, or not, that it is actually gone.

- Since each node is actively montoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how to handle failures that
  causes massive network partitionings. If both a lost node and all its
  direct monitoring neigbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  is more than one indirect monitoring hop away. Because of this, each
  node, apart from monitoring the member of its local domain, will also
  typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain 

Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-04-12 Thread Richard Alpe
Nice functionality. Some minor comments inline.

On 2016-04-09 20:38, Jon Maloy wrote:
> TIPC based clusters are by default set up with full-mesh link
> connectivity between all nodes. Those links are expected to provide
> a short failure detection time, by default set to 1500 ms. Because
> of this, the background load for neighbor monitoring in an N-node
> cluster increases with a factor N on each node, while the overall
> monitoring traffic through the network infrastructure inceases at
> a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
> scale well beyond ~100 nodes unless we significantly increase failure
> discovery tolerance.
> 
> This commit introduces a framework and an algorithm that drastically
> reduces this background load, while basically maintaining the original
> failure detection times across the whole cluster. Using this algortithm,
> background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
> at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
> now have to actively monitor 38 neighbors in a 400-node cluster, instead
> of as before 399.
> 
> This "Overlapping Ring Supervision Algorithm" is completely distributed
> and employs no centralized state. It goes as follows:
> 
> - Each node makes up a linearly ascending, circular list of all its
>   N known neighbors, based on their TIPC node identity. This algorithm
>   must be the same on all nodes.
> 
> - The node then selects the next M = sqrt(N)-1 nodes downstream in the
>   list, and chooses to actively monitor those. This is called its
>   "local monitoring domain".
> 
> - It creates a domain record describing the monitoring domain, and
>   piggy-backs this in the data area of all neigbor monitoring messages
>   (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
>   the cluster eventually (default within 400 ms) will learn about
>   its monitoring domain.
> 
> - Whenever a node discovers a change in its local domain, e.g., a node
>   has been added or has gone down, it creates and sends out a new
>   version of its node record to inform all neighbors about the change.
> 
> - A node receiving a domain record from anybody outside its local domain
>   matches this against its own list (which may not look the same), and
>   chooses to not actively monitor those members of the received domain
>   record that are also present in its own list. Instead, it relies on
>   indications from the direct monitoring nodes if an indirecly monitored
>   node has gone up or down. If a node is indicated lost, the receiving
>   node temporarily activates its own direct monitoring towards that node
>   in order to confirm, or not, that it is actually gone.
> 
> - Since each node is actively montoring sqrt(N) downstream neighbors,
>   each node is also actively monitored by the same number of upstream
>   neighbors. This means that all non-direct monitoring nodes normally
>   will receive sqrt(N) indications that a node is gone.
> 
> - A major drawback with ring monitoring is how to handle failures that
>   causes massive network partitionings. If both a lost node and all its
>   direct monitoring neigbors are inside the lost partition, the nodes in
>   the remaining partition will never receive indications about the loss.
>   To overcome this, each node also chooses to actively monitor some
>   nodes outside its local domain. Those nodes are called remote domain
>   "heads", and are selected in such a way that no node in the cluster
>   is more than one indirect monitoring hop away. Because of this, each
>   node, apart from monitoring the member of its local domain, will also
>   typically monitor sqrt(N) remote head nodes.
> 
> - As an optimization, local list status, domain status and domain
>   records are marked with a generation number. This saves senders from
>   unecessarily conveying  unchanged domain records, and receivers from
>   performing unneeded re-adaptations of their node monitoring list, such
>   as re-assigning domain heads.
> 
> Signed-off-by: Jon Maloy 
> ---
>  net/tipc/Makefile  |   2 +-
>  net/tipc/addr.h|   1 +
>  net/tipc/bearer.c  |   8 +-
>  net/tipc/bearer.h  |   2 +-
>  net/tipc/core.h|  15 ++
>  net/tipc/link.c|  32 +++-
>  net/tipc/monitor.c | 537 
> +
>  net/tipc/monitor.h |  65 +++
>  net/tipc/node.c|  25 ++-
>  9 files changed, 664 insertions(+), 23 deletions(-)
>  create mode 100644 net/tipc/monitor.c
>  create mode 100644 net/tipc/monitor.h
> 
> diff --git a/net/tipc/Makefile b/net/tipc/Makefile
> index 57e460b..31b9f9c 100644
> --- a/net/tipc/Makefile
> +++ b/net/tipc/Makefile
> @@ -6,7 +6,7 @@ obj-$(CONFIG_TIPC) := tipc.o
>  
>  tipc-y   += addr.o bcast.o bearer.o \
>  core.o link.o discover.o msg.o  \
> -name_distr.o  subscr.o name_table.o net.o  \
> +name_distr.o  subscr.o monitor.o name_table.o net.o  \
>  

Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-04-11 Thread Richard Alpe
Getting some conflicts applying this on net-next. Can
you rebase it or let us know what you based it on?

Regards
Richard

On 2016-04-09 20:38, Jon Maloy wrote:
> TIPC based clusters are by default set up with full-mesh link
> connectivity between all nodes. Those links are expected to provide
> a short failure detection time, by default set to 1500 ms. Because
> of this, the background load for neighbor monitoring in an N-node
> cluster increases with a factor N on each node, while the overall
> monitoring traffic through the network infrastructure inceases at
> a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
> scale well beyond ~100 nodes unless we significantly increase failure
> discovery tolerance.
> 
> This commit introduces a framework and an algorithm that drastically
> reduces this background load, while basically maintaining the original
> failure detection times across the whole cluster. Using this algortithm,
> background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
> at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
> now have to actively monitor 38 neighbors in a 400-node cluster, instead
> of as before 399.
> 
> This "Overlapping Ring Supervision Algorithm" is completely distributed
> and employs no centralized state. It goes as follows:
> 
> - Each node makes up a linearly ascending, circular list of all its
>   N known neighbors, based on their TIPC node identity. This algorithm
>   must be the same on all nodes.
> 
> - The node then selects the next M = sqrt(N)-1 nodes downstream in the
>   list, and chooses to actively monitor those. This is called its
>   "local monitoring domain".
> 
> - It creates a domain record describing the monitoring domain, and
>   piggy-backs this in the data area of all neigbor monitoring messages
>   (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
>   the cluster eventually (default within 400 ms) will learn about
>   its monitoring domain.
> 
> - Whenever a node discovers a change in its local domain, e.g., a node
>   has been added or has gone down, it creates and sends out a new
>   version of its node record to inform all neighbors about the change.
> 
> - A node receiving a domain record from anybody outside its local domain
>   matches this against its own list (which may not look the same), and
>   chooses to not actively monitor those members of the received domain
>   record that are also present in its own list. Instead, it relies on
>   indications from the direct monitoring nodes if an indirecly monitored
>   node has gone up or down. If a node is indicated lost, the receiving
>   node temporarily activates its own direct monitoring towards that node
>   in order to confirm, or not, that it is actually gone.
> 
> - Since each node is actively montoring sqrt(N) downstream neighbors,
>   each node is also actively monitored by the same number of upstream
>   neighbors. This means that all non-direct monitoring nodes normally
>   will receive sqrt(N) indications that a node is gone.
> 
> - A major drawback with ring monitoring is how to handle failures that
>   causes massive network partitionings. If both a lost node and all its
>   direct monitoring neigbors are inside the lost partition, the nodes in
>   the remaining partition will never receive indications about the loss.
>   To overcome this, each node also chooses to actively monitor some
>   nodes outside its local domain. Those nodes are called remote domain
>   "heads", and are selected in such a way that no node in the cluster
>   is more than one indirect monitoring hop away. Because of this, each
>   node, apart from monitoring the member of its local domain, will also
>   typically monitor sqrt(N) remote head nodes.
> 
> - As an optimization, local list status, domain status and domain
>   records are marked with a generation number. This saves senders from
>   unecessarily conveying  unchanged domain records, and receivers from
>   performing unneeded re-adaptations of their node monitoring list, such
>   as re-assigning domain heads.
> 
> Signed-off-by: Jon Maloy 
> ---
>  net/tipc/Makefile  |   2 +-
>  net/tipc/addr.h|   1 +
>  net/tipc/bearer.c  |   8 +-
>  net/tipc/bearer.h  |   2 +-
>  net/tipc/core.h|  15 ++
>  net/tipc/link.c|  32 +++-
>  net/tipc/monitor.c | 537 
> +
>  net/tipc/monitor.h |  65 +++
>  net/tipc/node.c|  25 ++-
>  9 files changed, 664 insertions(+), 23 deletions(-)
>  create mode 100644 net/tipc/monitor.c
>  create mode 100644 net/tipc/monitor.h
> 
> diff --git a/net/tipc/Makefile b/net/tipc/Makefile
> index 57e460b..31b9f9c 100644
> --- a/net/tipc/Makefile
> +++ b/net/tipc/Makefile
> @@ -6,7 +6,7 @@ obj-$(CONFIG_TIPC) := tipc.o
>  
>  tipc-y   += addr.o bcast.o bearer.o \
>  core.o link.o discover.o msg.o  \
> -name_distr.o  subscr.o name_table.o net.o  \
> +

Re: [tipc-discussion] [PATCH net-next 1/1] tipc: add neighbor monitoring framework

2016-04-10 Thread Erik Hugne
On Sat, Apr 09, 2016 at 02:38:15PM -0400, Jon Maloy wrote:
> TIPC based clusters are by default set up with full-mesh link
> connectivity between all nodes. Those links are expected to provide
> a short failure detection time, by default set to 1500 ms. Because
> of this, the background load for neighbor monitoring in an N-node
> cluster increases with a factor N on each node, while the overall
> monitoring traffic through the network infrastructure inceases at
> a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
> scale well beyond ~100 nodes unless we significantly increase failure
> discovery tolerance.
> 
> This commit introduces a framework and an algorithm that drastically
> reduces this background load, while basically maintaining the original
> failure detection times across the whole cluster. Using this algortithm,
> background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
> at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
> now have to actively monitor 38 neighbors in a 400-node cluster, instead
> of as before 399.
> 
> This "Overlapping Ring Supervision Algorithm" is completely distributed
> and employs no centralized state. It goes as follows:
> 
> - Each node makes up a linearly ascending, circular list of all its
>   N known neighbors, based on their TIPC node identity. This algorithm
>   must be the same on all nodes.
> 
> - The node then selects the next M = sqrt(N)-1 nodes downstream in the
>   list, and chooses to actively monitor those. This is called its
>   "local monitoring domain".

So 1.1.1 will actively monitor 1.1.2, 1.1.3,...,1.1.N
and 1.1.2 monitors 1.1.3, 1.1.4,...1.1.N+1 etc?

> 
> - It creates a domain record describing the monitoring domain, and
>   piggy-backs this in the data area of all neigbor monitoring messages
>   (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
>   the cluster eventually (default within 400 ms) will learn about
>   its monitoring domain.
> 
> - Whenever a node discovers a change in its local domain, e.g., a node
>   has been added or has gone down, it creates and sends out a new
>   version of its node record to inform all neighbors about the change.
> 
> - A node receiving a domain record from anybody outside its local domain
>   matches this against its own list (which may not look the same), and
>   chooses to not actively monitor those members of the received domain
>   record that are also present in its own list. Instead, it relies on
>   indications from the direct monitoring nodes if an indirecly monitored
>   node has gone up or down. If a node is indicated lost, the receiving
>   node temporarily activates its own direct monitoring towards that node
>   in order to confirm, or not, that it is actually gone.

> 
> - Since each node is actively montoring sqrt(N) downstream neighbors,
>   each node is also actively monitored by the same number of upstream
>   neighbors. This means that all non-direct monitoring nodes normally
>   will receive sqrt(N) indications that a node is gone.
> 
> - A major drawback with ring monitoring is how to handle failures that
>   causes massive network partitionings. If both a lost node and all its
>   direct monitoring neigbors are inside the lost partition, the nodes in
>   the remaining partition will never receive indications about the loss.
>   To overcome this, each node also chooses to actively monitor some
>   nodes outside its local domain. Those nodes are called remote domain
>   "heads", and are selected in such a way that no node in the cluster
>   is more than one indirect monitoring hop away. Because of this, each
>   node, apart from monitoring the member of its local domain, will also
>   typically monitor sqrt(N) remote head nodes.

What if each node includes it's local domain record in the DSC_REQ messages
aswell? These would be sent at 60 second intervals, meaning that a network
partitioning such as the one you described would be detected after 1min
worst case, but maybe that's too slow?

> 
> - As an optimization, local list status, domain status and domain
>   records are marked with a generation number. This saves senders from
>   unecessarily conveying  unchanged domain records, and receivers from
>   performing unneeded re-adaptations of their node monitoring list, such
>   as re-assigning domain heads.
> 
> Signed-off-by: Jon Maloy 

I like this, but i think it would be good if it's made configurable with
either a sysctl or bearer parameter

//Erik

--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301=/ca-pub-7940484522588532