Hi Hoang,After more pondering, I realize that my suggestion below was not a
good solution either. In particular, we can not have a transition
rcast-to-mcast (bulk-to-publ/withdraw) without sending sync to all
destinations, something we want to avoid.I think our problem is that we (well,
me...) are trying to solve two different problems at the same time, using a
mechanism that was only made for of them. So, what if we separate the problems
again?1) We send all bulk updates to all nodes, irrespective of whether they
support broadcast binding table update or not, as mandatory rcast. No
bcast/rcast sync needs to be involved, and I don't think it is when we use
'mandatory'. tipc_node_bcast() can be eliminated.2) When we need to send out
publish/withdraw, we just continue to send those as mandatory rcast to the
nodes not supporting binding table bcast. I believe our current
tipc_mcast_xmit(), given a list of legacy nodes, should do the trick even
here.3) To those nodes supporting broadcast update we do send all
publish/withdraw as mandatory broadcast.Now it remains to solve the
synchronization between 1) and 3) type messages. This we have discussed before,
and we know the solution is a "NOT_LAST" bit.This requires a slightly smarter
tipc_namedistr_rcv() function which can sort between non_seq and unicast
messages, but also refuses to dequeue any messages from the namedq until the
'not_last' bit is zero in an arrived bulk message. But no new synchronization
queue should be needed.I think we now have a viable solution that solves all
our problems without too much new code. Regards///jonSent from my Samsung
Galaxy smartphone.
-------- Original message --------From: Jon Maloy <[email protected]> Date:
2020-05-12 17:21 (GMT-05:00) To: [email protected],
[email protected], Xin Long <[email protected]>, Ying Xue
<[email protected]>, Shuang Li <[email protected]> Subject:
[tipc-discussion] Fwd: Re: FW: [PATCH 2/2] tipc: update a binding
service via broadcast cc to tipc-discussion and others.///jon--------
Forwarded Message --------Subject: Re: FW: [PATCH 2/2] tipc: update a
binding service via broadcastDate: Tue, 12 May 2020 14:38:29 -0400From:
Jon Maloy <[email protected]>To: Hoang Huu Le
<[email protected]>, [email protected] <[email protected]>Hi Hoang,I
had to re-read the discussion we had back in November to refresh my memory
about this.To me v2 looks like a better base for further development than the
most recent version,so I am sorry to have wasted your time here. I am still
convinced that the mcast/rcast synchprotocol we developed can solve almost all
our problems if we understand to use it right.What about the following:Sending
side:- Even the initial bulk is sent via tipc_mxast_xmit(), but with only one
destination node and as "mandatory rcast". The broadcast link may still
choose to send it as bcast in case rcast has been manually disabled by the
user, but the surplus messages will be dropped on the non-destination nodes.
(In case of legacy nodes this is certain, in the case of newer nodes we must
make sure that it happens by setting the 'destnode' field in the header and
always check for this in tipc_named_rcv().) I think this is a weird and rare
case where we can accept the extra overhead of broadcast. Under all normal
conditions rcast will be used here.- Regular PUBLISH/WITHDRAW messages are sent
as "mandatory bcast", or if some nodes don't not support TIPC_MCAST_RBCTL as
"mandatory rcast". The broadcast link may still choose to send it as rcast,
but this is nothing we need to worry about as long as the syncronization
mechanism is active. Under all normal conditions bcast will be selected when
possible, since all nodes are destinations.Receiving side:- We need a
deferred queue, but last time i missed that we don't need one per node, it is
sufficient to have a global one, as you were suggesting. So to me the code is
ok here.- As mentioned earlier, tipc_named_rcv() only accepts messages where
destnode is "self" (bulk) or zero (publish/withdraw).The point with all this is
that all messages will now be subject to the synchronization mechanism, so the
bulk messages areguaranteed to be delivered in order and ahead of
individualpublications/withdraws, and even topology changes will almostnever
lead to change of sending method and need for synchronization.We must make one
change to the broadcast protocol though: even"mandatory" messages must be made
subject to the synchronizationmechanism, something that doesn't seem to be the
case now.("Or maybe we don't need to set "mandatory" at all? This needs to be
checked.)The advantage of this method is simplicity.- We don't need any
separate synchronization mechanism for NAME_DISTR messages at all, we just
use what is already there.The disadvantage is that there will be a lot of SYN
messages sentif we are not careful.- Every time a new node comes up, the other
nodes will send it a a bcast SYN before they send the bulk, because they
believe they have just switched from bcast (normal mode) to rcast (bulk
mode). This one is in reality unnecessary, since we can be sure that the
new node has not been sent any previous bcast that needs to be synched with.-
At the first publication/withdraw sent after the bulk the neighbor nodes will
send an rcast SYN to all other nodes, because they just "switched" back
from rcast to bcast. In this case we really need to send an rcast SYN to the
new node that came up, since this is the only one where there is risk of a
race. This message does in reality serve as "end-of-bulk" message but is
handled like any other SYN by the reciving tipc_mcast_filter_rcv() So, to
avoid this the broadcast protocol needs to be able to recognize NAME_DISTR
messages and treat then slightly different from the others, or it needs to be
told what to do by the name_distr.c code via the API. Maybe a couple of new
fields in struct tipc_mcast_method?What do you think?Regards///jonOn 5/12/20
6:22 AM, Hoang Huu Le wrote:> Just forward the patch I mentioned.>>
-----Original Message-----> From: Hoang Le <[email protected]>> Sent:
Tuesday, November 19, 2019 5:01 PM> To: [email protected];
[email protected]; [email protected]> Subject: [PATCH 2/2] tipc: update a
binding service via broadcast>> Currently, updating binding table (add service
binding to> name table/withdraw a service binding) is being sent over
replicast.> However, if we are scaling up clusters to > 100 nodes/containers
this> method is less affection because of looping through nodes in a cluster
one> by one.>> It is worth to use broadcast to update a binding service. Then
binding> table updates in all nodes for one shot.>> The mechanism is backward
compatible as sync message slient dropped.>> v2: resolve synchronization
problem when switching from unicast to> broadcast>> Signed-off-by: Hoang Le
<[email protected]>> ---> net/tipc/bcast.c | 3 ++-> net/tipc/link.c | 6
++++++> net/tipc/name_table.c | 33 ++++++++++++++++++++++++++++++--->
net/tipc/name_table.h | 4 ++++> net/tipc/node.c | 32
++++++++++++++++++++++++++++++++> net/tipc/node.h | 2 ++> 6 files changed, 76
insertions(+), 4 deletions(-)>> diff --git a/net/tipc/bcast.c
b/net/tipc/bcast.c> index e06f05d55534..44ed481fec47 100644> ---
a/net/tipc/bcast.c> +++ b/net/tipc/bcast.c> @@ -324,7 +324,8 @@ static int
tipc_mcast_send_sync(struct net *net, > struct sk_buff *skb,> hdr =
buf_msg(skb);> if (msg_user(hdr) == MSG_FRAGMENTER)> hdr = msg_inner_hdr(hdr);>
- if (msg_type(hdr) != TIPC_MCAST_MSG)> + if (msg_user(hdr) != NAME_DISTRIBUTOR
&&> + msg_type(hdr) != TIPC_MCAST_MSG)> return 0;> /* Allocate dummy message
*/> diff --git a/net/tipc/link.c b/net/tipc/link.c> index
fb72031228c9..a2e9a64d5a0f 100644> --- a/net/tipc/link.c> +++
b/net/tipc/link.c> @@ -1190,6 +1190,8 @@ static bool tipc_data_input(struct
tipc_link *l, > struct sk_buff *skb,> struct sk_buff_head *inputq)> {> struct
sk_buff_head *mc_inputq = l->bc_rcvlink->inputq;> + struct name_table *nt =
tipc_name_table(l->net);> + struct sk_buff_head *defnq = &nt->defer_namedq;>
struct tipc_msg *hdr = buf_msg(skb);> switch (msg_user(hdr)) {> @@ -1211,6
+1213,10 @@ static bool tipc_data_input(struct tipc_link > *l, struct sk_buff
*skb,> case NAME_DISTRIBUTOR:> l->bc_rcvlink->state = LINK_ESTABLISHED;>
skb_queue_tail(l->namedq, skb);> +> + spin_lock_bh(&defnq->lock);> +
tipc_mcast_filter_msg(l->net, defnq, l->namedq);> +
spin_unlock_bh(&defnq->lock);> return true;> case MSG_BUNDLER:> case
TUNNEL_PROTOCOL:> diff --git a/net/tipc/name_table.c b/net/tipc/name_table.c>
index 66a65c2cdb23..593dcd11357f 100644> --- a/net/tipc/name_table.c> +++
b/net/tipc/name_table.c> @@ -615,9 +615,11 @@ struct publication
*tipc_nametbl_publish(struct > net *net, u32 type, u32 lower,> struct tipc_net
*tn = tipc_net(net);> struct publication *p = NULL;> struct sk_buff *skb =
NULL;> + bool rcast;> spin_lock_bh(&tn->nametbl_lock);> + rcast = nt->rcast;>
if (nt->local_publ_count >= TIPC_MAX_PUBL) {> pr_warn("Bind failed, max limit
%u reached\n", TIPC_MAX_PUBL);> goto exit;> @@ -632,8 +634,18 @@ struct
publication *tipc_nametbl_publish(struct > net *net, u32 type, u32 lower,>
exit:> spin_unlock_bh(&tn->nametbl_lock);> - if (skb)> -
tipc_node_broadcast(net, skb);> + if (skb) {> + /* Use broadcast if all nodes
support broadcast NAME_DISTR */> + if (tipc_net(net)->capabilities &
TIPC_MCAST_RBCTL) {> + tipc_node_broadcast_named_publish(net, skb, &rcast);> +
spin_lock_bh(&tn->nametbl_lock);> + nt->rcast = rcast;> +
spin_unlock_bh(&tn->nametbl_lock);> + } else {> + /* Otherwise, be backwards
compatible */> + tipc_node_broadcast(net, skb);> + }> + }> return p;> }> @@
-648,8 +660,10 @@ int tipc_nametbl_withdraw(struct net *net, u32 > type, u32
lower,> u32 self = tipc_own_addr(net);> struct sk_buff *skb = NULL;> struct
publication *p;> + bool rcast;> spin_lock_bh(&tn->nametbl_lock);> + rcast =
nt->rcast;> p = tipc_nametbl_remove_publ(net, type, lower, upper, self, key);>
if (p) {> @@ -664,7 +678,16 @@ int tipc_nametbl_withdraw(struct net *net, u32 >
type, u32 lower,> spin_unlock_bh(&tn->nametbl_lock);> if (skb) {> -
tipc_node_broadcast(net, skb);> + /* Use broadcast if all nodes support
broadcast NAME_DISTR */> + if (tipc_net(net)->capabilities & TIPC_MCAST_RBCTL)
{> + tipc_node_broadcast_named_publish(net, skb, &rcast);> +
spin_lock_bh(&tn->nametbl_lock);> + nt->rcast = rcast;> +
spin_unlock_bh(&tn->nametbl_lock);> + } else {> + /* Otherwise, be backwards
compatible */> + tipc_node_broadcast(net, skb);> + }> return 1;> }> return 0;>
@@ -746,6 +769,9 @@ int tipc_nametbl_init(struct net *net)>
INIT_LIST_HEAD(&nt->cluster_scope);> rwlock_init(&nt->cluster_scope_lock);>
tn->nametbl = nt;> + /* 'bulk' updated messages via unicast */> + nt->rcast =
true;> + skb_queue_head_init(&nt->defer_namedq);>
spin_lock_init(&tn->nametbl_lock);> return 0;> }> @@ -784,6 +810,7 @@ void
tipc_nametbl_stop(struct net *net)> * publications, then release the name
table> */> spin_lock_bh(&tn->nametbl_lock);> +
skb_queue_purge(&nt->defer_namedq);> for (i = 0; i < TIPC_NAMETBL_SIZE; i++) {>
if (hlist_empty(&nt->services[i]))> continue;> diff --git
a/net/tipc/name_table.h b/net/tipc/name_table.h> index
f79066334cc8..b8cdf2a29d48 100644> --- a/net/tipc/name_table.h> +++
b/net/tipc/name_table.h> @@ -95,6 +95,8 @@ struct publication {> * - used by
name_distr to send bulk updates to new nodes> * - used by name_distr during
re-init of name table> * @local_publ_count: number of publications issued by
this node> + * @defer_namedq: temporarily queue for 'synching' update> + *
@rcast: previous method used to publish/withdraw a service> */> struct
name_table {> struct hlist_head services[TIPC_NAMETBL_SIZE];> @@ -102,6 +104,8
@@ struct name_table {> struct list_head cluster_scope;> rwlock_t
cluster_scope_lock;> u32 local_publ_count;> + struct sk_buff_head
defer_namedq;> + bool rcast;> };> int tipc_nl_name_table_dump(struct sk_buff
*skb, struct > netlink_callback *cb);> diff --git a/net/tipc/node.c
b/net/tipc/node.c> index aaf595613e6e..b058647fa78b 100644> ---
a/net/tipc/node.c> +++ b/net/tipc/node.c> @@ -2981,3 +2981,35 @@ void
tipc_node_pre_cleanup_net(struct net > *exit_net)> }> rcu_read_unlock();> }> +>
+int tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb,>
+ bool *rcast)> +{> + struct tipc_mc_method method = {.rcast = *rcast};> +
struct sk_buff_head xmitq;> + struct tipc_nlist dests;> + struct tipc_node *n;>
+ u16 cong_link_cnt;> + int rc = 0;> +> + __skb_queue_head_init(&xmitq);> +
__skb_queue_tail(&xmitq, skb);> +> + tipc_nlist_init(&dests,
tipc_own_addr(net));> + rcu_read_lock();> + list_for_each_entry_rcu(n,
tipc_nodes(net), list) {> + if (in_own_node(net, n->addr))> + continue;> + if
(!node_is_up(n))> + continue;> + tipc_nlist_add(&dests, n->addr);> + }> +
rcu_read_unlock();> +> + rc = tipc_mcast_xmit(net, &xmitq, &method, &dests,
&cong_link_cnt);> + *rcast = method.rcast;> +> + tipc_nlist_purge(&dests);> +
__skb_queue_purge(&xmitq);> + return rc;> +}> diff --git a/net/tipc/node.h
b/net/tipc/node.h> index a6803b449a2c..d7d19f9932b1 100644> ---
a/net/tipc/node.h> +++ b/net/tipc/node.h> @@ -124,4 +124,6 @@ int
tipc_nl_node_set_key(struct sk_buff *skb, > struct genl_info *info);> int
tipc_nl_node_flush_key(struct sk_buff *skb, struct genl_info *info);> #endif>
void tipc_node_pre_cleanup_net(struct net *exit_net);> +int
tipc_node_broadcast_named_publish(struct net *net, struct sk_buff > *skb,> +
bool *rcast);>
#endif_______________________________________________tipc-discussion mailing
[email protected]https://lists.sourceforge.net/lists/listinfo/tipc-discussion
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion