Re: [tipc-discussion] [PATCH net-next 3/3] tipc: reduce risk of user starvation during link congestion

2016-11-21 Thread Jon Maloy
I am having some new doubts about our current link congestion criteria. See 
below.

///jon


> -Original Message-
> From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> Sent: Monday, 21 November, 2016 09:57
> To: tipc-discussion@lists.sourceforge.net; Parthasarathy Bhuvaragan
> ; Ying Xue
> ; Jon Maloy 
> Cc: ma...@donjonn.com; thompa@gmail.com
> Subject: [PATCH net-next 3/3] tipc: reduce risk of user starvation during link
> congestion
> 
> The socket code currently handles link congestion by either blocking
> and trying to send again when the congestion has abated, or just
> returning to the user with -EAGAIN and let him re-try later.
> 
> This mechanism is prone to starvation, because the wakeup algorithm is
> non-atomic. During the time the link issues a wakeup signal, until the
> socket wakes up and re-attempts sending, other senders may have come
> in between and occupied the free buffer space in the link. This in turn
> may lead to a socket having to make many send attempts before it is
> successful. In extremely loaded systems we have observed latency times
> of several seconds before a low-priority socket is able to send out a
> message.
> 
> In this commit, we simplify this mechanism and reduce the risk of the
> described scenario happening. When a message is sent to a congested
> link, we now let it keep the message in the wakeup-item that it has to
> create anyway, and immediately add it to the link's send queue when
> enough space has been freed up. Only when this is done do we issue a
> wakeup signal to the socket, which can now immediately go on and send
> the next message, if any.
> 
> The fact that a socket now can consider a message sent even when the
> link returns a congestion code means that the sending socket code can
> be simplified. Also, since this is a good opportunity to get rid of the
> obsolete 'mtu change' condition in the three socket send functions, we
> now choose to refactor those functions completely.
> 
> Signed-off-by: Jon Maloy 
> ---
>  net/tipc/bcast.c  |   2 +-
>  net/tipc/link.c   |  90 --
>  net/tipc/msg.h|   8 +-
>  net/tipc/node.c   |   2 +-
>  net/tipc/socket.c | 346 
> --
>  5 files changed, 209 insertions(+), 239 deletions(-)
> 
> diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
> index aa1babb..1a56cab 100644
> --- a/net/tipc/bcast.c
> +++ b/net/tipc/bcast.c
> @@ -174,7 +174,7 @@ static void tipc_bcbase_xmit(struct net *net, struct
> sk_buff_head *xmitq)
>   *and to identified node local sockets
>   * @net: the applicable net namespace
>   * @list: chain of buffers containing message
> - * Consumes the buffer chain, except when returning -ELINKCONG
> + * Consumes the buffer chain.
>   * Returns 0 if success, otherwise errno: -ELINKCONG,-EHOSTUNREACH,-
> EMSGSIZE
>   */
>  int tipc_bcast_xmit(struct net *net, struct sk_buff_head *list)
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 1055164..db3dcab 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -774,60 +774,77 @@ int tipc_link_timeout(struct tipc_link *l, struct
> sk_buff_head *xmitq)
>   return rc;
>  }
> 
> +static bool tipc_link_congested(struct tipc_link *l, int imp)
> +{
> + int i;
> +
> + if (skb_queue_empty(>backlogq))
> + return false;
> +
> + /* Match msg importance against this and all higher backlog limits: */
> + for (i = imp; i <= TIPC_SYSTEM_IMPORTANCE; i++) {
> + if (unlikely(l->backlog[i].len >= l->backlog[i].limit))
> + return true;
> + }
> + return false;
> +}

After introducing this patch I ran into a problem with spontaneous and 
premature connect disconnects in the ptty test program.
After some trouble shooting I realized that with the new algorithm, high 
priority shutdown messages sometimes bypass messages which are reported 
"delivered" but still waiting in the wakeup pending queue. This was easy to 
fix, by adding a potentially blocking tipc_wait_for_cond() call in the 
__tipc_shutdown() call.

But I also realized that we may have (and always had) another starvation 
scenario, which is not addressed with this patch. Gergely's algorithm (above)  
fixes that a socket may not be starved by lower- or equal priority users, but 
this may still be caused by higher-priority users. While a lower-prio message 
is waiting in the wakeup queue, higher-prio users may drive their own levels 
into congestion, thus potentially stopping the lower-prio user to send its 
message for a long time.
After the current patch, wouldn't it be better to go back to the algorithm I 
originally suggested, where we only check the sender's own level, and nothing 
else.
At a given level, messages would now be delivered in a strict 
first-come-first-served fashion, and no starvation can occur.

What do you think?

///jon

[tipc-discussion] [PATCH net-next 3/3] tipc: reduce risk of user starvation during link congestion

2016-11-21 Thread Jon Maloy
The socket code currently handles link congestion by either blocking
and trying to send again when the congestion has abated, or just
returning to the user with -EAGAIN and let him re-try later.

This mechanism is prone to starvation, because the wakeup algorithm is
non-atomic. During the time the link issues a wakeup signal, until the
socket wakes up and re-attempts sending, other senders may have come
in between and occupied the free buffer space in the link. This in turn
may lead to a socket having to make many send attempts before it is
successful. In extremely loaded systems we have observed latency times
of several seconds before a low-priority socket is able to send out a
message.

In this commit, we simplify this mechanism and reduce the risk of the
described scenario happening. When a message is sent to a congested
link, we now let it keep the message in the wakeup-item that it has to
create anyway, and immediately add it to the link's send queue when
enough space has been freed up. Only when this is done do we issue a
wakeup signal to the socket, which can now immediately go on and send
the next message, if any.

The fact that a socket now can consider a message sent even when the
link returns a congestion code means that the sending socket code can
be simplified. Also, since this is a good opportunity to get rid of the
obsolete 'mtu change' condition in the three socket send functions, we
now choose to refactor those functions completely.

Signed-off-by: Jon Maloy 
---
 net/tipc/bcast.c  |   2 +-
 net/tipc/link.c   |  90 --
 net/tipc/msg.h|   8 +-
 net/tipc/node.c   |   2 +-
 net/tipc/socket.c | 346 --
 5 files changed, 209 insertions(+), 239 deletions(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index aa1babb..1a56cab 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -174,7 +174,7 @@ static void tipc_bcbase_xmit(struct net *net, struct 
sk_buff_head *xmitq)
  *and to identified node local sockets
  * @net: the applicable net namespace
  * @list: chain of buffers containing message
- * Consumes the buffer chain, except when returning -ELINKCONG
+ * Consumes the buffer chain.
  * Returns 0 if success, otherwise errno: -ELINKCONG,-EHOSTUNREACH,-EMSGSIZE
  */
 int tipc_bcast_xmit(struct net *net, struct sk_buff_head *list)
diff --git a/net/tipc/link.c b/net/tipc/link.c
index 1055164..db3dcab 100644
--- a/net/tipc/link.c
+++ b/net/tipc/link.c
@@ -774,60 +774,77 @@ int tipc_link_timeout(struct tipc_link *l, struct 
sk_buff_head *xmitq)
return rc;
 }
 
+static bool tipc_link_congested(struct tipc_link *l, int imp)
+{
+   int i;
+
+   if (skb_queue_empty(>backlogq))
+   return false;
+
+   /* Match msg importance against this and all higher backlog limits: */
+   for (i = imp; i <= TIPC_SYSTEM_IMPORTANCE; i++) {
+   if (unlikely(l->backlog[i].len >= l->backlog[i].limit))
+   return true;
+   }
+   return false;
+}
+
 /**
  * link_schedule_user - schedule a message sender for wakeup after congestion
- * @link: congested link
+ * @l: congested link
  * @list: message that was attempted sent
  * Create pseudo msg to send back to user when congestion abates
- * Does not consume buffer list
+ * Consumes buffer list
  */
-static int link_schedule_user(struct tipc_link *link, struct sk_buff_head 
*list)
+static int link_schedule_user(struct tipc_link *l, struct sk_buff_head *list)
 {
-   struct tipc_msg *msg = buf_msg(skb_peek(list));
-   int imp = msg_importance(msg);
-   u32 oport = msg_origport(msg);
-   u32 addr = tipc_own_addr(link->net);
+   struct tipc_msg *hdr = buf_msg(skb_peek(list));
+   int imp = msg_importance(hdr);
+   u32 oport = msg_origport(hdr);
+   u32 dnode = tipc_own_addr(l->net);
struct sk_buff *skb;
+   struct sk_buff_head *pkts;
 
/* This really cannot happen...  */
if (unlikely(imp > TIPC_CRITICAL_IMPORTANCE)) {
-   pr_warn("%s<%s>, send queue full", link_rst_msg, link->name);
+   pr_warn("%s<%s>, send queue full", link_rst_msg, l->name);
return -ENOBUFS;
}
-   /* Non-blocking sender: */
-   if (TIPC_SKB_CB(skb_peek(list))->wakeup_pending)
-   return -ELINKCONG;
 
/* Create and schedule wakeup pseudo message */
skb = tipc_msg_create(SOCK_WAKEUP, 0, INT_H_SIZE, 0,
- addr, addr, oport, 0, 0);
+ dnode, l->addr, oport, 0, 0);
if (!skb)
return -ENOBUFS;
-   TIPC_SKB_CB(skb)->chain_sz = skb_queue_len(list);
-   TIPC_SKB_CB(skb)->chain_imp = imp;
-   skb_queue_tail(>wakeupq, skb);
-   link->stats.link_congs++;
+   msg_set_dest_droppable(buf_msg(skb), true);
+   skb_queue_tail(>wakeupq, skb);
+
+   /* Keep the packet chain until we can send it */