Re: [PATCH 15/28] netvm: network reserve infrastructure

2008-02-23 Thread Mike Snitzer
On Wed, Feb 20, 2008 at 9:46 AM, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
> Provide the basic infrastructure to reserve and charge/account network memory.
...

>  Index: linux-2.6/net/core/sock.c
>  ===
>  --- linux-2.6.orig/net/core/sock.c
>  +++ linux-2.6/net/core/sock.c
...
>  +/**
>  + * sk_adjust_memalloc - adjust the global memalloc reserve for critical 
> RX
>  + * @socks: number of new %SOCK_MEMALLOC sockets
>  + * @tx_resserve_pages: number of pages to (un)reserve for TX
>  + *
>  + * This function adjusts the memalloc reserve based on system demand.
>  + * The RX reserve is a limit, and only added once, not for each socket.
>  + *
>  + * NOTE:
>  + *@tx_reserve_pages is an upper-bound of memory used for TX hence
>  + *we need not account the pages like we do for RX pages.
>  + */
>  +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
>  +{
>  +   int nr_socks;
>  +   int err;
>  +
>  +   err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
>  +   if (err)
>  +   return err;
>  +
>  +   nr_socks = atomic_read(&memalloc_socks);
>  +   if (!nr_socks && socks > 0)
>  +   err = mem_reserve_connect(&net_reserve, &mem_reserve_root);
>  +   nr_socks = atomic_add_return(socks, &memalloc_socks);
>  +   if (!nr_socks && socks)
>  +   err = mem_reserve_disconnect(&net_reserve);
>  +
>  +   if (err)
>  +   mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
>  +
>  +   return err;
>  +}

EXPORT_SYMBOL_GPL(sk_adjust_memalloc); is needed here to build sunrpc
as a module.

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 15/28] netvm: network reserve infrastructure

2008-02-23 Thread Andrew Morton
On Wed, 20 Feb 2008 15:46:25 +0100 Peter Zijlstra <[EMAIL PROTECTED]> wrote:

> Provide the basic infrastructure to reserve and charge/account network memory.
> 
> We provide the following reserve tree:
> 
> 1)  total network reserve
> 2)network TX reserve
> 3)  protocol TX pages
> 4)network RX reserve
> 5)  SKB data reserve
> 
> [1] is used to make all the network reserves a single subtree, for easy
> manipulation.
> 
> [2] and [4] are merely for eastetic reasons.
> 
> The TX pages reserve [3] is assumed bounded by it being the upper bound of
> memory that can be used for sending pages (not quite true, but good enough)
> 
> The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
> against in the fallback path.
> 
> The consumers for these reserves are sockets marked with:
>   SOCK_MEMALLOC
> 
> Such sockets are to be used to service the VM (iow. to swap over). They
> must be handled kernel side, exposing such a socket to user-space is a BUG.
> 
> +/**
> + *   sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
> + *   @socks: number of new %SOCK_MEMALLOC sockets
> + *   @tx_resserve_pages: number of pages to (un)reserve for TX
> + *
> + *   This function adjusts the memalloc reserve based on system demand.
> + *   The RX reserve is a limit, and only added once, not for each socket.
> + *
> + *   NOTE:
> + *  @tx_reserve_pages is an upper-bound of memory used for TX hence
> + *  we need not account the pages like we do for RX pages.
> + */
> +int sk_adjust_memalloc(int socks, long tx_reserve_pages)
> +{
> + int nr_socks;
> + int err;
> +
> + err = mem_reserve_pages_add(&net_tx_pages, tx_reserve_pages);
> + if (err)
> + return err;
> +
> + nr_socks = atomic_read(&memalloc_socks);
> + if (!nr_socks && socks > 0)
> + err = mem_reserve_connect(&net_reserve, &mem_reserve_root);

This looks like it should have some locking?

> + nr_socks = atomic_add_return(socks, &memalloc_socks);
> + if (!nr_socks && socks)
> + err = mem_reserve_disconnect(&net_reserve);

Or does that try to make up for it?  Still looks fishy.

> + if (err)
> + mem_reserve_pages_add(&net_tx_pages, -tx_reserve_pages);
> +
> + return err;
> +}
> +
> +/**
> + *   sk_set_memalloc - sets %SOCK_MEMALLOC
> + *   @sk: socket to set it on
> + *
> + *   Set %SOCK_MEMALLOC on a socket and increase the memalloc reserve
> + *   accordingly.
> + */
> +int sk_set_memalloc(struct sock *sk)
> +{
> + int set = sock_flag(sk, SOCK_MEMALLOC);
> +#ifndef CONFIG_NETVM
> + BUG();
> +#endif

??  #error, maybe?

> + if (!set) {
> + int err = sk_adjust_memalloc(1, 0);
> + if (err)
> + return err;
> +
> + sock_set_flag(sk, SOCK_MEMALLOC);
> + sk->sk_allocation |= __GFP_MEMALLOC;
> + }
> + return !set;
> +}
> +EXPORT_SYMBOL_GPL(sk_set_memalloc);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 15/28] netvm: network reserve infrastructure

2008-02-20 Thread Peter Zijlstra
Provide the basic infrastructure to reserve and charge/account network memory.

We provide the following reserve tree:

1)  total network reserve
2)network TX reserve
3)  protocol TX pages
4)network RX reserve
5)  SKB data reserve

[1] is used to make all the network reserves a single subtree, for easy
manipulation.

[2] and [4] are merely for eastetic reasons.

The TX pages reserve [3] is assumed bounded by it being the upper bound of
memory that can be used for sending pages (not quite true, but good enough)

The SKB reserve [5] is an aggregate reserve, which is used to charge SKB data
against in the fallback path.

The consumers for these reserves are sockets marked with:
  SOCK_MEMALLOC

Such sockets are to be used to service the VM (iow. to swap over). They
must be handled kernel side, exposing such a socket to user-space is a BUG.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
---
 include/net/sock.h |   35 +++-
 net/Kconfig|3 +
 net/core/sock.c|  113 +
 3 files changed, 150 insertions(+), 1 deletion(-)

Index: linux-2.6/include/net/sock.h
===
--- linux-2.6.orig/include/net/sock.h
+++ linux-2.6/include/net/sock.h
@@ -51,6 +51,7 @@
 #include   /* struct sk_buff */
 #include 
 #include 
+#include 
 
 #include 
 
@@ -405,6 +406,7 @@ enum sock_flags {
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+   SOCK_MEMALLOC, /* the VM depends on us - make sure we're serviced */
 };
 
 static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -427,9 +429,40 @@ static inline int sock_flag(struct sock 
return test_bit(flag, &sk->sk_flags);
 }
 
+static inline int sk_has_memalloc(struct sock *sk)
+{
+   return sock_flag(sk, SOCK_MEMALLOC);
+}
+
+/*
+ * Guestimate the per request queue TX upper bound.
+ *
+ * Max packet size is 64k, and we need to reserve that much since the data
+ * might need to bounce it. Double it to be on the safe side.
+ */
+#define TX_RESERVE_PAGES DIV_ROUND_UP(2*65536, PAGE_SIZE)
+
+extern atomic_t memalloc_socks;
+
+extern struct mem_reserve net_rx_reserve;
+extern struct mem_reserve net_skb_reserve;
+
+static inline int sk_memalloc_socks(void)
+{
+   return atomic_read(&memalloc_socks);
+}
+
+extern int rx_emergency_get(int bytes);
+extern int rx_emergency_get_overcommit(int bytes);
+extern void rx_emergency_put(int bytes);
+
+extern int sk_adjust_memalloc(int socks, long tx_reserve_pages);
+extern int sk_set_memalloc(struct sock *sk);
+extern int sk_clear_memalloc(struct sock *sk);
+
 static inline gfp_t sk_allocation(struct sock *sk, gfp_t gfp_mask)
 {
-   return gfp_mask;
+   return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
 }
 
 static inline void sk_acceptq_removed(struct sock *sk)
Index: linux-2.6/net/core/sock.c
===
--- linux-2.6.orig/net/core/sock.c
+++ linux-2.6/net/core/sock.c
@@ -112,6 +112,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -213,6 +214,111 @@ __u32 sysctl_rmem_default __read_mostly 
 /* Maximal space eaten by iovec or ancilliary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 
+atomic_t memalloc_socks;
+
+static struct mem_reserve net_reserve;
+struct mem_reserve net_rx_reserve;
+EXPORT_SYMBOL_GPL(net_rx_reserve); /* modular ipv6 only */
+struct mem_reserve net_skb_reserve;
+EXPORT_SYMBOL_GPL(net_skb_reserve); /* modular ipv6 only */
+static struct mem_reserve net_tx_reserve;
+static struct mem_reserve net_tx_pages;
+
+
+/*
+ * is there room for another emergency packet?
+ */
+static int __rx_emergency_get(int bytes, bool overcommit)
+{
+   return mem_reserve_kmalloc_charge(&net_skb_reserve, bytes, overcommit);
+}
+
+int rx_emergency_get(int bytes)
+{
+   return __rx_emergency_get(bytes, false);
+}
+
+int rx_emergency_get_overcommit(int bytes)
+{
+   return __rx_emergency_get(bytes, true);
+}
+
+void rx_emergency_put(int bytes)
+{
+   mem_reserve_kmalloc_charge(&net_skb_reserve, -bytes, 0);
+}
+
+/**
+ * sk_adjust_memalloc - adjust the global memalloc reserve for critical RX
+ * @socks: number of new %SOCK_MEMALLOC sockets
+ * @tx_resserve_pages: number of pages to (un)reserve for TX
+ *
+ * This function adjusts the memalloc reserve based on system demand.
+ * The RX reserve is a limit, and only added once, not for each socket.
+ *
+ * NOTE:
+ *@tx_reserve_pages is an upper-bound of memory used for TX hence
+ *we need not account the pages like we do for RX pages.
+ */
+int sk_adjust_memalloc(int socks, long tx_reserve_pages)
+{
+   int nr_socks;
+   int err;
+
+   err = mem_reserve_pages