Re: [PATCH 5/9] network namespaces: async socket operations

2006-09-23 Thread Andrey Savochkin
On Fri, Sep 22, 2006 at 05:33:56PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  Non-trivial part of socket namespaces: asynchronous events
  should be run in proper context.
  
  Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
  ---
   af_inet.c|   10 ++
   inet_timewait_sock.c |8 
   tcp_timer.c  |9 +
   3 files changed, 27 insertions(+)
  
  --- ./net/ipv4/af_inet.c.venssock-asyn  Mon Aug 14 17:04:07 2006
  +++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006
  @@ -366,10 +366,17 @@ out_rcu_unlock:
   int inet_release(struct socket *sock)
   {
  struct sock *sk = sock-sk;
  +   struct net_namespace *ns, *orig_net_ns;
   
  if (sk) {
  long timeout;
   
  +   /* Need to change context here since protocol -close
  +* operation may send packets.
  +*/
  +   ns = get_net_ns(sk-sk_net_ns);
  +   push_net_ns(ns, orig_net_ns);
  +
 
 Is it not a race condition here ? What happens if you have a packet 
 incoming during the namespace context switching ?

All asynchronous operations (RX softirq, timers) should set their context
explicitly, and can't rely on the current context being the right one
(or a valid pointer at all).

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] network namespaces: socket hashes

2006-09-20 Thread Andrey Savochkin
Hi,

On Mon, Sep 18, 2006 at 05:12:49PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  Socket hash lookups are made within namespace.
  Hash tables are common for all namespaces, with
  additional permutation of indexes.
 
 Hi Andrey,
 
 why is the hash table common and not instanciated multiple times for 
 each namespace like the routes ?

The main reason is that socket hash tables should be large enough to work
efficiently, but it isn't good to waste a lot of memory for each namespace.
Namespaces should be cheap enough, to allow to have hundreds of them.
This reason of memory efficiency, of course, has a priority unless/until
socket hash tables start to resize automatically.

Another point is that routing lookup is much more complicated than the
socket's one to add another search key.
Routing also have additional routines for deleting entries matching some
patterns, and so on.
In short, routing is much more complicated, and it already quite efficient
for various sizes of routing tables.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] network namespaces: playing and debugging

2006-08-17 Thread Andrey Savochkin
On Wed, Aug 16, 2006 at 11:22:28AM -0600, Eric W. Biederman wrote:
 Stephen Hemminger [EMAIL PROTECTED] writes:
 
  On Tue, 15 Aug 2006 18:48:43 +0400
  Andrey Savochkin [EMAIL PROTECTED] wrote:
 
  Temporary code to play with network namespaces in the simplest way.
  Do
  exec 7 /proc/net/net_ns
  in your bash shell and you'll get a brand new network namespace.
  There you can, for example, do
  ip link set lo up
  ip addr list
  ip addr add 1.2.3.4 dev lo
  ping -n 1.2.3.4
  
  Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
 
  NACK, new /proc interfaces are not acceptable.
 
 The rule is that new /proc interfaces that are not process related
 are not acceptable.  If structured right a network namespace can
 arguably be process related.
 
 I do agree that this interface is pretty ugly there.

This proc interface was a backdoor to play with namespaces without
compiling any user-space programs.

As you wish.
Do you want to have a new clone flag right away?

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] net_device seq_file

2006-08-16 Thread Andrey Savochkin
Library function to create a seq_file in proc filesystem,
showing some information for each netdevice.
This code is present in the kernel in about 10 instances, and
all of them can be converted to using introduced library function.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/netdevice.h |7 +++
 net/core/dev.c|   96 ++
 2 files changed, 103 insertions(+)

--- ./include/linux/netdevice.h.venetproc   Tue Aug 15 13:46:08 2006
+++ ./include/linux/netdevice.h Tue Aug 15 13:46:08 2006
@@ -592,6 +592,13 @@ extern int register_netdevice(struct ne
 extern int unregister_netdevice(struct net_device *dev);
 extern voidfree_netdev(struct net_device *dev);
 extern voidsynchronize_net(void);
+#ifdef CONFIG_PROC_FS
+extern int netdev_proc_create(char *name,
+   int (*show)(struct seq_file *,
+   struct net_device *, void *),
+   void *data, struct module *mod);
+void   netdev_proc_remove(char *name);
+#endif
 extern int register_netdevice_notifier(struct notifier_block *nb);
 extern int unregister_netdevice_notifier(struct notifier_block 
*nb);
 extern int call_netdevice_notifiers(unsigned long val, void *v);
--- ./net/core/dev.c.venetproc  Tue Aug 15 13:46:08 2006
+++ ./net/core/dev.cTue Aug 15 13:46:08 2006
@@ -2100,6 +2100,102 @@ static int dev_ifconf(char __user *arg)
 }
 
 #ifdef CONFIG_PROC_FS
+
+struct netdev_proc_data {
+   struct file_operations fops;
+   int (*show)(struct seq_file *, struct net_device *, void *);
+   void *data;
+};
+
+static void *netdev_proc_seq_start(struct seq_file *seq, loff_t *pos)
+{
+   struct net_device *dev;
+   loff_t off;
+
+   read_lock(dev_base_lock);
+   if (*pos == 0)
+   return SEQ_START_TOKEN;
+   for (dev = dev_base, off = 1; dev; dev = dev-next, off++) {
+   if (*pos == off)
+   return dev;
+   }
+   return NULL;
+}
+
+static void *netdev_proc_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+   ++*pos;
+   return (v == SEQ_START_TOKEN) ? dev_base
+   : ((struct net_device *)v)-next;
+}
+
+static void netdev_proc_seq_stop(struct seq_file *seq, void *v)
+{
+   read_unlock(dev_base_lock);
+}
+
+static int netdev_proc_seq_show(struct seq_file *seq, void *v)
+{
+   struct netdev_proc_data *p;
+
+   p = seq-private;
+   return (*p-show)(seq, v, p-data);
+}
+
+static struct seq_operations netdev_proc_seq_ops = {
+   .start = netdev_proc_seq_start,
+   .next  = netdev_proc_seq_next,
+   .stop  = netdev_proc_seq_stop,
+   .show  = netdev_proc_seq_show,
+};
+
+static int netdev_proc_open(struct inode *inode, struct file *file)
+{
+   int err;
+   struct seq_file *p;
+
+   err = seq_open(file, netdev_proc_seq_ops);
+   if (!err) {
+   p = file-private_data;
+   p-private = (struct netdev_proc_data *)PDE(inode)-data;
+   }
+   return err;
+}
+
+int netdev_proc_create(char *name,
+   int (*show)(struct seq_file *, struct net_device *, void *),
+   void *data, struct module *mod)
+{
+   struct netdev_proc_data *p;
+   struct proc_dir_entry *ent;
+
+   p = kzalloc(sizeof(*p), GFP_KERNEL);
+   p-fops.owner = mod;
+   p-fops.open = netdev_proc_open;
+   p-fops.read = seq_read;
+   p-fops.llseek = seq_lseek;
+   p-fops.release = seq_release;
+   p-show = show;
+   p-data = data;
+   ent = create_proc_entry(name, S_IRUGO, proc_net);
+   if (ent == NULL) {
+   kfree(p);
+   return -EINVAL;
+   }
+   ent-data = p;
+   ent-destructor = proc_data_destructor;
+   smp_wmb();
+   ent-proc_fops = p-fops;
+   return 0;
+}
+EXPORT_SYMBOL(netdev_proc_create);
+
+void netdev_proc_remove(char *name)
+{
+   proc_net_remove(name);
+}
+EXPORT_SYMBOL(netdev_proc_remove);
+
 /*
  * This is invoked by the /proc filesystem handler to display a device
  * in detail.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] network namespaces: async socket operations

2006-08-16 Thread Andrey Savochkin
Non-trivial part of socket namespaces: asynchronous events
should be run in proper context.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 af_inet.c|   10 ++
 inet_timewait_sock.c |8 
 tcp_timer.c  |9 +
 3 files changed, 27 insertions(+)

--- ./net/ipv4/af_inet.c.venssock-asyn  Mon Aug 14 17:04:07 2006
+++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006
@@ -366,10 +366,17 @@ out_rcu_unlock:
 int inet_release(struct socket *sock)
 {
struct sock *sk = sock-sk;
+   struct net_namespace *ns, *orig_net_ns;
 
if (sk) {
long timeout;
 
+   /* Need to change context here since protocol -close
+* operation may send packets.
+*/
+   ns = get_net_ns(sk-sk_net_ns);
+   push_net_ns(ns, orig_net_ns);
+
/* Applications forget to leave groups before exiting */
ip_mc_drop_socket(sk);
 
@@ -386,6 +393,9 @@ int inet_release(struct socket *sock)
timeout = sk-sk_lingertime;
sock-sk = NULL;
sk-sk_prot-close(sk, timeout);
+
+   pop_net_ns(orig_net_ns);
+   put_net_ns(ns);
}
return 0;
 }
--- ./net/ipv4/inet_timewait_sock.c.venssock-asyn   Tue Aug 15 13:45:44 2006
+++ ./net/ipv4/inet_timewait_sock.c Tue Aug 15 13:45:44 2006
@@ -129,6 +129,7 @@ static int inet_twdr_do_twkill_work(stru
 {
struct inet_timewait_sock *tw;
struct hlist_node *node;
+   struct net_namespace *orig_net_ns;
unsigned int killed;
int ret;
 
@@ -140,8 +141,10 @@ static int inet_twdr_do_twkill_work(stru
 */
killed = 0;
ret = 0;
+   push_net_ns(current_net_ns, orig_net_ns);
 rescan:
inet_twsk_for_each_inmate(tw, node, twdr-cells[slot]) {
+   switch_net_ns(tw-tw_net_ns);
__inet_twsk_del_dead_node(tw);
spin_unlock(twdr-death_lock);
__inet_twsk_kill(tw, twdr-hashinfo);
@@ -164,6 +167,7 @@ rescan:
 
twdr-tw_count -= killed;
NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITED, killed);
+   pop_net_ns(orig_net_ns);
 
return ret;
 }
@@ -338,10 +342,12 @@ void inet_twdr_twcal_tick(unsigned long 
int n, slot;
unsigned long j;
unsigned long now = jiffies;
+   struct net_namespace *orig_net_ns;
int killed = 0;
int adv = 0;
 
twdr = (struct inet_timewait_death_row *)data;
+   push_net_ns(current_net_ns, orig_net_ns);
 
spin_lock(twdr-death_lock);
if (twdr-twcal_hand  0)
@@ -357,6 +363,7 @@ void inet_twdr_twcal_tick(unsigned long 
 
inet_twsk_for_each_inmate_safe(tw, node, safe,
   twdr-twcal_row[slot]) {
+   switch_net_ns(tw-tw_net_ns);
__inet_twsk_del_dead_node(tw);
__inet_twsk_kill(tw, twdr-hashinfo);
inet_twsk_put(tw);
@@ -384,6 +391,7 @@ out:
del_timer(twdr-tw_timer);
NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITKILLED, killed);
spin_unlock(twdr-death_lock);
+   pop_net_ns(orig_net_ns);
 }
 
 EXPORT_SYMBOL_GPL(inet_twdr_twcal_tick);
--- ./net/ipv4/tcp_timer.c.venssock-asynMon Aug 14 16:43:51 2006
+++ ./net/ipv4/tcp_timer.c  Tue Aug 15 13:45:44 2006
@@ -171,7 +171,9 @@ static void tcp_delack_timer(unsigned lo
struct sock *sk = (struct sock*)data;
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
bh_lock_sock(sk);
if (sock_owned_by_user(sk)) {
/* Try again later. */
@@ -225,6 +227,7 @@ out:
 out_unlock:
bh_unlock_sock(sk);
sock_put(sk);
+   pop_net_ns(orig_net_ns);
 }
 
 static void tcp_probe_timer(struct sock *sk)
@@ -384,8 +387,10 @@ static void tcp_write_timer(unsigned lon
 {
struct sock *sk = (struct sock*)data;
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
int event;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
bh_lock_sock(sk);
if (sock_owned_by_user(sk)) {
/* Try again later */
@@ -419,6 +424,7 @@ out:
 out_unlock:
bh_unlock_sock(sk);
sock_put(sk);
+   pop_net_ns(orig_net_ns);
 }
 
 /*
@@ -447,9 +453,11 @@ static void tcp_keepalive_timer (unsigne
 {
struct sock *sk = (struct sock *) data;
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct net_namespace *orig_net_ns;
struct tcp_sock *tp = tcp_sk(sk);
__u32 elapsed;
 
+   push_net_ns(sk-sk_net_ns, orig_net_ns);
/* Only process if socket is not in use. */
bh_lock_sock(sk

[RFC] network namespaces

2006-08-16 Thread Andrey Savochkin
Hi All,

I'd like to resurrect our discussion about network namespaces.
In our previous discussions it appeared that we have rather polar concepts
which seemed hard to reconcile.
Now I have an idea how to look at all discussed concepts to enable everyone's
usage scenario.

1. The most straightforward concept is complete separation of namespaces,
   covering device list, routing tables, netfilter tables, socket hashes, and
   everything else.

   On input path, each packet is tagged with namespace right from the
   place where it appears from a device, and is processed by each layer
   in the context of this namespace.
   Non-root namespaces communicate with the outside world in two ways: by
   owning hardware devices, or receiving packets forwarded them by their parent
   namespace via pass-through device.

   This complete separation of namespaces is very useful for at least two
   purposes:
- allowing users to create and manage by their own various tunnels and
  VPNs, and
- enabling easier and more straightforward live migration of groups of
  processes with their environment.

2. People expressed concerns that complete separation of namespaces
   may introduce an undesired overhead in certain usage scenarios.
   The overhead comes from packets traversing input path, then output path,
   then input path again in the destination namespace if root namespace
   acts as a router.

   So, we may introduce short-cuts, when input packet starts to be processes
   in one namespace, but changes it at some upper layer.
   The places where packet can change namespace are, for example:
   routing, post-routing netfilter hook, or even lookup in socket hash.

   The cleanest example among them is post-routing netfilter hook.
   Tagging of input packets there means that the packets is checked against
   root namespace's routing table, found to be local, and go directly to
   the socket hash lookup in the destination namespace.
   In this scheme the ability to change routing tables or netfilter rules on
   a per-namespace basis is traded for lower overhead.

   All other optimized schemes where input packets do not travel
   input-output-input paths in general case may be viewed as short-cuts in
   scheme (1).  The remaining question is which exactly short-cuts make most
   sense, and how to make them consistent from the interface point of view.

My current idea is to reach some agreement on the basic concept, review
patches, and then move on to implementing feasible short-cuts.

Opinions?

Next in this thread are patches introducing namespaces to device list,
IPv4 routing, and socket hashes, and a pass-through device.
Patches are against 2.6.18-rc4-mm1.

Best regards,

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] network namespaces: IPv4 routing

2006-08-16 Thread Andrey Savochkin
Structures related to IPv4 rounting (FIB and routing cache)
are made per-namespace.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/net_ns.h   |   10 +++
 include/net/flow.h   |3 +
 include/net/ip_fib.h |   46 
 net/core/dev.c   |8 ++
 net/core/fib_rules.c |   43 ---
 net/ipv4/Kconfig |4 -
 net/ipv4/fib_frontend.c  |  132 +--
 net/ipv4/fib_hash.c  |   13 +++-
 net/ipv4/fib_rules.c |   86 +-
 net/ipv4/fib_semantics.c |   99 +++
 net/ipv4/route.c |   26 -
 11 files changed, 375 insertions(+), 95 deletions(-)

--- ./include/linux/net_ns.h.vensroute  Mon Aug 14 17:18:59 2006
+++ ./include/linux/net_ns.hMon Aug 14 19:19:14 2006
@@ -14,7 +14,17 @@ struct net_namespace {
atomic_tactive_ref, use_ref;
struct net_device   *dev_base_p, **dev_tail_p;
struct net_device   *loopback;
+#ifndef CONFIG_IP_MULTIPLE_TABLES
+   struct fib_table*fib4_local_table, *fib4_main_table;
+#else
+   struct list_headfib_rules_ops_list;
+   struct fib_rules_ops*fib4_rules_ops;
+   struct hlist_head   *fib4_tables;
+#endif
+   struct hlist_head   *fib4_hash, *fib4_laddrhash;
+   unsignedfib4_hash_size, fib4_info_cnt;
unsigned inthash;
+   chardestroying;
struct work_struct  destroy_work;
 };
 
--- ./include/net/flow.h.vensroute  Mon Aug 14 17:04:04 2006
+++ ./include/net/flow.hMon Aug 14 17:18:59 2006
@@ -79,6 +79,9 @@ struct flowi {
 #define fl_icmp_code   uli_u.icmpt.code
 #define fl_ipsec_spi   uli_u.spi
__u32   secid;  /* used by xfrm; see secid.txt */
+#ifdef CONFIG_NET_NS
+   struct net_namespace *net_ns;
+#endif
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 #define FLOW_DIR_IN0
--- ./include/net/ip_fib.h.vensrouteMon Aug 14 17:04:04 2006
+++ ./include/net/ip_fib.h  Tue Aug 15 11:53:22 2006
@@ -18,6 +18,7 @@
 
 #include net/flow.h
 #include linux/seq_file.h
+#include linux/net_ns.h
 #include net/fib_rules.h
 
 /* WARNING: The ordering of these elements must match ordering
@@ -171,14 +172,21 @@ struct fib_table {
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
-extern struct fib_table *ip_fib_local_table;
-extern struct fib_table *ip_fib_main_table;
+#ifndef CONFIG_NET_NS
+extern struct fib_table *ip_fib_local_table_static;
+extern struct fib_table *ip_fib_main_table_static;
+#define ip_fib_local_table_ns()ip_fib_local_table_static
+#define ip_fib_main_table_ns() ip_fib_main_table_static
+#else
+#define ip_fib_local_table_ns()
(current_net_ns-fib4_local_table)
+#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table)
+#endif
 
 static inline struct fib_table *fib_get_table(u32 id)
 {
if (id != RT_TABLE_LOCAL)
-   return ip_fib_main_table;
-   return ip_fib_local_table;
+   return ip_fib_main_table_ns();
+   return ip_fib_local_table_ns();
 }
 
 static inline struct fib_table *fib_new_table(u32 id)
@@ -188,21 +196,29 @@ static inline struct fib_table *fib_new_
 
 static inline int fib_lookup(const struct flowi *flp, struct fib_result *res)
 {
-   if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) 
-   ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res))
+   struct fib_table *tb;
+
+   tb = ip_fib_local_table_ns();
+   if (!tb-tb_lookup(tb, flp, res))
+   return 0;
+   tb = ip_fib_main_table_ns();
+   if (tb-tb_lookup(tb, flp, res))
return -ENETUNREACH;
return 0;
 }
 
 static inline void fib_select_default(const struct flowi *flp, struct 
fib_result *res)
 {
+   struct fib_table *tb;
+
+   tb = ip_fib_main_table_ns();
if (FIB_RES_GW(*res)  FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
-   ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, 
res);
+   tb-tb_select_default(main_table, flp, res);
 }
 
 #else /* CONFIG_IP_MULTIPLE_TABLES */
-#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL)
-#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN)
+#define ip_fib_local_table_ns() fib_get_table(RT_TABLE_LOCAL)
+#define ip_fib_main_table_ns() fib_get_table(RT_TABLE_MAIN)
 
 extern int fib_lookup(struct flowi *flp, struct fib_result *res);
 
@@ -214,6 +230,10 @@ extern void fib_select_default(const str
 
 /* Exported by fib_frontend.c */
 extern voidip_fib_init(void);
+#ifdef CONFIG_NET_NS
+extern int ip_fib_struct_init(void);
+extern void ip_fib_struct_cleanup(void);
+#endif
 extern int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 extern int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void 
*arg);
 extern int inet_rtm_getroute

[PATCH 1/9] network namespaces: core and device list

2006-08-16 Thread Andrey Savochkin
CONFIG_NET_NS and net_namespace structure are introduced.
List of network devices is made per-namespace.
Each namespace gets its own loopback device.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 drivers/net/loopback.c|   69 -
 include/linux/init_task.h |9 ++
 include/linux/net_ns.h|   82 +
 include/linux/netdevice.h |   13 +++
 include/linux/nsproxy.h   |3 
 include/linux/sched.h |3 
 kernel/nsproxy.c  |   14 
 net/Kconfig   |7 ++
 net/core/dev.c|  150 --
 net/core/net-sysfs.c  |   24 +++
 net/ipv4/devinet.c|2 
 net/ipv6/addrconf.c   |2 
 net/ipv6/route.c  |9 +-
 13 files changed, 349 insertions(+), 38 deletions(-)

--- ./drivers/net/loopback.c.vensdevMon Aug 14 17:02:18 2006
+++ ./drivers/net/loopback.cMon Aug 14 17:18:20 2006
@@ -196,42 +196,55 @@ static struct ethtool_ops loopback_ethto
.set_tso= ethtool_op_set_tso,
 };
 
-struct net_device loopback_dev = {
-   .name   = lo,
-   .mtu= (16 * 1024) + 20 + 20 + 12,
-   .hard_start_xmit= loopback_xmit,
-   .hard_header= eth_header,
-   .hard_header_cache  = eth_header_cache,
-   .header_cache_update= eth_header_cache_update,
-   .hard_header_len= ETH_HLEN, /* 14   */
-   .addr_len   = ETH_ALEN, /* 6*/
-   .tx_queue_len   = 0,
-   .type   = ARPHRD_LOOPBACK,  /* 0x0001*/
-   .rebuild_header = eth_rebuild_header,
-   .flags  = IFF_LOOPBACK,
-   .features   = NETIF_F_SG | NETIF_F_FRAGLIST
+struct net_device loopback_dev_static;
+EXPORT_SYMBOL(loopback_dev_static);
+
+void loopback_dev_dtor(struct net_device *dev)
+{
+   if (dev-priv) {
+   kfree(dev-priv);
+   dev-priv = NULL;
+   }
+   free_netdev(dev);
+}
+
+void loopback_dev_ctor(struct net_device *dev)
+{
+   struct net_device_stats *stats;
+
+   memset(dev, 0, sizeof(*dev));
+   strcpy(dev-name, lo);
+   dev-mtu= (16 * 1024) + 20 + 20 + 12;
+   dev-hard_start_xmit= loopback_xmit;
+   dev-hard_header= eth_header;
+   dev-hard_header_cache  = eth_header_cache;
+   dev-header_cache_update = eth_header_cache_update;
+   dev-hard_header_len= ETH_HLEN; /* 14   */
+   dev-addr_len   = ETH_ALEN; /* 6*/
+   dev-tx_queue_len   = 0;
+   dev-type   = ARPHRD_LOOPBACK;  /* 0x0001*/
+   dev-rebuild_header = eth_rebuild_header;
+   dev-flags  = IFF_LOOPBACK;
+   dev-features   = NETIF_F_SG | NETIF_F_FRAGLIST
 #ifdef LOOPBACK_TSO
  | NETIF_F_TSO
 #endif
  | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA
- | NETIF_F_LLTX,
-   .ethtool_ops= loopback_ethtool_ops,
-};
-
-/* Setup and register the loopback device. */
-int __init loopback_init(void)
-{
-   struct net_device_stats *stats;
+ | NETIF_F_LLTX;
+   dev-ethtool_ops= loopback_ethtool_ops;
 
/* Can survive without statistics */
stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL);
if (stats) {
memset(stats, 0, sizeof(struct net_device_stats));
-   loopback_dev.priv = stats;
-   loopback_dev.get_stats = get_stats;
+   dev-priv = stats;
+   dev-get_stats = get_stats;
}
-   
-   return register_netdev(loopback_dev);
-};
+}
 
-EXPORT_SYMBOL(loopback_dev);
+/* Setup and register the loopback device. */
+int __init loopback_init(void)
+{
+   loopback_dev_ctor(loopback_dev_static);
+   return register_netdev(loopback_dev_static);
+};
--- ./include/linux/init_task.h.vensdev Mon Aug 14 17:04:04 2006
+++ ./include/linux/init_task.h Mon Aug 14 17:18:21 2006
@@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy;
 
 extern struct group_info init_groups;
 
+#ifdef CONFIG_NET_NS
+extern struct net_namespace init_net_ns;
+#define INIT_NET_NS \
+   .net_context= init_net_ns,
+#else
+#define INIT_NET_NS
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1f (=2MB)
@@ -129,6 +137,7 @@ extern struct group_info init_groups;
.signal = init_signals,\
.sighand= init_sighand,\
.nsproxy= init_nsproxy,\
+   INIT_NET_NS \
.pending= { \
.list = LIST_HEAD_INIT

[PATCH 3/9] network namespaces: playing and debugging

2006-08-16 Thread Andrey Savochkin
Temporary code to play with network namespaces in the simplest way.
Do
exec 7 /proc/net/net_ns
in your bash shell and you'll get a brand new network namespace.
There you can, for example, do
ip link set lo up
ip addr list
ip addr add 1.2.3.4 dev lo
ping -n 1.2.3.4

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 dev.c |   20 
 1 files changed, 20 insertions(+)

--- ./net/core/dev.c.vensxdbg   Tue Aug 15 13:46:44 2006
+++ ./net/core/dev.cTue Aug 15 13:46:44 2006
@@ -3597,6 +3597,8 @@ int net_ns_start(void)
if (err)
goto out_register;
put_net_ns(orig_ns);
+   printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n,
+   ns, task-comm, task-tgid);
return 0;
 
 out_register:
@@ -3629,14 +3631,29 @@ static void net_ns_destroy(void *data)
ip_fib_struct_cleanup();
pop_net_ns(orig_ns);
kfree(ns);
+   printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns);
 }
 
 void net_ns_stop(struct net_namespace *ns)
 {
+   printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns);
INIT_WORK(ns-destroy_work, net_ns_destroy, ns);
schedule_work(ns-destroy_work);
 }
 EXPORT_SYMBOL(net_ns_stop);
+
+static int net_ns_open(struct inode *i, struct file *f)
+{
+   return net_ns_start();
+}
+static struct file_operations net_ns_fops = {
+   .open   = net_ns_open,
+};
+static int net_ns_init(void)
+{
+   return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops)
+   ? 0 : -ENOMEM;
+}
 #endif
 
 /*
@@ -3701,6 +3718,9 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+#ifdef CONFIG_NET_NS
+   net_ns_init();
+#endif
rc = 0;
 out:
return rc;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] network namespaces: socket hashes

2006-08-16 Thread Andrey Savochkin
Socket hash lookups are made within namespace.
Hash tables are common for all namespaces, with
additional permutation of indexes.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/ipv6.h |3 ++-
 include/net/inet6_hashtables.h   |6 --
 include/net/inet_hashtables.h|   38 +-
 include/net/inet_sock.h  |6 --
 include/net/inet_timewait_sock.h |2 ++
 include/net/sock.h   |4 
 include/net/udp.h|   12 +---
 net/core/sock.c  |5 +
 net/ipv4/inet_connection_sock.c  |   19 +++
 net/ipv4/inet_hashtables.c   |   29 ++---
 net/ipv4/inet_timewait_sock.c|8 ++--
 net/ipv4/raw.c   |2 ++
 net/ipv4/udp.c   |   20 +---
 net/ipv6/inet6_connection_sock.c |2 ++
 net/ipv6/inet6_hashtables.c  |   25 ++---
 net/ipv6/raw.c   |4 
 net/ipv6/udp.c   |   21 ++---
 17 files changed, 151 insertions(+), 55 deletions(-)

--- ./include/linux/ipv6.h.venssock Mon Aug 14 17:02:45 2006
+++ ./include/linux/ipv6.h  Tue Aug 15 13:38:47 2006
@@ -428,10 +428,11 @@ static inline struct raw6_sock *raw6_sk(
 #define inet_v6_ipv6only(__sk) 0
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
 
-#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\
+#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif, __ns)\
(((__sk)-sk_hash == (__hash))   \
 ((*((__u32 *)(inet_sk(__sk)-dport))) == (__ports))\
 ((__sk)-sk_family == AF_INET6) \
+net_ns_match((__sk)-sk_net_ns, __ns)   \
 ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr))  \
 ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr))  \
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
--- ./include/net/inet6_hashtables.h.venssock   Mon Aug 14 17:02:47 2006
+++ ./include/net/inet6_hashtables.hTue Aug 15 13:38:47 2006
@@ -26,11 +26,13 @@ struct inet_hashinfo;
 
 /* I have no idea if this is a good hash for v6 or not. -DaveM */
 static inline unsigned int inet6_ehashfn(const struct in6_addr *laddr, const 
u16 lport,
-   const struct in6_addr *faddr, const u16 fport)
+   const struct in6_addr *faddr, const u16 fport,
+   struct net_namespace *ns)
 {
unsigned int hashent = (lport ^ fport);
 
hashent ^= (laddr-s6_addr32[3] ^ faddr-s6_addr32[3]);
+   hashent ^= net_ns_hash(ns);
hashent ^= hashent  16;
hashent ^= hashent  8;
return hashent;
@@ -44,7 +46,7 @@ static inline int inet6_sk_ehashfn(const
const struct in6_addr *faddr = np-daddr;
const __u16 lport = inet-num;
const __u16 fport = inet-dport;
-   return inet6_ehashfn(laddr, lport, faddr, fport);
+   return inet6_ehashfn(laddr, lport, faddr, fport, current_net_ns);
 }
 
 extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk);
--- ./include/net/inet_hashtables.h.venssockMon Aug 14 17:04:04 2006
+++ ./include/net/inet_hashtables.h Tue Aug 15 13:38:47 2006
@@ -74,6 +74,9 @@ struct inet_ehash_bucket {
  * ports are created in O(1) time?  I thought so. ;-)  -DaveM
  */
 struct inet_bind_bucket {
+#ifdef CONFIG_NET_NS
+   struct net_namespace*net_ns;
+#endif
unsigned short  port;
signed shortfastreuse;
struct hlist_node   node;
@@ -142,30 +145,34 @@ extern struct inet_bind_bucket *
 extern void inet_bind_bucket_destroy(kmem_cache_t *cachep,
 struct inet_bind_bucket *tb);
 
-static inline int inet_bhashfn(const __u16 lport, const int bhash_size)
+static inline int inet_bhashfn(const __u16 lport,
+  struct net_namespace *ns,
+  const int bhash_size)
 {
-   return lport  (bhash_size - 1);
+   return (lport ^ net_ns_hash(ns))  (bhash_size - 1);
 }
 
 extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
   const unsigned short snum);
 
 /* These can have wildcards, don't try too hard. */
-static inline int inet_lhashfn(const unsigned short num)
+static inline int inet_lhashfn(const unsigned short num,
+  struct net_namespace *ns)
 {
-   return num  (INET_LHTABLE_SIZE - 1);
+   return (num ^ net_ns_hash(ns))  (INET_LHTABLE_SIZE - 1);
 }
 
 static inline int inet_sk_listen_hashfn(const struct sock *sk)
 {
-   return inet_lhashfn(inet_sk(sk)-num);
+   return inet_lhashfn(inet_sk(sk)-num, current_net_ns);
 }
 
 /* Caller must disable local BH processing. */
 static inline void

[PATCH 6/9] allow proc_dir_entries to have destructor

2006-08-16 Thread Andrey Savochkin
Destructor field added proc_dir_entries,
standard destructor kfree'ing data introduced.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 fs/proc/generic.c   |   10 --
 fs/proc/root.c  |1 +
 include/linux/proc_fs.h |4 
 3 files changed, 13 insertions(+), 2 deletions(-)

--- ./fs/proc/generic.c.veprocdtor  Mon Aug 14 16:43:41 2006
+++ ./fs/proc/generic.c Tue Aug 15 13:45:51 2006
@@ -608,6 +608,11 @@ static struct proc_dir_entry *proc_creat
return ent;
 }
 
+void proc_data_destructor(struct proc_dir_entry *ent)
+{
+   kfree(ent-data);
+}
+
 struct proc_dir_entry *proc_symlink(const char *name,
struct proc_dir_entry *parent, const char *dest)
 {
@@ -620,6 +625,7 @@ struct proc_dir_entry *proc_symlink(cons
ent-data = kmalloc((ent-size=strlen(dest))+1, GFP_KERNEL);
if (ent-data) {
strcpy((char*)ent-data,dest);
+   ent-destructor = proc_data_destructor;
if (proc_register(parent, ent)  0) {
kfree(ent-data);
kfree(ent);
@@ -698,8 +704,8 @@ void free_proc_entry(struct proc_dir_ent
 
release_inode_number(ino);
 
-   if (S_ISLNK(de-mode)  de-data)
-   kfree(de-data);
+   if (de-destructor)
+   de-destructor(de);
kfree(de);
 }
 
--- ./fs/proc/root.c.veprocdtor Mon Aug 14 17:02:38 2006
+++ ./fs/proc/root.cTue Aug 15 13:45:51 2006
@@ -154,6 +154,7 @@ EXPORT_SYMBOL(proc_symlink);
 EXPORT_SYMBOL(proc_mkdir);
 EXPORT_SYMBOL(create_proc_entry);
 EXPORT_SYMBOL(remove_proc_entry);
+EXPORT_SYMBOL(proc_data_destructor);
 EXPORT_SYMBOL(proc_root);
 EXPORT_SYMBOL(proc_root_fs);
 EXPORT_SYMBOL(proc_net);
--- ./include/linux/proc_fs.h.veprocdtorMon Aug 14 17:02:47 2006
+++ ./include/linux/proc_fs.h   Tue Aug 15 13:45:51 2006
@@ -46,6 +46,8 @@ typedef   int (read_proc_t)(char *page, ch
 typedefint (write_proc_t)(struct file *file, const char __user *buffer,
   unsigned long count, void *data);
 typedef int (get_info_t)(char *, char **, off_t, int);
+struct proc_dir_entry;
+typedef void (destroy_proc_t)(struct proc_dir_entry *);
 
 struct proc_dir_entry {
unsigned int low_ino;
@@ -65,6 +67,7 @@ struct proc_dir_entry {
read_proc_t *read_proc;
write_proc_t *write_proc;
atomic_t count; /* use count */
+   destroy_proc_t *destructor;
int deleted;/* delete flag */
void *set;
 };
@@ -109,6 +112,7 @@ char *task_mem(struct mm_struct *, char 
 extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
struct proc_dir_entry *parent);
 extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent);
+extern void proc_data_destructor(struct proc_dir_entry *);
 
 extern struct vfsmount *proc_mnt;
 extern int proc_fill_super(struct super_block *,void *,int);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] network namespaces: device to pass packets between namespaces

2006-08-16 Thread Andrey Savochkin
A simple device to pass packets between a namespace and its child.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 Makefile |3 
 veth.c   |  327 +++
 2 files changed, 330 insertions(+)

--- ./drivers/net/Makefile.veveth   Mon Aug 14 17:03:45 2006
+++ ./drivers/net/Makefile  Tue Aug 15 13:46:15 2006
@@ -124,6 +124,9 @@ obj-$(CONFIG_SLIP) += slip.o
 obj-$(CONFIG_SLHC) += slhc.o
 
 obj-$(CONFIG_DUMMY) += dummy.o
+ifeq ($(CONFIG_NET_NS),y)
+obj-m += veth.o
+endif
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
--- ./drivers/net/veth.c.veveth Tue Aug 15 13:44:46 2006
+++ ./drivers/net/veth.cTue Aug 15 13:46:15 2006
@@ -0,0 +1,327 @@
+/*
+ * Copyright (C) 2006  SWsoft
+ *
+ * Written by Andrey Savochkin [EMAIL PROTECTED],
+ * reusing code by Andrey Mirkin [EMAIL PROTECTED].
+ */
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/ctype.h
+#include asm/semaphore.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include linux/proc_fs.h
+#include linux/seq_file.h
+#include net/dst.h
+#include net/xfrm.h
+
+struct veth_struct
+{
+   struct net_device   *pair;
+   struct net_device_stats stats;
+};
+
+#define veth_from_netdev(dev) ((struct veth_struct *)(netdev_priv(dev)))
+
+/* --- *
+ *
+ * Device functions
+ *
+ * --- */
+
+static struct net_device_stats *get_stats(struct net_device *dev);
+static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct net_device_stats *stats;
+   struct veth_struct *entry;
+   struct net_device *rcv;
+   struct net_namespace *orig_net_ns;
+   int length;
+
+   stats = get_stats(dev);
+   entry = veth_from_netdev(dev);
+   rcv = entry-pair;
+
+   if (!(rcv-flags  IFF_UP))
+   /* Target namespace does not want to receive packets */
+   goto outf;
+
+   dst_release(skb-dst);
+   skb-dst = NULL;
+   secpath_reset(skb);
+   skb_orphan(skb);
+#ifdef CONFIG_NETFILTER
+   nf_conntrack_put(skb-nfct);
+#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
+   nf_conntrack_put_reasm(skb-nfct_reasm);
+#endif
+#ifdef CONFIG_BRIDGE_NETFILTER
+   nf_bridge_put(skb-nf_bridge);
+#endif
+#endif
+
+   push_net_ns(rcv-net_ns, orig_net_ns);
+   skb-dev = rcv;
+   skb-pkt_type = PACKET_HOST;
+   skb-protocol = eth_type_trans(skb, rcv);
+
+   length = skb-len;
+   stats-tx_bytes += length;
+   stats-tx_packets++;
+   stats = get_stats(rcv);
+   stats-rx_bytes += length;
+   stats-rx_packets++;
+
+   netif_rx(skb);
+   pop_net_ns(orig_net_ns);
+   return 0;
+
+outf:
+   stats-tx_dropped++;
+   kfree_skb(skb);
+   return 0;
+}
+
+static int veth_open(struct net_device *dev)
+{
+   return 0;
+}
+
+static int veth_close(struct net_device *dev)
+{
+   return 0;
+}
+
+static void veth_destructor(struct net_device *dev)
+{
+   free_netdev(dev);
+}
+
+static struct net_device_stats *get_stats(struct net_device *dev)
+{
+   return veth_from_netdev(dev)-stats;
+}
+
+int veth_init_dev(struct net_device *dev)
+{
+   dev-hard_start_xmit = veth_xmit;
+   dev-open = veth_open;
+   dev-stop = veth_close;
+   dev-destructor = veth_destructor;
+   dev-get_stats = get_stats;
+
+   ether_setup(dev);
+
+   dev-tx_queue_len = 0;
+   return 0;
+}
+
+static void veth_setup(struct net_device *dev)
+{
+   dev-init = veth_init_dev;
+}
+
+static inline int is_veth_dev(struct net_device *dev)
+{
+   return dev-init == veth_init_dev;
+}
+
+/* --- *
+ *
+ * Management interface
+ *
+ * --- */
+
+struct net_device *veth_dev_alloc(char *name, char *addr)
+{
+   struct net_device *dev;
+
+   dev = alloc_netdev(sizeof(struct veth_struct), name, veth_setup);
+   if (dev != NULL) {
+   memcpy(dev-dev_addr, addr, ETH_ALEN);
+   dev-addr_len = ETH_ALEN;
+   }
+   return dev;
+}
+
+int veth_entry_add(char *parent_name, char *parent_addr,
+   char *child_name, char *child_addr,
+   struct net_namespace *child_ns)
+{
+   struct net_device *parent_dev, *child_dev;
+   struct net_namespace *parent_ns;
+   int err;
+
+   err = -ENOMEM;
+   if ((parent_dev = veth_dev_alloc(parent_name, parent_addr)) == NULL)
+   goto out_alocp;
+   if ((child_dev = veth_dev_alloc(child_name, child_addr)) == NULL)
+   goto out_alocc;
+   veth_from_netdev(parent_dev)-pair = child_dev;
+   veth_from_netdev(child_dev)-pair = parent_dev;
+
+   /*
+* About serialization, see

[PATCH 9/9] network namespaces: playing with pass-through device

2006-08-16 Thread Andrey Savochkin
Temporary code to debug and play with pass-through device.
Create device pair by
modprobe veth
echo 'add veth1 0:1:2:3:4:1 eth0 0:1:2:3:4:2' /proc/net/veth_ctl
and your shell will appear into a new namespace with `eth0' device.
Configure device in this namespace
ip l s eth0 up
ip a a 1.2.3.4/24 dev eth0
and in the root namespace
ip l s veth1 up
ip a a 1.2.3.1/24 dev veth1
to establish a communication channel between root namespace and the newly
created one.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 veth.c |  113 +
 1 files changed, 113 insertions(+)

--- ./drivers/net/veth.c.veveth-dbg Tue Aug 15 13:47:48 2006
+++ ./drivers/net/veth.cTue Aug 15 14:08:04 2006
@@ -251,6 +251,116 @@ void veth_entry_del_all(void)
 
 /* --- *
  *
+ * Temporary interface to create veth devices
+ *
+ * --- */
+
+#ifdef CONFIG_PROC_FS
+
+static int veth_debug_open(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static char *parse_addr(char *s, char *addr)
+{
+   int i, v;
+
+   for (i = 0; i  ETH_ALEN; i++) {
+   if (!isxdigit(*s))
+   return NULL;
+   *addr = 0;
+   v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10;
+   s++;
+   if (isxdigit(*s)) {
+   *addr += v  16;
+   v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10;
+   s++;
+   }
+   *addr++ += v;
+   if (i  ETH_ALEN - 1  ispunct(*s))
+   s++;
+   }
+   return s;
+}
+
+extern int net_ns_start(void);
+static ssize_t veth_debug_write(struct file *file, const char __user *user_buf,
+   size_t size, loff_t *ppos)
+{
+   char buf[128], *s, *parent_name, *child_name;
+   char parent_addr[ETH_ALEN], child_addr[ETH_ALEN];
+   struct net_namespace *parent_ns, *child_ns;
+   int err;
+
+   s = buf;
+   err = -EINVAL;
+   if (size = sizeof(buf))
+   goto out;
+   err = -EFAULT;
+   if (copy_from_user(buf, user_buf, size))
+   goto out;
+   buf[size] = 0;
+
+   err = -EBADRQC;
+   if (!strncmp(buf, add , 4)) {
+   parent_name = buf + 4;
+   if ((s = strchr(parent_name, ' ')) == NULL)
+   goto out;
+   *s = 0;
+   if ((s = parse_addr(s + 1, parent_addr)) == NULL)
+   goto out;
+   if (!*s)
+   goto out;
+   child_name = s + 1;
+   if ((s = strchr(child_name, ' ')) == NULL)
+   goto out;
+   *s = 0;
+   if ((s = parse_addr(s + 1, child_addr)) == NULL)
+   goto out;
+
+   parent_ns = get_net_ns(current_net_ns);
+   err = net_ns_start();
+   if (err)
+   goto out;
+   /* return to parent context */
+   push_net_ns(parent_ns, child_ns);
+   err = veth_entry_add(parent_name, parent_addr,
+   child_name, child_addr, child_ns);
+   pop_net_ns(child_ns);
+   put_net_ns(parent_ns);
+   if (!err)
+   err = size;
+   }
+out:
+   return err;
+}
+
+static struct file_operations veth_debug_ops = {
+   .open   = veth_debug_open,
+   .write  = veth_debug_write,
+};
+
+static int veth_debug_create(void)
+{
+   proc_net_fops_create(veth_ctl, 0200, veth_debug_ops);
+   return 0;
+}
+
+static void veth_debug_remove(void)
+{
+   proc_net_remove(veth_ctl);
+}
+
+#else
+
+static int veth_debug_create(void) { return -1; }
+static void veth_debug_remove(void) { }
+
+#endif
+
+/* --- *
+ *
  * Information in proc
  *
  * --- */
@@ -310,12 +420,15 @@ static inline void veth_proc_remove(void
 
 int __init veth_init(void)
 {
+   if (veth_debug_create())
+   return -EINVAL;
veth_proc_create();
return 0;
 }
 
 void __exit veth_exit(void)
 {
+   veth_debug_remove();
veth_proc_remove();
veth_entry_del_all();
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/7] net_device list cleanup: core

2006-07-10 Thread Andrey Savochkin
On Sat, Jul 08, 2006 at 01:48:13AM +0900, YOSHIFUJI Hideaki / [EMAIL 
PROTECTED](B wrote:
 In article [EMAIL PROTECTED] (at Fri, 7 Jul 2006 11:54:25 +0400), Andrey 
 Savochkin [EMAIL PROTECTED] says:
 
  On Fri, Jul 07, 2006 at 01:34:34PM +0900, YOSHIFUJI Hideaki / [EMAIL 
  PROTECTED](B wrote:
   In article [EMAIL PROTECTED] (at Mon, 3 Jul 2006 12:18:51 +0400), 
   Andrey Savochkin [EMAIL PROTECTED] says:
   
@@ -3271,22 +3277,22 @@ int unregister_netdevice(struct net_devi
 
/* And unlink it from device chain. */
for (dp = dev_base; (d = *dp) != NULL; dp = d-next) {
   
   Why not for_each_netdev?
  
  it's a different list
 
 Sorry, I still do not understand.
 In other words, why will we still have dev-next?
 After introducing net_device-dev_list, we do not need
 dev-next anymore, do we?

dev-next is removed in the last patch, to make possible the bisection
of patch list.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/7] net_device list cleanup: core

2006-07-07 Thread Andrey Savochkin
On Fri, Jul 07, 2006 at 01:34:34PM +0900, YOSHIFUJI Hideaki / [EMAIL 
PROTECTED](B wrote:
 In article [EMAIL PROTECTED] (at Mon, 3 Jul 2006 12:18:51 +0400), Andrey 
 Savochkin [EMAIL PROTECTED] says:
 
  @@ -3271,22 +3277,22 @@ int unregister_netdevice(struct net_devi
   
  /* And unlink it from device chain. */
  for (dp = dev_base; (d = *dp) != NULL; dp = d-next) {
 
 Why not for_each_netdev?

it's a different list
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/7] net_device list cleanup: core

2006-07-05 Thread Andrey Savochkin
On Tue, Jul 04, 2006 at 08:35:37PM +0400, A.N.Kuznetsov wrote:
 
  Different modules want different kinds of lookup.
  So, I'm thinking about something like ilookup5.
 
  The next question: would people agree to review a patch doing this for
  net_devices? :)
 
 One not original suggestion, which did not sound nevertheless:
 to implement netdev_iterate_list() or whatever, update only core
 and a few of devices and deprecate dev_base_head
 with __deprecated_for_modules adding it to
 Documentation/feature-removal-schedule.txt

I like this idea

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/7] net_device list cleanup: core

2006-07-04 Thread Andrey Savochkin
Christoph,

On Mon, Jul 03, 2006 at 06:46:50PM +0100, Christoph Hellwig wrote:
 On Mon, Jul 03, 2006 at 12:18:51PM +0400, Andrey Savochkin wrote:
  Cleanup of net_device list use in net_dev core and IP.
  The cleanup consists of
   - converting the to list_head, to make the list double-linked (thus making
 remove operation O(1)), and list walks more readable;
   - introducing of for_each_netdev wrapper over list_for_each.
 
 When you change all this please make sure dev_base_head is never directly
 accessed anymore, not even through macros and dev_base_head is not exported
 anymore.  That's the only way to keep drivers messing with it.
 
 Yes, it's a little more work as you need to audit all drivers to see what
 they are doing and find suitable abstractions but it's a must have that
 should have been done a lot earlier.

Hiding dev_base_head can be done by converting first_netdev/next_netdev into
functions and implementing for_each_netdev loop through them.

Or are you talking about abstractions like functions
for_each_netdev/find_netdev with callbacks?
Do you think that hiding the list internals is worth the additional
complexity and substantial increase of the patch size?

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/7] net_device list cleanup: core

2006-07-04 Thread Andrey Savochkin
On Tue, Jul 04, 2006 at 10:10:03AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote:
   Yes, it's a little more work as you need to audit all drivers to see what
   they are doing and find suitable abstractions but it's a must have that
   should have been done a lot earlier.
  
  Hiding dev_base_head can be done by converting first_netdev/next_netdev into
  functions and implementing for_each_netdev loop through them.
  
  Or are you talking about abstractions like functions
  for_each_netdev/find_netdev with callbacks?
 
 an for_each_netdev with a callback makes sense and gives a cleaner
 abstraction, yes.  I don't think you should need a callback for the lookup
 structure.

Different modules want different kinds of lookup.
So, I'm thinking about something like ilookup5.

 
  Do you think that hiding the list internals is worth the additional
  complexity and substantial increase of the patch size?
 
 Yes, absolutely.  We've converted scsi hosts and devices from a model
 where drivers could directly access the list to strict iterators in the
 2.5 series.  It's quite a lot of work as you have to understand what
 the drivers actually do (and to at least 50% they were doing something
 really stupid) and convert them to the right abstractions.

The next question: would people agree to review a patch doing this for
net_devices? :)

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] bridge: br_dump_ifinfo index fix

2006-07-03 Thread Andrey Savochkin
Fix for inability of br_dump_ifinfo to handle non-zero start index:
loop index never increases when entered with non-zero start.
Spotted by Kirill Korotaev.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
Cc: Kirill Korotaev [EMAIL PROTECTED]
---
Against 2.6.17-mm6

--- ./net/bridge/br_netlink.c.vebridge-dump Wed Jun 21 18:53:18 2006
+++ ./net/bridge/br_netlink.c   Mon Jul  3 14:31:03 2006
@@ -117,12 +117,13 @@ static int br_dump_ifinfo(struct sk_buff
continue;
 
if (idx  s_idx)
-   continue;
+   goto cont;
 
err = br_fill_ifinfo(skb, p, NETLINK_CB(cb-skb).pid,
 cb-nlh-nlmsg_seq, RTM_NEWLINK, 
NLM_F_MULTI);
if (err = 0)
break;
+cont:
++idx;
}
read_unlock(dev_base_lock);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/7] net_device list cleanup: core

2006-07-03 Thread Andrey Savochkin
Cleanup of net_device list use in net_dev core and IP.
The cleanup consists of
 - converting the to list_head, to make the list double-linked (thus making
   remove operation O(1)), and list walks more readable;
 - introducing of for_each_netdev wrapper over list_for_each.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
Signed-off-by: Kirill Korotaev [EMAIL PROTECTED]
---
 include/linux/netdevice.h |   29 ++-
 net/core/dev.c|   48 +-
 net/ipv4/devinet.c|6 ++---
 net/ipv6/addrconf.c   |8 +++
 net/ipv6/anycast.c|   10 +
 5 files changed, 68 insertions(+), 33 deletions(-)

--- ./include/linux/netdevice.h.vedevbase-core  Mon Jul  3 15:14:15 2006
+++ ./include/linux/netdevice.h Mon Jul  3 16:09:11 2006
@@ -290,7 +290,8 @@ struct net_device
unsigned long   state;
 
struct net_device   *next;
-   
+   struct list_headdev_list;
+
/* The device initialization function. Called only once. */
int (*init)(struct net_device *dev);
 
@@ -558,8 +559,34 @@ struct packet_type {
 
 extern struct net_device   loopback_dev;   /* The loopback 
*/
 extern struct net_device   *dev_base;  /* All devices 
*/
+extern struct list_headdev_base_head;  /* All 
devices */
 extern rwlock_tdev_base_lock;  /* 
Device list lock */
 
+#define for_each_netdev(p) list_for_each_entry(p, dev_base_head, dev_list)
+
+/*
+ * When possible, it is preferrable to use for_each_netdev() loop
+ * defined above, rather than first_netdev()/next_netdev() macros.
+ * for_each_netdev() loop makes the intentions clearer, and gives more
+ * flexibility in device list implementation.
+ * While next_netdev() is unavoidable in seq_proc functions,
+ * first_netdev() should be needed quite rarely.
+ */
+#define first_netdev() ({ \
+   list_empty(dev_base_head) ? NULL : \
+   list_entry(dev_base_head.next, \
+   struct net_device, \
+   dev_list); \
+})
+#define next_netdev(dev)   ({ \
+   struct list_head *__next; \
+   __next = (dev)-dev_list.next; \
+   __next == dev_base_head ? NULL : \
+   list_entry(__next, \
+   struct net_device, \
+   dev_list); \
+})
+
 extern int netdev_boot_setup_check(struct net_device *dev);
 extern unsigned long   netdev_boot_base(const char *prefix, int unit);
 extern struct net_device*dev_getbyhwaddr(unsigned short type, char 
*hwaddr);
--- ./net/core/dev.c.vedevbase-core Mon Jul  3 15:14:19 2006
+++ ./net/core/dev.cMon Jul  3 16:09:11 2006
@@ -181,6 +181,9 @@ DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base);
 EXPORT_SYMBOL(dev_base_lock);
 
+LIST_HEAD(dev_base_head);
+EXPORT_SYMBOL(dev_base_head);
+
 #define NETDEV_HASHBITS8
 static struct hlist_head dev_name_head[1NETDEV_HASHBITS];
 static struct hlist_head dev_index_head[1NETDEV_HASHBITS];
@@ -575,11 +578,11 @@ struct net_device *dev_getbyhwaddr(unsig
 
ASSERT_RTNL();
 
-   for (dev = dev_base; dev; dev = dev-next)
+   for_each_netdev(dev)
if (dev-type == type 
!memcmp(dev-dev_addr, ha, dev-addr_len))
-   break;
-   return dev;
+   return dev;
+   return NULL;
 }
 
 EXPORT_SYMBOL(dev_getbyhwaddr);
@@ -589,14 +592,15 @@ struct net_device *dev_getfirstbyhwtype(
struct net_device *dev;
 
rtnl_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
if (dev-type == type) {
dev_hold(dev);
-   break;
+   rtnl_unlock();
+   return dev;
}
}
rtnl_unlock();
-   return dev;
+   return NULL;
 }
 
 EXPORT_SYMBOL(dev_getfirstbyhwtype);
@@ -617,14 +621,15 @@ struct net_device * dev_get_by_flags(uns
struct net_device *dev;
 
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
if (((dev-flags ^ if_flags)  mask) == 0) {
dev_hold(dev);
-   break;
+   read_unlock(dev_base_lock);
+   return dev;
}
}
read_unlock(dev_base_lock

[patch 5/7] net_device list cleanup: arch-dependent code and block devices

2006-07-03 Thread Andrey Savochkin
Cleanup of net_device list use in arch-dependent code and block devices.

The cleanup consists of
 - converting the to list_head, to make the list double-linked (thus making
   remove operation O(1)), and list walks more readable;
 - introducing of for_each_netdev wrapper over list_for_each.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 arch/s390/appldata/appldata_net_sum.c |2 +-
 arch/sparc64/solaris/ioctl.c  |2 +-
 drivers/block/aoe/aoecmd.c|8 ++--
 drivers/parisc/led.c  |2 +-
 4 files changed, 9 insertions(+), 5 deletions(-)

--- ./arch/s390/appldata/appldata_net_sum.c.vedevbase-misc  Mon Jul  3 
15:13:15 2006
+++ ./arch/s390/appldata/appldata_net_sum.c Mon Jul  3 16:16:05 2006
@@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo
tx_dropped = 0;
collisions = 0;
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
if (dev-get_stats == NULL) {
continue;
}
--- ./arch/sparc64/solaris/ioctl.c.vedevbase-misc   Mon Mar 20 08:53:29 2006
+++ ./arch/sparc64/solaris/ioctl.c  Mon Jul  3 16:16:05 2006
@@ -686,7 +686,7 @@ static inline int solaris_i(unsigned int
int i = 0;

read_lock_bh(dev_base_lock);
-   for (d = dev_base; d; d = d-next) i++;
+   for_each_netdev(d) i++;
read_unlock_bh(dev_base_lock);
 
if (put_user (i, (int __user *)A(arg)))
--- ./drivers/block/aoe/aoecmd.c.vedevbase-misc Mon Jul  3 15:09:57 2006
+++ ./drivers/block/aoe/aoecmd.cMon Jul  3 16:16:05 2006
@@ -204,14 +204,17 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
sl = sl_tail = NULL;
 
read_lock(dev_base_lock);
-   for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) {
+   for_each_netdev(ifp) {
dev_hold(ifp);
-   if (!is_aoe_netif(ifp))
+   if (!is_aoe_netif(ifp)) {
+   dev_put(ifp);
continue;
+   }
 
skb = new_skb(ifp, sizeof *h + sizeof *ch);
if (skb == NULL) {
printk(KERN_INFO aoe: aoecmd_cfg: skb alloc 
failure\n);
+   dev_put(ifp);
continue;
}
if (sl_tail == NULL)
@@ -229,6 +232,7 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
 
skb-next = sl;
sl = skb;
+   dev_put(ifp);
}
read_unlock(dev_base_lock);
 
--- ./drivers/parisc/led.c.vedevbase-misc   Mon Jul  3 15:13:46 2006
+++ ./drivers/parisc/led.c  Mon Jul  3 16:16:05 2006
@@ -367,7 +367,7 @@ static __inline__ int led_get_net_activi
 * for reading should be OK */
read_lock(dev_base_lock);
rcu_read_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
struct net_device_stats *stats;
struct in_device *in_dev = __in_dev_get_rcu(dev);
if (!in_dev || !in_dev-ifa_list)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/7] net_device list cleanup: proc seq_file output

2006-07-03 Thread Andrey Savochkin
Cleanup of net_device list use in seq_file output routines in core networking
files.  Implementation of /proc/net/dev was copied from dev_mcast, since the
latter did the same in a more compact and cleaner way.

The cleanup consists of
 - converting the to list_head, to make the list double-linked (thus making
   remove operation O(1)), and list walks more readable;
 - introducing of for_each_netdev wrapper over list_for_each.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
Note: functions covered by this patch are good candidates for further
restructuring by introduction of library routines for seq_file's showing some
information for each device.

 core/dev.c   |   23 +++
 core/dev_mcast.c |4 ++--
 ipv4/igmp.c  |   25 +++--
 ipv6/anycast.c   |   12 +++-
 ipv6/mcast.c |   25 +++--
 5 files changed, 50 insertions(+), 39 deletions(-)

--- ./net/core/dev.c.vedevbase-proc Mon Jul  3 16:09:54 2006
+++ ./net/core/dev.cMon Jul  3 16:09:54 2006
@@ -2072,26 +2072,25 @@ static int dev_ifconf(char __user *arg)
  * This is invoked by the /proc filesystem handler to display a device
  * in detail.
  */
-static __inline__ struct net_device *dev_get_idx(loff_t pos)
-{
-   struct net_device *dev;
-   loff_t i;
-
-   for (i = 0, dev = dev_base; dev  i  pos; ++i, dev = dev-next);
-
-   return i == pos ? dev : NULL;
-}
-
 void *dev_seq_start(struct seq_file *seq, loff_t *pos)
 {
+   struct net_device *dev;
+   loff_t off = 1;
read_lock(dev_base_lock);
-   return *pos ? dev_get_idx(*pos - 1) : SEQ_START_TOKEN;
+   if (!*pos)
+   return SEQ_START_TOKEN;
+   for_each_netdev(dev) {
+   if (off++ == *pos)
+   return dev;
+   }
+   return NULL;
 }
 
 void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 {
+   struct net_device *dev = v;
++*pos;
-   return v == SEQ_START_TOKEN ? dev_base : ((struct net_device *)v)-next;
+   return v == SEQ_START_TOKEN ? first_netdev() : next_netdev(dev);
 }
 
 void dev_seq_stop(struct seq_file *seq, void *v)
--- ./net/core/dev_mcast.c.vedevbase-proc   Mon Jul  3 15:14:19 2006
+++ ./net/core/dev_mcast.c  Mon Jul  3 16:09:54 2006
@@ -225,7 +225,7 @@ static void *dev_mc_seq_start(struct seq
loff_t off = 0;
 
read_lock(dev_base_lock);
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
if (off++ == *pos) 
return dev;
}
@@ -236,7 +236,7 @@ static void *dev_mc_seq_next(struct seq_
 {
struct net_device *dev = v;
++*pos;
-   return dev-next;
+   return next_netdev(dev);
 }
 
 static void dev_mc_seq_stop(struct seq_file *seq, void *v)
--- ./net/ipv4/igmp.c.vedevbase-procMon Jul  3 15:14:20 2006
+++ ./net/ipv4/igmp.c   Mon Jul  3 16:09:54 2006
@@ -2254,19 +2254,21 @@ struct igmp_mc_iter_state {
 
 static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 {
+   struct net_device *dev;
struct ip_mc_list *im = NULL;
struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
 
-   for (state-dev = dev_base, state-in_dev = NULL;
-state-dev; 
-state-dev = state-dev-next) {
+   state-dev = NULL;
+   state-in_dev = NULL;
+   for_each_netdev(dev) {
struct in_device *in_dev;
-   in_dev = in_dev_get(state-dev);
+   in_dev = in_dev_get(dev);
if (!in_dev)
continue;
read_lock(in_dev-mc_list_lock);
im = in_dev-mc_list;
if (im) {
+   state-dev = dev;
state-in_dev = in_dev;
break;
}
@@ -2285,7 +2287,7 @@ static struct ip_mc_list *igmp_mc_get_ne
read_unlock(state-in_dev-mc_list_lock);
in_dev_put(state-in_dev);
}
-   state-dev = state-dev-next;
+   state-dev = next_netdev(state-dev);
if (!state-dev) {
state-in_dev = NULL;
break;
@@ -2416,15 +2418,17 @@ struct igmp_mcf_iter_state {
 
 static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 {
+   struct net_device *dev;
struct ip_sf_list *psf = NULL;
struct ip_mc_list *im = NULL;
struct igmp_mcf_iter_state *state = igmp_mcf_seq_private(seq);
 
-   for (state-dev = dev_base, state-idev = NULL, state-im = NULL;
-state-dev; 
-state-dev = state-dev-next) {
+   state-dev = NULL;
+   state-im = NULL;
+   state-idev = NULL;
+   for_each_netdev(dev) {
struct in_device *idev;
-   idev = in_dev_get(state-dev);
+   idev = in_dev_get(dev);
if (unlikely

[patch 7/7] net_device list cleanup: debugging

2006-07-03 Thread Andrey Savochkin
Optional code to catch cases when loop cursor is used after for_each_netdev
loop: often it's a sign of a bug, since it isn't guaranteed to point to a
device.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
If anyone wants to keep this under some debug config option,
let me know which one.

 netdevice.h |8 +++-
 1 files changed, 7 insertions(+), 1 deletion(-)

--- ./include/linux/netdevice.h.vedevbase-dbg   Mon Jul  3 16:16:51 2006
+++ ./include/linux/netdevice.h Mon Jul  3 16:16:51 2006
@@ -560,7 +560,13 @@ extern struct net_device   loopback_dev;   
 extern struct list_headdev_base_head;  /* All 
devices */
 extern rwlock_tdev_base_lock;  /* 
Device list lock */
 
-#define for_each_netdev(p) list_for_each_entry(p, dev_base_head, dev_list)
+#define for_each_netdev(pos)   \
+for (pos = list_entry(dev_base_head.next, typeof(*pos), dev_list); \
+ prefetch(pos-dev_list.next), \
+   pos-dev_list != dev_base_head ? : \
+   ({ void *__check_dev_use_after_for_each_netdev; \
+  pos = __check_dev_use_after_for_each_netdev; 0; }); \
+ pos = list_entry(pos-dev_list.next, typeof(*pos), dev_list))
 
 /*
  * When possible, it is preferrable to use for_each_netdev() loop
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/7] net_device list cleanup: drivers and non-IP protocols

2006-07-03 Thread Andrey Savochkin
Cleanup of net_device list use in network device drivers and protocols
other than IP.

The cleanup consists of
 - converting the to list_head, to make the list double-linked (thus making
   remove operation O(1)), and list walks more readable;
 - introducing of for_each_netdev wrapper over list_for_each.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
Requires bridge: br_dump_ifinfo index fix

 drivers/net/wireless/strip.c |4 +---
 net/8021q/vlan.c |4 ++--
 net/8021q/vlanproc.c |   10 +-
 net/bridge/br_if.c   |4 ++--
 net/bridge/br_ioctl.c|4 +++-
 net/bridge/br_netlink.c  |3 ++-
 net/decnet/af_decnet.c   |   11 +++
 net/decnet/dn_dev.c  |   17 ++---
 net/decnet/dn_fib.c  |2 +-
 net/decnet/dn_route.c|   13 +++--
 net/llc/llc_core.c   |7 +--
 net/netrom/nr_route.c|5 +++--
 net/rose/rose_route.c|8 +---
 net/sctp/protocol.c  |2 +-
 net/tipc/eth_media.c |   11 +++
 15 files changed, 61 insertions(+), 44 deletions(-)

--- ./drivers/net/wireless/strip.c.vedevbase-onet   Mon Jul  3 15:13:46 2006
+++ ./drivers/net/wireless/strip.c  Mon Jul  3 16:12:11 2006
@@ -1969,8 +1969,7 @@ static struct net_device *get_strip_dev(
  sizeof(zero_address))) {
struct net_device *dev;
read_lock_bh(dev_base_lock);
-   dev = dev_base;
-   while (dev) {
+   for_each_netdev(dev) {
if (dev-type == strip_info-dev-type 
!memcmp(dev-dev_addr,
strip_info-true_dev_addr,
@@ -1981,7 +1980,6 @@ static struct net_device *get_strip_dev(
read_unlock_bh(dev_base_lock);
return (dev);
}
-   dev = dev-next;
}
read_unlock_bh(dev_base_lock);
}
--- ./net/8021q/vlan.c.vedevbase-onet   Mon Jul  3 15:14:17 2006
+++ ./net/8021q/vlan.c  Mon Jul  3 16:12:11 2006
@@ -121,8 +121,8 @@ static void __exit vlan_cleanup_devices(
struct net_device *dev, *nxt;
 
rtnl_lock();
-   for (dev = dev_base; dev; dev = nxt) {
-   nxt = dev-next;
+   for (dev = first_netdev(); dev; dev = nxt) {
+   nxt = next_netdev(dev);
if (dev-priv_flags  IFF_802_1Q_VLAN) {
unregister_vlan_dev(VLAN_DEV_INFO(dev)-real_dev,
VLAN_DEV_INFO(dev)-vlan_id);
--- ./net/8021q/vlanproc.c.vedevbase-onet   Mon Jul  3 15:14:17 2006
+++ ./net/8021q/vlanproc.c  Mon Jul  3 16:12:11 2006
@@ -241,7 +241,7 @@ int vlan_proc_rem_dev(struct net_device 
 static struct net_device *vlan_skip(struct net_device *dev) 
 {
while (dev  !(dev-priv_flags  IFF_802_1Q_VLAN)) 
-   dev = dev-next;
+   dev = next_netdev(dev);
 
return dev;
 }
@@ -257,8 +257,8 @@ static void *vlan_seq_start(struct seq_f
if (*pos == 0)
return SEQ_START_TOKEN;

-   for (dev = vlan_skip(dev_base); dev  i  *pos; 
-dev = vlan_skip(dev-next), ++i);
+   for (dev = vlan_skip(first_netdev()); dev  i  *pos; 
+dev = vlan_skip(next_netdev(dev)), ++i);

return  (i == *pos) ? dev : NULL;
 } 
@@ -268,8 +268,8 @@ static void *vlan_seq_next(struct seq_fi
++*pos;
 
return vlan_skip((v == SEQ_START_TOKEN)  
-   ? dev_base 
-   : ((struct net_device *)v)-next);
+   ? first_netdev()
+   : next_netdev((struct net_device *)v));
 }
 
 static void vlan_seq_stop(struct seq_file *seq, void *v)
--- ./net/bridge/br_if.c.vedevbase-onet Mon Jul  3 15:14:19 2006
+++ ./net/bridge/br_if.cMon Jul  3 16:12:11 2006
@@ -474,8 +474,8 @@ void __exit br_cleanup_bridges(void)
struct net_device *dev, *nxt;
 
rtnl_lock();
-   for (dev = dev_base; dev; dev = nxt) {
-   nxt = dev-next;
+   for (dev = first_netdev(); dev; dev = nxt) {
+   nxt = next_netdev(dev);
if (dev-priv_flags  IFF_EBRIDGE)
del_br(dev-priv);
}
--- ./net/bridge/br_ioctl.c.vedevbase-onet  Mon Mar 20 08:53:29 2006
+++ ./net/bridge/br_ioctl.c Mon Jul  3 16:12:11 2006
@@ -27,7 +27,9 @@ static int get_bridge_ifindices(int *ind
struct net_device *dev;
int i = 0;
 
-   for (dev = dev_base; dev  i  num; dev = dev-next) {
+   for_each_netdev(dev) {
+   if (i = num)
+   break;
if (dev-priv_flags  IFF_EBRIDGE) 
indices[i++] = dev-ifindex;
}
--- ./net/bridge/br_netlink.c.vedevbase-onetMon Jul  3 16:12:11 2006

[patch 3/7] net_device list cleanup: netlink_dump

2006-07-03 Thread Andrey Savochkin
Cleanup of net_device list use in netlink_dump routines in core networking
files.

The cleanup consists of
 - converting the to list_head, to make the list double-linked (thus making
   remove operation O(1)), and list walks more readable;
 - introducing of for_each_netdev wrapper over list_for_each.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 core/rtnetlink.c |   18 ++
 ipv4/devinet.c   |   14 --
 ipv6/addrconf.c  |   20 +---
 sched/sch_api.c  |8 ++--
 4 files changed, 37 insertions(+), 23 deletions(-)

--- ./net/core/rtnetlink.c.vedevbase-dump   Mon Jul  3 15:14:19 2006
+++ ./net/core/rtnetlink.c  Mon Jul  3 16:10:12 2006
@@ -319,14 +319,16 @@ static int rtnetlink_dump_ifinfo(struct 
struct net_device *dev;
 
read_lock(dev_base_lock);
-   for (dev=dev_base, idx=0; dev; dev = dev-next, idx++) {
-   if (idx  s_idx)
-   continue;
-   if (rtnetlink_fill_ifinfo(skb, dev, RTM_NEWLINK,
- NETLINK_CB(cb-skb).pid,
- cb-nlh-nlmsg_seq, 0,
- NLM_F_MULTI) = 0)
-   break;
+   idx = 0;
+   for_each_netdev(dev) {
+   if (idx = s_idx) {
+   if (rtnetlink_fill_ifinfo(skb, dev, RTM_NEWLINK,
+ NETLINK_CB(cb-skb).pid,
+ cb-nlh-nlmsg_seq, 0,
+ NLM_F_MULTI) = 0)
+   break;
+   }
+   idx++;
}
read_unlock(dev_base_lock);
cb-args[0] = idx;
--- ./net/ipv4/devinet.c.vedevbase-dump Mon Jul  3 16:10:12 2006
+++ ./net/ipv4/devinet.cMon Jul  3 16:10:12 2006
@@ -1094,18 +1094,17 @@ static int inet_dump_ifaddr(struct sk_bu
struct in_ifaddr *ifa;
int s_ip_idx, s_idx = cb-args[0];
 
+   idx = 0;
s_ip_idx = ip_idx = cb-args[1];
read_lock(dev_base_lock);
-   for (dev = dev_base, idx = 0; dev; dev = dev-next, idx++) {
+   for_each_netdev(dev) {
if (idx  s_idx)
-   continue;
+   goto cont;
if (idx  s_idx)
s_ip_idx = 0;
rcu_read_lock();
-   if ((in_dev = __in_dev_get_rcu(dev)) == NULL) {
-   rcu_read_unlock();
-   continue;
-   }
+   if ((in_dev = __in_dev_get_rcu(dev)) == NULL)
+   goto cont_unlock;
 
for (ifa = in_dev-ifa_list, ip_idx = 0; ifa;
 ifa = ifa-ifa_next, ip_idx++) {
@@ -1118,7 +1117,10 @@ static int inet_dump_ifaddr(struct sk_bu
goto done;
}
}
+cont_unlock:
rcu_read_unlock();
+cont:
+   idx++;
}
 
 done:
--- ./net/ipv6/addrconf.c.vedevbase-dumpMon Jul  3 16:10:12 2006
+++ ./net/ipv6/addrconf.c   Mon Jul  3 16:10:12 2006
@@ -3013,18 +3013,19 @@ static int inet6_dump_addr(struct sk_buf
struct ifmcaddr6 *ifmca;
struct ifacaddr6 *ifaca;
 
+   idx = 0;
s_idx = cb-args[0];
s_ip_idx = ip_idx = cb-args[1];
read_lock(dev_base_lock);

-   for (dev = dev_base, idx = 0; dev; dev = dev-next, idx++) {
+   for_each_netdev(dev) {
if (idx  s_idx)
-   continue;
+   goto cont;
if (idx  s_idx)
s_ip_idx = 0;
ip_idx = 0;
if ((idev = in6_dev_get(dev)) == NULL)
-   continue;
+   goto cont;
read_lock_bh(idev-lock);
switch (type) {
case UNICAST_ADDR:
@@ -3071,6 +3072,8 @@ static int inet6_dump_addr(struct sk_buf
}
read_unlock_bh(idev-lock);
in6_dev_put(idev);
+cont:
+   idx++;
}
 done:
if (err = 0) {
@@ -3238,17 +3241,20 @@ static int inet6_dump_ifinfo(struct sk_b
struct net_device *dev;
struct inet6_dev *idev;
 
+   idx = 0;
read_lock(dev_base_lock);
-   for (dev=dev_base, idx=0; dev; dev = dev-next, idx++) {
+   for_each_netdev(dev) {
if (idx  s_idx)
-   continue;
+   goto cont;
if ((idev = in6_dev_get(dev)) == NULL)
-   continue;
+   goto cont;
err = inet6_fill_ifinfo(skb, idev, NETLINK_CB(cb-skb).pid, 
cb-nlh-nlmsg_seq, RTM_NEWLINK, NLM_F_MULTI);
in6_dev_put(idev);
if (err = 0)
break;
+cont:
+   idx

Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-30 Thread Andrey Savochkin
Jamal,

On Fri, Jun 30, 2006 at 09:50:52AM -0400, jamal wrote:
 
 BTW - I was just looking at openvz, very impressive. To the other folks,

Thanks!

 I am not putting down any of your approaches - just havent
 had time to study them. Andrey, this is the same thing you guys have
 been working on for a few years now, you just changed the name, correct?

The relations are more complicated than just the change of name,
but yes, OpenVZ represents the result of our work for a few years.

 
 Ok, since you guys are encouraging me to speak, here goes ;-
 Hopefully this addresses the other email from Herbert et al.
 
[snip]
 // create the guest
 [host-node]# vzctl create 101 --ostemplate fedora-core-5-minimal 
 // create guest101::eth0, seems to only create config to boot up with 
 [host-node]# vzctl create 101 --netdev eth0
 // bootup guest101
 [host-node]# vzctl start 101
 
 As soon as bootup of guest101 happens, creating guest101::eth0 should 
 activate 
 creation of the host side netdevice. This could be triggered for example by
 the netlink event message seen on host whic- which is a result of creating 
 guest101::eth0 
 Which means control sits purely in user space.

I'd like to clarify you idea: whether this host-side device is a real
device capable of receiving and transmitting packets (by moving them between
namespaces), or it's a fake device creating only a view of other namespace's
devices?

[snip]
  However, I oppose the idea of automatic mirroring of _all_ devices appearing
  inside some namespaces (guests) to another namespace (the host).
  This clearly goes against the concept of namespaces as independent realms,
  and creates a lot of problems with applications running in the host, hotplug
  scripts and so on.
  
 
 I was thinking that the host side is the master i.e you can peek at
 namespaces in the guest from the host.

Host(master)-guest relations is a valid and useful scheme.
However, I'm thinking about broader application of network namespaces,
when they can form an arbitrary tree and may not be in host-guest relations.

 Also note that having the pass through device allows for guests to be
 connected via standard linux schemes in the host side (bridge, point
 routes, tc redirect etc); so you dont need a speacial device to hook
 them together.

What do you mean under pass through device?
Do you mean using guest1-tun0 as a backdoor to talk to the guest?

 
Then the pragmatic question becomes how to correlate what you see from
`ip addr list' to guests.
   
   on the host ip addr and the one seen on the guest side are the same.
   Except one is seen (on the host) on guest0-eth0 and another is seen 
   on eth0 (on guest).
  
  Then what to do if the host system has 10.0.0.1 as a private address on 
  eth3,
  and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1
  when each guest has added 10.0.0.1 to their tun0 device?
  
 
 Yes, that would be a conflict that needs to be resolved. If you look at
 ip addresses as also belonging to namespaces, then it should work, no?
 i am assuming a tag at the ifa table level.

I'm not sure, it's complicated.
You wouldn't want automatic local routes to be added for IP addresses on
the host-side interfaces, right?
Do you expect these IP addresses to act as local addresses in other places,
like answering to arp requests about these IP on all physical devices?

But anyway, you'll have conflicts on the application level.
Many programs like ntpd, bind, and others fetch the device list using the
same ioctls as ifconfig, and make (un)intelligent decisions basing on what
they see.
Mirroring may have some advantages if I am both host and guest administrator.
But if I create a namespace for my friend Joe to play with IPv6 and sit
tunnels, why should I face inconveniences because of what he does there?

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-28 Thread Andrey Savochkin
Hi Eric,

On Tue, Jun 27, 2006 at 10:20:32PM -0600, Eric W. Biederman wrote:
 Andrey Savochkin [EMAIL PROTECTED] writes:
[snip]
  My first patchset covers devices but not sockets.
  The only difference from what you're suggesting is ipv4 routing.
  For me, it is not less important than devices and sockets.  May be even
  more important, since routing exposes design deficiencies less obvious at
  socket level.
 
 I agree we need to do it.  I mostly want a base that allows us to 
 not need to convert the whole network stack at once and still be able
 to merge code all the way to the stable kernel.
 
 The routing code is important for understanding design choices.  It
 isn't important for merging if that makes sense.   

Ok, fine.
Now I'm working on socket code.

We still have a question about implicit vs explicit function parameters.
This question becomes more important for sockets: if we want to allow to use
sockets belonging to namespaces other than the current one, we need to do
something about it.

One possible option to resolve this question is to show 2 relatively short
patches just introducing namespaces for sockets in 2 ways: with explicit
function parameters and using implicit current context.
Then people can compare them and vote.
Do you think it's worth the effort?

 
 For everyone looking at routing choices the IPv6 routing table is
 interesting because it does not use a hash table, and seems quite
 possibly to be an equally fast structure that scales better.
 
 There is something to think about there.

Sure

[snip]
 
  Can you summarize you objections against my way of handling devices, please?
  And what was the typo you referred to in your letter to Kirill Korotaev?
 
 I have no fundamental objects to the content I have seen so far.
 
 Please read the first email Kirill responded too.  I quoted a couple
 of sections of code and described the bugs I saw with the patch.

I found your comments, thank you!

 
 All minor things.  The typo I was referring to was a section where the
 original iteration was on an ifp variable and you called it dev
 without changing the rest of the code in that section.  
 
 The only big issue was that the patch too big, and should be split
 into a patchset for better review.  One patch for the new functions,
 and the an additional patch for each driver/subsystem hunk describing
 why that chunk needed to be changed.

I'll split the patch.

 I'm still curious why many of those chunks can't use existing helper
 functions, to be cleaned up.

What helper functions are you referring to?

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-28 Thread Andrey Savochkin
Hi Jamal,

On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote:
 
 On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote:
 
  note: personally I'm absolutely not against virtualizing
  the device names so that each guest can have a separate
  name space for devices, but there should be a way to
  'see' _and_ 'identify' the interfaces from outside
  (i.e. host or spectator context)
  
 
 Makes sense for the host side to have naming convention tied
 to the guest. Example as a prefix: guest0-eth0. Would it not
 be interesting to have the host also manage these interfaces
 via standard tools like ip or ifconfig etc? i.e if i admin up
 guest0-eth0, then the user in guest0 will see its eth0 going
 up.

Seeing guestXX-eth0 interfaces by standard tools has certain attractive
sides.  But it creates a lot of undesired side effects.

For example, ntpd queries all network devices by the same ioctls as ifconfig,
and creates separate sockets bound to IP addresses of each device, which is
certainly not desired with namespaces.

Or more subtle question: do you want hotplug events to be generated when
guest0-eth0 interface comes up in the root namespace, and standard scripts
to try to set some IP address on this interface?..

In my opinion, the downside of this scheme overweights possible advantages,
and I'm personally quite happy with running commands with switched namespace,
like
vzctl exec guest0 ip addr list
vzctl exec guest0 ip link set eth0 up
and so on.

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-28 Thread Andrey Savochkin
Daniel,

On Wed, Jun 28, 2006 at 03:51:32PM +0200, Daniel Lezcano wrote:
 Daniel Lezcano wrote:
  Andrey Savochkin wrote:
  
  Structures related to IPv4 rounting (FIB and routing cache)
  are made per-namespace.
 
 Hi Andrey,
 
 if the ressources are private to the namespace, how do you will handle 
 NFS mounted before creating the network namespace ? Do you take care of 
 that or simply assume you can't access NFS anymore ?

This is a question that brings up another level of interaction between
networking and the rest of kernel code.
Solution that I use now makes the NFS communication part always run in
the root namespace.  This is discussable, of course, but it's a far more
complicated matter than just device lists or routing :)

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-28 Thread Andrey Savochkin
On Wed, Jun 28, 2006 at 12:17:35PM -0400, jamal wrote:
 
 On Wed, 2006-28-06 at 18:19 +0400, Andrey Savochkin wrote:
  
  Seeing guestXX-eth0 interfaces by standard tools has certain attractive
  sides.  But it creates a lot of undesired side effects.
  
 
 I apologize because i butted into the discussion without perhaps reading
 the full thread. 

Your comments are quite welcome

 
  For example, ntpd queries all network devices by the same ioctls as 
  ifconfig,
  and creates separate sockets bound to IP addresses of each device, which is
  certainly not desired with namespaces.
  
 
 Ok, so the problem is that ntp in this case runs on the host side as

yes

 opposed to the guest? This would explain why Eric is reacting vehemently
 to the suggestion.

:)

And I actually do not want to distinguish host and guest sides much.
They are namespaces in the first place.
Parent namespace may have some capabilities to manipulate its child
namespaces, like donate its own device to one of its children.

But it comes secondary to having namespace isolation borders.
In particular, because most cases of cross-namespace interaction lead to
failures of formal security models and inability to migrate
namespaces between computers.

 
  Or more subtle question: do you want hotplug events to be generated when
  guest0-eth0 interface comes up in the root namespace, and standard scripts
  to try to set some IP address on this interface?..
  
 
 yes, thats what i was thinking. Even go further and actually create
 guestxx-eth0 on the host (which results in creating eth0 on the guest)
 and other things.

This actually goes in the opposite direction to what I keep in mind.
I want to offload as much as possible of network administration work to
guests.  Delegation of management is one of the motivating factors
behind covering not only sockets but devices, routes, and so on by the
namespace patches.

 
  In my opinion, the downside of this scheme overweights possible advantages,
  and I'm personally quite happy with running commands with switched 
  namespace,
  like
  vzctl exec guest0 ip addr list
  vzctl exec guest0 ip link set eth0 up
  and so on.
 
 Ok, above may be good enough and doesnt require any state it seems on
 the host side. 
 I got motivated when the word migration was mentioned. I understood it
 to be meaning that a guest may become inoperative for some reason and
 that its info will be transfered to another guest which may be local or
 even remote. In such a case, clearly one would need a protocol and the
 state of all guests sitting at the host. Maybe i am over-reaching. 

Migration will work inside the kernel, so it has full access
to whatever state information it needs.

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-28 Thread Andrey Savochkin
Hi Eric,

On Wed, Jun 28, 2006 at 10:51:26AM -0600, Eric W. Biederman wrote:
 Andrey Savochkin [EMAIL PROTECTED] writes:
 
  One possible option to resolve this question is to show 2 relatively short
  patches just introducing namespaces for sockets in 2 ways: with explicit
  function parameters and using implicit current context.
  Then people can compare them and vote.
  Do you think it's worth the effort?
 
 Given that we have two strong opinions in different directions I think it
 is worth the effort to resolve this.

Do you have time to extract necessary parts of your old patch?
Or you aren't afraid of letting me draft an alternative version of socket
namespaces basing on your code? :)

 
 In a slightly different vein your second patch introduced a lot
 of #ifdef CONFIG_NET_NS in C files.  That is something we need to look closely
 at.
 
 So I think the abstraction that we use to access per network namespace
 variables needs some work if we are going to allow the ability to compile
 out all of the namespace code.  The explicit versus implicit lookup is just
 one dimension of that problem.

This is a good comment.

Those ifdef's mostly correspond to places where we walk over lists
and need to filter-out entities not belonging to a specific namespace.
Those places about the same in your and my implementation.
We can think what we can do with them.
One trick that I used on several occasions is net_ns_same macro
which doesn't evalute its arguments if CONFIG_NET_NS not defined,
and thus can be used without ifdef's.

Returning to implicit vs explicit function arguments, I belive that implicit
arguments are more promising in having zero impact on the code when
CONFIG_NET_NS is disabled.
Functions like inet_addr_type will translate into exactly the same code as
they did without net namespace patches.

 
  I'm still curious why many of those chunks can't use existing helper
  functions, to be cleaned up.
 
  What helper functions are you referring to?
 
 Basically most of the device list walker functions live in.
 net/core/dev.c 
 
 I don't know if the cases you fixed could have used any of those
 helper functions but it certainly has me asking that question.
 
 A general pattern that happens in cleanups is the discovery
 that code using an old interface in a problematic way really
 could be done much better another way.  I didn't dig enough
 to see if that was the case in any of the code that you changed.

Well, there is obvious improvement of this kind: many protocols walk over
device list to find devices with non-NULL protocol specific pointers.
For example, IPv6, decnet and others do it on module unloading to clean up.
Those places just ask for some simpler standard way of doing it, but I wasn't
bold enough for such radical change.
Do you think I should try?

Best regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-28 Thread Andrey Savochkin
On Wed, Jun 28, 2006 at 12:14:41PM -0600, Eric W. Biederman wrote:
 Andrey Savochkin [EMAIL PROTECTED] writes:
 
  On Wed, Jun 28, 2006 at 10:51:26AM -0600, Eric W. Biederman wrote:
  Andrey Savochkin [EMAIL PROTECTED] writes:
  
   One possible option to resolve this question is to show 2 relatively 
   short
   patches just introducing namespaces for sockets in 2 ways: with explicit
   function parameters and using implicit current context.
   Then people can compare them and vote.
   Do you think it's worth the effort?
  
  Given that we have two strong opinions in different directions I think it
  is worth the effort to resolve this.
 
  Do you have time to extract necessary parts of your old patch?
  Or you aren't afraid of letting me draft an alternative version of socket
  namespaces basing on your code? :)
 
 I'm not terribly afraid.  I can always say you did it wrong. :)

:)

 I don't think I am going to have time today.  But since this conversation
 is slowing down and we are to getting into the technical details.  
 I will try and find some time.

Good.
I'll focus on my part then.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Andrey Savochkin
Herbert,

On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote:
 
 keep in mind that you actually have three kinds
 of network traffic on a typical host/guest system:
 
  - traffic between unit and outside
- host traffic should be quite minimal
- guest traffic will be quite high
 
  - traffic between host and guest
probably minimal too (only for shared services)
 
  - traffic between guests
can be as high (or even higher) than the
outbound traffic, just think web guest and
database guest

My experience with host-guest systems tells me the opposite:
outside traffic is a way higher than traffic between guests.
People put web server and database in different guests not more frequent than
they put them on separate physical server.
Unless people are building a really huge system when 1 server can't take the
whole load, web and database live together and benefit from communications
over UNIX sockets.

Guests are usually comprised of web-db pairs, and people place many such
guests on a single computer.

 
  The routing between network namespaces does have the potential to be
  more expensive than just a packet trivially coming off the wire into a
  socket.
 
 IMHO the routing between network namespaces should
 not require more than the current local traffic
 does (i.e. you should be able to achieve loopback
 speed within an insignificant tolerance) and not
 nearly the time required for on-wire stuff ...

I'd like to caution about over-optimizing communications between different
network namespaces.
Many optimizations of local traffic (such as high MTU) don't look so
appealing when you start to think about live migration of namespaces.

Regards
Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Andrey Savochkin
Daniel,

On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
 
  Then you lose the ability for each namespace to have its own routing 
  entries.
  Which implies that you'll have difficulties with devices that should exist
  and be visible in one namespace only (like tunnels), as they require IP
  addresses and route.
 
 I mean instead of having the route tables private to the namespace, the 
 routes have the information to which namespace they are associated.

I think I understand what you're talking about: you want to make routing
responsible for determining destination namespace ID in addition to route
type (local, unicast etc), nexthop information, and so on.  Right?

My point is that if you make namespace tagging at routing time, and
your packets are being routed only once, you lose the ability
to have separate routing tables in each namespace.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Andrey Savochkin
On Tue, Jun 27, 2006 at 11:34:36AM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  Daniel,
  
  On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote:
  
 Then you lose the ability for each namespace to have its own routing 
 entries.
 Which implies that you'll have difficulties with devices that should exist
 and be visible in one namespace only (like tunnels), as they require IP
 addresses and route.
 
 I mean instead of having the route tables private to the namespace, the 
 routes have the information to which namespace they are associated.
  
  
  I think I understand what you're talking about: you want to make routing
  responsible for determining destination namespace ID in addition to route
  type (local, unicast etc), nexthop information, and so on.  Right?
 
 Yes.
 
  
  My point is that if you make namespace tagging at routing time, and
  your packets are being routed only once, you lose the ability
  to have separate routing tables in each namespace.
 
 Right. What is the advantage of having separate the routing tables ?

Routing is everything.
For example, I want namespaces to have their private tunnel devices.
It means that namespaces should be allowed have private routes of local type,
private default routes, and so on...

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-27 Thread Andrey Savochkin
On Mon, Jun 26, 2006 at 10:05:14PM +0200, Herbert Poetzl wrote:
 On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote:
  Andrey Savochkin wrote:
  Structures related to IPv4 rounting (FIB and routing cache)
  are made per-namespace.
  
  How do you handle ICMP_REDIRECT ?
 
 and btw. how do you handle the beloved 'ping'
 (i.e. ICMP_ECHO_REQUEST/REPLY for and from
 guests?

I don't need to do anything special.  They are just IP packets.
If packets are local in the current net namespace, they are delivered to
socket or handled by icmp_rcv.

Certainly, packet/raw sockets shouldn't see packets they aren't supposed to
see.  For raw sockets, it implies making socket lookup aware of namespaces,
exactly like for TCP or UDP.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Andrey Savochkin
Daniel,

On Tue, Jun 27, 2006 at 01:21:02PM +0200, Daniel Lezcano wrote:
 My point is that if you make namespace tagging at routing time, and
 your packets are being routed only once, you lose the ability
 to have separate routing tables in each namespace.
 
 Right. What is the advantage of having separate the routing tables ?
  
  
  Routing is everything.
  For example, I want namespaces to have their private tunnel devices.
  It means that namespaces should be allowed have private routes of local 
  type,
  private default routes, and so on...
  
 
 Ok, we are talking about the same things. We do it only in a different way:

We are not talking about the same things.

It isn't a technical thing whether route lookup is performed before or after
namespace change.
It is a fundamental question determining functionality of network namespaces.
We are talking about the capabilities namespaces provide.

Your proposal essentially denies namespaces to have their own tunnel or other
devices.  There is no point in having a device inside a namespace if the
namespace owner can't route all or some specific outgoing packets through
that device.  You don't allow system administrators to completely delegate
management of network configuration to namespace owners.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-27 Thread Andrey Savochkin
Herbert,

On Tue, Jun 27, 2006 at 05:48:19PM +0200, Herbert Poetzl wrote:
 On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote:
  
  On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote:
   
- traffic between guests
  can be as high (or even higher) than the
  outbound traffic, just think web guest and
  database guest
  
  My experience with host-guest systems tells me the opposite: outside
  traffic is a way higher than traffic between guests. People put web
  server and database in different guests not more frequent than they
  put them on separate physical server. Unless people are building a
  really huge system when 1 server can't take the whole load, web and
  database live together and benefit from communications over UNIX
  sockets.
 
 well, that's probably because you (or your company)
 focuses on providers which simply (re)sell the entities
 to their customers, in which case it would be more
 expensive to put e.g. the database into a separate
 guest. but let me state here that this is not the only
 application for this technology

I'm just sharing my experience.
You have one experience, I have another, and your classification of traffic
importance is not the universal one.
My point was that we shouldn't overestimate the use of INET sockets vs. UNIX
ones in configurations where communications but not web/db operations play a
big role in overall performance.
And indeed I've talked with many different people, from universities to
large enterprises.

 
[snip]
  I'd like to caution about over-optimizing communications between
  different network namespaces. Many optimizations of local traffic
  (such as high MTU) don't look so appealing when you start to think
  about live migration of namespaces.
 
 I think the 'optimization' (or to be precise: desire
 not to sacrifice local/loopback traffic for some use
 case as you describe it) does not interfere with live
 migration at all, we still will have 'local' and 'remote'
 traffic, and personally I doubt that the live migration
 is a feature for the masses ...

Why not for the masses?

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network namespaces a path to mergable code.

2006-06-27 Thread Andrey Savochkin
Eric,

On Tue, Jun 27, 2006 at 11:20:40AM -0600, Eric W. Biederman wrote:
 
 Thinking about this I am going to suggest a slightly different direction
 for get a patchset we can merge.
 
 First we concentrate on the fundamentals.
 - How we mark a device as belonging to a specific network namespace.
 - How we mark a socket as belonging to a specific network namespace.

I agree with the direction of your thoughts.
I was trying to do a similar thing, define clear steps in network
namespace merging.

My first patchset covers devices but not sockets.
The only difference from what you're suggesting is ipv4 routing.
For me, it is not less important than devices and sockets.  May be even
more important, since routing exposes design deficiencies less obvious at
socket level.

 
 As part of the fundamentals we add a patch to the generic socket code
 that by default will disable it for protocol families that do not indicate
 support for handling network namespaces, on a non-default network namespace.

Fine

Can you summarize you objections against my way of handling devices, please?
And what was the typo you referred to in your letter to Kirill Korotaev?

Regards
Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-26 Thread Andrey Savochkin
Hi Daniel,

It's good that you kicked off network namespace discussion.
Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :).

Indeed, the first point to agree in this discussion is device list.
In your patch, you essentially introduce a data structure parallel
to the main device list, creating a view of this list.
I see a fundamental problem with this approach.
When a device presents an skb to the protocol layer, it needs to know to which
namespace this skb belongs.
Otherwise you would never get rid of problems with bind: what to do if device
eth1 is visible in namespace1, namespace2, and root namespace, and each
namespace has a socket bound to 0.0.0.0:80?

We have to conclude that each device should be visible only in one namespace.
In this case, instead of introducing net_ns_dev and net_ns_dev_list
structures, we can simply have a separate dev_base list head in each namespace.
Moreover, separate device list in each namespace will be in line with
making namespace isolation complete.  Complete isolation will allow each
namespace to set up own tun/tap devices, have own routes, netfilter tables,
and so on.

My follow-up messages will contain the first set of patches with network
namespaces implemented in the same way as network isolation in OpenVZ.
This patchset introduces namespaces for device list and IPv4 FIB/routing.
Two technical issues are omitted to make the patch idea clearer: device moving
between namespaces, and selective routing cache flush + garbage collection.

If this patchset is agreeable, the next patchset will finalize integration
with nsproxy, add namespaces to socket lookup code and neighbour
cache, and introduce a simple device to pass traffic between namespaces.
Then we will turn to less obvious matters including netlink messages,
network statistics, representation of network information in proc and sysfs,
tuning of parameters through sysctl, IPv6 and other protocols, and
per-namespace netfilters.

Best regards
Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-26 Thread Andrey Savochkin
Cleanup of dev_base list use, with the aim to make device list per-namespace.
In almost every occasion, use of dev_base variable and dev-next pointer
could be easily replaced by for_each_netdev loop.
A few most complicated places were converted to using
first_netdev()/next_netdev().

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 arch/s390/appldata/appldata_net_sum.c |2 
 arch/sparc64/solaris/ioctl.c  |2 
 drivers/block/aoe/aoecmd.c|8 ++-
 drivers/net/wireless/strip.c  |4 -
 drivers/parisc/led.c  |2 
 include/linux/netdevice.h |   28 +++--
 net/8021q/vlan.c  |4 -
 net/8021q/vlanproc.c  |   10 ++--
 net/bridge/br_if.c|4 -
 net/bridge/br_ioctl.c |4 +
 net/bridge/br_netlink.c   |3 -
 net/core/dev.c|   70 --
 net/core/dev_mcast.c  |4 -
 net/core/rtnetlink.c  |   18 
 net/decnet/af_decnet.c|   11 +++--
 net/decnet/dn_dev.c   |   17 
 net/decnet/dn_fib.c   |2 
 net/decnet/dn_route.c |   12 ++---
 net/ipv4/devinet.c|   15 ---
 net/ipv4/igmp.c   |   25 +++-
 net/ipv6/addrconf.c   |   28 -
 net/ipv6/anycast.c|   22 ++
 net/ipv6/mcast.c  |   20 +
 net/llc/llc_core.c|7 ++-
 net/netrom/nr_route.c |4 -
 net/rose/rose_route.c |8 ++-
 net/sched/sch_api.c   |8 ++-
 net/sctp/protocol.c   |2 
 net/tipc/eth_media.c  |   12 +++--
 29 files changed, 200 insertions, 156 deletions

--- ./arch/s390/appldata/appldata_net_sum.c.vedevbase   Mon Mar 20 08:53:29 2006
+++ ./arch/s390/appldata/appldata_net_sum.c Thu Jun 22 12:03:07 2006
@@ -108,7 +108,7 @@ static void appldata_get_net_sum_data(vo
tx_dropped = 0;
collisions = 0;
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
if (dev-get_stats == NULL) {
continue;
}
--- ./arch/sparc64/solaris/ioctl.c.vedevbaseMon Mar 20 08:53:29 2006
+++ ./arch/sparc64/solaris/ioctl.c  Thu Jun 22 12:03:07 2006
@@ -686,7 +686,7 @@ static inline int solaris_i(unsigned int
int i = 0;

read_lock_bh(dev_base_lock);
-   for (d = dev_base; d; d = d-next) i++;
+   for_each_netdev(d) i++;
read_unlock_bh(dev_base_lock);
 
if (put_user (i, (int __user *)A(arg)))
--- ./drivers/block/aoe/aoecmd.c.vedevbase  Wed Jun 21 18:50:28 2006
+++ ./drivers/block/aoe/aoecmd.cThu Jun 22 12:03:07 2006
@@ -204,14 +204,17 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
sl = sl_tail = NULL;
 
read_lock(dev_base_lock);
-   for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) {
+   for_each_netdev(dev) {
dev_hold(ifp);
-   if (!is_aoe_netif(ifp))
+   if (!is_aoe_netif(ifp)) {
+   dev_put(ifp);
continue;
+   }
 
skb = new_skb(ifp, sizeof *h + sizeof *ch);
if (skb == NULL) {
printk(KERN_INFO aoe: aoecmd_cfg: skb alloc 
failure\n);
+   dev_put(ifp);
continue;
}
if (sl_tail == NULL)
@@ -229,6 +232,7 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
 
skb-next = sl;
sl = skb;
+   dev_put(ifp);
}
read_unlock(dev_base_lock);
 
--- ./drivers/net/wireless/strip.c.vedevbaseWed Jun 21 18:50:43 2006
+++ ./drivers/net/wireless/strip.c  Thu Jun 22 12:03:07 2006
@@ -1970,8 +1970,7 @@ static struct net_device *get_strip_dev(
  sizeof(zero_address))) {
struct net_device *dev;
read_lock_bh(dev_base_lock);
-   dev = dev_base;
-   while (dev) {
+   for_each_netdev(dev) {
if (dev-type == strip_info-dev-type 
!memcmp(dev-dev_addr,
strip_info-true_dev_addr,
@@ -1982,7 +1981,6 @@ static struct net_device *get_strip_dev(
read_unlock_bh(dev_base_lock);
return (dev);
}
-   dev = dev-next;
}
read_unlock_bh(dev_base_lock);
}
--- ./drivers/parisc/led.c.vedevbaseWed Jun 21 18:52:58 2006

[patch 2/4] Network namespaces: cleanup of dev_base list use

2006-06-26 Thread Andrey Savochkin
CONFIG_NET_NS and net_namespace structure are introduced.
List of network devices is made per-namespace.
Each namespace gets its own loopback device.

Task's net_namespace pointer is not incorporated into nsproxy structure,
since current namespace changes temporarily for processing of packets
in softirq.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 drivers/net/loopback.c|   70 +++
 include/linux/init_task.h |9 ++
 include/linux/net_ns.h|   88 
 include/linux/netdevice.h |   20 -
 include/linux/nsproxy.h   |3 
 include/linux/sched.h |3 
 kernel/nsproxy.c  |   14 +++
 net/Kconfig   |7 +
 net/core/dev.c|  162 +-
 net/core/net-sysfs.c  |   24 ++
 net/ipv4/devinet.c|2 
 net/ipv6/addrconf.c   |2 
 net/ipv6/route.c  |3 
 13 files changed, 371 insertions, 36 deletions

--- ./drivers/net/loopback.c.venshd Wed Jun 21 18:50:39 2006
+++ ./drivers/net/loopback.cFri Jun 23 11:48:09 2006
@@ -196,42 +196,56 @@ static struct ethtool_ops loopback_ethto
.set_tso= ethtool_op_set_tso,
 };
 
-struct net_device loopback_dev = {
-   .name   = lo,
-   .mtu= (16 * 1024) + 20 + 20 + 12,
-   .hard_start_xmit= loopback_xmit,
-   .hard_header= eth_header,
-   .hard_header_cache  = eth_header_cache,
-   .header_cache_update= eth_header_cache_update,
-   .hard_header_len= ETH_HLEN, /* 14   */
-   .addr_len   = ETH_ALEN, /* 6*/
-   .tx_queue_len   = 0,
-   .type   = ARPHRD_LOOPBACK,  /* 0x0001*/
-   .rebuild_header = eth_rebuild_header,
-   .flags  = IFF_LOOPBACK,
-   .features   = NETIF_F_SG | NETIF_F_FRAGLIST
+struct net_device loopback_dev_static;
+EXPORT_SYMBOL(loopback_dev_static);
+
+void loopback_dev_dtor(struct net_device *dev)
+{
+   if (dev-priv) {
+   kfree(dev-priv);
+   dev-priv = NULL;
+   }
+   free_netdev(dev);
+}
+
+void loopback_dev_ctor(struct net_device *dev)
+{
+   struct net_device_stats *stats;
+
+   memset(dev, 0, sizeof(*dev));
+   strcpy(dev-name, lo);
+   dev-mtu= (16 * 1024) + 20 + 20 + 12;
+   dev-hard_start_xmit= loopback_xmit;
+   dev-hard_header= eth_header;
+   dev-hard_header_cache  = eth_header_cache;
+   dev-header_cache_update = eth_header_cache_update;
+   dev-hard_header_len= ETH_HLEN; /* 14   */
+   dev-addr_len   = ETH_ALEN; /* 6*/
+   dev-tx_queue_len   = 0;
+   dev-type   = ARPHRD_LOOPBACK;  /* 0x0001*/
+   dev-rebuild_header = eth_rebuild_header;
+   dev-flags  = IFF_LOOPBACK;
+   dev-features   = NETIF_F_SG | NETIF_F_FRAGLIST
 #ifdef LOOPBACK_TSO
  | NETIF_F_TSO
 #endif
  | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA
- | NETIF_F_LLTX,
-   .ethtool_ops= loopback_ethtool_ops,
-};
-
-/* Setup and register the loopback device. */
-int __init loopback_init(void)
-{
-   struct net_device_stats *stats;
+ | NETIF_F_LLTX
+ | NETIF_F_NSOK;
+   dev-ethtool_ops= loopback_ethtool_ops;
 
/* Can survive without statistics */
stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL);
if (stats) {
memset(stats, 0, sizeof(struct net_device_stats));
-   loopback_dev.priv = stats;
-   loopback_dev.get_stats = get_stats;
+   dev-priv = stats;
+   dev-get_stats = get_stats;
}
-   
-   return register_netdev(loopback_dev);
-};
+}
 
-EXPORT_SYMBOL(loopback_dev);
+/* Setup and register the loopback device. */
+int __init loopback_init(void)
+{
+   loopback_dev_ctor(loopback_dev_static);
+   return register_netdev(loopback_dev_static);
+};
--- ./include/linux/init_task.h.venshd  Wed Jun 21 18:53:16 2006
+++ ./include/linux/init_task.h Fri Jun 23 11:48:09 2006
@@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy;
 
 extern struct group_info init_groups;
 
+#ifdef CONFIG_NET_NS
+extern struct net_namespace init_net_ns;
+#define INIT_NET_NS \
+   .net_context= init_net_ns,
+#else
+#define INIT_NET_NS
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1f (=2MB)
@@ -129,6 +137,7 @@ extern struct group_info init_groups;
.signal = init_signals,\
.sighand= init_sighand,\
.nsproxy= init_nsproxy

[patch 4/4] Network namespaces: playing and debugging

2006-06-26 Thread Andrey Savochkin
Temporary code to play with network namespaces in the simplest way.
Do
exec 7 /proc/net/net_ns
in your bash shell and you'll get a brand new network namespace.
There you can, for example, do
ip link set lo up
ip addr list
ip addr add 1.2.3.4 dev lo
ping -n 1.2.3.4

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 dev.c |   27 ++-
 1 files changed, 26 insertions, 1 deletion

--- ./net/core/dev.c.vensdbgFri Jun 23 11:50:16 2006
+++ ./net/core/dev.cFri Jun 23 11:50:40 2006
@@ -3444,6 +3444,8 @@ int net_ns_start(void)
if (err)
goto out_register;
put_net_ns(orig_ns);
+   printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n,
+   ns, task-comm, task-tgid);
return 0;
 
 out_register:
@@ -3461,6 +3463,7 @@ EXPORT_SYMBOL(net_ns_start);
 
 void net_ns_free(struct net_namespace *ns)
 {
+   printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns);
kfree(ns);
 }
 EXPORT_SYMBOL(net_ns_free);
@@ -3473,8 +3476,13 @@ static void net_ns_destroy(void *data)
ns = data;
push_net_ns(ns, orig_ns);
unregister_netdev(ns-loopback);
+   if (!list_empty(ns-dev_base)) {
+   printk(NET_NS: BUG: context %p has devices! ref %d\n,
+   ns, atomic_read(ns-active_ref));
+   pop_net_ns(orig_ns);
+   return;
+   }
ip_fib_struct_fini();
-   BUG_ON(!list_empty(ns-dev_base));
pop_net_ns(orig_ns);
 
/* drop (hopefully) final reference */
@@ -3483,9 +3491,23 @@ static void net_ns_destroy(void *data)
 
 void net_ns_stop(struct net_namespace *ns)
 {
+   printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns);
execute_in_process_context(net_ns_destroy, ns, ns-destroy_work);
 }
 EXPORT_SYMBOL(net_ns_stop);
+
+static int net_ns_open(struct inode *i, struct file *f)
+{
+   return net_ns_start();
+}
+static struct file_operations net_ns_fops = {
+   .open   = net_ns_open,
+};
+static int net_ns_init(void)
+{
+   return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops)
+   ? 0 : -ENOMEM;
+}
 #endif
 
 /*
@@ -3550,6 +3572,9 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+#ifdef CONFIG_NET_NS
+   net_ns_init();
+#endif
rc = 0;
 out:
return rc;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-26 Thread Andrey Savochkin
Structures related to IPv4 rounting (FIB and routing cache)
are made per-namespace.

Signed-off-by: Andrey Savochkin [EMAIL PROTECTED]
---
 include/linux/net_ns.h   |9 +++
 include/net/flow.h   |3 +
 include/net/ip_fib.h |   62 -
 net/core/dev.c   |7 ++
 net/ipv4/Kconfig |4 -
 net/ipv4/fib_frontend.c  |   87 +--
 net/ipv4/fib_hash.c  |   13 -
 net/ipv4/fib_rules.c |  114 +--
 net/ipv4/fib_semantics.c |  104 +-
 net/ipv4/route.c |   26 ++
 10 files changed, 348 insertions, 81 deletions

--- ./include/linux/net_ns.h.vensrt Fri Jun 23 11:49:42 2006
+++ ./include/linux/net_ns.hFri Jun 23 11:50:16 2006
@@ -14,7 +14,16 @@ struct net_namespace {
atomic_tactive_ref, use_ref;
struct list_headdev_base;
struct net_device   *loopback;
+#ifndef CONFIG_IP_MULTIPLE_TABLES
+   struct fib_table*fib4_local_table, *fib4_main_table;
+#else
+   struct fib_table**fib4_tables;
+   struct hlist_head   fib4_rules;
+#endif
+   struct hlist_head   *fib4_hash, *fib4_laddrhash;
+   unsignedfib4_hash_size, fib4_info_cnt;
unsigned inthash;
+   chardestroying;
struct execute_work destroy_work;
 };
 
--- ./include/net/flow.h.vensrt Wed Jun 21 18:51:08 2006
+++ ./include/net/flow.hFri Jun 23 11:50:16 2006
@@ -78,6 +78,9 @@ struct flowi {
 #define fl_icmp_type   uli_u.icmpt.type
 #define fl_icmp_code   uli_u.icmpt.code
 #define fl_ipsec_spi   uli_u.spi
+#ifdef CONFIG_NET_NS
+   struct net_namespace *net_ns;
+#endif
 } __attribute__((__aligned__(BITS_PER_LONG/8)));
 
 #define FLOW_DIR_IN0
--- ./include/net/ip_fib.h.vensrt   Wed Jun 21 18:53:17 2006
+++ ./include/net/ip_fib.h  Fri Jun 23 11:50:16 2006
@@ -18,6 +18,7 @@
 
 #include net/flow.h
 #include linux/seq_file.h
+#include linux/net_ns.h
 
 /* WARNING: The ordering of these elements must match ordering
  *  of RTA_* rtnetlink attribute numbers.
@@ -169,14 +170,21 @@ struct fib_table {
 
 #ifndef CONFIG_IP_MULTIPLE_TABLES
 
-extern struct fib_table *ip_fib_local_table;
-extern struct fib_table *ip_fib_main_table;
+#ifndef CONFIG_NET_NS
+extern struct fib_table *ip_fib_local_table_static;
+extern struct fib_table *ip_fib_main_table_static;
+#define ip_fib_local_table_ns()ip_fib_local_table_static
+#define ip_fib_main_table_ns() ip_fib_main_table_static
+#else
+#define ip_fib_local_table_ns()
(current_net_ns-fib4_local_table)
+#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table)
+#endif
 
 static inline struct fib_table *fib_get_table(int id)
 {
if (id != RT_TABLE_LOCAL)
-   return ip_fib_main_table;
-   return ip_fib_local_table;
+   return ip_fib_main_table_ns();
+   return ip_fib_local_table_ns();
 }
 
 static inline struct fib_table *fib_new_table(int id)
@@ -186,23 +194,36 @@ static inline struct fib_table *fib_new_
 
 static inline int fib_lookup(const struct flowi *flp, struct fib_result *res)
 {
-   if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) 
-   ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res))
+   struct fib_table *tb;
+
+   tb = ip_fib_local_table_ns();
+   if (!tb-tb_lookup(tb, flp, res))
+   return 0;
+   tb = ip_fib_main_table_ns();
+   if (tb-tb_lookup(tb, flp, res))
return -ENETUNREACH;
return 0;
 }
 
 static inline void fib_select_default(const struct flowi *flp, struct 
fib_result *res)
 {
+   struct fib_table *tb;
+
+   tb = ip_fib_main_table_ns();
if (FIB_RES_GW(*res)  FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK)
-   ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, 
res);
+   tb-tb_select_default(main_table, flp, res);
 }
 
 #else /* CONFIG_IP_MULTIPLE_TABLES */
-#define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL])
-#define ip_fib_main_table (fib_tables[RT_TABLE_MAIN])
+#define ip_fib_local_table_ns() (fib_tables_ns()[RT_TABLE_LOCAL])
+#define ip_fib_main_table_ns() (fib_tables_ns()[RT_TABLE_MAIN])
 
-extern struct fib_table * fib_tables[RT_TABLE_MAX+1];
+#ifndef CONFIG_NET_NS
+extern struct fib_table * fib_tables_static[RT_TABLE_MAX+1];
+#define fib_tables_ns() fib_tables_static
+#else
+#define fib_tables_ns() (current_net_ns-fib4_tables)
+#endif
 extern int fib_lookup(const struct flowi *flp, struct fib_result *res);
 extern struct fib_table *__fib_new_table(int id);
 extern void fib_rule_put(struct fib_rule *r);
@@ -212,7 +233,7 @@ static inline struct fib_table *fib_get_
if (id == 0)
id = RT_TABLE_MAIN;
 
-   return fib_tables[id];
+   return fib_tables_ns()[id];
 }
 
 static inline

Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-26 Thread Andrey Savochkin
Hi Herbert,

On Mon, Jun 26, 2006 at 03:02:03PM +0200, Herbert Poetzl wrote:
 On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote:
 
  I see a fundamental problem with this approach. When a device presents
  an skb to the protocol layer, it needs to know to which namespace this
  skb belongs.
 
  Otherwise you would never get rid of problems with bind: what to do if
  device eth1 is visible in namespace1, namespace2, and root namespace,
  and each namespace has a socket bound to 0.0.0.0:80?
 
 this is something which isn't a fundamental problem at
 all, and IMHO there are at least three options here
 (probably more)
 
  - check at 'bind' time if the binding would overlap
and give the 'proper' error (as it happens right
now on the host)
(this is how Linux-VServer currently handles the
network isolation, and yes, it works quite fine :)

I'm not comfortable with this as a permanent mainstream solution.
It means that network namespaces are actually not namespaces: you can't run
some program (e.g., apache) with default configs in a new namespace without
regards to who runs what in other namespaces.
In other words, name 0.0.0.0:80 creates a collision in your implementation,
so socket names do not form isolated spaces.

 
  - allow arbitrary binds and 'tag' the packets according
to some 'host' policy (e.g. iptables or tc)
(this is how the Linux-VServer ngnet was designed)
 
  - deliver packets to _all_ bound sockets/destinations
(this is probably a more unusable but quite thinkable
solution)

Deliver TCP packets to all sockets?
How many connections do you expect to be established in this case?

 
  We have to conclude that each device should be visible only in one
  namespace. 
 
 I disagree here, especially some supervisor context or
 the host context should be able to 'see' and probably
 manipulate _all_ of the devices

Right, manipulating all devices from some supervisor context is useful.

But this shouldn't necessarily be done by regular ip/ifconfig tools.
Besides, it could be quite confusing if in ifconfig output in the
supervisor context you see 325 tun0 devices coming from
different namespaces :)

So I'm all for network namespace management mechanisms not bound
to existing tools/APIs.

 
  Complete isolation will allow each namespace to set up own tun/tap
  devices, have own routes, netfilter tables, and so on.
 
 tun/tap devices are quite possible with this approach
 too, I see no problem here ...
 
 for iptables and routes, I'm worried about the required
 'policy' to make them secure, i.e. how do you ensure
 that the packets 'leaving' guest X do not contain
 'evil' packets and/or disrupt your host system?

Sorry, I don't get your point.
How do you ensure that packets leaving your neighbor's computer
do not disrupt your system?
From my point of view, network namespaces are just neighbors.

 
  My follow-up messages will contain the first set of patches with
  network namespaces implemented in the same way as network isolation 
  in OpenVZ. 
 
 hmm, you probably mean 'network virtualization' here

I meant isolation between different network contexts/namespaces.

 
  This patchset introduces namespaces for device list and IPv4
  FIB/routing. Two technical issues are omitted to make the patch idea
  clearer: device moving between namespaces, and selective routing cache
  flush + garbage collection.
 
  If this patchset is agreeable, the next patchset will finalize
  integration with nsproxy, add namespaces to socket lookup code and
  neighbour cache, and introduce a simple device to pass traffic between
  namespaces.
 
 passing traffic 'between' namespaces should happen via
 lo, no? what kind of 'device' is required there, and
 what overhead does it add to the networking?

OpenVZ provides 2 options.

 1) A packet appears right inside some namespace, without any additional
overhead.  Usually this implies that either all packets from this device
belong to this namespace, i.e. simple device-namespace assignment.
However, there is nothing conceptually wrong with having
namespace-aware device drivers or netfilter modules selecting namespaces
for each incoming packet.  It all depends on how you want packets go
through various network layers, and how much network management abilities
you want to have in non-root namespaces.
My point is that for network namespaces being real namespaces, decision
making should be done somewhere before socket lookup.

 2) Parent network namespace acts as a router forwarding packets to child
namespaces.  This scheme is the preferred one in OpenVZ for various
reasons, most important being the simplicity of migration of network
namespaces.  In this case flexibility has the cost of going through
packet handling layers two times.
Technically, this is implemented via a simple netdevice doing
netif_rx in hard_xmit.

Regards

Andrey
-
To unsubscribe from this list: send the line

Re: [patch 2/6] [Network namespace] Network device sharing by view

2006-06-26 Thread Andrey Savochkin
Daniel,

On Mon, Jun 26, 2006 at 04:56:32PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  
  It's good that you kicked off network namespace discussion.
  Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier 
  :).
 
 [EMAIL PROTECTED] ?

[EMAIL PROTECTED] is fine

 
  When a device presents an skb to the protocol layer, it needs to know to 
  which
  namespace this skb belongs.
  Otherwise you would never get rid of problems with bind: what to do if 
  device
  eth1 is visible in namespace1, namespace2, and root namespace, and each
  namespace has a socket bound to 0.0.0.0:80?
 
 Exact. But, the idea was to retrieve the namespace from the routes.

Then you lose the ability for each namespace to have its own routing entries.
Which implies that you'll have difficulties with devices that should exist
and be visible in one namespace only (like tunnels), as they require IP
addresses and route.

 
 IMHO, I think there are roughly 2 network isolation implementation:
 
   - make all network ressources private to the namespace
 
   - keep a flat model where network ressources have a new identifier 
 which is the network namespace pointer. The idea is to move only some 
 network informations private to the namespace (eg port range, stats, ...)

Sorry, I don't get the second idea with only some information private to
namespace.

How do you want TCP_INC_STATS macro look?
In my concept, it would be something like
#define TCP_INC_STATS(field) SNMP_INC_STATS(current_net_ns-tcp_stat, field)
where tcp_stat is a TCP statistics array inside net_namespace.

Regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-26 Thread Andrey Savochkin
On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  Structures related to IPv4 rounting (FIB and routing cache)
  are made per-namespace.
 
 How do you handle ICMP_REDIRECT ?

Are you talking about routing cache entries created on incoming redirects?
Or outgoing redirects?

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/4] Network namespaces: playing and debugging

2006-06-26 Thread Andrey Savochkin
On Mon, Jun 26, 2006 at 05:04:29PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  Temporary code to play with network namespaces in the simplest way.
  Do
  exec 7 /proc/net/net_ns
  in your bash shell and you'll get a brand new network namespace.
  There you can, for example, do
  ip link set lo up
  ip addr list
  ip addr add 1.2.3.4 dev lo
  ping -n 1.2.3.4
  
 
 Is it possible to setup a network device to communicate with the outside ?

Such device was planned for the second patchset :)
I perhaps can send the patch tomorrow.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-26 Thread Andrey Savochkin
Hi Eric,

On Mon, Jun 26, 2006 at 09:13:52AM -0600, Eric W. Biederman wrote:
 Andrey Savochkin [EMAIL PROTECTED] writes:
 
  Cleanup of dev_base list use, with the aim to make device list 
  per-namespace.
  In almost every occasion, use of dev_base variable and dev-next pointer
  could be easily replaced by for_each_netdev loop.
  A few most complicated places were converted to using
  first_netdev()/next_netdev().
 
 As a proof of concept patch this is ok.
 
 As a real world patch this is much too big, which prevents review.
 Plus it takes a few actions that are more than replace just
 iterators through the device list.

dev_base list is historically not the cleanest part of Linux networking.
I've still spotted a place where the first device in dev_base list is assumed
to be loopback.  In early days we had more, now only one place or two...

 
 In addition I suspect several if not all of these iterators
 can be replaced with the an appropriate helper function.
 
 The normal structure for a patch like this would be to
 introduce the new helper function.  for_each_netdev.
 And then to replace all of the users while cc'ing the
 maintainers of those drivers.  With each different
 driver being a different patch.
 
 There is another topic for discussion in this patch as well.
 How much of the context should be implicit and how much
 should be explicit.
 
 If the changes from netchannels had already been implemented, and all of
 the network processing was happening in a process context then I would
 trivially agree that implicit would be the way to go.

Why would we want all network processing happen in a process context?

 
 However short of always having code always execute in the proper
 context I'm not comfortable with implicit parameters to functions.
 Not that this the contents of this patch should address this but the
 later patches should.

We just have too many layers in networking code, and FIB/routing
illustrates it well.

 
 When I went through this, my patchset just added an explicit
 continue if the devices was not in the appropriate namespace.
 I actually prefer the multiple list implementation but at
 the same time I think it is harder to get a clean implementation
 out of it.

Certainly, dev_base list reorganization is not the crucial point in network
namespaces.  But it has to be done some way or other.
If people vote for a single list with skipping devices from a wrong
namespace, it's fine with me, I can re-make this patch.

I personally prefer per-namespace device list since we have too many places
in the kernel where this list is walked in a linear fashion,
and with many namespaces this list may become quite long.

Regards

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/4] Network namespaces: playing and debugging

2006-06-26 Thread Andrey Savochkin
On Mon, Jun 26, 2006 at 07:29:57PM +0200, Daniel Lezcano wrote:
 Do
exec 7 /proc/net/net_ns
 in your bash shell and you'll get a brand new network namespace.
 There you can, for example, do
ip link set lo up
ip addr list
ip addr add 1.2.3.4 dev lo
ping -n 1.2.3.4
 
 
 Andrey,
 
 I began to play with your patchset. I am able to connect to 127.0.0.1 
 from different namespaces. Is it the expected behavior ?
 Furthermore, I am not able to have several programs, running in 
 different namespaces, to bind to the same INADDR_ANY:port.
 
 Will these features be included in the second patchset ?

Of course.
This patchset adds namespaces to routing code, which means that
you can define local IP addresses in each namespace independently.
But this first patchset doesn't include namespaces in socket lookup code.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces

2006-06-26 Thread Andrey Savochkin
On Mon, Jun 26, 2006 at 05:57:01PM +0200, Daniel Lezcano wrote:
 Andrey Savochkin wrote:
  On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote:
 
 How do you handle ICMP_REDIRECT ?
  
  
  Are you talking about routing cache entries created on incoming redirects?
  Or outgoing redirects?
  
 
 incoming redirects

They are inserted into routing cache with the current namespace tag, in
the same way as input routing cache entries.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/4] Network namespaces: cleanup of dev_base list use

2006-06-26 Thread Andrey Savochkin
Eric,

On Mon, Jun 26, 2006 at 10:26:23AM -0600, Eric W. Biederman wrote:
 Andrey Savochkin [EMAIL PROTECTED] writes:
 
  On Mon, Jun 26, 2006 at 09:13:52AM -0600, Eric W. Biederman wrote:
  
  There is another topic for discussion in this patch as well.
  How much of the context should be implicit and how much
  should be explicit.
  
  If the changes from netchannels had already been implemented, and all of
  the network processing was happening in a process context then I would
  trivially agree that implicit would be the way to go.
 
 
[snip]
 It is a big enough problem that I don't think we want to gate on
 that development but we need to be ready to take advantage of it when
 it happens.

Well, ok, implicit namespace reference will take advantage of it
if it happens.

 
  However short of always having code always execute in the proper
  context I'm not comfortable with implicit parameters to functions.
  Not that this the contents of this patch should address this but the
  later patches should.
 
  We just have too many layers in networking code, and FIB/routing
  illustrates it well.
 
 I don't follow this comment.  How does a lot of layers affect
 the choice of implicit or explicit parameters?  If you are maintaining
 a patch outside the kernel I could see how there could be a win for
 touching the least amount of code possible but for merged code that
 you only have to go through once I don't see how the number of layers
 affects things.

I agree that implicit vs explicit parameters is a topic for discussion.
From what you see from my patch, I vote for implicit ones in this case :)

I was talking about layers because they imply changing more code,
and usually imply adding more parameters to functions and passing these
additional parameters to next layers.
In routing code it goes from routing entry points, to routing cache, to
general FIB functions, to table-specific code (FIB hash).

These additional parameters bloat the code to some extent.
Sometimes it's possible to save here and there by fetching the parameter
(namespace pointer) indirectly from structures you already have at hand,
but it can't be done universally.

One of the properties of implicit argument which I especially like
is that both input and output paths are absolutely symmetric in how
the namespace pointer is extracted.

 
 As I recall for most of the FIB/routing code once you have removed
 the global variable accesses and introduce namespace checks in
 the hash table (because allocating hash tables at runtime isn't sane)
 the rest of the code was agnostic about what was going on.  So I think
 you have touched everything that needs touching.  So I don't see
 a code size or complexity argument there.

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html