Re: [PATCH 5/9] network namespaces: async socket operations
On Fri, Sep 22, 2006 at 05:33:56PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Non-trivial part of socket namespaces: asynchronous events should be run in proper context. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- af_inet.c| 10 ++ inet_timewait_sock.c |8 tcp_timer.c |9 + 3 files changed, 27 insertions(+) --- ./net/ipv4/af_inet.c.venssock-asyn Mon Aug 14 17:04:07 2006 +++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006 @@ -366,10 +366,17 @@ out_rcu_unlock: int inet_release(struct socket *sock) { struct sock *sk = sock-sk; + struct net_namespace *ns, *orig_net_ns; if (sk) { long timeout; + /* Need to change context here since protocol -close +* operation may send packets. +*/ + ns = get_net_ns(sk-sk_net_ns); + push_net_ns(ns, orig_net_ns); + Is it not a race condition here ? What happens if you have a packet incoming during the namespace context switching ? All asynchronous operations (RX softirq, timers) should set their context explicitly, and can't rely on the current context being the right one (or a valid pointer at all). Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] network namespaces: socket hashes
Hi, On Mon, Sep 18, 2006 at 05:12:49PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Socket hash lookups are made within namespace. Hash tables are common for all namespaces, with additional permutation of indexes. Hi Andrey, why is the hash table common and not instanciated multiple times for each namespace like the routes ? The main reason is that socket hash tables should be large enough to work efficiently, but it isn't good to waste a lot of memory for each namespace. Namespaces should be cheap enough, to allow to have hundreds of them. This reason of memory efficiency, of course, has a priority unless/until socket hash tables start to resize automatically. Another point is that routing lookup is much more complicated than the socket's one to add another search key. Routing also have additional routines for deleting entries matching some patterns, and so on. In short, routing is much more complicated, and it already quite efficient for various sizes of routing tables. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] network namespaces: playing and debugging
On Wed, Aug 16, 2006 at 11:22:28AM -0600, Eric W. Biederman wrote: Stephen Hemminger [EMAIL PROTECTED] writes: On Tue, 15 Aug 2006 18:48:43 +0400 Andrey Savochkin [EMAIL PROTECTED] wrote: Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] NACK, new /proc interfaces are not acceptable. The rule is that new /proc interfaces that are not process related are not acceptable. If structured right a network namespace can arguably be process related. I do agree that this interface is pretty ugly there. This proc interface was a backdoor to play with namespaces without compiling any user-space programs. As you wish. Do you want to have a new clone flag right away? Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] net_device seq_file
Library function to create a seq_file in proc filesystem, showing some information for each netdevice. This code is present in the kernel in about 10 instances, and all of them can be converted to using introduced library function. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/netdevice.h |7 +++ net/core/dev.c| 96 ++ 2 files changed, 103 insertions(+) --- ./include/linux/netdevice.h.venetproc Tue Aug 15 13:46:08 2006 +++ ./include/linux/netdevice.h Tue Aug 15 13:46:08 2006 @@ -592,6 +592,13 @@ extern int register_netdevice(struct ne extern int unregister_netdevice(struct net_device *dev); extern voidfree_netdev(struct net_device *dev); extern voidsynchronize_net(void); +#ifdef CONFIG_PROC_FS +extern int netdev_proc_create(char *name, + int (*show)(struct seq_file *, + struct net_device *, void *), + void *data, struct module *mod); +void netdev_proc_remove(char *name); +#endif extern int register_netdevice_notifier(struct notifier_block *nb); extern int unregister_netdevice_notifier(struct notifier_block *nb); extern int call_netdevice_notifiers(unsigned long val, void *v); --- ./net/core/dev.c.venetproc Tue Aug 15 13:46:08 2006 +++ ./net/core/dev.cTue Aug 15 13:46:08 2006 @@ -2100,6 +2100,102 @@ static int dev_ifconf(char __user *arg) } #ifdef CONFIG_PROC_FS + +struct netdev_proc_data { + struct file_operations fops; + int (*show)(struct seq_file *, struct net_device *, void *); + void *data; +}; + +static void *netdev_proc_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct net_device *dev; + loff_t off; + + read_lock(dev_base_lock); + if (*pos == 0) + return SEQ_START_TOKEN; + for (dev = dev_base, off = 1; dev; dev = dev-next, off++) { + if (*pos == off) + return dev; + } + return NULL; +} + +static void *netdev_proc_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + ++*pos; + return (v == SEQ_START_TOKEN) ? dev_base + : ((struct net_device *)v)-next; +} + +static void netdev_proc_seq_stop(struct seq_file *seq, void *v) +{ + read_unlock(dev_base_lock); +} + +static int netdev_proc_seq_show(struct seq_file *seq, void *v) +{ + struct netdev_proc_data *p; + + p = seq-private; + return (*p-show)(seq, v, p-data); +} + +static struct seq_operations netdev_proc_seq_ops = { + .start = netdev_proc_seq_start, + .next = netdev_proc_seq_next, + .stop = netdev_proc_seq_stop, + .show = netdev_proc_seq_show, +}; + +static int netdev_proc_open(struct inode *inode, struct file *file) +{ + int err; + struct seq_file *p; + + err = seq_open(file, netdev_proc_seq_ops); + if (!err) { + p = file-private_data; + p-private = (struct netdev_proc_data *)PDE(inode)-data; + } + return err; +} + +int netdev_proc_create(char *name, + int (*show)(struct seq_file *, struct net_device *, void *), + void *data, struct module *mod) +{ + struct netdev_proc_data *p; + struct proc_dir_entry *ent; + + p = kzalloc(sizeof(*p), GFP_KERNEL); + p-fops.owner = mod; + p-fops.open = netdev_proc_open; + p-fops.read = seq_read; + p-fops.llseek = seq_lseek; + p-fops.release = seq_release; + p-show = show; + p-data = data; + ent = create_proc_entry(name, S_IRUGO, proc_net); + if (ent == NULL) { + kfree(p); + return -EINVAL; + } + ent-data = p; + ent-destructor = proc_data_destructor; + smp_wmb(); + ent-proc_fops = p-fops; + return 0; +} +EXPORT_SYMBOL(netdev_proc_create); + +void netdev_proc_remove(char *name) +{ + proc_net_remove(name); +} +EXPORT_SYMBOL(netdev_proc_remove); + /* * This is invoked by the /proc filesystem handler to display a device * in detail. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] network namespaces: async socket operations
Non-trivial part of socket namespaces: asynchronous events should be run in proper context. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- af_inet.c| 10 ++ inet_timewait_sock.c |8 tcp_timer.c |9 + 3 files changed, 27 insertions(+) --- ./net/ipv4/af_inet.c.venssock-asyn Mon Aug 14 17:04:07 2006 +++ ./net/ipv4/af_inet.cTue Aug 15 13:45:44 2006 @@ -366,10 +366,17 @@ out_rcu_unlock: int inet_release(struct socket *sock) { struct sock *sk = sock-sk; + struct net_namespace *ns, *orig_net_ns; if (sk) { long timeout; + /* Need to change context here since protocol -close +* operation may send packets. +*/ + ns = get_net_ns(sk-sk_net_ns); + push_net_ns(ns, orig_net_ns); + /* Applications forget to leave groups before exiting */ ip_mc_drop_socket(sk); @@ -386,6 +393,9 @@ int inet_release(struct socket *sock) timeout = sk-sk_lingertime; sock-sk = NULL; sk-sk_prot-close(sk, timeout); + + pop_net_ns(orig_net_ns); + put_net_ns(ns); } return 0; } --- ./net/ipv4/inet_timewait_sock.c.venssock-asyn Tue Aug 15 13:45:44 2006 +++ ./net/ipv4/inet_timewait_sock.c Tue Aug 15 13:45:44 2006 @@ -129,6 +129,7 @@ static int inet_twdr_do_twkill_work(stru { struct inet_timewait_sock *tw; struct hlist_node *node; + struct net_namespace *orig_net_ns; unsigned int killed; int ret; @@ -140,8 +141,10 @@ static int inet_twdr_do_twkill_work(stru */ killed = 0; ret = 0; + push_net_ns(current_net_ns, orig_net_ns); rescan: inet_twsk_for_each_inmate(tw, node, twdr-cells[slot]) { + switch_net_ns(tw-tw_net_ns); __inet_twsk_del_dead_node(tw); spin_unlock(twdr-death_lock); __inet_twsk_kill(tw, twdr-hashinfo); @@ -164,6 +167,7 @@ rescan: twdr-tw_count -= killed; NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITED, killed); + pop_net_ns(orig_net_ns); return ret; } @@ -338,10 +342,12 @@ void inet_twdr_twcal_tick(unsigned long int n, slot; unsigned long j; unsigned long now = jiffies; + struct net_namespace *orig_net_ns; int killed = 0; int adv = 0; twdr = (struct inet_timewait_death_row *)data; + push_net_ns(current_net_ns, orig_net_ns); spin_lock(twdr-death_lock); if (twdr-twcal_hand 0) @@ -357,6 +363,7 @@ void inet_twdr_twcal_tick(unsigned long inet_twsk_for_each_inmate_safe(tw, node, safe, twdr-twcal_row[slot]) { + switch_net_ns(tw-tw_net_ns); __inet_twsk_del_dead_node(tw); __inet_twsk_kill(tw, twdr-hashinfo); inet_twsk_put(tw); @@ -384,6 +391,7 @@ out: del_timer(twdr-tw_timer); NET_ADD_STATS_BH(LINUX_MIB_TIMEWAITKILLED, killed); spin_unlock(twdr-death_lock); + pop_net_ns(orig_net_ns); } EXPORT_SYMBOL_GPL(inet_twdr_twcal_tick); --- ./net/ipv4/tcp_timer.c.venssock-asynMon Aug 14 16:43:51 2006 +++ ./net/ipv4/tcp_timer.c Tue Aug 15 13:45:44 2006 @@ -171,7 +171,9 @@ static void tcp_delack_timer(unsigned lo struct sock *sk = (struct sock*)data; struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; + push_net_ns(sk-sk_net_ns, orig_net_ns); bh_lock_sock(sk); if (sock_owned_by_user(sk)) { /* Try again later. */ @@ -225,6 +227,7 @@ out: out_unlock: bh_unlock_sock(sk); sock_put(sk); + pop_net_ns(orig_net_ns); } static void tcp_probe_timer(struct sock *sk) @@ -384,8 +387,10 @@ static void tcp_write_timer(unsigned lon { struct sock *sk = (struct sock*)data; struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; int event; + push_net_ns(sk-sk_net_ns, orig_net_ns); bh_lock_sock(sk); if (sock_owned_by_user(sk)) { /* Try again later */ @@ -419,6 +424,7 @@ out: out_unlock: bh_unlock_sock(sk); sock_put(sk); + pop_net_ns(orig_net_ns); } /* @@ -447,9 +453,11 @@ static void tcp_keepalive_timer (unsigne { struct sock *sk = (struct sock *) data; struct inet_connection_sock *icsk = inet_csk(sk); + struct net_namespace *orig_net_ns; struct tcp_sock *tp = tcp_sk(sk); __u32 elapsed; + push_net_ns(sk-sk_net_ns, orig_net_ns); /* Only process if socket is not in use. */ bh_lock_sock(sk
[RFC] network namespaces
Hi All, I'd like to resurrect our discussion about network namespaces. In our previous discussions it appeared that we have rather polar concepts which seemed hard to reconcile. Now I have an idea how to look at all discussed concepts to enable everyone's usage scenario. 1. The most straightforward concept is complete separation of namespaces, covering device list, routing tables, netfilter tables, socket hashes, and everything else. On input path, each packet is tagged with namespace right from the place where it appears from a device, and is processed by each layer in the context of this namespace. Non-root namespaces communicate with the outside world in two ways: by owning hardware devices, or receiving packets forwarded them by their parent namespace via pass-through device. This complete separation of namespaces is very useful for at least two purposes: - allowing users to create and manage by their own various tunnels and VPNs, and - enabling easier and more straightforward live migration of groups of processes with their environment. 2. People expressed concerns that complete separation of namespaces may introduce an undesired overhead in certain usage scenarios. The overhead comes from packets traversing input path, then output path, then input path again in the destination namespace if root namespace acts as a router. So, we may introduce short-cuts, when input packet starts to be processes in one namespace, but changes it at some upper layer. The places where packet can change namespace are, for example: routing, post-routing netfilter hook, or even lookup in socket hash. The cleanest example among them is post-routing netfilter hook. Tagging of input packets there means that the packets is checked against root namespace's routing table, found to be local, and go directly to the socket hash lookup in the destination namespace. In this scheme the ability to change routing tables or netfilter rules on a per-namespace basis is traded for lower overhead. All other optimized schemes where input packets do not travel input-output-input paths in general case may be viewed as short-cuts in scheme (1). The remaining question is which exactly short-cuts make most sense, and how to make them consistent from the interface point of view. My current idea is to reach some agreement on the basic concept, review patches, and then move on to implementing feasible short-cuts. Opinions? Next in this thread are patches introducing namespaces to device list, IPv4 routing, and socket hashes, and a pass-through device. Patches are against 2.6.18-rc4-mm1. Best regards, Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] network namespaces: IPv4 routing
Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/net_ns.h | 10 +++ include/net/flow.h |3 + include/net/ip_fib.h | 46 net/core/dev.c |8 ++ net/core/fib_rules.c | 43 --- net/ipv4/Kconfig |4 - net/ipv4/fib_frontend.c | 132 +-- net/ipv4/fib_hash.c | 13 +++- net/ipv4/fib_rules.c | 86 +- net/ipv4/fib_semantics.c | 99 +++ net/ipv4/route.c | 26 - 11 files changed, 375 insertions(+), 95 deletions(-) --- ./include/linux/net_ns.h.vensroute Mon Aug 14 17:18:59 2006 +++ ./include/linux/net_ns.hMon Aug 14 19:19:14 2006 @@ -14,7 +14,17 @@ struct net_namespace { atomic_tactive_ref, use_ref; struct net_device *dev_base_p, **dev_tail_p; struct net_device *loopback; +#ifndef CONFIG_IP_MULTIPLE_TABLES + struct fib_table*fib4_local_table, *fib4_main_table; +#else + struct list_headfib_rules_ops_list; + struct fib_rules_ops*fib4_rules_ops; + struct hlist_head *fib4_tables; +#endif + struct hlist_head *fib4_hash, *fib4_laddrhash; + unsignedfib4_hash_size, fib4_info_cnt; unsigned inthash; + chardestroying; struct work_struct destroy_work; }; --- ./include/net/flow.h.vensroute Mon Aug 14 17:04:04 2006 +++ ./include/net/flow.hMon Aug 14 17:18:59 2006 @@ -79,6 +79,9 @@ struct flowi { #define fl_icmp_code uli_u.icmpt.code #define fl_ipsec_spi uli_u.spi __u32 secid; /* used by xfrm; see secid.txt */ +#ifdef CONFIG_NET_NS + struct net_namespace *net_ns; +#endif } __attribute__((__aligned__(BITS_PER_LONG/8))); #define FLOW_DIR_IN0 --- ./include/net/ip_fib.h.vensrouteMon Aug 14 17:04:04 2006 +++ ./include/net/ip_fib.h Tue Aug 15 11:53:22 2006 @@ -18,6 +18,7 @@ #include net/flow.h #include linux/seq_file.h +#include linux/net_ns.h #include net/fib_rules.h /* WARNING: The ordering of these elements must match ordering @@ -171,14 +172,21 @@ struct fib_table { #ifndef CONFIG_IP_MULTIPLE_TABLES -extern struct fib_table *ip_fib_local_table; -extern struct fib_table *ip_fib_main_table; +#ifndef CONFIG_NET_NS +extern struct fib_table *ip_fib_local_table_static; +extern struct fib_table *ip_fib_main_table_static; +#define ip_fib_local_table_ns()ip_fib_local_table_static +#define ip_fib_main_table_ns() ip_fib_main_table_static +#else +#define ip_fib_local_table_ns() (current_net_ns-fib4_local_table) +#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table) +#endif static inline struct fib_table *fib_get_table(u32 id) { if (id != RT_TABLE_LOCAL) - return ip_fib_main_table; - return ip_fib_local_table; + return ip_fib_main_table_ns(); + return ip_fib_local_table_ns(); } static inline struct fib_table *fib_new_table(u32 id) @@ -188,21 +196,29 @@ static inline struct fib_table *fib_new_ static inline int fib_lookup(const struct flowi *flp, struct fib_result *res) { - if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) - ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res)) + struct fib_table *tb; + + tb = ip_fib_local_table_ns(); + if (!tb-tb_lookup(tb, flp, res)) + return 0; + tb = ip_fib_main_table_ns(); + if (tb-tb_lookup(tb, flp, res)) return -ENETUNREACH; return 0; } static inline void fib_select_default(const struct flowi *flp, struct fib_result *res) { + struct fib_table *tb; + + tb = ip_fib_main_table_ns(); if (FIB_RES_GW(*res) FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) - ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, res); + tb-tb_select_default(main_table, flp, res); } #else /* CONFIG_IP_MULTIPLE_TABLES */ -#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL) -#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN) +#define ip_fib_local_table_ns() fib_get_table(RT_TABLE_LOCAL) +#define ip_fib_main_table_ns() fib_get_table(RT_TABLE_MAIN) extern int fib_lookup(struct flowi *flp, struct fib_result *res); @@ -214,6 +230,10 @@ extern void fib_select_default(const str /* Exported by fib_frontend.c */ extern voidip_fib_init(void); +#ifdef CONFIG_NET_NS +extern int ip_fib_struct_init(void); +extern void ip_fib_struct_cleanup(void); +#endif extern int inet_rtm_delroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); extern int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg); extern int inet_rtm_getroute
[PATCH 1/9] network namespaces: core and device list
CONFIG_NET_NS and net_namespace structure are introduced. List of network devices is made per-namespace. Each namespace gets its own loopback device. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- drivers/net/loopback.c| 69 - include/linux/init_task.h |9 ++ include/linux/net_ns.h| 82 + include/linux/netdevice.h | 13 +++ include/linux/nsproxy.h |3 include/linux/sched.h |3 kernel/nsproxy.c | 14 net/Kconfig |7 ++ net/core/dev.c| 150 -- net/core/net-sysfs.c | 24 +++ net/ipv4/devinet.c|2 net/ipv6/addrconf.c |2 net/ipv6/route.c |9 +- 13 files changed, 349 insertions(+), 38 deletions(-) --- ./drivers/net/loopback.c.vensdevMon Aug 14 17:02:18 2006 +++ ./drivers/net/loopback.cMon Aug 14 17:18:20 2006 @@ -196,42 +196,55 @@ static struct ethtool_ops loopback_ethto .set_tso= ethtool_op_set_tso, }; -struct net_device loopback_dev = { - .name = lo, - .mtu= (16 * 1024) + 20 + 20 + 12, - .hard_start_xmit= loopback_xmit, - .hard_header= eth_header, - .hard_header_cache = eth_header_cache, - .header_cache_update= eth_header_cache_update, - .hard_header_len= ETH_HLEN, /* 14 */ - .addr_len = ETH_ALEN, /* 6*/ - .tx_queue_len = 0, - .type = ARPHRD_LOOPBACK, /* 0x0001*/ - .rebuild_header = eth_rebuild_header, - .flags = IFF_LOOPBACK, - .features = NETIF_F_SG | NETIF_F_FRAGLIST +struct net_device loopback_dev_static; +EXPORT_SYMBOL(loopback_dev_static); + +void loopback_dev_dtor(struct net_device *dev) +{ + if (dev-priv) { + kfree(dev-priv); + dev-priv = NULL; + } + free_netdev(dev); +} + +void loopback_dev_ctor(struct net_device *dev) +{ + struct net_device_stats *stats; + + memset(dev, 0, sizeof(*dev)); + strcpy(dev-name, lo); + dev-mtu= (16 * 1024) + 20 + 20 + 12; + dev-hard_start_xmit= loopback_xmit; + dev-hard_header= eth_header; + dev-hard_header_cache = eth_header_cache; + dev-header_cache_update = eth_header_cache_update; + dev-hard_header_len= ETH_HLEN; /* 14 */ + dev-addr_len = ETH_ALEN; /* 6*/ + dev-tx_queue_len = 0; + dev-type = ARPHRD_LOOPBACK; /* 0x0001*/ + dev-rebuild_header = eth_rebuild_header; + dev-flags = IFF_LOOPBACK; + dev-features = NETIF_F_SG | NETIF_F_FRAGLIST #ifdef LOOPBACK_TSO | NETIF_F_TSO #endif | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA - | NETIF_F_LLTX, - .ethtool_ops= loopback_ethtool_ops, -}; - -/* Setup and register the loopback device. */ -int __init loopback_init(void) -{ - struct net_device_stats *stats; + | NETIF_F_LLTX; + dev-ethtool_ops= loopback_ethtool_ops; /* Can survive without statistics */ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); if (stats) { memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = get_stats; + dev-priv = stats; + dev-get_stats = get_stats; } - - return register_netdev(loopback_dev); -}; +} -EXPORT_SYMBOL(loopback_dev); +/* Setup and register the loopback device. */ +int __init loopback_init(void) +{ + loopback_dev_ctor(loopback_dev_static); + return register_netdev(loopback_dev_static); +}; --- ./include/linux/init_task.h.vensdev Mon Aug 14 17:04:04 2006 +++ ./include/linux/init_task.h Mon Aug 14 17:18:21 2006 @@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy; extern struct group_info init_groups; +#ifdef CONFIG_NET_NS +extern struct net_namespace init_net_ns; +#define INIT_NET_NS \ + .net_context= init_net_ns, +#else +#define INIT_NET_NS +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -129,6 +137,7 @@ extern struct group_info init_groups; .signal = init_signals,\ .sighand= init_sighand,\ .nsproxy= init_nsproxy,\ + INIT_NET_NS \ .pending= { \ .list = LIST_HEAD_INIT
[PATCH 3/9] network namespaces: playing and debugging
Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- dev.c | 20 1 files changed, 20 insertions(+) --- ./net/core/dev.c.vensxdbg Tue Aug 15 13:46:44 2006 +++ ./net/core/dev.cTue Aug 15 13:46:44 2006 @@ -3597,6 +3597,8 @@ int net_ns_start(void) if (err) goto out_register; put_net_ns(orig_ns); + printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n, + ns, task-comm, task-tgid); return 0; out_register: @@ -3629,14 +3631,29 @@ static void net_ns_destroy(void *data) ip_fib_struct_cleanup(); pop_net_ns(orig_ns); kfree(ns); + printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns); } void net_ns_stop(struct net_namespace *ns) { + printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns); INIT_WORK(ns-destroy_work, net_ns_destroy, ns); schedule_work(ns-destroy_work); } EXPORT_SYMBOL(net_ns_stop); + +static int net_ns_open(struct inode *i, struct file *f) +{ + return net_ns_start(); +} +static struct file_operations net_ns_fops = { + .open = net_ns_open, +}; +static int net_ns_init(void) +{ + return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops) + ? 0 : -ENOMEM; +} #endif /* @@ -3701,6 +3718,9 @@ static int __init net_dev_init(void) hotcpu_notifier(dev_cpu_callback, 0); dst_init(); dev_mcast_init(); +#ifdef CONFIG_NET_NS + net_ns_init(); +#endif rc = 0; out: return rc; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] network namespaces: socket hashes
Socket hash lookups are made within namespace. Hash tables are common for all namespaces, with additional permutation of indexes. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/ipv6.h |3 ++- include/net/inet6_hashtables.h |6 -- include/net/inet_hashtables.h| 38 +- include/net/inet_sock.h |6 -- include/net/inet_timewait_sock.h |2 ++ include/net/sock.h |4 include/net/udp.h| 12 +--- net/core/sock.c |5 + net/ipv4/inet_connection_sock.c | 19 +++ net/ipv4/inet_hashtables.c | 29 ++--- net/ipv4/inet_timewait_sock.c|8 ++-- net/ipv4/raw.c |2 ++ net/ipv4/udp.c | 20 +--- net/ipv6/inet6_connection_sock.c |2 ++ net/ipv6/inet6_hashtables.c | 25 ++--- net/ipv6/raw.c |4 net/ipv6/udp.c | 21 ++--- 17 files changed, 151 insertions(+), 55 deletions(-) --- ./include/linux/ipv6.h.venssock Mon Aug 14 17:02:45 2006 +++ ./include/linux/ipv6.h Tue Aug 15 13:38:47 2006 @@ -428,10 +428,11 @@ static inline struct raw6_sock *raw6_sk( #define inet_v6_ipv6only(__sk) 0 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */ -#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\ +#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif, __ns)\ (((__sk)-sk_hash == (__hash)) \ ((*((__u32 *)(inet_sk(__sk)-dport))) == (__ports))\ ((__sk)-sk_family == AF_INET6) \ +net_ns_match((__sk)-sk_net_ns, __ns) \ ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr)) \ ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif --- ./include/net/inet6_hashtables.h.venssock Mon Aug 14 17:02:47 2006 +++ ./include/net/inet6_hashtables.hTue Aug 15 13:38:47 2006 @@ -26,11 +26,13 @@ struct inet_hashinfo; /* I have no idea if this is a good hash for v6 or not. -DaveM */ static inline unsigned int inet6_ehashfn(const struct in6_addr *laddr, const u16 lport, - const struct in6_addr *faddr, const u16 fport) + const struct in6_addr *faddr, const u16 fport, + struct net_namespace *ns) { unsigned int hashent = (lport ^ fport); hashent ^= (laddr-s6_addr32[3] ^ faddr-s6_addr32[3]); + hashent ^= net_ns_hash(ns); hashent ^= hashent 16; hashent ^= hashent 8; return hashent; @@ -44,7 +46,7 @@ static inline int inet6_sk_ehashfn(const const struct in6_addr *faddr = np-daddr; const __u16 lport = inet-num; const __u16 fport = inet-dport; - return inet6_ehashfn(laddr, lport, faddr, fport); + return inet6_ehashfn(laddr, lport, faddr, fport, current_net_ns); } extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk); --- ./include/net/inet_hashtables.h.venssockMon Aug 14 17:04:04 2006 +++ ./include/net/inet_hashtables.h Tue Aug 15 13:38:47 2006 @@ -74,6 +74,9 @@ struct inet_ehash_bucket { * ports are created in O(1) time? I thought so. ;-) -DaveM */ struct inet_bind_bucket { +#ifdef CONFIG_NET_NS + struct net_namespace*net_ns; +#endif unsigned short port; signed shortfastreuse; struct hlist_node node; @@ -142,30 +145,34 @@ extern struct inet_bind_bucket * extern void inet_bind_bucket_destroy(kmem_cache_t *cachep, struct inet_bind_bucket *tb); -static inline int inet_bhashfn(const __u16 lport, const int bhash_size) +static inline int inet_bhashfn(const __u16 lport, + struct net_namespace *ns, + const int bhash_size) { - return lport (bhash_size - 1); + return (lport ^ net_ns_hash(ns)) (bhash_size - 1); } extern void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb, const unsigned short snum); /* These can have wildcards, don't try too hard. */ -static inline int inet_lhashfn(const unsigned short num) +static inline int inet_lhashfn(const unsigned short num, + struct net_namespace *ns) { - return num (INET_LHTABLE_SIZE - 1); + return (num ^ net_ns_hash(ns)) (INET_LHTABLE_SIZE - 1); } static inline int inet_sk_listen_hashfn(const struct sock *sk) { - return inet_lhashfn(inet_sk(sk)-num); + return inet_lhashfn(inet_sk(sk)-num, current_net_ns); } /* Caller must disable local BH processing. */ static inline void
[PATCH 6/9] allow proc_dir_entries to have destructor
Destructor field added proc_dir_entries, standard destructor kfree'ing data introduced. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- fs/proc/generic.c | 10 -- fs/proc/root.c |1 + include/linux/proc_fs.h |4 3 files changed, 13 insertions(+), 2 deletions(-) --- ./fs/proc/generic.c.veprocdtor Mon Aug 14 16:43:41 2006 +++ ./fs/proc/generic.c Tue Aug 15 13:45:51 2006 @@ -608,6 +608,11 @@ static struct proc_dir_entry *proc_creat return ent; } +void proc_data_destructor(struct proc_dir_entry *ent) +{ + kfree(ent-data); +} + struct proc_dir_entry *proc_symlink(const char *name, struct proc_dir_entry *parent, const char *dest) { @@ -620,6 +625,7 @@ struct proc_dir_entry *proc_symlink(cons ent-data = kmalloc((ent-size=strlen(dest))+1, GFP_KERNEL); if (ent-data) { strcpy((char*)ent-data,dest); + ent-destructor = proc_data_destructor; if (proc_register(parent, ent) 0) { kfree(ent-data); kfree(ent); @@ -698,8 +704,8 @@ void free_proc_entry(struct proc_dir_ent release_inode_number(ino); - if (S_ISLNK(de-mode) de-data) - kfree(de-data); + if (de-destructor) + de-destructor(de); kfree(de); } --- ./fs/proc/root.c.veprocdtor Mon Aug 14 17:02:38 2006 +++ ./fs/proc/root.cTue Aug 15 13:45:51 2006 @@ -154,6 +154,7 @@ EXPORT_SYMBOL(proc_symlink); EXPORT_SYMBOL(proc_mkdir); EXPORT_SYMBOL(create_proc_entry); EXPORT_SYMBOL(remove_proc_entry); +EXPORT_SYMBOL(proc_data_destructor); EXPORT_SYMBOL(proc_root); EXPORT_SYMBOL(proc_root_fs); EXPORT_SYMBOL(proc_net); --- ./include/linux/proc_fs.h.veprocdtorMon Aug 14 17:02:47 2006 +++ ./include/linux/proc_fs.h Tue Aug 15 13:45:51 2006 @@ -46,6 +46,8 @@ typedef int (read_proc_t)(char *page, ch typedefint (write_proc_t)(struct file *file, const char __user *buffer, unsigned long count, void *data); typedef int (get_info_t)(char *, char **, off_t, int); +struct proc_dir_entry; +typedef void (destroy_proc_t)(struct proc_dir_entry *); struct proc_dir_entry { unsigned int low_ino; @@ -65,6 +67,7 @@ struct proc_dir_entry { read_proc_t *read_proc; write_proc_t *write_proc; atomic_t count; /* use count */ + destroy_proc_t *destructor; int deleted;/* delete flag */ void *set; }; @@ -109,6 +112,7 @@ char *task_mem(struct mm_struct *, char extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, struct proc_dir_entry *parent); extern void remove_proc_entry(const char *name, struct proc_dir_entry *parent); +extern void proc_data_destructor(struct proc_dir_entry *); extern struct vfsmount *proc_mnt; extern int proc_fill_super(struct super_block *,void *,int); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] network namespaces: device to pass packets between namespaces
A simple device to pass packets between a namespace and its child. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- Makefile |3 veth.c | 327 +++ 2 files changed, 330 insertions(+) --- ./drivers/net/Makefile.veveth Mon Aug 14 17:03:45 2006 +++ ./drivers/net/Makefile Tue Aug 15 13:46:15 2006 @@ -124,6 +124,9 @@ obj-$(CONFIG_SLIP) += slip.o obj-$(CONFIG_SLHC) += slhc.o obj-$(CONFIG_DUMMY) += dummy.o +ifeq ($(CONFIG_NET_NS),y) +obj-m += veth.o +endif obj-$(CONFIG_IFB) += ifb.o obj-$(CONFIG_DE600) += de600.o obj-$(CONFIG_DE620) += de620.o --- ./drivers/net/veth.c.veveth Tue Aug 15 13:44:46 2006 +++ ./drivers/net/veth.cTue Aug 15 13:46:15 2006 @@ -0,0 +1,327 @@ +/* + * Copyright (C) 2006 SWsoft + * + * Written by Andrey Savochkin [EMAIL PROTECTED], + * reusing code by Andrey Mirkin [EMAIL PROTECTED]. + */ +#include linux/list.h +#include linux/spinlock.h +#include linux/ctype.h +#include asm/semaphore.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include linux/proc_fs.h +#include linux/seq_file.h +#include net/dst.h +#include net/xfrm.h + +struct veth_struct +{ + struct net_device *pair; + struct net_device_stats stats; +}; + +#define veth_from_netdev(dev) ((struct veth_struct *)(netdev_priv(dev))) + +/* --- * + * + * Device functions + * + * --- */ + +static struct net_device_stats *get_stats(struct net_device *dev); +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct net_device_stats *stats; + struct veth_struct *entry; + struct net_device *rcv; + struct net_namespace *orig_net_ns; + int length; + + stats = get_stats(dev); + entry = veth_from_netdev(dev); + rcv = entry-pair; + + if (!(rcv-flags IFF_UP)) + /* Target namespace does not want to receive packets */ + goto outf; + + dst_release(skb-dst); + skb-dst = NULL; + secpath_reset(skb); + skb_orphan(skb); +#ifdef CONFIG_NETFILTER + nf_conntrack_put(skb-nfct); +#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) + nf_conntrack_put_reasm(skb-nfct_reasm); +#endif +#ifdef CONFIG_BRIDGE_NETFILTER + nf_bridge_put(skb-nf_bridge); +#endif +#endif + + push_net_ns(rcv-net_ns, orig_net_ns); + skb-dev = rcv; + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rcv); + + length = skb-len; + stats-tx_bytes += length; + stats-tx_packets++; + stats = get_stats(rcv); + stats-rx_bytes += length; + stats-rx_packets++; + + netif_rx(skb); + pop_net_ns(orig_net_ns); + return 0; + +outf: + stats-tx_dropped++; + kfree_skb(skb); + return 0; +} + +static int veth_open(struct net_device *dev) +{ + return 0; +} + +static int veth_close(struct net_device *dev) +{ + return 0; +} + +static void veth_destructor(struct net_device *dev) +{ + free_netdev(dev); +} + +static struct net_device_stats *get_stats(struct net_device *dev) +{ + return veth_from_netdev(dev)-stats; +} + +int veth_init_dev(struct net_device *dev) +{ + dev-hard_start_xmit = veth_xmit; + dev-open = veth_open; + dev-stop = veth_close; + dev-destructor = veth_destructor; + dev-get_stats = get_stats; + + ether_setup(dev); + + dev-tx_queue_len = 0; + return 0; +} + +static void veth_setup(struct net_device *dev) +{ + dev-init = veth_init_dev; +} + +static inline int is_veth_dev(struct net_device *dev) +{ + return dev-init == veth_init_dev; +} + +/* --- * + * + * Management interface + * + * --- */ + +struct net_device *veth_dev_alloc(char *name, char *addr) +{ + struct net_device *dev; + + dev = alloc_netdev(sizeof(struct veth_struct), name, veth_setup); + if (dev != NULL) { + memcpy(dev-dev_addr, addr, ETH_ALEN); + dev-addr_len = ETH_ALEN; + } + return dev; +} + +int veth_entry_add(char *parent_name, char *parent_addr, + char *child_name, char *child_addr, + struct net_namespace *child_ns) +{ + struct net_device *parent_dev, *child_dev; + struct net_namespace *parent_ns; + int err; + + err = -ENOMEM; + if ((parent_dev = veth_dev_alloc(parent_name, parent_addr)) == NULL) + goto out_alocp; + if ((child_dev = veth_dev_alloc(child_name, child_addr)) == NULL) + goto out_alocc; + veth_from_netdev(parent_dev)-pair = child_dev; + veth_from_netdev(child_dev)-pair = parent_dev; + + /* +* About serialization, see
[PATCH 9/9] network namespaces: playing with pass-through device
Temporary code to debug and play with pass-through device. Create device pair by modprobe veth echo 'add veth1 0:1:2:3:4:1 eth0 0:1:2:3:4:2' /proc/net/veth_ctl and your shell will appear into a new namespace with `eth0' device. Configure device in this namespace ip l s eth0 up ip a a 1.2.3.4/24 dev eth0 and in the root namespace ip l s veth1 up ip a a 1.2.3.1/24 dev veth1 to establish a communication channel between root namespace and the newly created one. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- veth.c | 113 + 1 files changed, 113 insertions(+) --- ./drivers/net/veth.c.veveth-dbg Tue Aug 15 13:47:48 2006 +++ ./drivers/net/veth.cTue Aug 15 14:08:04 2006 @@ -251,6 +251,116 @@ void veth_entry_del_all(void) /* --- * * + * Temporary interface to create veth devices + * + * --- */ + +#ifdef CONFIG_PROC_FS + +static int veth_debug_open(struct inode *inode, struct file *file) +{ + return 0; +} + +static char *parse_addr(char *s, char *addr) +{ + int i, v; + + for (i = 0; i ETH_ALEN; i++) { + if (!isxdigit(*s)) + return NULL; + *addr = 0; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + if (isxdigit(*s)) { + *addr += v 16; + v = isdigit(*s) ? *s - '0' : toupper(*s) - 'A' + 10; + s++; + } + *addr++ += v; + if (i ETH_ALEN - 1 ispunct(*s)) + s++; + } + return s; +} + +extern int net_ns_start(void); +static ssize_t veth_debug_write(struct file *file, const char __user *user_buf, + size_t size, loff_t *ppos) +{ + char buf[128], *s, *parent_name, *child_name; + char parent_addr[ETH_ALEN], child_addr[ETH_ALEN]; + struct net_namespace *parent_ns, *child_ns; + int err; + + s = buf; + err = -EINVAL; + if (size = sizeof(buf)) + goto out; + err = -EFAULT; + if (copy_from_user(buf, user_buf, size)) + goto out; + buf[size] = 0; + + err = -EBADRQC; + if (!strncmp(buf, add , 4)) { + parent_name = buf + 4; + if ((s = strchr(parent_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, parent_addr)) == NULL) + goto out; + if (!*s) + goto out; + child_name = s + 1; + if ((s = strchr(child_name, ' ')) == NULL) + goto out; + *s = 0; + if ((s = parse_addr(s + 1, child_addr)) == NULL) + goto out; + + parent_ns = get_net_ns(current_net_ns); + err = net_ns_start(); + if (err) + goto out; + /* return to parent context */ + push_net_ns(parent_ns, child_ns); + err = veth_entry_add(parent_name, parent_addr, + child_name, child_addr, child_ns); + pop_net_ns(child_ns); + put_net_ns(parent_ns); + if (!err) + err = size; + } +out: + return err; +} + +static struct file_operations veth_debug_ops = { + .open = veth_debug_open, + .write = veth_debug_write, +}; + +static int veth_debug_create(void) +{ + proc_net_fops_create(veth_ctl, 0200, veth_debug_ops); + return 0; +} + +static void veth_debug_remove(void) +{ + proc_net_remove(veth_ctl); +} + +#else + +static int veth_debug_create(void) { return -1; } +static void veth_debug_remove(void) { } + +#endif + +/* --- * + * * Information in proc * * --- */ @@ -310,12 +420,15 @@ static inline void veth_proc_remove(void int __init veth_init(void) { + if (veth_debug_create()) + return -EINVAL; veth_proc_create(); return 0; } void __exit veth_exit(void) { + veth_debug_remove(); veth_proc_remove(); veth_entry_del_all(); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Sat, Jul 08, 2006 at 01:48:13AM +0900, YOSHIFUJI Hideaki / [EMAIL PROTECTED](B wrote: In article [EMAIL PROTECTED] (at Fri, 7 Jul 2006 11:54:25 +0400), Andrey Savochkin [EMAIL PROTECTED] says: On Fri, Jul 07, 2006 at 01:34:34PM +0900, YOSHIFUJI Hideaki / [EMAIL PROTECTED](B wrote: In article [EMAIL PROTECTED] (at Mon, 3 Jul 2006 12:18:51 +0400), Andrey Savochkin [EMAIL PROTECTED] says: @@ -3271,22 +3277,22 @@ int unregister_netdevice(struct net_devi /* And unlink it from device chain. */ for (dp = dev_base; (d = *dp) != NULL; dp = d-next) { Why not for_each_netdev? it's a different list Sorry, I still do not understand. In other words, why will we still have dev-next? After introducing net_device-dev_list, we do not need dev-next anymore, do we? dev-next is removed in the last patch, to make possible the bisection of patch list. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Fri, Jul 07, 2006 at 01:34:34PM +0900, YOSHIFUJI Hideaki / [EMAIL PROTECTED](B wrote: In article [EMAIL PROTECTED] (at Mon, 3 Jul 2006 12:18:51 +0400), Andrey Savochkin [EMAIL PROTECTED] says: @@ -3271,22 +3277,22 @@ int unregister_netdevice(struct net_devi /* And unlink it from device chain. */ for (dp = dev_base; (d = *dp) != NULL; dp = d-next) { Why not for_each_netdev? it's a different list - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Tue, Jul 04, 2006 at 08:35:37PM +0400, A.N.Kuznetsov wrote: Different modules want different kinds of lookup. So, I'm thinking about something like ilookup5. The next question: would people agree to review a patch doing this for net_devices? :) One not original suggestion, which did not sound nevertheless: to implement netdev_iterate_list() or whatever, update only core and a few of devices and deprecate dev_base_head with __deprecated_for_modules adding it to Documentation/feature-removal-schedule.txt I like this idea Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
Christoph, On Mon, Jul 03, 2006 at 06:46:50PM +0100, Christoph Hellwig wrote: On Mon, Jul 03, 2006 at 12:18:51PM +0400, Andrey Savochkin wrote: Cleanup of net_device list use in net_dev core and IP. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. When you change all this please make sure dev_base_head is never directly accessed anymore, not even through macros and dev_base_head is not exported anymore. That's the only way to keep drivers messing with it. Yes, it's a little more work as you need to audit all drivers to see what they are doing and find suitable abstractions but it's a must have that should have been done a lot earlier. Hiding dev_base_head can be done by converting first_netdev/next_netdev into functions and implementing for_each_netdev loop through them. Or are you talking about abstractions like functions for_each_netdev/find_netdev with callbacks? Do you think that hiding the list internals is worth the additional complexity and substantial increase of the patch size? Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Tue, Jul 04, 2006 at 10:10:03AM +0100, Christoph Hellwig wrote: On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote: Yes, it's a little more work as you need to audit all drivers to see what they are doing and find suitable abstractions but it's a must have that should have been done a lot earlier. Hiding dev_base_head can be done by converting first_netdev/next_netdev into functions and implementing for_each_netdev loop through them. Or are you talking about abstractions like functions for_each_netdev/find_netdev with callbacks? an for_each_netdev with a callback makes sense and gives a cleaner abstraction, yes. I don't think you should need a callback for the lookup structure. Different modules want different kinds of lookup. So, I'm thinking about something like ilookup5. Do you think that hiding the list internals is worth the additional complexity and substantial increase of the patch size? Yes, absolutely. We've converted scsi hosts and devices from a model where drivers could directly access the list to strict iterators in the 2.5 series. It's quite a lot of work as you have to understand what the drivers actually do (and to at least 50% they were doing something really stupid) and convert them to the right abstractions. The next question: would people agree to review a patch doing this for net_devices? :) Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] bridge: br_dump_ifinfo index fix
Fix for inability of br_dump_ifinfo to handle non-zero start index: loop index never increases when entered with non-zero start. Spotted by Kirill Korotaev. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] Cc: Kirill Korotaev [EMAIL PROTECTED] --- Against 2.6.17-mm6 --- ./net/bridge/br_netlink.c.vebridge-dump Wed Jun 21 18:53:18 2006 +++ ./net/bridge/br_netlink.c Mon Jul 3 14:31:03 2006 @@ -117,12 +117,13 @@ static int br_dump_ifinfo(struct sk_buff continue; if (idx s_idx) - continue; + goto cont; err = br_fill_ifinfo(skb, p, NETLINK_CB(cb-skb).pid, cb-nlh-nlmsg_seq, RTM_NEWLINK, NLM_F_MULTI); if (err = 0) break; +cont: ++idx; } read_unlock(dev_base_lock); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/7] net_device list cleanup: core
Cleanup of net_device list use in net_dev core and IP. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] Signed-off-by: Kirill Korotaev [EMAIL PROTECTED] --- include/linux/netdevice.h | 29 ++- net/core/dev.c| 48 +- net/ipv4/devinet.c|6 ++--- net/ipv6/addrconf.c |8 +++ net/ipv6/anycast.c| 10 + 5 files changed, 68 insertions(+), 33 deletions(-) --- ./include/linux/netdevice.h.vedevbase-core Mon Jul 3 15:14:15 2006 +++ ./include/linux/netdevice.h Mon Jul 3 16:09:11 2006 @@ -290,7 +290,8 @@ struct net_device unsigned long state; struct net_device *next; - + struct list_headdev_list; + /* The device initialization function. Called only once. */ int (*init)(struct net_device *dev); @@ -558,8 +559,34 @@ struct packet_type { extern struct net_device loopback_dev; /* The loopback */ extern struct net_device *dev_base; /* All devices */ +extern struct list_headdev_base_head; /* All devices */ extern rwlock_tdev_base_lock; /* Device list lock */ +#define for_each_netdev(p) list_for_each_entry(p, dev_base_head, dev_list) + +/* + * When possible, it is preferrable to use for_each_netdev() loop + * defined above, rather than first_netdev()/next_netdev() macros. + * for_each_netdev() loop makes the intentions clearer, and gives more + * flexibility in device list implementation. + * While next_netdev() is unavoidable in seq_proc functions, + * first_netdev() should be needed quite rarely. + */ +#define first_netdev() ({ \ + list_empty(dev_base_head) ? NULL : \ + list_entry(dev_base_head.next, \ + struct net_device, \ + dev_list); \ +}) +#define next_netdev(dev) ({ \ + struct list_head *__next; \ + __next = (dev)-dev_list.next; \ + __next == dev_base_head ? NULL : \ + list_entry(__next, \ + struct net_device, \ + dev_list); \ +}) + extern int netdev_boot_setup_check(struct net_device *dev); extern unsigned long netdev_boot_base(const char *prefix, int unit); extern struct net_device*dev_getbyhwaddr(unsigned short type, char *hwaddr); --- ./net/core/dev.c.vedevbase-core Mon Jul 3 15:14:19 2006 +++ ./net/core/dev.cMon Jul 3 16:09:11 2006 @@ -181,6 +181,9 @@ DEFINE_RWLOCK(dev_base_lock); EXPORT_SYMBOL(dev_base); EXPORT_SYMBOL(dev_base_lock); +LIST_HEAD(dev_base_head); +EXPORT_SYMBOL(dev_base_head); + #define NETDEV_HASHBITS8 static struct hlist_head dev_name_head[1NETDEV_HASHBITS]; static struct hlist_head dev_index_head[1NETDEV_HASHBITS]; @@ -575,11 +578,11 @@ struct net_device *dev_getbyhwaddr(unsig ASSERT_RTNL(); - for (dev = dev_base; dev; dev = dev-next) + for_each_netdev(dev) if (dev-type == type !memcmp(dev-dev_addr, ha, dev-addr_len)) - break; - return dev; + return dev; + return NULL; } EXPORT_SYMBOL(dev_getbyhwaddr); @@ -589,14 +592,15 @@ struct net_device *dev_getfirstbyhwtype( struct net_device *dev; rtnl_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { if (dev-type == type) { dev_hold(dev); - break; + rtnl_unlock(); + return dev; } } rtnl_unlock(); - return dev; + return NULL; } EXPORT_SYMBOL(dev_getfirstbyhwtype); @@ -617,14 +621,15 @@ struct net_device * dev_get_by_flags(uns struct net_device *dev; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { if (((dev-flags ^ if_flags) mask) == 0) { dev_hold(dev); - break; + read_unlock(dev_base_lock); + return dev; } } read_unlock(dev_base_lock
[patch 5/7] net_device list cleanup: arch-dependent code and block devices
Cleanup of net_device list use in arch-dependent code and block devices. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- arch/s390/appldata/appldata_net_sum.c |2 +- arch/sparc64/solaris/ioctl.c |2 +- drivers/block/aoe/aoecmd.c|8 ++-- drivers/parisc/led.c |2 +- 4 files changed, 9 insertions(+), 5 deletions(-) --- ./arch/s390/appldata/appldata_net_sum.c.vedevbase-misc Mon Jul 3 15:13:15 2006 +++ ./arch/s390/appldata/appldata_net_sum.c Mon Jul 3 16:16:05 2006 @@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo tx_dropped = 0; collisions = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { if (dev-get_stats == NULL) { continue; } --- ./arch/sparc64/solaris/ioctl.c.vedevbase-misc Mon Mar 20 08:53:29 2006 +++ ./arch/sparc64/solaris/ioctl.c Mon Jul 3 16:16:05 2006 @@ -686,7 +686,7 @@ static inline int solaris_i(unsigned int int i = 0; read_lock_bh(dev_base_lock); - for (d = dev_base; d; d = d-next) i++; + for_each_netdev(d) i++; read_unlock_bh(dev_base_lock); if (put_user (i, (int __user *)A(arg))) --- ./drivers/block/aoe/aoecmd.c.vedevbase-misc Mon Jul 3 15:09:57 2006 +++ ./drivers/block/aoe/aoecmd.cMon Jul 3 16:16:05 2006 @@ -204,14 +204,17 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne sl = sl_tail = NULL; read_lock(dev_base_lock); - for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) { + for_each_netdev(ifp) { dev_hold(ifp); - if (!is_aoe_netif(ifp)) + if (!is_aoe_netif(ifp)) { + dev_put(ifp); continue; + } skb = new_skb(ifp, sizeof *h + sizeof *ch); if (skb == NULL) { printk(KERN_INFO aoe: aoecmd_cfg: skb alloc failure\n); + dev_put(ifp); continue; } if (sl_tail == NULL) @@ -229,6 +232,7 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne skb-next = sl; sl = skb; + dev_put(ifp); } read_unlock(dev_base_lock); --- ./drivers/parisc/led.c.vedevbase-misc Mon Jul 3 15:13:46 2006 +++ ./drivers/parisc/led.c Mon Jul 3 16:16:05 2006 @@ -367,7 +367,7 @@ static __inline__ int led_get_net_activi * for reading should be OK */ read_lock(dev_base_lock); rcu_read_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { struct net_device_stats *stats; struct in_device *in_dev = __in_dev_get_rcu(dev); if (!in_dev || !in_dev-ifa_list) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/7] net_device list cleanup: proc seq_file output
Cleanup of net_device list use in seq_file output routines in core networking files. Implementation of /proc/net/dev was copied from dev_mcast, since the latter did the same in a more compact and cleaner way. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- Note: functions covered by this patch are good candidates for further restructuring by introduction of library routines for seq_file's showing some information for each device. core/dev.c | 23 +++ core/dev_mcast.c |4 ++-- ipv4/igmp.c | 25 +++-- ipv6/anycast.c | 12 +++- ipv6/mcast.c | 25 +++-- 5 files changed, 50 insertions(+), 39 deletions(-) --- ./net/core/dev.c.vedevbase-proc Mon Jul 3 16:09:54 2006 +++ ./net/core/dev.cMon Jul 3 16:09:54 2006 @@ -2072,26 +2072,25 @@ static int dev_ifconf(char __user *arg) * This is invoked by the /proc filesystem handler to display a device * in detail. */ -static __inline__ struct net_device *dev_get_idx(loff_t pos) -{ - struct net_device *dev; - loff_t i; - - for (i = 0, dev = dev_base; dev i pos; ++i, dev = dev-next); - - return i == pos ? dev : NULL; -} - void *dev_seq_start(struct seq_file *seq, loff_t *pos) { + struct net_device *dev; + loff_t off = 1; read_lock(dev_base_lock); - return *pos ? dev_get_idx(*pos - 1) : SEQ_START_TOKEN; + if (!*pos) + return SEQ_START_TOKEN; + for_each_netdev(dev) { + if (off++ == *pos) + return dev; + } + return NULL; } void *dev_seq_next(struct seq_file *seq, void *v, loff_t *pos) { + struct net_device *dev = v; ++*pos; - return v == SEQ_START_TOKEN ? dev_base : ((struct net_device *)v)-next; + return v == SEQ_START_TOKEN ? first_netdev() : next_netdev(dev); } void dev_seq_stop(struct seq_file *seq, void *v) --- ./net/core/dev_mcast.c.vedevbase-proc Mon Jul 3 15:14:19 2006 +++ ./net/core/dev_mcast.c Mon Jul 3 16:09:54 2006 @@ -225,7 +225,7 @@ static void *dev_mc_seq_start(struct seq loff_t off = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { if (off++ == *pos) return dev; } @@ -236,7 +236,7 @@ static void *dev_mc_seq_next(struct seq_ { struct net_device *dev = v; ++*pos; - return dev-next; + return next_netdev(dev); } static void dev_mc_seq_stop(struct seq_file *seq, void *v) --- ./net/ipv4/igmp.c.vedevbase-procMon Jul 3 15:14:20 2006 +++ ./net/ipv4/igmp.c Mon Jul 3 16:09:54 2006 @@ -2254,19 +2254,21 @@ struct igmp_mc_iter_state { static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq) { + struct net_device *dev; struct ip_mc_list *im = NULL; struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq); - for (state-dev = dev_base, state-in_dev = NULL; -state-dev; -state-dev = state-dev-next) { + state-dev = NULL; + state-in_dev = NULL; + for_each_netdev(dev) { struct in_device *in_dev; - in_dev = in_dev_get(state-dev); + in_dev = in_dev_get(dev); if (!in_dev) continue; read_lock(in_dev-mc_list_lock); im = in_dev-mc_list; if (im) { + state-dev = dev; state-in_dev = in_dev; break; } @@ -2285,7 +2287,7 @@ static struct ip_mc_list *igmp_mc_get_ne read_unlock(state-in_dev-mc_list_lock); in_dev_put(state-in_dev); } - state-dev = state-dev-next; + state-dev = next_netdev(state-dev); if (!state-dev) { state-in_dev = NULL; break; @@ -2416,15 +2418,17 @@ struct igmp_mcf_iter_state { static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq) { + struct net_device *dev; struct ip_sf_list *psf = NULL; struct ip_mc_list *im = NULL; struct igmp_mcf_iter_state *state = igmp_mcf_seq_private(seq); - for (state-dev = dev_base, state-idev = NULL, state-im = NULL; -state-dev; -state-dev = state-dev-next) { + state-dev = NULL; + state-im = NULL; + state-idev = NULL; + for_each_netdev(dev) { struct in_device *idev; - idev = in_dev_get(state-dev); + idev = in_dev_get(dev); if (unlikely
[patch 7/7] net_device list cleanup: debugging
Optional code to catch cases when loop cursor is used after for_each_netdev loop: often it's a sign of a bug, since it isn't guaranteed to point to a device. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- If anyone wants to keep this under some debug config option, let me know which one. netdevice.h |8 +++- 1 files changed, 7 insertions(+), 1 deletion(-) --- ./include/linux/netdevice.h.vedevbase-dbg Mon Jul 3 16:16:51 2006 +++ ./include/linux/netdevice.h Mon Jul 3 16:16:51 2006 @@ -560,7 +560,13 @@ extern struct net_device loopback_dev; extern struct list_headdev_base_head; /* All devices */ extern rwlock_tdev_base_lock; /* Device list lock */ -#define for_each_netdev(p) list_for_each_entry(p, dev_base_head, dev_list) +#define for_each_netdev(pos) \ +for (pos = list_entry(dev_base_head.next, typeof(*pos), dev_list); \ + prefetch(pos-dev_list.next), \ + pos-dev_list != dev_base_head ? : \ + ({ void *__check_dev_use_after_for_each_netdev; \ + pos = __check_dev_use_after_for_each_netdev; 0; }); \ + pos = list_entry(pos-dev_list.next, typeof(*pos), dev_list)) /* * When possible, it is preferrable to use for_each_netdev() loop - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/7] net_device list cleanup: drivers and non-IP protocols
Cleanup of net_device list use in network device drivers and protocols other than IP. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- Requires bridge: br_dump_ifinfo index fix drivers/net/wireless/strip.c |4 +--- net/8021q/vlan.c |4 ++-- net/8021q/vlanproc.c | 10 +- net/bridge/br_if.c |4 ++-- net/bridge/br_ioctl.c|4 +++- net/bridge/br_netlink.c |3 ++- net/decnet/af_decnet.c | 11 +++ net/decnet/dn_dev.c | 17 ++--- net/decnet/dn_fib.c |2 +- net/decnet/dn_route.c| 13 +++-- net/llc/llc_core.c |7 +-- net/netrom/nr_route.c|5 +++-- net/rose/rose_route.c|8 +--- net/sctp/protocol.c |2 +- net/tipc/eth_media.c | 11 +++ 15 files changed, 61 insertions(+), 44 deletions(-) --- ./drivers/net/wireless/strip.c.vedevbase-onet Mon Jul 3 15:13:46 2006 +++ ./drivers/net/wireless/strip.c Mon Jul 3 16:12:11 2006 @@ -1969,8 +1969,7 @@ static struct net_device *get_strip_dev( sizeof(zero_address))) { struct net_device *dev; read_lock_bh(dev_base_lock); - dev = dev_base; - while (dev) { + for_each_netdev(dev) { if (dev-type == strip_info-dev-type !memcmp(dev-dev_addr, strip_info-true_dev_addr, @@ -1981,7 +1980,6 @@ static struct net_device *get_strip_dev( read_unlock_bh(dev_base_lock); return (dev); } - dev = dev-next; } read_unlock_bh(dev_base_lock); } --- ./net/8021q/vlan.c.vedevbase-onet Mon Jul 3 15:14:17 2006 +++ ./net/8021q/vlan.c Mon Jul 3 16:12:11 2006 @@ -121,8 +121,8 @@ static void __exit vlan_cleanup_devices( struct net_device *dev, *nxt; rtnl_lock(); - for (dev = dev_base; dev; dev = nxt) { - nxt = dev-next; + for (dev = first_netdev(); dev; dev = nxt) { + nxt = next_netdev(dev); if (dev-priv_flags IFF_802_1Q_VLAN) { unregister_vlan_dev(VLAN_DEV_INFO(dev)-real_dev, VLAN_DEV_INFO(dev)-vlan_id); --- ./net/8021q/vlanproc.c.vedevbase-onet Mon Jul 3 15:14:17 2006 +++ ./net/8021q/vlanproc.c Mon Jul 3 16:12:11 2006 @@ -241,7 +241,7 @@ int vlan_proc_rem_dev(struct net_device static struct net_device *vlan_skip(struct net_device *dev) { while (dev !(dev-priv_flags IFF_802_1Q_VLAN)) - dev = dev-next; + dev = next_netdev(dev); return dev; } @@ -257,8 +257,8 @@ static void *vlan_seq_start(struct seq_f if (*pos == 0) return SEQ_START_TOKEN; - for (dev = vlan_skip(dev_base); dev i *pos; -dev = vlan_skip(dev-next), ++i); + for (dev = vlan_skip(first_netdev()); dev i *pos; +dev = vlan_skip(next_netdev(dev)), ++i); return (i == *pos) ? dev : NULL; } @@ -268,8 +268,8 @@ static void *vlan_seq_next(struct seq_fi ++*pos; return vlan_skip((v == SEQ_START_TOKEN) - ? dev_base - : ((struct net_device *)v)-next); + ? first_netdev() + : next_netdev((struct net_device *)v)); } static void vlan_seq_stop(struct seq_file *seq, void *v) --- ./net/bridge/br_if.c.vedevbase-onet Mon Jul 3 15:14:19 2006 +++ ./net/bridge/br_if.cMon Jul 3 16:12:11 2006 @@ -474,8 +474,8 @@ void __exit br_cleanup_bridges(void) struct net_device *dev, *nxt; rtnl_lock(); - for (dev = dev_base; dev; dev = nxt) { - nxt = dev-next; + for (dev = first_netdev(); dev; dev = nxt) { + nxt = next_netdev(dev); if (dev-priv_flags IFF_EBRIDGE) del_br(dev-priv); } --- ./net/bridge/br_ioctl.c.vedevbase-onet Mon Mar 20 08:53:29 2006 +++ ./net/bridge/br_ioctl.c Mon Jul 3 16:12:11 2006 @@ -27,7 +27,9 @@ static int get_bridge_ifindices(int *ind struct net_device *dev; int i = 0; - for (dev = dev_base; dev i num; dev = dev-next) { + for_each_netdev(dev) { + if (i = num) + break; if (dev-priv_flags IFF_EBRIDGE) indices[i++] = dev-ifindex; } --- ./net/bridge/br_netlink.c.vedevbase-onetMon Jul 3 16:12:11 2006
[patch 3/7] net_device list cleanup: netlink_dump
Cleanup of net_device list use in netlink_dump routines in core networking files. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- core/rtnetlink.c | 18 ++ ipv4/devinet.c | 14 -- ipv6/addrconf.c | 20 +--- sched/sch_api.c |8 ++-- 4 files changed, 37 insertions(+), 23 deletions(-) --- ./net/core/rtnetlink.c.vedevbase-dump Mon Jul 3 15:14:19 2006 +++ ./net/core/rtnetlink.c Mon Jul 3 16:10:12 2006 @@ -319,14 +319,16 @@ static int rtnetlink_dump_ifinfo(struct struct net_device *dev; read_lock(dev_base_lock); - for (dev=dev_base, idx=0; dev; dev = dev-next, idx++) { - if (idx s_idx) - continue; - if (rtnetlink_fill_ifinfo(skb, dev, RTM_NEWLINK, - NETLINK_CB(cb-skb).pid, - cb-nlh-nlmsg_seq, 0, - NLM_F_MULTI) = 0) - break; + idx = 0; + for_each_netdev(dev) { + if (idx = s_idx) { + if (rtnetlink_fill_ifinfo(skb, dev, RTM_NEWLINK, + NETLINK_CB(cb-skb).pid, + cb-nlh-nlmsg_seq, 0, + NLM_F_MULTI) = 0) + break; + } + idx++; } read_unlock(dev_base_lock); cb-args[0] = idx; --- ./net/ipv4/devinet.c.vedevbase-dump Mon Jul 3 16:10:12 2006 +++ ./net/ipv4/devinet.cMon Jul 3 16:10:12 2006 @@ -1094,18 +1094,17 @@ static int inet_dump_ifaddr(struct sk_bu struct in_ifaddr *ifa; int s_ip_idx, s_idx = cb-args[0]; + idx = 0; s_ip_idx = ip_idx = cb-args[1]; read_lock(dev_base_lock); - for (dev = dev_base, idx = 0; dev; dev = dev-next, idx++) { + for_each_netdev(dev) { if (idx s_idx) - continue; + goto cont; if (idx s_idx) s_ip_idx = 0; rcu_read_lock(); - if ((in_dev = __in_dev_get_rcu(dev)) == NULL) { - rcu_read_unlock(); - continue; - } + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) + goto cont_unlock; for (ifa = in_dev-ifa_list, ip_idx = 0; ifa; ifa = ifa-ifa_next, ip_idx++) { @@ -1118,7 +1117,10 @@ static int inet_dump_ifaddr(struct sk_bu goto done; } } +cont_unlock: rcu_read_unlock(); +cont: + idx++; } done: --- ./net/ipv6/addrconf.c.vedevbase-dumpMon Jul 3 16:10:12 2006 +++ ./net/ipv6/addrconf.c Mon Jul 3 16:10:12 2006 @@ -3013,18 +3013,19 @@ static int inet6_dump_addr(struct sk_buf struct ifmcaddr6 *ifmca; struct ifacaddr6 *ifaca; + idx = 0; s_idx = cb-args[0]; s_ip_idx = ip_idx = cb-args[1]; read_lock(dev_base_lock); - for (dev = dev_base, idx = 0; dev; dev = dev-next, idx++) { + for_each_netdev(dev) { if (idx s_idx) - continue; + goto cont; if (idx s_idx) s_ip_idx = 0; ip_idx = 0; if ((idev = in6_dev_get(dev)) == NULL) - continue; + goto cont; read_lock_bh(idev-lock); switch (type) { case UNICAST_ADDR: @@ -3071,6 +3072,8 @@ static int inet6_dump_addr(struct sk_buf } read_unlock_bh(idev-lock); in6_dev_put(idev); +cont: + idx++; } done: if (err = 0) { @@ -3238,17 +3241,20 @@ static int inet6_dump_ifinfo(struct sk_b struct net_device *dev; struct inet6_dev *idev; + idx = 0; read_lock(dev_base_lock); - for (dev=dev_base, idx=0; dev; dev = dev-next, idx++) { + for_each_netdev(dev) { if (idx s_idx) - continue; + goto cont; if ((idev = in6_dev_get(dev)) == NULL) - continue; + goto cont; err = inet6_fill_ifinfo(skb, idev, NETLINK_CB(cb-skb).pid, cb-nlh-nlmsg_seq, RTM_NEWLINK, NLM_F_MULTI); in6_dev_put(idev); if (err = 0) break; +cont: + idx
Re: [patch 2/6] [Network namespace] Network device sharing by view
Jamal, On Fri, Jun 30, 2006 at 09:50:52AM -0400, jamal wrote: BTW - I was just looking at openvz, very impressive. To the other folks, Thanks! I am not putting down any of your approaches - just havent had time to study them. Andrey, this is the same thing you guys have been working on for a few years now, you just changed the name, correct? The relations are more complicated than just the change of name, but yes, OpenVZ represents the result of our work for a few years. Ok, since you guys are encouraging me to speak, here goes ;- Hopefully this addresses the other email from Herbert et al. [snip] // create the guest [host-node]# vzctl create 101 --ostemplate fedora-core-5-minimal // create guest101::eth0, seems to only create config to boot up with [host-node]# vzctl create 101 --netdev eth0 // bootup guest101 [host-node]# vzctl start 101 As soon as bootup of guest101 happens, creating guest101::eth0 should activate creation of the host side netdevice. This could be triggered for example by the netlink event message seen on host whic- which is a result of creating guest101::eth0 Which means control sits purely in user space. I'd like to clarify you idea: whether this host-side device is a real device capable of receiving and transmitting packets (by moving them between namespaces), or it's a fake device creating only a view of other namespace's devices? [snip] However, I oppose the idea of automatic mirroring of _all_ devices appearing inside some namespaces (guests) to another namespace (the host). This clearly goes against the concept of namespaces as independent realms, and creates a lot of problems with applications running in the host, hotplug scripts and so on. I was thinking that the host side is the master i.e you can peek at namespaces in the guest from the host. Host(master)-guest relations is a valid and useful scheme. However, I'm thinking about broader application of network namespaces, when they can form an arbitrary tree and may not be in host-guest relations. Also note that having the pass through device allows for guests to be connected via standard linux schemes in the host side (bridge, point routes, tc redirect etc); so you dont need a speacial device to hook them together. What do you mean under pass through device? Do you mean using guest1-tun0 as a backdoor to talk to the guest? Then the pragmatic question becomes how to correlate what you see from `ip addr list' to guests. on the host ip addr and the one seen on the guest side are the same. Except one is seen (on the host) on guest0-eth0 and another is seen on eth0 (on guest). Then what to do if the host system has 10.0.0.1 as a private address on eth3, and then interfaces guest1-tun0 and guest2-tun0 both get address 10.0.0.1 when each guest has added 10.0.0.1 to their tun0 device? Yes, that would be a conflict that needs to be resolved. If you look at ip addresses as also belonging to namespaces, then it should work, no? i am assuming a tag at the ifa table level. I'm not sure, it's complicated. You wouldn't want automatic local routes to be added for IP addresses on the host-side interfaces, right? Do you expect these IP addresses to act as local addresses in other places, like answering to arp requests about these IP on all physical devices? But anyway, you'll have conflicts on the application level. Many programs like ntpd, bind, and others fetch the device list using the same ioctls as ifconfig, and make (un)intelligent decisions basing on what they see. Mirroring may have some advantages if I am both host and guest administrator. But if I create a namespace for my friend Joe to play with IPv6 and sit tunnels, why should I face inconveniences because of what he does there? Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network namespaces a path to mergable code.
Hi Eric, On Tue, Jun 27, 2006 at 10:20:32PM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: [snip] My first patchset covers devices but not sockets. The only difference from what you're suggesting is ipv4 routing. For me, it is not less important than devices and sockets. May be even more important, since routing exposes design deficiencies less obvious at socket level. I agree we need to do it. I mostly want a base that allows us to not need to convert the whole network stack at once and still be able to merge code all the way to the stable kernel. The routing code is important for understanding design choices. It isn't important for merging if that makes sense. Ok, fine. Now I'm working on socket code. We still have a question about implicit vs explicit function parameters. This question becomes more important for sockets: if we want to allow to use sockets belonging to namespaces other than the current one, we need to do something about it. One possible option to resolve this question is to show 2 relatively short patches just introducing namespaces for sockets in 2 ways: with explicit function parameters and using implicit current context. Then people can compare them and vote. Do you think it's worth the effort? For everyone looking at routing choices the IPv6 routing table is interesting because it does not use a hash table, and seems quite possibly to be an equally fast structure that scales better. There is something to think about there. Sure [snip] Can you summarize you objections against my way of handling devices, please? And what was the typo you referred to in your letter to Kirill Korotaev? I have no fundamental objects to the content I have seen so far. Please read the first email Kirill responded too. I quoted a couple of sections of code and described the bugs I saw with the patch. I found your comments, thank you! All minor things. The typo I was referring to was a section where the original iteration was on an ifp variable and you called it dev without changing the rest of the code in that section. The only big issue was that the patch too big, and should be split into a patchset for better review. One patch for the new functions, and the an additional patch for each driver/subsystem hunk describing why that chunk needed to be changed. I'll split the patch. I'm still curious why many of those chunks can't use existing helper functions, to be cleaned up. What helper functions are you referring to? Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Hi Jamal, On Wed, Jun 28, 2006 at 09:53:23AM -0400, jamal wrote: On Wed, 2006-28-06 at 15:36 +0200, Herbert Poetzl wrote: note: personally I'm absolutely not against virtualizing the device names so that each guest can have a separate name space for devices, but there should be a way to 'see' _and_ 'identify' the interfaces from outside (i.e. host or spectator context) Makes sense for the host side to have naming convention tied to the guest. Example as a prefix: guest0-eth0. Would it not be interesting to have the host also manage these interfaces via standard tools like ip or ifconfig etc? i.e if i admin up guest0-eth0, then the user in guest0 will see its eth0 going up. Seeing guestXX-eth0 interfaces by standard tools has certain attractive sides. But it creates a lot of undesired side effects. For example, ntpd queries all network devices by the same ioctls as ifconfig, and creates separate sockets bound to IP addresses of each device, which is certainly not desired with namespaces. Or more subtle question: do you want hotplug events to be generated when guest0-eth0 interface comes up in the root namespace, and standard scripts to try to set some IP address on this interface?.. In my opinion, the downside of this scheme overweights possible advantages, and I'm personally quite happy with running commands with switched namespace, like vzctl exec guest0 ip addr list vzctl exec guest0 ip link set eth0 up and so on. Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
Daniel, On Wed, Jun 28, 2006 at 03:51:32PM +0200, Daniel Lezcano wrote: Daniel Lezcano wrote: Andrey Savochkin wrote: Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. Hi Andrey, if the ressources are private to the namespace, how do you will handle NFS mounted before creating the network namespace ? Do you take care of that or simply assume you can't access NFS anymore ? This is a question that brings up another level of interaction between networking and the rest of kernel code. Solution that I use now makes the NFS communication part always run in the root namespace. This is discussable, of course, but it's a far more complicated matter than just device lists or routing :) Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
On Wed, Jun 28, 2006 at 12:17:35PM -0400, jamal wrote: On Wed, 2006-28-06 at 18:19 +0400, Andrey Savochkin wrote: Seeing guestXX-eth0 interfaces by standard tools has certain attractive sides. But it creates a lot of undesired side effects. I apologize because i butted into the discussion without perhaps reading the full thread. Your comments are quite welcome For example, ntpd queries all network devices by the same ioctls as ifconfig, and creates separate sockets bound to IP addresses of each device, which is certainly not desired with namespaces. Ok, so the problem is that ntp in this case runs on the host side as yes opposed to the guest? This would explain why Eric is reacting vehemently to the suggestion. :) And I actually do not want to distinguish host and guest sides much. They are namespaces in the first place. Parent namespace may have some capabilities to manipulate its child namespaces, like donate its own device to one of its children. But it comes secondary to having namespace isolation borders. In particular, because most cases of cross-namespace interaction lead to failures of formal security models and inability to migrate namespaces between computers. Or more subtle question: do you want hotplug events to be generated when guest0-eth0 interface comes up in the root namespace, and standard scripts to try to set some IP address on this interface?.. yes, thats what i was thinking. Even go further and actually create guestxx-eth0 on the host (which results in creating eth0 on the guest) and other things. This actually goes in the opposite direction to what I keep in mind. I want to offload as much as possible of network administration work to guests. Delegation of management is one of the motivating factors behind covering not only sockets but devices, routes, and so on by the namespace patches. In my opinion, the downside of this scheme overweights possible advantages, and I'm personally quite happy with running commands with switched namespace, like vzctl exec guest0 ip addr list vzctl exec guest0 ip link set eth0 up and so on. Ok, above may be good enough and doesnt require any state it seems on the host side. I got motivated when the word migration was mentioned. I understood it to be meaning that a guest may become inoperative for some reason and that its info will be transfered to another guest which may be local or even remote. In such a case, clearly one would need a protocol and the state of all guests sitting at the host. Maybe i am over-reaching. Migration will work inside the kernel, so it has full access to whatever state information it needs. Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network namespaces a path to mergable code.
Hi Eric, On Wed, Jun 28, 2006 at 10:51:26AM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: One possible option to resolve this question is to show 2 relatively short patches just introducing namespaces for sockets in 2 ways: with explicit function parameters and using implicit current context. Then people can compare them and vote. Do you think it's worth the effort? Given that we have two strong opinions in different directions I think it is worth the effort to resolve this. Do you have time to extract necessary parts of your old patch? Or you aren't afraid of letting me draft an alternative version of socket namespaces basing on your code? :) In a slightly different vein your second patch introduced a lot of #ifdef CONFIG_NET_NS in C files. That is something we need to look closely at. So I think the abstraction that we use to access per network namespace variables needs some work if we are going to allow the ability to compile out all of the namespace code. The explicit versus implicit lookup is just one dimension of that problem. This is a good comment. Those ifdef's mostly correspond to places where we walk over lists and need to filter-out entities not belonging to a specific namespace. Those places about the same in your and my implementation. We can think what we can do with them. One trick that I used on several occasions is net_ns_same macro which doesn't evalute its arguments if CONFIG_NET_NS not defined, and thus can be used without ifdef's. Returning to implicit vs explicit function arguments, I belive that implicit arguments are more promising in having zero impact on the code when CONFIG_NET_NS is disabled. Functions like inet_addr_type will translate into exactly the same code as they did without net namespace patches. I'm still curious why many of those chunks can't use existing helper functions, to be cleaned up. What helper functions are you referring to? Basically most of the device list walker functions live in. net/core/dev.c I don't know if the cases you fixed could have used any of those helper functions but it certainly has me asking that question. A general pattern that happens in cleanups is the discovery that code using an old interface in a problematic way really could be done much better another way. I didn't dig enough to see if that was the case in any of the code that you changed. Well, there is obvious improvement of this kind: many protocols walk over device list to find devices with non-NULL protocol specific pointers. For example, IPv6, decnet and others do it on module unloading to clean up. Those places just ask for some simpler standard way of doing it, but I wasn't bold enough for such radical change. Do you think I should try? Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network namespaces a path to mergable code.
On Wed, Jun 28, 2006 at 12:14:41PM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: On Wed, Jun 28, 2006 at 10:51:26AM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: One possible option to resolve this question is to show 2 relatively short patches just introducing namespaces for sockets in 2 ways: with explicit function parameters and using implicit current context. Then people can compare them and vote. Do you think it's worth the effort? Given that we have two strong opinions in different directions I think it is worth the effort to resolve this. Do you have time to extract necessary parts of your old patch? Or you aren't afraid of letting me draft an alternative version of socket namespaces basing on your code? :) I'm not terribly afraid. I can always say you did it wrong. :) :) I don't think I am going to have time today. But since this conversation is slowing down and we are to getting into the technical details. I will try and find some time. Good. I'll focus on my part then. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Herbert, On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote: keep in mind that you actually have three kinds of network traffic on a typical host/guest system: - traffic between unit and outside - host traffic should be quite minimal - guest traffic will be quite high - traffic between host and guest probably minimal too (only for shared services) - traffic between guests can be as high (or even higher) than the outbound traffic, just think web guest and database guest My experience with host-guest systems tells me the opposite: outside traffic is a way higher than traffic between guests. People put web server and database in different guests not more frequent than they put them on separate physical server. Unless people are building a really huge system when 1 server can't take the whole load, web and database live together and benefit from communications over UNIX sockets. Guests are usually comprised of web-db pairs, and people place many such guests on a single computer. The routing between network namespaces does have the potential to be more expensive than just a packet trivially coming off the wire into a socket. IMHO the routing between network namespaces should not require more than the current local traffic does (i.e. you should be able to achieve loopback speed within an insignificant tolerance) and not nearly the time required for on-wire stuff ... I'd like to caution about over-optimizing communications between different network namespaces. Many optimizations of local traffic (such as high MTU) don't look so appealing when you start to think about live migration of namespaces. Regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Daniel, On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote: Then you lose the ability for each namespace to have its own routing entries. Which implies that you'll have difficulties with devices that should exist and be visible in one namespace only (like tunnels), as they require IP addresses and route. I mean instead of having the route tables private to the namespace, the routes have the information to which namespace they are associated. I think I understand what you're talking about: you want to make routing responsible for determining destination namespace ID in addition to route type (local, unicast etc), nexthop information, and so on. Right? My point is that if you make namespace tagging at routing time, and your packets are being routed only once, you lose the ability to have separate routing tables in each namespace. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
On Tue, Jun 27, 2006 at 11:34:36AM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Daniel, On Mon, Jun 26, 2006 at 05:49:41PM +0200, Daniel Lezcano wrote: Then you lose the ability for each namespace to have its own routing entries. Which implies that you'll have difficulties with devices that should exist and be visible in one namespace only (like tunnels), as they require IP addresses and route. I mean instead of having the route tables private to the namespace, the routes have the information to which namespace they are associated. I think I understand what you're talking about: you want to make routing responsible for determining destination namespace ID in addition to route type (local, unicast etc), nexthop information, and so on. Right? Yes. My point is that if you make namespace tagging at routing time, and your packets are being routed only once, you lose the ability to have separate routing tables in each namespace. Right. What is the advantage of having separate the routing tables ? Routing is everything. For example, I want namespaces to have their private tunnel devices. It means that namespaces should be allowed have private routes of local type, private default routes, and so on... Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
On Mon, Jun 26, 2006 at 10:05:14PM +0200, Herbert Poetzl wrote: On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. How do you handle ICMP_REDIRECT ? and btw. how do you handle the beloved 'ping' (i.e. ICMP_ECHO_REQUEST/REPLY for and from guests? I don't need to do anything special. They are just IP packets. If packets are local in the current net namespace, they are delivered to socket or handled by icmp_rcv. Certainly, packet/raw sockets shouldn't see packets they aren't supposed to see. For raw sockets, it implies making socket lookup aware of namespaces, exactly like for TCP or UDP. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Daniel, On Tue, Jun 27, 2006 at 01:21:02PM +0200, Daniel Lezcano wrote: My point is that if you make namespace tagging at routing time, and your packets are being routed only once, you lose the ability to have separate routing tables in each namespace. Right. What is the advantage of having separate the routing tables ? Routing is everything. For example, I want namespaces to have their private tunnel devices. It means that namespaces should be allowed have private routes of local type, private default routes, and so on... Ok, we are talking about the same things. We do it only in a different way: We are not talking about the same things. It isn't a technical thing whether route lookup is performed before or after namespace change. It is a fundamental question determining functionality of network namespaces. We are talking about the capabilities namespaces provide. Your proposal essentially denies namespaces to have their own tunnel or other devices. There is no point in having a device inside a namespace if the namespace owner can't route all or some specific outgoing packets through that device. You don't allow system administrators to completely delegate management of network configuration to namespace owners. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Herbert, On Tue, Jun 27, 2006 at 05:48:19PM +0200, Herbert Poetzl wrote: On Tue, Jun 27, 2006 at 01:09:11PM +0400, Andrey Savochkin wrote: On Mon, Jun 26, 2006 at 10:02:25PM +0200, Herbert Poetzl wrote: - traffic between guests can be as high (or even higher) than the outbound traffic, just think web guest and database guest My experience with host-guest systems tells me the opposite: outside traffic is a way higher than traffic between guests. People put web server and database in different guests not more frequent than they put them on separate physical server. Unless people are building a really huge system when 1 server can't take the whole load, web and database live together and benefit from communications over UNIX sockets. well, that's probably because you (or your company) focuses on providers which simply (re)sell the entities to their customers, in which case it would be more expensive to put e.g. the database into a separate guest. but let me state here that this is not the only application for this technology I'm just sharing my experience. You have one experience, I have another, and your classification of traffic importance is not the universal one. My point was that we shouldn't overestimate the use of INET sockets vs. UNIX ones in configurations where communications but not web/db operations play a big role in overall performance. And indeed I've talked with many different people, from universities to large enterprises. [snip] I'd like to caution about over-optimizing communications between different network namespaces. Many optimizations of local traffic (such as high MTU) don't look so appealing when you start to think about live migration of namespaces. I think the 'optimization' (or to be precise: desire not to sacrifice local/loopback traffic for some use case as you describe it) does not interfere with live migration at all, we still will have 'local' and 'remote' traffic, and personally I doubt that the live migration is a feature for the masses ... Why not for the masses? Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network namespaces a path to mergable code.
Eric, On Tue, Jun 27, 2006 at 11:20:40AM -0600, Eric W. Biederman wrote: Thinking about this I am going to suggest a slightly different direction for get a patchset we can merge. First we concentrate on the fundamentals. - How we mark a device as belonging to a specific network namespace. - How we mark a socket as belonging to a specific network namespace. I agree with the direction of your thoughts. I was trying to do a similar thing, define clear steps in network namespace merging. My first patchset covers devices but not sockets. The only difference from what you're suggesting is ipv4 routing. For me, it is not less important than devices and sockets. May be even more important, since routing exposes design deficiencies less obvious at socket level. As part of the fundamentals we add a patch to the generic socket code that by default will disable it for protocol families that do not indicate support for handling network namespaces, on a non-default network namespace. Fine Can you summarize you objections against my way of handling devices, please? And what was the typo you referred to in your letter to Kirill Korotaev? Regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 2/6] [Network namespace] Network device sharing by view
Hi Daniel, It's good that you kicked off network namespace discussion. Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :). Indeed, the first point to agree in this discussion is device list. In your patch, you essentially introduce a data structure parallel to the main device list, creating a view of this list. I see a fundamental problem with this approach. When a device presents an skb to the protocol layer, it needs to know to which namespace this skb belongs. Otherwise you would never get rid of problems with bind: what to do if device eth1 is visible in namespace1, namespace2, and root namespace, and each namespace has a socket bound to 0.0.0.0:80? We have to conclude that each device should be visible only in one namespace. In this case, instead of introducing net_ns_dev and net_ns_dev_list structures, we can simply have a separate dev_base list head in each namespace. Moreover, separate device list in each namespace will be in line with making namespace isolation complete. Complete isolation will allow each namespace to set up own tun/tap devices, have own routes, netfilter tables, and so on. My follow-up messages will contain the first set of patches with network namespaces implemented in the same way as network isolation in OpenVZ. This patchset introduces namespaces for device list and IPv4 FIB/routing. Two technical issues are omitted to make the patch idea clearer: device moving between namespaces, and selective routing cache flush + garbage collection. If this patchset is agreeable, the next patchset will finalize integration with nsproxy, add namespaces to socket lookup code and neighbour cache, and introduce a simple device to pass traffic between namespaces. Then we will turn to less obvious matters including netlink messages, network statistics, representation of network information in proc and sysfs, tuning of parameters through sysctl, IPv6 and other protocols, and per-namespace netfilters. Best regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] Network namespaces: cleanup of dev_base list use
Cleanup of dev_base list use, with the aim to make device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- arch/s390/appldata/appldata_net_sum.c |2 arch/sparc64/solaris/ioctl.c |2 drivers/block/aoe/aoecmd.c|8 ++- drivers/net/wireless/strip.c |4 - drivers/parisc/led.c |2 include/linux/netdevice.h | 28 +++-- net/8021q/vlan.c |4 - net/8021q/vlanproc.c | 10 ++-- net/bridge/br_if.c|4 - net/bridge/br_ioctl.c |4 + net/bridge/br_netlink.c |3 - net/core/dev.c| 70 -- net/core/dev_mcast.c |4 - net/core/rtnetlink.c | 18 net/decnet/af_decnet.c| 11 +++-- net/decnet/dn_dev.c | 17 net/decnet/dn_fib.c |2 net/decnet/dn_route.c | 12 ++--- net/ipv4/devinet.c| 15 --- net/ipv4/igmp.c | 25 +++- net/ipv6/addrconf.c | 28 - net/ipv6/anycast.c| 22 ++ net/ipv6/mcast.c | 20 + net/llc/llc_core.c|7 ++- net/netrom/nr_route.c |4 - net/rose/rose_route.c |8 ++- net/sched/sch_api.c |8 ++- net/sctp/protocol.c |2 net/tipc/eth_media.c | 12 +++-- 29 files changed, 200 insertions, 156 deletions --- ./arch/s390/appldata/appldata_net_sum.c.vedevbase Mon Mar 20 08:53:29 2006 +++ ./arch/s390/appldata/appldata_net_sum.c Thu Jun 22 12:03:07 2006 @@ -108,7 +108,7 @@ static void appldata_get_net_sum_data(vo tx_dropped = 0; collisions = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { if (dev-get_stats == NULL) { continue; } --- ./arch/sparc64/solaris/ioctl.c.vedevbaseMon Mar 20 08:53:29 2006 +++ ./arch/sparc64/solaris/ioctl.c Thu Jun 22 12:03:07 2006 @@ -686,7 +686,7 @@ static inline int solaris_i(unsigned int int i = 0; read_lock_bh(dev_base_lock); - for (d = dev_base; d; d = d-next) i++; + for_each_netdev(d) i++; read_unlock_bh(dev_base_lock); if (put_user (i, (int __user *)A(arg))) --- ./drivers/block/aoe/aoecmd.c.vedevbase Wed Jun 21 18:50:28 2006 +++ ./drivers/block/aoe/aoecmd.cThu Jun 22 12:03:07 2006 @@ -204,14 +204,17 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne sl = sl_tail = NULL; read_lock(dev_base_lock); - for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) { + for_each_netdev(dev) { dev_hold(ifp); - if (!is_aoe_netif(ifp)) + if (!is_aoe_netif(ifp)) { + dev_put(ifp); continue; + } skb = new_skb(ifp, sizeof *h + sizeof *ch); if (skb == NULL) { printk(KERN_INFO aoe: aoecmd_cfg: skb alloc failure\n); + dev_put(ifp); continue; } if (sl_tail == NULL) @@ -229,6 +232,7 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne skb-next = sl; sl = skb; + dev_put(ifp); } read_unlock(dev_base_lock); --- ./drivers/net/wireless/strip.c.vedevbaseWed Jun 21 18:50:43 2006 +++ ./drivers/net/wireless/strip.c Thu Jun 22 12:03:07 2006 @@ -1970,8 +1970,7 @@ static struct net_device *get_strip_dev( sizeof(zero_address))) { struct net_device *dev; read_lock_bh(dev_base_lock); - dev = dev_base; - while (dev) { + for_each_netdev(dev) { if (dev-type == strip_info-dev-type !memcmp(dev-dev_addr, strip_info-true_dev_addr, @@ -1982,7 +1981,6 @@ static struct net_device *get_strip_dev( read_unlock_bh(dev_base_lock); return (dev); } - dev = dev-next; } read_unlock_bh(dev_base_lock); } --- ./drivers/parisc/led.c.vedevbaseWed Jun 21 18:52:58 2006
[patch 2/4] Network namespaces: cleanup of dev_base list use
CONFIG_NET_NS and net_namespace structure are introduced. List of network devices is made per-namespace. Each namespace gets its own loopback device. Task's net_namespace pointer is not incorporated into nsproxy structure, since current namespace changes temporarily for processing of packets in softirq. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- drivers/net/loopback.c| 70 +++ include/linux/init_task.h |9 ++ include/linux/net_ns.h| 88 include/linux/netdevice.h | 20 - include/linux/nsproxy.h |3 include/linux/sched.h |3 kernel/nsproxy.c | 14 +++ net/Kconfig |7 + net/core/dev.c| 162 +- net/core/net-sysfs.c | 24 ++ net/ipv4/devinet.c|2 net/ipv6/addrconf.c |2 net/ipv6/route.c |3 13 files changed, 371 insertions, 36 deletions --- ./drivers/net/loopback.c.venshd Wed Jun 21 18:50:39 2006 +++ ./drivers/net/loopback.cFri Jun 23 11:48:09 2006 @@ -196,42 +196,56 @@ static struct ethtool_ops loopback_ethto .set_tso= ethtool_op_set_tso, }; -struct net_device loopback_dev = { - .name = lo, - .mtu= (16 * 1024) + 20 + 20 + 12, - .hard_start_xmit= loopback_xmit, - .hard_header= eth_header, - .hard_header_cache = eth_header_cache, - .header_cache_update= eth_header_cache_update, - .hard_header_len= ETH_HLEN, /* 14 */ - .addr_len = ETH_ALEN, /* 6*/ - .tx_queue_len = 0, - .type = ARPHRD_LOOPBACK, /* 0x0001*/ - .rebuild_header = eth_rebuild_header, - .flags = IFF_LOOPBACK, - .features = NETIF_F_SG | NETIF_F_FRAGLIST +struct net_device loopback_dev_static; +EXPORT_SYMBOL(loopback_dev_static); + +void loopback_dev_dtor(struct net_device *dev) +{ + if (dev-priv) { + kfree(dev-priv); + dev-priv = NULL; + } + free_netdev(dev); +} + +void loopback_dev_ctor(struct net_device *dev) +{ + struct net_device_stats *stats; + + memset(dev, 0, sizeof(*dev)); + strcpy(dev-name, lo); + dev-mtu= (16 * 1024) + 20 + 20 + 12; + dev-hard_start_xmit= loopback_xmit; + dev-hard_header= eth_header; + dev-hard_header_cache = eth_header_cache; + dev-header_cache_update = eth_header_cache_update; + dev-hard_header_len= ETH_HLEN; /* 14 */ + dev-addr_len = ETH_ALEN; /* 6*/ + dev-tx_queue_len = 0; + dev-type = ARPHRD_LOOPBACK; /* 0x0001*/ + dev-rebuild_header = eth_rebuild_header; + dev-flags = IFF_LOOPBACK; + dev-features = NETIF_F_SG | NETIF_F_FRAGLIST #ifdef LOOPBACK_TSO | NETIF_F_TSO #endif | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA - | NETIF_F_LLTX, - .ethtool_ops= loopback_ethtool_ops, -}; - -/* Setup and register the loopback device. */ -int __init loopback_init(void) -{ - struct net_device_stats *stats; + | NETIF_F_LLTX + | NETIF_F_NSOK; + dev-ethtool_ops= loopback_ethtool_ops; /* Can survive without statistics */ stats = kmalloc(sizeof(struct net_device_stats), GFP_KERNEL); if (stats) { memset(stats, 0, sizeof(struct net_device_stats)); - loopback_dev.priv = stats; - loopback_dev.get_stats = get_stats; + dev-priv = stats; + dev-get_stats = get_stats; } - - return register_netdev(loopback_dev); -}; +} -EXPORT_SYMBOL(loopback_dev); +/* Setup and register the loopback device. */ +int __init loopback_init(void) +{ + loopback_dev_ctor(loopback_dev_static); + return register_netdev(loopback_dev_static); +}; --- ./include/linux/init_task.h.venshd Wed Jun 21 18:53:16 2006 +++ ./include/linux/init_task.h Fri Jun 23 11:48:09 2006 @@ -87,6 +87,14 @@ extern struct nsproxy init_nsproxy; extern struct group_info init_groups; +#ifdef CONFIG_NET_NS +extern struct net_namespace init_net_ns; +#define INIT_NET_NS \ + .net_context= init_net_ns, +#else +#define INIT_NET_NS +#endif + /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1f (=2MB) @@ -129,6 +137,7 @@ extern struct group_info init_groups; .signal = init_signals,\ .sighand= init_sighand,\ .nsproxy= init_nsproxy
[patch 4/4] Network namespaces: playing and debugging
Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- dev.c | 27 ++- 1 files changed, 26 insertions, 1 deletion --- ./net/core/dev.c.vensdbgFri Jun 23 11:50:16 2006 +++ ./net/core/dev.cFri Jun 23 11:50:40 2006 @@ -3444,6 +3444,8 @@ int net_ns_start(void) if (err) goto out_register; put_net_ns(orig_ns); + printk(KERN_DEBUG NET_NS: created new netcontext %p for %s (pid=%d)\n, + ns, task-comm, task-tgid); return 0; out_register: @@ -3461,6 +3463,7 @@ EXPORT_SYMBOL(net_ns_start); void net_ns_free(struct net_namespace *ns) { + printk(KERN_DEBUG NET_NS: netcontext %p freed\n, ns); kfree(ns); } EXPORT_SYMBOL(net_ns_free); @@ -3473,8 +3476,13 @@ static void net_ns_destroy(void *data) ns = data; push_net_ns(ns, orig_ns); unregister_netdev(ns-loopback); + if (!list_empty(ns-dev_base)) { + printk(NET_NS: BUG: context %p has devices! ref %d\n, + ns, atomic_read(ns-active_ref)); + pop_net_ns(orig_ns); + return; + } ip_fib_struct_fini(); - BUG_ON(!list_empty(ns-dev_base)); pop_net_ns(orig_ns); /* drop (hopefully) final reference */ @@ -3483,9 +3491,23 @@ static void net_ns_destroy(void *data) void net_ns_stop(struct net_namespace *ns) { + printk(KERN_DEBUG NET_NS: netcontext %p scheduled for stop\n, ns); execute_in_process_context(net_ns_destroy, ns, ns-destroy_work); } EXPORT_SYMBOL(net_ns_stop); + +static int net_ns_open(struct inode *i, struct file *f) +{ + return net_ns_start(); +} +static struct file_operations net_ns_fops = { + .open = net_ns_open, +}; +static int net_ns_init(void) +{ + return proc_net_fops_create(net_ns, S_IRWXU, net_ns_fops) + ? 0 : -ENOMEM; +} #endif /* @@ -3550,6 +3572,9 @@ static int __init net_dev_init(void) hotcpu_notifier(dev_cpu_callback, 0); dst_init(); dev_mcast_init(); +#ifdef CONFIG_NET_NS + net_ns_init(); +#endif rc = 0; out: return rc; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. Signed-off-by: Andrey Savochkin [EMAIL PROTECTED] --- include/linux/net_ns.h |9 +++ include/net/flow.h |3 + include/net/ip_fib.h | 62 - net/core/dev.c |7 ++ net/ipv4/Kconfig |4 - net/ipv4/fib_frontend.c | 87 +-- net/ipv4/fib_hash.c | 13 - net/ipv4/fib_rules.c | 114 +-- net/ipv4/fib_semantics.c | 104 +- net/ipv4/route.c | 26 ++ 10 files changed, 348 insertions, 81 deletions --- ./include/linux/net_ns.h.vensrt Fri Jun 23 11:49:42 2006 +++ ./include/linux/net_ns.hFri Jun 23 11:50:16 2006 @@ -14,7 +14,16 @@ struct net_namespace { atomic_tactive_ref, use_ref; struct list_headdev_base; struct net_device *loopback; +#ifndef CONFIG_IP_MULTIPLE_TABLES + struct fib_table*fib4_local_table, *fib4_main_table; +#else + struct fib_table**fib4_tables; + struct hlist_head fib4_rules; +#endif + struct hlist_head *fib4_hash, *fib4_laddrhash; + unsignedfib4_hash_size, fib4_info_cnt; unsigned inthash; + chardestroying; struct execute_work destroy_work; }; --- ./include/net/flow.h.vensrt Wed Jun 21 18:51:08 2006 +++ ./include/net/flow.hFri Jun 23 11:50:16 2006 @@ -78,6 +78,9 @@ struct flowi { #define fl_icmp_type uli_u.icmpt.type #define fl_icmp_code uli_u.icmpt.code #define fl_ipsec_spi uli_u.spi +#ifdef CONFIG_NET_NS + struct net_namespace *net_ns; +#endif } __attribute__((__aligned__(BITS_PER_LONG/8))); #define FLOW_DIR_IN0 --- ./include/net/ip_fib.h.vensrt Wed Jun 21 18:53:17 2006 +++ ./include/net/ip_fib.h Fri Jun 23 11:50:16 2006 @@ -18,6 +18,7 @@ #include net/flow.h #include linux/seq_file.h +#include linux/net_ns.h /* WARNING: The ordering of these elements must match ordering * of RTA_* rtnetlink attribute numbers. @@ -169,14 +170,21 @@ struct fib_table { #ifndef CONFIG_IP_MULTIPLE_TABLES -extern struct fib_table *ip_fib_local_table; -extern struct fib_table *ip_fib_main_table; +#ifndef CONFIG_NET_NS +extern struct fib_table *ip_fib_local_table_static; +extern struct fib_table *ip_fib_main_table_static; +#define ip_fib_local_table_ns()ip_fib_local_table_static +#define ip_fib_main_table_ns() ip_fib_main_table_static +#else +#define ip_fib_local_table_ns() (current_net_ns-fib4_local_table) +#define ip_fib_main_table_ns() (current_net_ns-fib4_main_table) +#endif static inline struct fib_table *fib_get_table(int id) { if (id != RT_TABLE_LOCAL) - return ip_fib_main_table; - return ip_fib_local_table; + return ip_fib_main_table_ns(); + return ip_fib_local_table_ns(); } static inline struct fib_table *fib_new_table(int id) @@ -186,23 +194,36 @@ static inline struct fib_table *fib_new_ static inline int fib_lookup(const struct flowi *flp, struct fib_result *res) { - if (ip_fib_local_table-tb_lookup(ip_fib_local_table, flp, res) - ip_fib_main_table-tb_lookup(ip_fib_main_table, flp, res)) + struct fib_table *tb; + + tb = ip_fib_local_table_ns(); + if (!tb-tb_lookup(tb, flp, res)) + return 0; + tb = ip_fib_main_table_ns(); + if (tb-tb_lookup(tb, flp, res)) return -ENETUNREACH; return 0; } static inline void fib_select_default(const struct flowi *flp, struct fib_result *res) { + struct fib_table *tb; + + tb = ip_fib_main_table_ns(); if (FIB_RES_GW(*res) FIB_RES_NH(*res).nh_scope == RT_SCOPE_LINK) - ip_fib_main_table-tb_select_default(ip_fib_main_table, flp, res); + tb-tb_select_default(main_table, flp, res); } #else /* CONFIG_IP_MULTIPLE_TABLES */ -#define ip_fib_local_table (fib_tables[RT_TABLE_LOCAL]) -#define ip_fib_main_table (fib_tables[RT_TABLE_MAIN]) +#define ip_fib_local_table_ns() (fib_tables_ns()[RT_TABLE_LOCAL]) +#define ip_fib_main_table_ns() (fib_tables_ns()[RT_TABLE_MAIN]) -extern struct fib_table * fib_tables[RT_TABLE_MAX+1]; +#ifndef CONFIG_NET_NS +extern struct fib_table * fib_tables_static[RT_TABLE_MAX+1]; +#define fib_tables_ns() fib_tables_static +#else +#define fib_tables_ns() (current_net_ns-fib4_tables) +#endif extern int fib_lookup(const struct flowi *flp, struct fib_result *res); extern struct fib_table *__fib_new_table(int id); extern void fib_rule_put(struct fib_rule *r); @@ -212,7 +233,7 @@ static inline struct fib_table *fib_get_ if (id == 0) id = RT_TABLE_MAIN; - return fib_tables[id]; + return fib_tables_ns()[id]; } static inline
Re: [patch 2/6] [Network namespace] Network device sharing by view
Hi Herbert, On Mon, Jun 26, 2006 at 03:02:03PM +0200, Herbert Poetzl wrote: On Mon, Jun 26, 2006 at 01:47:11PM +0400, Andrey Savochkin wrote: I see a fundamental problem with this approach. When a device presents an skb to the protocol layer, it needs to know to which namespace this skb belongs. Otherwise you would never get rid of problems with bind: what to do if device eth1 is visible in namespace1, namespace2, and root namespace, and each namespace has a socket bound to 0.0.0.0:80? this is something which isn't a fundamental problem at all, and IMHO there are at least three options here (probably more) - check at 'bind' time if the binding would overlap and give the 'proper' error (as it happens right now on the host) (this is how Linux-VServer currently handles the network isolation, and yes, it works quite fine :) I'm not comfortable with this as a permanent mainstream solution. It means that network namespaces are actually not namespaces: you can't run some program (e.g., apache) with default configs in a new namespace without regards to who runs what in other namespaces. In other words, name 0.0.0.0:80 creates a collision in your implementation, so socket names do not form isolated spaces. - allow arbitrary binds and 'tag' the packets according to some 'host' policy (e.g. iptables or tc) (this is how the Linux-VServer ngnet was designed) - deliver packets to _all_ bound sockets/destinations (this is probably a more unusable but quite thinkable solution) Deliver TCP packets to all sockets? How many connections do you expect to be established in this case? We have to conclude that each device should be visible only in one namespace. I disagree here, especially some supervisor context or the host context should be able to 'see' and probably manipulate _all_ of the devices Right, manipulating all devices from some supervisor context is useful. But this shouldn't necessarily be done by regular ip/ifconfig tools. Besides, it could be quite confusing if in ifconfig output in the supervisor context you see 325 tun0 devices coming from different namespaces :) So I'm all for network namespace management mechanisms not bound to existing tools/APIs. Complete isolation will allow each namespace to set up own tun/tap devices, have own routes, netfilter tables, and so on. tun/tap devices are quite possible with this approach too, I see no problem here ... for iptables and routes, I'm worried about the required 'policy' to make them secure, i.e. how do you ensure that the packets 'leaving' guest X do not contain 'evil' packets and/or disrupt your host system? Sorry, I don't get your point. How do you ensure that packets leaving your neighbor's computer do not disrupt your system? From my point of view, network namespaces are just neighbors. My follow-up messages will contain the first set of patches with network namespaces implemented in the same way as network isolation in OpenVZ. hmm, you probably mean 'network virtualization' here I meant isolation between different network contexts/namespaces. This patchset introduces namespaces for device list and IPv4 FIB/routing. Two technical issues are omitted to make the patch idea clearer: device moving between namespaces, and selective routing cache flush + garbage collection. If this patchset is agreeable, the next patchset will finalize integration with nsproxy, add namespaces to socket lookup code and neighbour cache, and introduce a simple device to pass traffic between namespaces. passing traffic 'between' namespaces should happen via lo, no? what kind of 'device' is required there, and what overhead does it add to the networking? OpenVZ provides 2 options. 1) A packet appears right inside some namespace, without any additional overhead. Usually this implies that either all packets from this device belong to this namespace, i.e. simple device-namespace assignment. However, there is nothing conceptually wrong with having namespace-aware device drivers or netfilter modules selecting namespaces for each incoming packet. It all depends on how you want packets go through various network layers, and how much network management abilities you want to have in non-root namespaces. My point is that for network namespaces being real namespaces, decision making should be done somewhere before socket lookup. 2) Parent network namespace acts as a router forwarding packets to child namespaces. This scheme is the preferred one in OpenVZ for various reasons, most important being the simplicity of migration of network namespaces. In this case flexibility has the cost of going through packet handling layers two times. Technically, this is implemented via a simple netdevice doing netif_rx in hard_xmit. Regards Andrey - To unsubscribe from this list: send the line
Re: [patch 2/6] [Network namespace] Network device sharing by view
Daniel, On Mon, Jun 26, 2006 at 04:56:32PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: It's good that you kicked off network namespace discussion. Although I wish you'd Cc'ed someone at OpenVZ so I could notice it earlier :). [EMAIL PROTECTED] ? [EMAIL PROTECTED] is fine When a device presents an skb to the protocol layer, it needs to know to which namespace this skb belongs. Otherwise you would never get rid of problems with bind: what to do if device eth1 is visible in namespace1, namespace2, and root namespace, and each namespace has a socket bound to 0.0.0.0:80? Exact. But, the idea was to retrieve the namespace from the routes. Then you lose the ability for each namespace to have its own routing entries. Which implies that you'll have difficulties with devices that should exist and be visible in one namespace only (like tunnels), as they require IP addresses and route. IMHO, I think there are roughly 2 network isolation implementation: - make all network ressources private to the namespace - keep a flat model where network ressources have a new identifier which is the network namespace pointer. The idea is to move only some network informations private to the namespace (eg port range, stats, ...) Sorry, I don't get the second idea with only some information private to namespace. How do you want TCP_INC_STATS macro look? In my concept, it would be something like #define TCP_INC_STATS(field) SNMP_INC_STATS(current_net_ns-tcp_stat, field) where tcp_stat is a TCP statistics array inside net_namespace. Regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Structures related to IPv4 rounting (FIB and routing cache) are made per-namespace. How do you handle ICMP_REDIRECT ? Are you talking about routing cache entries created on incoming redirects? Or outgoing redirects? Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] Network namespaces: playing and debugging
On Mon, Jun 26, 2006 at 05:04:29PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: Temporary code to play with network namespaces in the simplest way. Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Is it possible to setup a network device to communicate with the outside ? Such device was planned for the second patchset :) I perhaps can send the patch tomorrow. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] Network namespaces: cleanup of dev_base list use
Hi Eric, On Mon, Jun 26, 2006 at 09:13:52AM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: Cleanup of dev_base list use, with the aim to make device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). As a proof of concept patch this is ok. As a real world patch this is much too big, which prevents review. Plus it takes a few actions that are more than replace just iterators through the device list. dev_base list is historically not the cleanest part of Linux networking. I've still spotted a place where the first device in dev_base list is assumed to be loopback. In early days we had more, now only one place or two... In addition I suspect several if not all of these iterators can be replaced with the an appropriate helper function. The normal structure for a patch like this would be to introduce the new helper function. for_each_netdev. And then to replace all of the users while cc'ing the maintainers of those drivers. With each different driver being a different patch. There is another topic for discussion in this patch as well. How much of the context should be implicit and how much should be explicit. If the changes from netchannels had already been implemented, and all of the network processing was happening in a process context then I would trivially agree that implicit would be the way to go. Why would we want all network processing happen in a process context? However short of always having code always execute in the proper context I'm not comfortable with implicit parameters to functions. Not that this the contents of this patch should address this but the later patches should. We just have too many layers in networking code, and FIB/routing illustrates it well. When I went through this, my patchset just added an explicit continue if the devices was not in the appropriate namespace. I actually prefer the multiple list implementation but at the same time I think it is harder to get a clean implementation out of it. Certainly, dev_base list reorganization is not the crucial point in network namespaces. But it has to be done some way or other. If people vote for a single list with skipping devices from a wrong namespace, it's fine with me, I can re-make this patch. I personally prefer per-namespace device list since we have too many places in the kernel where this list is walked in a linear fashion, and with many namespaces this list may become quite long. Regards Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 4/4] Network namespaces: playing and debugging
On Mon, Jun 26, 2006 at 07:29:57PM +0200, Daniel Lezcano wrote: Do exec 7 /proc/net/net_ns in your bash shell and you'll get a brand new network namespace. There you can, for example, do ip link set lo up ip addr list ip addr add 1.2.3.4 dev lo ping -n 1.2.3.4 Andrey, I began to play with your patchset. I am able to connect to 127.0.0.1 from different namespaces. Is it the expected behavior ? Furthermore, I am not able to have several programs, running in different namespaces, to bind to the same INADDR_ANY:port. Will these features be included in the second patchset ? Of course. This patchset adds namespaces to routing code, which means that you can define local IP addresses in each namespace independently. But this first patchset doesn't include namespaces in socket lookup code. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Network namespaces: IPv4 FIB/routing in namespaces
On Mon, Jun 26, 2006 at 05:57:01PM +0200, Daniel Lezcano wrote: Andrey Savochkin wrote: On Mon, Jun 26, 2006 at 04:56:46PM +0200, Daniel Lezcano wrote: How do you handle ICMP_REDIRECT ? Are you talking about routing cache entries created on incoming redirects? Or outgoing redirects? incoming redirects They are inserted into routing cache with the current namespace tag, in the same way as input routing cache entries. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] Network namespaces: cleanup of dev_base list use
Eric, On Mon, Jun 26, 2006 at 10:26:23AM -0600, Eric W. Biederman wrote: Andrey Savochkin [EMAIL PROTECTED] writes: On Mon, Jun 26, 2006 at 09:13:52AM -0600, Eric W. Biederman wrote: There is another topic for discussion in this patch as well. How much of the context should be implicit and how much should be explicit. If the changes from netchannels had already been implemented, and all of the network processing was happening in a process context then I would trivially agree that implicit would be the way to go. [snip] It is a big enough problem that I don't think we want to gate on that development but we need to be ready to take advantage of it when it happens. Well, ok, implicit namespace reference will take advantage of it if it happens. However short of always having code always execute in the proper context I'm not comfortable with implicit parameters to functions. Not that this the contents of this patch should address this but the later patches should. We just have too many layers in networking code, and FIB/routing illustrates it well. I don't follow this comment. How does a lot of layers affect the choice of implicit or explicit parameters? If you are maintaining a patch outside the kernel I could see how there could be a win for touching the least amount of code possible but for merged code that you only have to go through once I don't see how the number of layers affects things. I agree that implicit vs explicit parameters is a topic for discussion. From what you see from my patch, I vote for implicit ones in this case :) I was talking about layers because they imply changing more code, and usually imply adding more parameters to functions and passing these additional parameters to next layers. In routing code it goes from routing entry points, to routing cache, to general FIB functions, to table-specific code (FIB hash). These additional parameters bloat the code to some extent. Sometimes it's possible to save here and there by fetching the parameter (namespace pointer) indirectly from structures you already have at hand, but it can't be done universally. One of the properties of implicit argument which I especially like is that both input and output paths are absolutely symmetric in how the namespace pointer is extracted. As I recall for most of the FIB/routing code once you have removed the global variable accesses and introduce namespace checks in the hash table (because allocating hash tables at runtime isn't sane) the rest of the code was agnostic about what was going on. So I think you have touched everything that needs touching. So I don't see a code size or complexity argument there. Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html