Re: Please pull 'upstream' branch of wireless-2.6
On Thu, Jan 18, 2007 at 10:10:47PM -0500, Jeff Garzik wrote: > John W. Linville wrote: > >The following changes since commit > >10764889c6355cbb335cf0578ce12427475d1a65: > > Larry Finger (1): > >bcm43xx: Fix failure to deliver PCI-E interrupts > > > >are found in the git repository at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git > > upstream > > ACK. Open question of parentage, though: I just rebased > netdev-2.6.git#upstream. Is your wireless-2.6 affected by this rebase? > > If not, I will go ahead and pull. Right now it looks like this: Linus's tree -> my upstream-fixes branch -> my upstream branch So, I think it should be fine for your you to pull. Thanks, John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/12] L2 network namespace (v3)
On Friday 19 January 2007 10:27, Eric W. Biederman wrote: > YOSHIFUJI Hideaki / 吉藤英明 <[EMAIL PROTECTED]> writes: > > > In article <[EMAIL PROTECTED]> (at Wed, 17 Jan 2007 18:51:14 > > +0300), Dmitry Mishin <[EMAIL PROTECTED]> says: > > > >> === > >> L2 network namespaces > >> > >> The most straightforward concept of network virtualization is complete > >> separation of namespaces, covering device list, routing tables, netfilter > >> tables, socket hashes, and everything else. > >> > >> On input path, each packet is tagged with namespace right from the > >> place where it appears from a device, and is processed by each layer > >> in the context of this namespace. > >> Non-root namespaces communicate with the outside world in two ways: by > >> owning hardware devices, or receiving packets forwarded them by their > >> parent > >> namespace via pass-through device. > > > > Can you handle multicast / broadcast and IPv6, which are very important? > > The basic idea here is very simple. > > Each network namespace appears to user space as a separate network stack, > with it's own set of routing tables etc. > > All sockets and all network devices (the sources of packets) belong > to exactly one network namespace. > > >From the socket or the network device a packet enters the network stack > you can infer the network namespace that it will be processed in. > Each network namespace should get it own complement of the data structures > necessary to process packets, and everything should work. > > Talking between namespaces is accomplished either through an external network, > or through a special pseudo network device. The simplest to implement > is two network devices where all packets transmitted on one are received > on the other. Then by placing one network device in one namespace and > the other in another interface it looks like two machines connected by > a cross over cable. > > Once you have that in a one namespace you can connect other namespaces > with the existing ethernet bridging or by configuring one of the > namespaces as a router and routing traffic between them. > > > Supporting IPv6 is roughly as difficult as supporting IPv4. > > What needs to happen to convert code is all variables either need > a per network namespace instance or the data structures needs to be > modified to have a network namespace tag. For hash tables which > are hard to allocate dynamically tagging is the preferred conversion > method, for anything that is small enough duplication is preferred > as it allows the existing logic to be kept. > > In the fast path the impact of all of the conversions should be very light, > to non-existent. In network stack initialization and cleanup there > is work todo because you are initializing and cleanup variables more often > then at module insertion and removal. > > So my expectation is that once we get a framework established and merged > to allow network namespaces eventually the entire network stack will be > converted. Not just ipv4 and ipv6 but decnet, ipx, iptables, fair scheduling, > ethernet bridging and all of the other weird and twisty bits of the > linux network stack. Thanks Eric for such descriptive comment. I can only sign off on it :) > > The primary practical hurdle is there is a lot of networking code in > the kernel. > > I think I know a path by which we can incrementally merge support for > network namespaces without breaking anything. More to come on this > when I finish up my demonstration patchset in a week or so that > is complete enough to show what I am talking about. > > I hope this helps but the concept into perspective. I'll be waiting it. > > As for Dmitry's patchset in particular it currently does not support > IPv6 and I don't know where it is with respect to the broadcast and > multicast but I don't see any immediate problems that would preclude > those from working. But any incompleteness is exactly that > incompleteness and an implementation problem not a fundamental design > issue. Broadcasts/multicasts are supported. -- Thanks, Dmitry. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
On 17-01-2007 15:12, Michael Tokarev wrote: > Herbert Xu wrote: >> On Tue, Jan 16, 2007 at 11:08:51AM +0300, Michael Tokarev wrote: >>> Ok. Here's another trace, from that remote network that triggers >>> this thing more-or-less reliable (every 2nd transfer at least) -- >>> http://www.corpit.ru/mjt/bh-bad-cksum-dmp.bin . It's a full session >>> between 216.168.29.244 - the requesting/receiving side -- and >>> 81.13.94.6 -- our sending side (the file being transferred is some >>> trojan horse I found on a friend's PC, so be careful ;) >> I'll have a look at this tomorrow. >> >> Since you're certain that this is being seen on the wire, one >> possibility is that we've got a bug somewhere that's zeroing >> skb->ip_summed on a packet with a partial checksum. > > Here's another sample, which may be more useful. I've seen quite > alot of very similar stuff while running tcpdump. > > http://www.corpit.ru/mjt/bad-cksum-session3-dmp.bin > > The scenario looks like this. > > A client (82.84.172.37 -- a zombie machine trying to send us spam > in this case) connects to a port 25 here (81.13.94.6:25). SYN+ACK > sequence completes. Next, our server send an initial SMTP greething > message, but almost right after that, the client sends a FIN packet, > WITHOUT acknowleging that it received the (first and only) data > packet. So some time later our machine re-sends the data, AND adds > FIN flag to the packet (also replying to the FIN received from the > client). And *that* packet - original data packet which is modified > to also include FIN - has incorrect checksum. > > So it looks like the checksum isn't being updated WHEN ADDING MORE > FLAGS to the original data packet. > Hi, Here is my patch proposal. If I'm not totally wrong, there is a possibility that, during collapsing, empty skb with FIN is added to "normal" packet and changes its ip_summed field to CHECKSUM_NONE. Regards, Jarek P. PS: probably there are also other possibilities... --- [PATCH][NET] tcp_output: rare bad TCP checksum with 2.6.19 The patch "Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE" changed to unconditional copying of ip_summed field from collapsed skb. This patch reverts this change. All substantial work including heavy testing and diagnosing by: Michael Tokarev <[EMAIL PROTECTED]> Signed-off-by: Jarek Poplawski <[EMAIL PROTECTED]> --- diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c linux-2.6.19/net/ipv4/tcp_output.c --- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 +0100 +++ linux-2.6.19/net/ipv4/tcp_output.c 2007-01-19 07:58:39.0 +0100 @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str memcpy(skb_put(skb, next_skb_size), next_skb->data, next_skb_size); - skb->ip_summed = next_skb->ip_summed; + if (next_skb->ip_summed == CHECKSUM_PARTIAL) + skb->ip_summed = CHECKSUM_PARTIAL; if (skb->ip_summed != CHECKSUM_PARTIAL) skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
Jarek Poplawski wrote: > Here is my patch proposal. If I'm not totally wrong, > there is a possibility that, during collapsing, empty > skb with FIN is added to "normal" packet and changes > its ip_summed field to CHECKSUM_NONE. > > diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c > linux-2.6.19/net/ipv4/tcp_output.c > --- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 > +0100 > +++ linux-2.6.19/net/ipv4/tcp_output.c2007-01-19 07:58:39.0 > +0100 > @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str > > memcpy(skb_put(skb, next_skb_size), next_skb->data, > next_skb_size); > > - skb->ip_summed = next_skb->ip_summed; > + if (next_skb->ip_summed == CHECKSUM_PARTIAL) > + skb->ip_summed = CHECKSUM_PARTIAL; > > if (skb->ip_summed != CHECKSUM_PARTIAL) > skb->csum = csum_block_add(skb->csum, next_skb->csum, > skb_size); > I noticed this too, but I can't see how it could lead to a partial checksum on the wire since the checksumming is done after changing ip_summed to CHECKSUM_NONE. Is this patch verified to fix Michael's problem? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)
Russell Stuart wrote: > On Thu, 2007-01-18 at 12:37 +0100, Patrick McHardy wrote: > >>>Or are you proposing tc behave differently on different >>>kernel versions. (I have no problem with that, but >>>isn't it officially frowned upon?) >> >>Yes. There is no way you can make this work on old kernels, >>nobody expects that. The important part is that everything >>continues to work as before and that both old and new iproute >>binaries work properly on both old and new kernels (new >>iproute on old kernels without STABs obviously). > > > I thought that some degree of compatibility was > expected. At the very least the newest version > of "tc" must work on _any_ kernel as least as > well as the version it replaces did. > > I also though newer kernels should work older > version of iproute2, albeit without the features > added in the newer versions. > > Are you saying this is not so? No, thats exactly what I'm saying. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TKIP encryption should allocate enough tailroom
On Thu, Jan 18, 2007 at 08:55:37AM -0500, Brandon Craig Rhodes wrote: > to debugging messages! In some circumstances, debug messages are > always produced; in several others, net_ratelimit() is called to > decided whether to print an error (but why in these cases and not > others?); and in many cases, nothing is printed at all (is this > because convention would dictate that the caller discover the error > and print something out?). > > If I want to generate a patch that festoons the ieee80211 functions > with informative error messages, what are the guidelines? My understanding is: BUG_ON() / BUG() if it's a clear "impossible" condition ("function calling me was wrong") null pointers/buffer lengths being inconsistent. Might even be justified in this case? net_ratelimit() says: /* * All net warning printk()s should be guarded by this function. */ int net_ratelimit(void) { return __printk_ratelimit(net_msg_cost, net_msg_burst); } Especially important if the code path can be triggered by anyone (local user or arbitrary packet from the network). Otherwise not that big a deal if it's buggy code elsewhere in the kernel that causes the message to be printed. You fix the code and you stop getting thousands of lines of debug messages/second (which is why net_ratelimit() exists). If it's an arbitrary packet from the network, there probably should even be a sysctl to enable/disable debug output completely. IPv4 has: static void ip_handle_martian_source(struct net_device *dev, struct in_device *in_dev, struct sk_buff *skb, __be32 daddr, __be32 saddr) { RT_CACHE_STAT_INC(in_martian_src); #ifdef CONFIG_IP_ROUTE_VERBOSE if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit()) { /* * RFC1812 recommendation, if source is martian, * the only hint is MAC header. */ printk(KERN_WARNING "martian source %u.%u.%u.%u from " "%u.%u.%u.%u, on dev %s\n", NIPQUAD(daddr), NIPQUAD(saddr), dev->name); ... (so there's a #ifdef _and_ a log_martians sysctl to see debug output). In general #ifdefs should be avoided. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE] FYI: MultiTCP for linux-2.6.19
MultiTCP[1] is (yet another) Linux TCP patch intended for researchers/developers, which can report TCP events in the kernel logs in order to watch TCP internal variables. Furthermore, it includes TCP Pacing and Hoe's initial ssthresh estimation[2]. Their use in satellite links is strongly recommended by the TCP-Hybla authors in order to mitigate congestion episodes and limit the initial cwnd overshoot phenomenon respectively. A new version for linux-2.6.19 has been released at http://www.sf.net/projects/multitcp/ Hybla's authors future goal would be to produce official patches for both Pacing and initial ssthresh estimation, by freeing the latter one implementation from the TCP kernel logs engine and getting the current tcp_sock structure slimmer. Meanwhile, we will appreciate if the comparative tests made on Linux congestion control will include the two algorithms described above when using TCP-Hybla. [1] C. Caini, R. Firrincieli and D. Lacamera, "A Linux Based Multi TCP Implementation for Experimental Evaluation of TCP Enhancements", SPECTS 2005, Philadelphia, July 2005. C. Caini, R. Firrincieli, D. Lacamera, "An emulation approach for the evaluation of enhanced transport protocols performance in satellite networks", IEEE Globecom 2006 - Satellite and Space Communications, 27 November-1 December 2006, San Francisco, CA, USA. [2] J. C. Hoe, "Improving the Start-up Behavior of a Congestion Control Scheme for TCP", ACM SIGCOMM 1996, pp. 270-280 Regards, -- Daniele Lacamera - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
Jarek Poplawski wrote: > On 17-01-2007 15:12, Michael Tokarev wrote: [] >> Here's another sample, which may be more useful. I've seen quite >> alot of very similar stuff while running tcpdump. >> >> http://www.corpit.ru/mjt/bad-cksum-session3-dmp.bin >> >> The scenario looks like this. >> >> A client (82.84.172.37 -- a zombie machine trying to send us spam >> in this case) connects to a port 25 here (81.13.94.6:25). SYN+ACK >> sequence completes. Next, our server send an initial SMTP greething >> message, but almost right after that, the client sends a FIN packet, >> WITHOUT acknowleging that it received the (first and only) data >> packet. So some time later our machine re-sends the data, AND adds >> FIN flag to the packet (also replying to the FIN received from the >> client). And *that* packet - original data packet which is modified >> to also include FIN - has incorrect checksum. >> >> So it looks like the checksum isn't being updated WHEN ADDING MORE >> FLAGS to the original data packet. >> > > Hi, > > Here is my patch proposal. If I'm not totally wrong, > there is a possibility that, during collapsing, empty > skb with FIN is added to "normal" packet and changes > its ip_summed field to CHECKSUM_NONE. > > Regards, > Jarek P. > > PS: probably there are also other possibilities... Well.. I just tried it - with this patch applied, no more bad checksums are shown. Tried from the network that triggers it most reliable - and wasn't able to reproduce the bad behavior. I'm running a tcpdump right now, and so far it only captured a few bad-cksum packets from other hosts (which are also running 2.6.19 ;) Thanks Jarek! /mjt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
Patrick McHardy wrote: > Jarek Poplawski wrote: >> Here is my patch proposal. If I'm not totally wrong, >> there is a possibility that, during collapsing, empty >> skb with FIN is added to "normal" packet and changes >> its ip_summed field to CHECKSUM_NONE. >> >> diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c >> linux-2.6.19/net/ipv4/tcp_output.c >> --- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 >> +0100 >> +++ linux-2.6.19/net/ipv4/tcp_output.c 2007-01-19 07:58:39.0 >> +0100 >> @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str >> >> memcpy(skb_put(skb, next_skb_size), next_skb->data, >> next_skb_size); >> >> -skb->ip_summed = next_skb->ip_summed; >> +if (next_skb->ip_summed == CHECKSUM_PARTIAL) >> +skb->ip_summed = CHECKSUM_PARTIAL; >> >> if (skb->ip_summed != CHECKSUM_PARTIAL) >> skb->csum = csum_block_add(skb->csum, next_skb->csum, >> skb_size); >> > > I noticed this too, but I can't see how it could lead to > a partial checksum on the wire since the checksumming is > done after changing ip_summed to CHECKSUM_NONE. Is this > patch verified to fix Michael's problem? It seems to fix this "my" problem, yes - at least I can't reproduce it anymore. Tcpdump is running however - let's see... :) /mjt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bonding: sysfs patch broke module renaming
The sysfs patch broke using multiple instances of the bonding module through module renaming (modprobe -o). In recent kernels it fails with -EEXIST when trying to add the bonding_masters file for the second time, in older kernels (where sysfs_add_file didn't check for duplicates) it will crash when unloading the modules. I don't see a good way to fix it, can someone please look into this? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
On Fri, Jan 19, 2007 at 04:20:01PM +0300, Michael Tokarev wrote: ... > Well.. I just tried it - with this patch applied, no more bad checksums > are shown. Tried from the network that triggers it most reliable - and > wasn't able to reproduce the bad behavior. > > I'm running a tcpdump right now, and so far it only captured a few bad-cksum > packets from other hosts (which are also running 2.6.19 ;) > > Thanks Jarek! You are welcome! But you probably didn't read this with attention: if it works, you should thank mainly to that other guy... Btw. I can't remember I've seen such ferocious testing ever! Cheers, Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
On Fri, Jan 19, 2007 at 01:14:52PM +0100, Patrick McHardy wrote: > Jarek Poplawski wrote: > > Here is my patch proposal. If I'm not totally wrong, > > there is a possibility that, during collapsing, empty > > skb with FIN is added to "normal" packet and changes > > its ip_summed field to CHECKSUM_NONE. > > > > diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c > > linux-2.6.19/net/ipv4/tcp_output.c > > --- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 > > +0100 > > +++ linux-2.6.19/net/ipv4/tcp_output.c 2007-01-19 07:58:39.0 > > +0100 > > @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str > > > > memcpy(skb_put(skb, next_skb_size), next_skb->data, > > next_skb_size); > > > > - skb->ip_summed = next_skb->ip_summed; > > + if (next_skb->ip_summed == CHECKSUM_PARTIAL) > > + skb->ip_summed = CHECKSUM_PARTIAL; > > > > if (skb->ip_summed != CHECKSUM_PARTIAL) > > skb->csum = csum_block_add(skb->csum, next_skb->csum, > > skb_size); > > > > I noticed this too, but I can't see how it could lead to > a partial checksum on the wire since the checksumming is > done after changing ip_summed to CHECKSUM_NONE. Is this > patch verified to fix Michael's problem? No, this was intended as a proposal for testing. I didn't verify all the checksum path here, but I guessed such change during the summing could matter (probably for skb_copy_and_csum_dev and maybe earlier) and I couldn't find more suspicious change since 2.6.17 near this FINs. But if it really works, it shoudn't be so hard to verify the mechanism, I hope. Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/12] net namespace : initialize init process to level 2
From: Daniel Lezcano <[EMAIL PROTECTED]> Initialize the init's network namespace to level 2 Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- net/core/net_namespace.c |1 + 1 file changed, 1 insertion(+) Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -21,6 +21,7 @@ .dev_tail_p = &init_net_ns.dev_base_p, .loopback_dev_p = NULL, .pcpu_lstats_p = NULL, + .level = NET_NS_LEVEL2, }; #ifdef CONFIG_NET_NS -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 00/12] net namespace : L3 namespace - introduction
This patchset provide a network isolation similar at what Linux-Vserver provides. It is based on the L2 namespaces and relies on the mechanisms provided by the namespace. This L3 namespaces does not aim to bring full virtualization for the network, it provides an IP isolation which can be reused for Linux-Vserver, jailed application or application containers. A L3 namespace are always L2 s' childs and they can not create more network namespaces, furthermore, they lose their NET_ADMIN capability. They share their parent's network ressources. From the parent namespace, IP addresses are created and assigned to the different L3 childs. From this point, L3 namespaces can use their assigned IP address and all computed broadcast addresses. Because the L3 namespace relies on the L2 virtualization mechanisms, it is possible to have several L3 namespaces listening on INADDR_ANY:port without conflict, that's allow to run several server without modifying the network configuration. The loopback is a shared device between all L3 namespaces. To ensure the 127.0.0.1 address isolation, the sender store its namespace into the packet, so when the packet arrives, the destination namespace is already set, because "source" == "destination". By this way, it is easy to disable the loopback isolation and let the application to talk with application outside of the namespace via the 127.0.0.1 because we consider them trusted (like portmap). The ifconfig / ip commands will only show IP addresses assigned to the L3 namespace. When a L3 namespace dies, the assigned IP address is released to its parent. At the IP level, when a packet arrives, the L3 network namespace destination is retrieved from the destination address. At the bind time, the address is checked against the assigned IP address. -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 02/12] net namespace : store L2 parent namespace
From: Daniel Lezcano <[EMAIL PROTECTED]> All L3 namespaces are the final nodes of the L2 namespaces tree. Because their share some ressources coming from the L2 namespace. The L2 parent namespace should be stored into the L3 child when it is created. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h |1 + net/core/net_namespace.c | 11 +++ 2 files changed, 12 insertions(+) Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -27,6 +27,7 @@ #define NET_NS_LEVEL2 1 #define NET_NS_LEVEL3 2 unsigned intlevel; + struct net_namespace*parent; }; extern struct net_namespace init_net_ns; Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -22,6 +22,7 @@ .loopback_dev_p = NULL, .pcpu_lstats_p = NULL, .level = NET_NS_LEVEL2, + .parent = NULL, }; #ifdef CONFIG_NET_NS @@ -62,6 +63,12 @@ if (ip_fib_struct_init()) goto out_fib4; } + + if (level == NET_NS_LEVEL3) { + get_net_ns(old_ns); + ns->parent = old_ns; + } + ns->level = level; if (loopback_init()) goto out_loopback; @@ -126,8 +133,12 @@ ns, atomic_read(&ns->kref.refcount)); return; } + if (ns->level == NET_NS_LEVEL2) ip_fib_struct_cleanup(ns); + if (ns->level == NET_NS_LEVEL3) + put_net_ns(ns->parent); + printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns); kfree(ns); } -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 12/12] net namespace : Add broadcasting
From: Daniel Lezcano <[EMAIL PROTECTED]> Broadcast packets should be delivered to l2 and all l3 childs Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h | 11 +++ net/core/net_namespace.c | 27 +++ net/ipv4/udp.c|3 ++- 3 files changed, 40 insertions(+), 1 deletion(-) Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -9,6 +9,7 @@ struct in_ifaddr; struct sk_buff; +struct sock; struct net_namespace { struct kref kref; @@ -109,6 +110,9 @@ extern void net_ns_tag_sk_buff(struct sk_buff *skb); +extern int net_ns_sock_is_visible(const struct sock *sk, + const struct net_namespace *net_ns); + #define SELECT_SRC_ADDR net_ns_select_source_address #else /* CONFIG_NET_NS */ @@ -192,6 +196,13 @@ { ; } + +static inline int net_ns_sock_is_visible(const struct sock *sk, +const struct net_namespace *net_ns) +{ + return 1; +} + #define SELECT_SRC_ADDR inet_select_addr #endif /* !CONFIG_NET_NS */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -17,6 +17,7 @@ #include #include +#include struct net_namespace init_net_ns = { .kref = { @@ -464,4 +465,30 @@ struct net_namespace *net_ns = current_net_ns; skb->net_ns = net_ns; } + +/* + * This function checks if the socket is visible from the specified + * namespace. This is needed to ensure the broadcast and the multicast + * for multiple network namespace l2 and l3 to have the packets to be + * delivered. If we have a l3 namespace and its parent (l2 namespace) + * listening on a broadcast address, we should deliver the packet to + * both. That is done by the udp_v4_mcast_next function. But we should + * find a common point between sockets which are relatives to a + * namespace. The common point is they have the same parent in case + * of l3 network namespace. + * @sk : the socket to be checked + * @net_ns : the receiving network namespace + * Returns: 1 if the socket is visible by the namespace, 0 otherwise. + */ +int net_ns_sock_is_visible(const struct sock *sk, + const struct net_namespace *net_ns) +{ + if (net_ns->level == NET_NS_LEVEL3) + net_ns = net_ns->parent; + + if (sk->sk_net_ns->level == NET_NS_LEVEL3) + return sk->sk_net_ns->parent == net_ns; + else + return sk->sk_net_ns == net_ns; +} #endif /* CONFIG_NET_NS */ Index: 2.6.20-rc4-mm1/net/ipv4/udp.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/udp.c +++ 2.6.20-rc4-mm1/net/ipv4/udp.c @@ -309,9 +309,10 @@ (inet->dport != rmt_port && inet->dport)|| (inet->rcv_saddr && inet->rcv_saddr != loc_addr)|| ipv6_only_sock(s) || - !net_ns_match(sk->sk_net_ns, ns)|| (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)) continue; + if (!net_ns_sock_is_visible(sk, ns)) + continue; if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif)) continue; goto found; -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 08/12] net namespace : find namespace by addr
From: Daniel Lezcano <[EMAIL PROTECTED]> Switch to the the l3 namespace using the destination address. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h |7 +++ net/core/net_namespace.c | 35 +++ net/ipv4/ip_input.c | 16 +++- 3 files changed, 57 insertions(+), 1 deletion(-) Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c +++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c @@ -374,6 +374,9 @@ { struct iphdr *iph; u32 len; + int err; + struct net_namespace *net_ns = current_net_ns; + struct net_namespace *dst_net_ns = NULL; /* When the interface is in promisc. mode, drop all the crap * that it receives, do not try to analyse it. @@ -393,6 +396,9 @@ iph = skb->nh.iph; + dst_net_ns = net_ns_find_from_dest_addr(iph->daddr); + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) + push_net_ns(dst_net_ns); /* * RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails the checksum. * @@ -431,10 +437,18 @@ /* Remove any debris in the socket control block */ memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); - return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, + err = NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) + pop_net_ns(net_ns); + + return err; + inhdr_error: + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) + pop_net_ns(net_ns); + IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS); drop: kfree_skb(skb); Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -99,6 +99,8 @@ extern __be32 net_ns_select_source_address(const struct net_device *dev, u32 dst, int scope); +extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr); + #define SELECT_SRC_ADDR net_ns_select_source_address #else /* CONFIG_NET_NS */ @@ -167,6 +169,11 @@ return 0; } +static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) +{ + return NULL; +} + #define SELECT_SRC_ADDR inet_select_addr #endif /* !CONFIG_NET_NS */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -385,4 +385,39 @@ out: return addr; } + +/* + * This function finds the network namespace destination deduced from + * the destination address. The network namespace is retrieved from + * the ifaddr owned by a network namespace + * @daddr : destination + * Returns : the network namespace destination or NULL if not found + */ +struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) +{ + struct net_namespace *net_ns = NULL; + struct net_device *dev; + struct in_device *in_dev; + + if (LOOPBACK(daddr)) + return current_net_ns; + + read_lock(&dev_base_lock); + rcu_read_lock(); + for (dev = dev_base; dev; dev = dev->next) { + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) + continue; + for_ifa(in_dev) { + if (ifa->ifa_local == daddr) { + net_ns = ifa->ifa_net_ns; + goto out_unlock_both; + } + } endfor_ifa(in_dev); + } +out_unlock_both: + read_unlock(&dev_base_lock); + rcu_read_unlock(); + + return net_ns; +} #endif /* CONFIG_NET_NS */ -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/12] net namespace : make loopback address always visible
From: Daniel Lezcano <[EMAIL PROTECTED]> Add a specific condition when doing inet interface listing in order to see always the loopback address. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h |9 + net/core/net_namespace.c | 22 ++ net/ipv4/devinet.c| 12 +--- 3 files changed, 36 insertions(+), 7 deletions(-) Index: 2.6.20-rc4-mm1/net/ipv4/devinet.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/devinet.c +++ 2.6.20-rc4-mm1/net/ipv4/devinet.c @@ -695,8 +695,7 @@ for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL; ifap = &ifa->ifa_next) { if (!strcmp(ifr.ifr_name, ifa->ifa_label) && - net_ns_match(ifa->ifa_net_ns, -current_net_ns) && + net_ns_ifa_is_visible(ifa) && sin_orig.sin_addr.s_addr == ifa->ifa_address) { break; /* found */ @@ -710,13 +709,12 @@ for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL; ifap = &ifa->ifa_next) if (!strcmp(ifr.ifr_name, ifa->ifa_label) && -net_ns_match(ifa->ifa_net_ns, - current_net_ns)) +net_ns_ifa_is_visible(ifa)) break; } } - if (ifa && !net_ns_match(ifa->ifa_net_ns, current_net_ns)) + if (ifa && !net_ns_ifa_is_visible(ifa)) goto done; ret = -EADDRNOTAVAIL; @@ -868,7 +866,7 @@ goto out; for (; ifa; ifa = ifa->ifa_next) { - if (!net_ns_match(ifa->ifa_net_ns, current_net_ns)) + if (!net_ns_ifa_is_visible(ifa)) continue; if (!buf) { done += sizeof(ifr); @@ -1216,7 +1214,7 @@ for (ifa = in_dev->ifa_list, ip_idx = 0; ifa; ifa = ifa->ifa_next, ip_idx++) { - if (!net_ns_match(ifa->ifa_net_ns, current_net_ns)) + if (!net_ns_ifa_is_visible(ifa)) continue; if (ip_idx < s_ip_idx) continue; Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -7,6 +7,8 @@ #include #include +struct in_ifaddr; + struct net_namespace { struct kref kref; struct net_device *dev_base_p, **dev_tail_p; @@ -101,6 +103,8 @@ extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr); +extern int net_ns_ifa_is_visible(const struct in_ifaddr *ifa); + #define SELECT_SRC_ADDR net_ns_select_source_address #else /* CONFIG_NET_NS */ @@ -174,6 +178,11 @@ return NULL; } +static inline int net_ns_ifa_is_visible(const struct in_ifaddr *ifa) +{ + return 1; +} + #define SELECT_SRC_ADDR inet_select_addr #endif /* !CONFIG_NET_NS */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -420,4 +420,26 @@ return net_ns; } + +/* + * This function checks if the ifaddr is visible from the + * current network namespace. This is true if the ifaddr is + * the loopback address or if the ifaddr is owned by the network + * namespace. + * @ifa : the ifaddr + * Returns : 1 if visible, 0 otherwise + */ +int net_ns_ifa_is_visible(const struct in_ifaddr *ifa) +{ + struct net_namespace *net_ns = current_net_ns; + + if (LOOPBACK(ifa->ifa_local)) + return 1; + + if (net_ns_match(ifa->ifa_net_ns, net_ns)) + return 1; + + return 0; +} + #endif /* CONFIG_NET_NS */ -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 04/12] net namespace : isolate the inet device.
From: Daniel Lezcano <[EMAIL PROTECTED]> ip and ifconfig commands will not show ip addr not belonging to the current network namespace. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/inetdevice.h |1 + net/ipv4/devinet.c | 22 +- 2 files changed, 22 insertions(+), 1 deletion(-) Index: 2.6.20-rc4-mm1/include/linux/inetdevice.h === --- 2.6.20-rc4-mm1.orig/include/linux/inetdevice.h +++ 2.6.20-rc4-mm1/include/linux/inetdevice.h @@ -99,6 +99,7 @@ unsigned char ifa_flags; unsigned char ifa_prefixlen; charifa_label[IFNAMSIZ]; + struct net_namespace*ifa_net_ns; }; extern int register_inetaddr_notifier(struct notifier_block *nb); Index: 2.6.20-rc4-mm1/net/ipv4/devinet.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/devinet.c +++ 2.6.20-rc4-mm1/net/ipv4/devinet.c @@ -53,6 +53,7 @@ #include #include #include +#include #ifdef CONFIG_SYSCTL #include #endif @@ -269,6 +270,7 @@ if (!(ifa->ifa_flags & IFA_F_SECONDARY) || ifa1->ifa_mask != ifa->ifa_mask || + !net_ns_match(ifa->ifa_net_ns, ifa1->ifa_net_ns) || !inet_ifa_match(ifa1->ifa_address, ifa)) { ifap1 = &ifa->ifa_next; prev_prom = ifa; @@ -471,6 +473,9 @@ for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL; ifap = &ifa->ifa_next) { + if (!net_ns_match(ifa->ifa_net_ns, current_net_ns)) + continue; + if (tb[IFA_LOCAL] && ifa->ifa_local != nla_get_be32(tb[IFA_LOCAL])) continue; @@ -544,6 +549,7 @@ ifa->ifa_flags = ifm->ifa_flags; ifa->ifa_scope = ifm->ifa_scope; ifa->ifa_dev = in_dev; + ifa->ifa_net_ns = current_net_ns; ifa->ifa_local = nla_get_be32(tb[IFA_LOCAL]); ifa->ifa_address = nla_get_be32(tb[IFA_ADDRESS]); @@ -689,6 +695,8 @@ for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL; ifap = &ifa->ifa_next) { if (!strcmp(ifr.ifr_name, ifa->ifa_label) && + net_ns_match(ifa->ifa_net_ns, +current_net_ns) && sin_orig.sin_addr.s_addr == ifa->ifa_address) { break; /* found */ @@ -701,11 +709,16 @@ if (!ifa) { for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL; ifap = &ifa->ifa_next) - if (!strcmp(ifr.ifr_name, ifa->ifa_label)) + if (!strcmp(ifr.ifr_name, ifa->ifa_label) && +net_ns_match(ifa->ifa_net_ns, + current_net_ns)) break; } } + if (ifa && !net_ns_match(ifa->ifa_net_ns, current_net_ns)) + goto done; + ret = -EADDRNOTAVAIL; if (!ifa && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS) goto done; @@ -749,6 +762,8 @@ ret = -ENOBUFS; if ((ifa = inet_alloc_ifa()) == NULL) break; + + ifa->ifa_net_ns = current_net_ns; if (colon) memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ); else @@ -853,6 +868,8 @@ goto out; for (; ifa; ifa = ifa->ifa_next) { + if (!net_ns_match(ifa->ifa_net_ns, current_net_ns)) + continue; if (!buf) { done += sizeof(ifr); continue; @@ -1086,6 +1103,7 @@ in_dev_hold(in_dev); ifa->ifa_dev = in_dev; ifa->ifa_scope = RT_SCOPE_HOST; + ifa->ifa_net_ns = current_net_ns; memcpy(ifa->ifa_label, dev->name, IFNAMSIZ); inet_insert_ifa(ifa); } @@ -1198,6 +1216,8 @@ for (ifa = in_dev->ifa_list, ip_idx = 0; ifa; ifa = ifa->ifa_next, ip_idx++) { + if (!net_ns_match(ifa->ifa_net_ns, current_net_ns)) + continue; if (ip_idx < s_ip_idx) continue; if (inet_fill_ifaddr(skb, ifa, NETLINK_
[patch 03/12] net namespace : share network ressources L2 with L3
From: Daniel Lezcano <[EMAIL PROTECTED]> L3 namespace will use routes and devices belonging to its parent, so the old network namespace structure is copied when allocating a new one. By this way, hash value, dev list, routes are accessible from the L3 namespaces. In case of L2 namespace, these values are overwritten by the newly allocated values. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h | 14 ++ net/core/dev.c|4 ++-- net/core/net_namespace.c | 33 ++--- 3 files changed, 34 insertions(+), 17 deletions(-) Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -37,7 +37,7 @@ * Return ERR_PTR on error, new ns otherwise */ static struct net_namespace *clone_net_ns(unsigned int level, - struct net_namespace *old_ns) + struct net_namespace *old_ns) { struct net_namespace *ns; @@ -45,23 +45,26 @@ if (current_net_ns->level == NET_NS_LEVEL3) return ERR_PTR(-EPERM); - ns = kzalloc(sizeof(struct net_namespace), GFP_KERNEL); + ns = kmemdup(old_ns, sizeof(struct net_namespace), GFP_KERNEL); if (!ns) return NULL; kref_init(&ns->kref); - ns->dev_base_p = NULL; - ns->dev_tail_p = &ns->dev_base_p; - ns->hash = net_random(); - if ((push_net_ns(ns)) != old_ns) + BUG(); if (level == NET_NS_LEVEL2) { + ns->dev_base_p = NULL; + ns->dev_tail_p = &ns->dev_base_p; + ns->hash = net_random(); + #ifdef CONFIG_IP_MULTIPLE_TABLES INIT_LIST_HEAD(&ns->fib_rules_ops_list); #endif if (ip_fib_struct_init()) goto out_fib4; + if (loopback_init()) + goto out_loopback; } if (level == NET_NS_LEVEL3) { @@ -70,8 +73,6 @@ } ns->level = level; - if (loopback_init()) - goto out_loopback; pop_net_ns(old_ns); printk(KERN_DEBUG "NET_NS: created new netcontext %p, level %u, " "for %s (pid=%d)\n", ns, (ns->level == NET_NS_LEVEL2) ? @@ -127,15 +128,17 @@ struct net_namespace *ns; ns = container_of(kref, struct net_namespace, kref); - unregister_netdev(ns->loopback_dev_p); - if (ns->dev_base_p != NULL) { - printk("NET_NS: BUG: namespace %p has devices! ref %d\n", - ns, atomic_read(&ns->kref.refcount)); - return; - } - if (ns->level == NET_NS_LEVEL2) + if (ns->level == NET_NS_LEVEL2) { ip_fib_struct_cleanup(ns); + unregister_netdev(ns->loopback_dev_p); + if (ns->dev_base_p != NULL) { + printk("NET_NS: BUG: namespace %p has devices! ref %d\n", + ns, atomic_read(&ns->kref.refcount)); + return; + } + } + if (ns->level == NET_NS_LEVEL3) put_net_ns(ns->parent); Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -56,6 +56,15 @@ DECLARE_PER_CPU(struct net_namespace *, exec_net_ns); #define current_net_ns (__get_cpu_var(exec_net_ns)) +static inline struct net_namespace *net_ns_l2(void) +{ + struct net_namespace *net_ns = current_net_ns; + + if (net_ns->level == NET_NS_LEVEL3) + return net_ns->parent; + return net_ns; +} + static inline void init_current_net_ns(int cpu) { get_net_ns(&init_net_ns); @@ -110,6 +119,11 @@ #define current_net_ns NULL +static inline struct net_namespace *net_ns_l2(void) +{ + return NULL; +} + static inline void init_current_net_ns(int cpu) { } Index: 2.6.20-rc4-mm1/net/core/dev.c === --- 2.6.20-rc4-mm1.orig/net/core/dev.c +++ 2.6.20-rc4-mm1/net/core/dev.c @@ -485,7 +485,7 @@ struct net_device *__dev_get_by_name(const char *name) { struct hlist_node *p; - struct net_namespace *ns = current_net_ns; + struct net_namespace *ns = net_ns_l2(); hlist_for_each(p, dev_name_hash(name, ns)) { struct net_device *dev @@ -768,7 +768,7 @@ if (!err) { hlist_del(&dev->name_hlist); hlist_add_head(&dev->name_hlist, dev_name_hash(dev->name, - current_net_ns)); + net_ns
[patch 06/12] net namespace : check bind address
From: Daniel Lezcano <[EMAIL PROTECTED]> Check the bind address is allowed. It must match ifaddr assigned to the namespace and all derivative addresses. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h |7 + net/core/net_namespace.c | 54 ++ net/ipv4/af_inet.c|2 + net/ipv4/raw.c|3 ++ 4 files changed, 66 insertions(+) Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c +++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c @@ -433,6 +433,8 @@ * is temporarily down) */ err = -EADDRNOTAVAIL; + if (net_ns_check_bind(chk_addr_ret, addr->sin_addr.s_addr)) + goto out; if (!sysctl_ip_nonlocal_bind && !inet->freebind && addr->sin_addr.s_addr != INADDR_ANY && Index: 2.6.20-rc4-mm1/net/ipv4/raw.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/raw.c +++ 2.6.20-rc4-mm1/net/ipv4/raw.c @@ -559,7 +559,10 @@ if (sk->sk_state != TCP_CLOSE || addr_len < sizeof(struct sockaddr_in)) goto out; chk_addr_ret = inet_addr_type(addr->sin_addr.s_addr); + ret = -EADDRNOTAVAIL; + if (net_ns_check_bind(chk_addr_ret, addr->sin_addr.s_addr)) + goto out; if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL && chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST) goto out; Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -93,6 +93,8 @@ extern int net_ns_ioctl(unsigned int cmd, void __user *arg); +extern int net_ns_check_bind(int addr_type, u32 addr); + #else /* CONFIG_NET_NS */ #define INIT_NET_NS(net_ns) @@ -148,6 +150,11 @@ return -ENOSYS; } +static inline int net_ns_check_bind(int addr_type, u32 addr) +{ + return 0; +} + #endif /* !CONFIG_NET_NS */ #endif /* _LINUX_NET_NAMESPACE_H */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -263,4 +263,58 @@ return err; } +/* + * This function check if the specified bind address is allowed. + * The bind is allowed if the address is: + * - 127.0.0.1 + * - INADDR_ANY + * - INADDR_BROADCAST + * - a multicast address + * - the specified address match an ifaddr owned by the current + * network namespace. That implies the local address and the + * computed address from the netmask + * @addr_type : an addr type + * @addr : the requested bind address + * Returns: -EPERM on failure, 0 on success + */ +int net_ns_check_bind(int addr_type, u32 addr) +{ + int ret = -EPERM; +struct net_device *dev; +struct in_device *in_dev; + struct net_namespace *net_ns = current_net_ns; + + if (LOOPBACK(addr) || + MULTICAST(addr) || + INADDR_ANY == addr || + INADDR_BROADCAST == addr) + return 0; + +read_lock(&dev_base_lock); +rcu_read_lock(); +for (dev = dev_base; dev; dev = dev->next) { +in_dev = __in_dev_get_rcu(dev); +if (!in_dev) +continue; + +for_ifa(in_dev) { +if (ifa->ifa_net_ns != net_ns) + continue; + if (addr == ifa->ifa_local || + addr == ifa->ifa_broadcast || + addr == (ifa->ifa_local & ifa->ifa_mask) || + addr == ((ifa->ifa_address & ifa->ifa_mask)| + ~ifa->ifa_mask)) { + ret = 0; + goto out; + } +} endfor_ifa(in_dev); +} +out: +read_unlock(&dev_base_lock); +rcu_read_unlock(); + + return ret; +} + #endif /* CONFIG_NET_NS */ -- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 07/12] net namespace: set source addresse
From: Daniel Lezcano <[EMAIL PROTECTED]> When no source address is specified, search from the dev list the ifaddr allowed to be used as source address. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h | 14 net/core/net_namespace.c | 68 ++ net/ipv4/route.c | 28 +++-- 3 files changed, 100 insertions(+), 10 deletions(-) Index: 2.6.20-rc4-mm1/net/ipv4/route.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/route.c +++ 2.6.20-rc4-mm1/net/ipv4/route.c @@ -2475,17 +2475,17 @@ if (LOCAL_MCAST(oldflp->fl4_dst) || oldflp->fl4_dst == htonl(0x)) { if (!fl.fl4_src) - fl.fl4_src = inet_select_addr(dev_out, 0, - RT_SCOPE_LINK); + fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0, +RT_SCOPE_LINK); goto make_route; } if (!fl.fl4_src) { if (MULTICAST(oldflp->fl4_dst)) - fl.fl4_src = inet_select_addr(dev_out, 0, - fl.fl4_scope); + fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0, +fl.fl4_scope); else if (!oldflp->fl4_dst) - fl.fl4_src = inet_select_addr(dev_out, 0, - RT_SCOPE_HOST); + fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0, +RT_SCOPE_HOST); } } @@ -2525,8 +2525,8 @@ */ if (fl.fl4_src == 0) - fl.fl4_src = inet_select_addr(dev_out, 0, - RT_SCOPE_LINK); + fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0, +RT_SCOPE_LINK); res.type = RTN_UNICAST; goto make_route; } @@ -2539,7 +2539,13 @@ if (res.type == RTN_LOCAL) { if (!fl.fl4_src) +#ifdef CONFIG_NET_NS + fl.fl4_src = net_ns_select_source_address(dev_out, + fl.fl4_dst, + RT_SCOPE_LINK); +#else fl.fl4_src = fl.fl4_dst; +#endif if (dev_out) dev_put(dev_out); dev_out = &loopback_dev; @@ -2561,8 +2567,10 @@ fib_select_default(&fl, &res); if (!fl.fl4_src) - fl.fl4_src = FIB_RES_PREFSRC(res); - + fl.fl4_src = res.fi->fib_prefsrc ? : + SELECT_SRC_ADDR(FIB_RES_DEV(res), + FIB_RES_GW(res), + res.scope); if (dev_out) dev_put(dev_out); dev_out = FIB_RES_DEV(res); Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -5,6 +5,7 @@ #include #include #include +#include struct net_namespace { struct kref kref; @@ -95,6 +96,11 @@ extern int net_ns_check_bind(int addr_type, u32 addr); +extern __be32 net_ns_select_source_address(const struct net_device *dev, + u32 dst, int scope); + +#define SELECT_SRC_ADDR net_ns_select_source_address + #else /* CONFIG_NET_NS */ #define INIT_NET_NS(net_ns) @@ -155,6 +161,14 @@ return 0; } +static inline __be32 net_ns_select_source_address(struct net_device *dev, + u32 dst, int scope) +{ + return 0; +} + +#define SELECT_SRC_ADDR inet_select_addr + #endif /* !CONFIG_NET_NS */ #endif /* _LINUX_NET_NAMESPACE_H */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -317,4 +317,72 @@ return ret; } +/* + * This function choose the source address from the network device, + * destination and the scope. The function will browse the ifaddr + * owned by network namespace and choose the most adapted for the + * dst address and dev. + * @dev : the network device where the traffic will go + * @dst : the destination a
[patch 10/12] net namespace : add the loopback isolation
From: Daniel Lezcano <[EMAIL PROTECTED]> When a packet is outgoing, the namespace source is stored into the skbuff. Because it is the loopback address, the source == destination, so when the packet is incoming, it has already the namespace destination set into the packet. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h | 13 +++-- include/linux/skbuff.h|5 - net/core/net_namespace.c | 32 +++- net/ipv4/ip_input.c |2 +- net/ipv4/ip_output.c |1 + 5 files changed, 44 insertions(+), 9 deletions(-) Index: 2.6.20-rc4-mm1/include/linux/skbuff.h === --- 2.6.20-rc4-mm1.orig/include/linux/skbuff.h +++ 2.6.20-rc4-mm1/include/linux/skbuff.h @@ -225,6 +225,7 @@ * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions * @secmark: security marking + * @net_ns: namespace destination */ struct sk_buff { @@ -309,7 +310,9 @@ #ifdef CONFIG_NETWORK_SECMARK __u32 secmark; #endif - +#ifdef CONFIG_NET_NS + struct net_namespace*net_ns; +#endif __u32 mark; /* These elements must be at the end, see alloc_skb() for details. */ Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c +++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c @@ -396,7 +396,7 @@ iph = skb->nh.iph; - dst_net_ns = net_ns_find_from_dest_addr(iph->daddr); + dst_net_ns = net_ns_find_from_dest_addr(skb); if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) push_net_ns(dst_net_ns); /* Index: 2.6.20-rc4-mm1/net/ipv4/ip_output.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/ip_output.c +++ 2.6.20-rc4-mm1/net/ipv4/ip_output.c @@ -272,6 +272,7 @@ IP_INC_STATS(IPSTATS_MIB_OUTREQUESTS); + net_ns_tag_sk_buff(skb); skb->dev = dev; skb->protocol = htons(ETH_P_IP); Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -8,6 +8,7 @@ #include struct in_ifaddr; +struct sk_buff; struct net_namespace { struct kref kref; @@ -101,10 +102,13 @@ extern __be32 net_ns_select_source_address(const struct net_device *dev, u32 dst, int scope); -extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr); +extern struct net_namespace +*net_ns_find_from_dest_addr(const struct sk_buff *skb); extern int net_ns_ifa_is_visible(const struct in_ifaddr *ifa); +extern void net_ns_tag_sk_buff(struct sk_buff *skb); + #define SELECT_SRC_ADDR net_ns_select_source_address #else /* CONFIG_NET_NS */ @@ -173,7 +177,8 @@ return 0; } -static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) +static inline struct net_namespace +*net_ns_find_from_dest_addr(const struct sk_buff *skb) { return NULL; } @@ -183,6 +188,10 @@ return 1; } +static inline void net_ns_tag_sk_buff(struct sk_buff *skb) +{ + ; +} #define SELECT_SRC_ADDR inet_select_addr #endif /* !CONFIG_NET_NS */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -13,6 +13,9 @@ #include #include #include +#include +#include + #include struct net_namespace init_net_ns = { @@ -389,18 +392,25 @@ /* * This function finds the network namespace destination deduced from * the destination address. The network namespace is retrieved from - * the ifaddr owned by a network namespace - * @daddr : destination + * the ifaddr owned by a network namespace. If the packet is for the + * loopback address so we assume the destination address is already filled + * by the sender which is the same as the receiver. + * @skb : the packet to be delivered * Returns : the network namespace destination or NULL if not found */ -struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) +struct net_namespace *net_ns_find_from_dest_addr(const struct sk_buff *skb) { struct net_namespace *net_ns = NULL; struct net_device *dev; struct in_device *in_dev; + struct iphdr *iph; + __be32 daddr; + + iph = skb->nh.iph; + daddr = iph->daddr; - if (LOOPBACK(daddr)) - return current_net_ns; + if (LOOPBACK(daddr)) + return skb->net_ns; read_lock(&dev_base_lock); rcu_read_lock(); @@ -442,4 +452,16 @@ return 0; } +/* + * This fun
[patch 05/12] net namespace : ioctl to push ifa to net namespace l3
From: Daniel Lezcano <[EMAIL PROTECTED]> New ioctl to "push" ifaddr to a container. Actually, the push is done from the current namespace, so the right word is "pull". That will be changed to move ifaddr from l2 network namespace to l3. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/net_namespace.h |7 ++ include/linux/sockios.h |4 + net/core/net_namespace.c | 118 +- net/ipv4/af_inet.c|4 + 4 files changed, 132 insertions(+), 1 deletion(-) Index: 2.6.20-rc4-mm1/include/linux/sockios.h === --- 2.6.20-rc4-mm1.orig/include/linux/sockios.h +++ 2.6.20-rc4-mm1/include/linux/sockios.h @@ -122,6 +122,10 @@ #define SIOCBRADDIF0x89a2 /* add interface to bridge */ #define SIOCBRDELIF0x89a3 /* remove interface from bridge */ +/* Container calls */ +#define SIOCNETNSPUSHIF 0x89b0 /* add ifaddr to namespace */ +#define SIOCNETNSPULLIF 0x89b1 /* remove ifaddr to namespace */ + /* Device private ioctl calls */ /* Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c === --- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c +++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c @@ -789,6 +789,10 @@ case SIOCSIFFLAGS: err = devinet_ioctl(cmd, (void __user *)arg); break; + case SIOCNETNSPUSHIF: + case SIOCNETNSPULLIF: + err = net_ns_ioctl(cmd, (void __user *)arg); + break; default: if (sk->sk_prot->ioctl) err = sk->sk_prot->ioctl(sk, cmd, arg); Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h === --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h @@ -91,6 +91,8 @@ #define net_ns_hash(ns)((ns)->hash) +extern int net_ns_ioctl(unsigned int cmd, void __user *arg); + #else /* CONFIG_NET_NS */ #define INIT_NET_NS(net_ns) @@ -141,6 +143,11 @@ #define net_ns_hash(ns)(0) +static inline int net_ns_ioctl(unsigned int cmd, void __user *arg) +{ + return -ENOSYS; +} + #endif /* !CONFIG_NET_NS */ #endif /* _LINUX_NET_NAMESPACE_H */ Index: 2.6.20-rc4-mm1/net/core/net_namespace.c === --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c +++ 2.6.20-rc4-mm1/net/core/net_namespace.c @@ -10,7 +10,9 @@ #include #include #include +#include #include +#include #include struct net_namespace init_net_ns = { @@ -123,6 +125,33 @@ return err; } +/* + * The function will move the ifaddr to the l2 network namespace + * parent. + * @net_ns: the related network namespace + */ +static void release_ifa_to_parent(const struct net_namespace* net_ns) +{ + struct net_device *dev; + struct in_device *in_dev; + + read_lock(&dev_base_lock); + rcu_read_lock(); + for (dev = dev_base; dev; dev = dev->next) { + in_dev = __in_dev_get_rcu(dev); + if (!in_dev) + continue; + + for_ifa(in_dev) { + if (ifa->ifa_net_ns != net_ns) + continue; + ifa->ifa_net_ns = net_ns->parent; + } endfor_ifa(in_dev); + } + read_unlock(&dev_base_lock); + rcu_read_unlock(); +} + void free_net_ns(struct kref *kref) { struct net_namespace *ns; @@ -139,12 +168,99 @@ } } - if (ns->level == NET_NS_LEVEL3) + if (ns->level == NET_NS_LEVEL3) { + release_ifa_to_parent(ns); put_net_ns(ns->parent); + } printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns); kfree(ns); } EXPORT_SYMBOL_GPL(free_net_ns); +/* + * This function allows to assign an IP address from a l2 network + * namespace to one of his l3 child or to release from an l3 network + * namespace to his l2 network namespace parent. + * @cmd: a "push" / "pull" command + * @arg: an userspace buffer containing an ifreq structure + * Returns: + * - EPERM : if caller has no CAP_NET_ADMIN capabilities or the + * current level of network namespace is not layer 2 + * - EFAULT : if arg is an invalid buffer + * - EADDRNOTAVAIL : if the specified ifaddr does not exists + * - EINVAL : if cmd is unknown + * - zero on success + */ +int net_ns_ioctl(unsigned int cmd, void __user *arg) +{ + struct ifreq ifr; + struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr; + struct net_namespace *net_ns = current_net_ns; + struct net_device *dev; + struct in_device *in_dev; + struct in_ifad
[patch 11/12] net namespace : debugfs - add net_ns debugfs
From: Daniel Lezcano <[EMAIL PROTECTED]> For debug purpose only, this is not intended to be included. Add /sys/kernel/debug/net_ns. Creation of network namespace: echo > /sys/kernel/debug/net_ns/start Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- fs/debugfs/Makefile |2 fs/debugfs/net_ns.c | 335 net/Kconfig |4 3 files changed, 340 insertions(+), 1 deletion(-) Index: 2.6.20-rc4-mm1/fs/debugfs/Makefile === --- 2.6.20-rc4-mm1.orig/fs/debugfs/Makefile +++ 2.6.20-rc4-mm1/fs/debugfs/Makefile @@ -1,4 +1,4 @@ debugfs-objs := inode.o file.o obj-$(CONFIG_DEBUG_FS) += debugfs.o - +obj-$(CONFIG_NET_NS_DEBUG) += net_ns.o Index: 2.6.20-rc4-mm1/fs/debugfs/net_ns.c === --- /dev/null +++ 2.6.20-rc4-mm1/fs/debugfs/net_ns.c @@ -0,0 +1,335 @@ +/* + * net_ns.c - adds a net_ns/ directory to debug NET namespaces + * + * Author: Daniel Lezcano <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static struct dentry *net_ns_dentry; +static struct dentry *net_ns_dentry_dev; +static struct dentry *net_ns_dentry_start; +static struct dentry *net_ns_dentry_info; + +static ssize_t net_ns_dev_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + return 0; +} + +static ssize_t net_ns_dev_write_file(struct file *file, +const char __user *user_buf, +size_t count, loff_t *ppos) +{ + return 0; +} + +static int net_ns_dev_open_file(struct inode *inode, struct file *file) +{ + return 0; +} + +static int net_ns_start_open_file(struct inode *inode, struct file *file) +{ + return 0; +} + +static ssize_t net_ns_start_read_file(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + return 0; +} + +static ssize_t net_ns_start_write_file(struct file *file, + const char __user *user_buf, + size_t count, loff_t *ppos) +{ + int err; + size_t len; + const char __user *p; + char c; + unsigned long flags; + struct net_namespace *net, *new_net; + struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL; + + if (current_net_ns != &init_net_ns) + return -EBUSY; + + len = 0; + p = user_buf; + while (len < count) { + if (get_user(c, p++)) + return -EFAULT; + if (c == 0 || c == '\n') + break; + len++; + } + + if (len > 1) + return -EINVAL; + + if (copy_from_user(&c, user_buf, sizeof(c))) + return -EFAULT; + + if (c != '2' && c != '3') + return -EINVAL; + + flags = (c=='2'?CLONE_NEWNET2:CLONE_NEWNET3); + err = unshare_net_ns(flags, &new_net); + if (err) + return err; + + old_nsproxy = current->nsproxy; + new_nsproxy = dup_namespaces(old_nsproxy); + + if (!new_nsproxy) { + put_net_ns(new_net); + task_unlock(current); + return -ENOMEM; + } + + task_lock(current); + + if (new_nsproxy) { + current->nsproxy = new_nsproxy; + new_nsproxy = old_nsproxy; + } + + net = current->nsproxy->net_ns; + current->nsproxy->net_ns = new_net; + pop_net_ns(new_net); + new_net = net; + + task_unlock(current); + + put_nsproxy(new_nsproxy); + put_net_ns(new_net); + + return count; +} + +static int net_ns_info_open_file(struct inode *inode, struct file *file) +{ + return 0; +} + +static ssize_t net_ns_info_read_file(struct file *file, char __user *user_buf, +size_t count, loff_t *ppos) +{ + const unsigned int length = 256; + size_t len; + char buff[length]; + char *level; + struct net_namespace *net_ns = current_net_ns; + struct nsproxy *ns = current->nsproxy; + + if (*ppos < 0) + return -EINVAL; + if (*ppos >= count) + return 0; + if (!count) + return 0; + + switch (net_ns->level) { + case NET_NS_LEVEL2: + level = "layer 2"; + break; + case NET_NS_LEVEL3: + level = "layer 3"; + break; + default: + level = "unknown";
RE: [PATCH 2.6.20 1/5] s2io: updates for s2io driver.
Hi Jeff, Thanks for the comments and references. As per you suggestion, we have resubmitted the patches with required change. Thanks, ~Siva -Original Message- From: Jeff Garzik [mailto:[EMAIL PROTECTED] Sent: Thursday, January 18, 2007 10:32 PM To: Ananda Raju Cc: netdev@vger.kernel.org; Leonid Grossman; Sivakumar Subramani; Alicia Pena; [EMAIL PROTECTED]; Ramkrishna Vepa Subject: Re: [PATCH 2.6.20 1/5] s2io: updates for s2io driver. Ananda Raju wrote: > Hello, > > List of changes in this patch: > > This patch adds two load parameters napi and ufo. Previously NAPI was > compilation option with these changes wan enable disable NAPI using > load parameter. Also we are introducing ufo load parameter to > enable/disable ufo feature > > Signed-off-by: Sivakumar Subramani <[EMAIL PROTECTED]> OK, you're getting closer :) Problems that need correcting: 1) Your email subject line is a one-line summary of the patch. "s2io: updates for s2io driver" is useless, because it tells us nothing about the patch itself. When applied in a series, git log master..upstream-fixes | git shortlog will produce Ananda Raju (5): s2io: updates for s2io driver s2io: updates for s2io driver s2io: updates for s2io driver s2io: updates for s2io driver s2io: updates for s2io driver which clearly makes it impossible to distinguish between changesets. Please re-read Rule #1 of http://linux.yyz.us/patch-format.html Also, re-read Rule #2. Everything in your email body before the "---" terminator is copied DIRECTLY into the kernel changelog. As such, comments like "Hello," and "List of changes in this patch:" must be hand-edited out of your email, before applying the patch. Please fix these problems and resubmit. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).
On Fri, Jan 19, 2007 at 11:57:00AM +0530, Suparna Bhattacharya ([EMAIL PROTECTED]) wrote: > > > Since you are implementing new APIs here, have you considered doing an > > > aio_sendfilev to be able to send a header with the data ? > > > > It is doable, but why people do not like corking? > > With Linux less than microsecond syscall overhead it is better and more > > flexible solution, doesn't it? > > That is what I used to think as well. However ... > > The problem as I understand it now is not about bunching data together, but > of ensuring some sort of atomicity between the header and the data, when > there can be multiple outstanding aio requests on the same socket - i.e > ensuring strict ordering without other data coming in between, when data > to be sent is not already in cache, and in the meantime another sendfile > or aio write requests comes in for the same socket. Without having to lock > the socket when reading data from disk. No, socket locking is not solution at all here. But the same applies to header - it will be copied into socket queue, then socket will be unlocked and populated VFS data will be put into that queue too, but there is a window between socket unlock after header copy and file data copy. If we will hold socket lock after header is copied, it is possible to lock it for too long - bad sectors on disk, and reading might take forever. > There are alternate ways to address this, aio_sendfilev is one of the options > I have heard people requesting. I bet those people worked with different Unix systems, which have much slower syscalls, so they combine several operations into one call. Only from this perspective I see any benefit from having header in the syscall related to file transfer. Since I already "optimized" open() syscall into file sending, things can not became worse if I will put there header pointer too. I will schedule new kevent release with this change somewhere after current work on M-on-N threading model. > Regards > Suparna -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible ways of dealing with OOM conditions.
> Let me briefly describe your approach and possible drawbacks in it. > You start reserving some memory when systems is under memory pressure. > when system is in real trouble, you start using that reserve for special > tasks mainly for network path to allocate packets and process them in > order to get committed some memory swapping. > > So, the problems I see here, are following: > 1. it is possible that when you are starting to create a reserve, there > will not be enough memeory at all. So the solution is to reserve in > advance. Swap is usually enabled at startup, but sure, if you want you can mess this up. > 2. You differentiate by hand between critical and non-critical > allocations by specifying some kernel users as potentially possible to > allocate from reserve. True, all sockets that are needed for swap, no-one else. > This does not prevent from NVIDIA module to > allocate from that reserve too, does it? All users of the NVidiot crap deserve all the pain they get. If it breaks they get to keep both pieces. > And you artificially limit > system to process only tiny bits of what it must do, thus potentially > leaking pathes which must use reserve too. How so? I cover pretty much every allocation needed to process an skb by setting PF_MEMALLOC - the only drawback there is that the reserve might not actually be large enough because it covers more allocations that were considered. (thats one of the TODO items, validate the reserve functions parameters) > So, solution is to have a reserve in advance, and manage it using > special path when system is in OOM. So you will have network memory > reserve, which will be used when system is in trouble. It is very > similar to what you had. > > But the whole reserve can never be used at all, so it should be used, > but not by those who can create OOM condition, thus it should be > exported to, for example, network only, and when system is in trouble, > network would be still functional (although only critical pathes). But the network can create OOM conditions for itself just fine. Consider the remote storage disappearing for a while (it got rebooted, someone tripped over the wire etc..). Now the rest of the network traffic keeps coming and will queue up - because user-space is stalled, waiting for more memory - and we run out of memory. There must be a point where we start dropping packets that are not critical to the survival of the machine. > Even further development of such idea is to prevent such OOM condition > at all - by starting swapping early (but wisely) and reduce memory > usage. These just postpone execution but will not avoid it. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel headers - linux-atm userspace build broken by recent change; __be16 undefined
On Thu, Jan 18, 2007 at 09:22:52PM +, Andrew Walrond wrote: > Don't know exactly when this change went in, but it's not in 2.6.18.3 > and is in 2.6.19.2+ > > $ diff linux/include/linux/if_arp.h linux-2.6/include/linux/if_arp.h > 133,134c133,134 > < unsigned short ar_hrd; /* format of hardware address */ > < unsigned short ar_pro; /* format of protocol address */ > --- > > __be16 ar_hrd; /* format of hardware address */ > > __be16 ar_pro; /* format of protocol address */ > 137c137 > < unsigned short ar_op; /* ARP opcode (command) */ > --- > > __be16 ar_op; /* ARP opcode (command) */ > > > This causes the linux-atm userspace compile to fail like this: > > In file included from arp.c:19: > /usr/include/linux/if_arp.h:133: error: expected > specifier-qualifier-list before '__be16' > > I guess if_arp.h needs to include include/linux/byteorder/big_endian.h? No, linux/types.h But what bothers me more about if_arp.h is that it is one of the headers using "struct sockaddr" in userspace, but as far as I can see we aren't exporting it in any header. This seems to work since glibc is providing the struct, but this looks a bit fishy. > Andrew Walrond cu Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible ways of dealing with OOM conditions.
On Thu, 18 Jan 2007, Peter Zijlstra wrote: > > > Cache misses for small packet flow due to the fact, that the same data > > is allocated and freed and accessed on different CPUs will become an > > issue soon, not right now, since two-four core CPUs are not yet to be > > very popular and price for the cache miss is not _that_ high. > > SGI does networking too, right? Sslab deals with those issues the right way. We have per processor queues that attempt to keep the cache hot state. A special shared queue exists between neighboring processors to facilitate exchange of objects between then. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?
On Fri, Jan 19, 2007 at 12:06:41PM +0100, Jarek Poplawski wrote: > > [PATCH][NET] tcp_output: rare bad TCP checksum with 2.6.19 > > The patch "Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE" > changed to unconditional copying of ip_summed field from collapsed > skb. This patch reverts this change. > > All substantial work including heavy testing and diagnosing by: > Michael Tokarev <[EMAIL PROTECTED]> > > Signed-off-by: Jarek Poplawski <[EMAIL PROTECTED]> Acked-by: Herbert Xu <[EMAIL PROTECTED]> Thanks for catching this! I'll take the credit for adding this bug :) Dave, we'll need this fix for 2.6.20 as well as 2.6.19. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection). In short, this is a feature whereby a node with a Tentative address can begin to make use of that address almost immediately after its configured. To enable this, extra rules need to be followed during the Duplicate address detection phase of the addresses configuration, so that in the event of a collision, neighboring nodes do not have thier neighbor caches affected adversely by the optimistic node. This patch implements those rules as outlined in the RFC. I have a fairly limited testing environment here, but from the testing I've done, this patch appears to conform to the rules as outlined in RFC 4429, causes no adverse affects on normal IPv6 operation when in use, and doesn't seem to break anything when disabled via the sysctl. Comments and Reviews appreciated. Thanks and Regards Neil Signed-Off-By: Neil Horman <[EMAIL PROTECTED]> include/linux/if_addr.h|1 include/linux/sysctl.h |1 include/net/addrconf.h |4 ++- include/net/ipv6.h |1 net/ipv6/addrconf.c| 50 +++--- net/ipv6/mcast.c |4 +-- net/ipv6/ndisc.c | 59 ++--- net/ipv6/sysctl_net_ipv6.c |8 ++ 8 files changed, 107 insertions(+), 21 deletions(-) diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h index d557e4c..43f3bed 100644 --- a/include/linux/if_addr.h +++ b/include/linux/if_addr.h @@ -39,6 +39,7 @@ enum #define IFA_F_TEMPORARYIFA_F_SECONDARY #defineIFA_F_NODAD 0x02 +#define IFA_F_OPTIMISTIC 0x04 #defineIFA_F_HOMEADDRESS 0x10 #define IFA_F_DEPRECATED 0x20 #define IFA_F_TENTATIVE0x40 diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index 81480e6..62034c3 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -531,6 +531,7 @@ enum { NET_IPV6_IP6FRAG_TIME=23, NET_IPV6_IP6FRAG_SECRET_INTERVAL=24, NET_IPV6_MLD_MAX_MSF=25, + NET_IPV6_OPT_DAD_ENABLE=26, }; enum { diff --git a/include/net/addrconf.h b/include/net/addrconf.h index 88df8fc..d248a19 100644 --- a/include/net/addrconf.h +++ b/include/net/addrconf.h @@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry *dst, extern int ipv6_dev_get_saddr(struct net_device *dev, struct in6_addr *daddr, struct in6_addr *saddr); -extern int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *); +extern int ipv6_get_lladdr(struct net_device *dev, + struct in6_addr *, + unsigned char banned_flags); extern int ipv6_rcv_saddr_equal(const struct sock *sk, const struct sock *sk2); extern voidaddrconf_join_solict(struct net_device *dev, diff --git a/include/net/ipv6.h b/include/net/ipv6.h index 00328b7..dd16169 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -110,6 +110,7 @@ struct frag_hdr { /* sysctls */ extern int sysctl_ipv6_bindv6only; extern int sysctl_mld_max_msf; +extern int sysctl_optimistic_dad; /* MIBs */ DECLARE_SNMP_STAT(struct ipstats_mib, ipv6_statistics); diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 2a7e461..f7afb2a 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .proxy_ndp = 0, }; +int sysctl_optimistic_dad = 1; + /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ #if 0 const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT; @@ -830,7 +832,8 @@ retry: ift = !max_addresses || ipv6_count_addresses(idev) < max_addresses ? ipv6_add_addr(idev, &addr, tmp_plen, - ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, IFA_F_TEMPORARY) : NULL; + ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL; if (!ift || IS_ERR(ift)) { in6_ifa_put(ifp); in6_dev_put(idev); @@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst, } -int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr) +int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, + unsigned char banned_flags) { struct inet6_dev *idev; int err = -EADDRNOTAVAIL; @@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr) read_lock_bh(&idev->lock); for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) { -
Re: [PATCH 10/12] forcedeth: tx max work
Jeff Garzik wrote: Ayaz Abdulla wrote: > This patch adds a limit to how much tx work can be done in each > iteration of tx processing. > > Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> What about the "tail end" of the work, when the limit is reached? Remember that delaying the completion of TX's too long increases latency. It seems to me that this patch needs a timer or somesuch, to guarantee that TX completions are not delayed too long in the worst case. Yes, you are right. There is a timer interrupt that fires in throughput mode every 10ms (in cpu mode it fires at approx every 130us). I can use that to clean out any uncompleted TXs. Let me know if 10ms is not too late for worst case tx completion. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bonding: sysfs patch broke module renaming
Patrick McHardy <[EMAIL PROTECTED]> wrote: >The sysfs patch broke using multiple instances of the bonding module >through module renaming (modprobe -o). In recent kernels it fails >with -EEXIST when trying to add the bonding_masters file for the >second time, in older kernels (where sysfs_add_file didn't check >for duplicates) it will crash when unloading the modules. Ok, I see what the problem is; it's got to do with out device creation was changed at some point for the sysfs stuff that broke the multiple load logic. I don't think it has to do with the sysfs_add_file duplicate check business; I can see the error in how bond_create() is called in the new (post-sysfs) stuff, although I haven't tracked it down to a particular changeset. There'a also a separate error handling bug I see in bond_create() that I don't even get to because it bails out first. Anyway, let me see what I can work out to fix this up. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
e1000: update device ID table for register dumps
e1000: update device ID table for register dumps with new devices From: Auke Kok <[EMAIL PROTECTED]> The register dump routine of e1000 was missing several newer chipsets. I reimported the mac detection code from the linux e1000 driver. This fixes newer NIC's reporting that their bus type is PCI instead of PCI-e. Signed-off-by: Auke Kok <[EMAIL PROTECTED]> --- e1000.c | 154 ++- 1 files changed, 103 insertions(+), 51 deletions(-) diff --git a/e1000.c b/e1000.c index 6741323..d67947a 100644 --- a/e1000.c +++ b/e1000.c @@ -111,42 +111,66 @@ #define E1000_TCTL_NRTU 0x0200/* No Re-transmit on underrun */ /* PCI Device IDs */ -#define E1000_DEV_ID_82542 0x1000 -#define E1000_DEV_ID_82543GC_FIBER 0x1001 -#define E1000_DEV_ID_82543GC_COPPER 0x1004 -#define E1000_DEV_ID_82544EI_COPPER 0x1008 -#define E1000_DEV_ID_82544EI_FIBER 0x1009 -#define E1000_DEV_ID_82544GC_COPPER 0x100C -#define E1000_DEV_ID_82544GC_LOM 0x100D -#define E1000_DEV_ID_82540EM 0x100E -#define E1000_DEV_ID_82540EM_LOM 0x1015 -#define E1000_DEV_ID_82540EP_LOM 0x1016 -#define E1000_DEV_ID_82540EP 0x1017 -#define E1000_DEV_ID_82540EP_LP 0x101E -#define E1000_DEV_ID_82545EM_COPPER 0x100F -#define E1000_DEV_ID_82545EM_FIBER 0x1011 -#define E1000_DEV_ID_82545GM_COPPER 0x1026 -#define E1000_DEV_ID_82545GM_FIBER 0x1027 -#define E1000_DEV_ID_82545GM_SERDES 0x1028 -#define E1000_DEV_ID_82546EB_COPPER 0x1010 -#define E1000_DEV_ID_82546EB_FIBER 0x1012 -#define E1000_DEV_ID_82546EB_QUAD_COPPER 0x101D -#define E1000_DEV_ID_82541EI 0x1013 -#define E1000_DEV_ID_82541EI_MOBILE 0x1018 -#define E1000_DEV_ID_82541ER 0x1078 -#define E1000_DEV_ID_82547GI 0x1075 -#define E1000_DEV_ID_82541GI 0x1076 -#define E1000_DEV_ID_82541GI_MOBILE 0x1077 -#define E1000_DEV_ID_82541GI_LF 0x107C -#define E1000_DEV_ID_82546GB_COPPER 0x1079 -#define E1000_DEV_ID_82546GB_FIBER 0x107A -#define E1000_DEV_ID_82546GB_SERDES 0x107B -#define E1000_DEV_ID_82546GB_PCIE0x108A -#define E1000_DEV_ID_82547EI 0x1019 -#define E1000_DEV_ID_82573E 0x108B -#define E1000_DEV_ID_82573E_IAMT 0x108C - -#define E1000_DEV_ID_82546GB_QUAD_COPPER 0x1099 +#define E1000_DEV_ID_825420x1000 +#define E1000_DEV_ID_82543GC_FIBER0x1001 +#define E1000_DEV_ID_82543GC_COPPER 0x1004 +#define E1000_DEV_ID_82544EI_COPPER 0x1008 +#define E1000_DEV_ID_82544EI_FIBER0x1009 +#define E1000_DEV_ID_82544GC_COPPER 0x100C +#define E1000_DEV_ID_82544GC_LOM 0x100D +#define E1000_DEV_ID_82540EM 0x100E +#define E1000_DEV_ID_82540EM_LOM 0x1015 +#define E1000_DEV_ID_82540EP_LOM 0x1016 +#define E1000_DEV_ID_82540EP 0x1017 +#define E1000_DEV_ID_82540EP_LP 0x101E +#define E1000_DEV_ID_82545EM_COPPER 0x100F +#define E1000_DEV_ID_82545EM_FIBER0x1011 +#define E1000_DEV_ID_82545GM_COPPER 0x1026 +#define E1000_DEV_ID_82545GM_FIBER0x1027 +#define E1000_DEV_ID_82545GM_SERDES 0x1028 +#define E1000_DEV_ID_82546EB_COPPER 0x1010 +#define E1000_DEV_ID_82546EB_FIBER0x1012 +#define E1000_DEV_ID_82546EB_QUAD_COPPER 0x101D +#define E1000_DEV_ID_82546GB_COPPER 0x1079 +#define E1000_DEV_ID_82546GB_FIBER0x107A +#define E1000_DEV_ID_82546GB_SERDES 0x107B +#define E1000_DEV_ID_82546GB_PCIE 0x108A +#define E1000_DEV_ID_82546GB_QUAD_COPPER 0x1099 +#define E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3 0x10B5 +#define E1000_DEV_ID_82541EI 0x1013 +#define E1000_DEV_ID_82541EI_MOBILE 0x1018 +#define E1000_DEV_ID_82541ER_LOM 0x1014 +#define E1000_DEV_ID_82541ER 0x1078 +#define E1000_DEV_ID_82541GI 0x1076 +#define E1000_DEV_ID_82541GI_LF 0x107C +#define E1000_DEV_ID_82541GI_MOBILE 0x1077 +#define E1000_DEV_ID_82547EI 0x1019 +#define E1000_DEV_ID_82547EI_MOBILE 0x101A +#define E1000_DEV_ID_82547GI 0x1075 +#define E1000_DEV_ID_82571EB_COPPER 0x105E +#define E1000_DEV_ID_82571EB_FIBER0x105F +#define E1000_DEV_ID_82571EB_SERDES 0x1060 +#define E1000_DEV_ID_82571EB_QUAD_COPPER 0x10A4 +#define E1000_DEV_ID_82571EB_QUAD_FIBER 0x10A5 +#define E1000_DEV_ID_82571EB_QUAD_COPPER_LP 0x10BC +#define E1000_DEV_ID_82572EI_COPPER 0x107D +#define E1000_DEV_ID_82572EI_FIBER0x107E +#define E1000_DEV_ID_82572EI_SERDES 0x107F +#define E1000_DEV_ID_82572EI 0x10B9 +#define E1000_DEV_ID_82573E 0x108B +#define E1000_DEV_ID_82573E_IAMT
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
Hello. In article <[EMAIL PROTECTED]> (at Fri, 19 Jan 2007 16:23:14 -0500), Neil Horman <[EMAIL PROTECTED]> says: > Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection). In Good work. We will see if this would break core and basic ipv6 code. Dave, please hold on. Some quick comments. > --- a/include/net/ipv6.h > +++ b/include/net/ipv6.h > @@ -110,6 +110,7 @@ struct frag_hdr { > /* sysctls */ > extern int sysctl_ipv6_bindv6only; > extern int sysctl_mld_max_msf; > +extern int sysctl_optimistic_dad; > : > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > index 2a7e461..f7afb2a 100644 > --- a/net/ipv6/addrconf.c > +++ b/net/ipv6/addrconf.c > @@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt > __read_mostly = { > .proxy_ndp = 0, > }; > > +int sysctl_optimistic_dad = 1; > + Please put this into ipv6_devconf{} and make it per-interface variable. And I think default should be kept off (0). > /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ > #if 0 > const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT; > @@ -830,7 +832,8 @@ retry: > ift = !max_addresses || > ipv6_count_addresses(idev) < max_addresses ? > ipv6_add_addr(idev, &addr, tmp_plen, > - ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, > IFA_F_TEMPORARY) : NULL; > + ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, > + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL; > if (!ift || IS_ERR(ift)) { > in6_ifa_put(ifp); > in6_dev_put(idev); Please align ipv6_addr_type and IFA_F_TEMPORARY > @@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst, > } > > > -int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr) > +int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, > + unsigned char banned_flags) > { > struct inet6_dev *idev; > int err = -EADDRNOTAVAIL; Please align "struct net_device" and "unsigned char". > @@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct > in6_addr *addr) > > read_lock_bh(&idev->lock); > for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) { > - if (ifp->scope == IFA_LINK && > !(ifp->flags&IFA_F_TENTATIVE)) { > + if (ifp->scope == IFA_LINK && > !(ifp->flags&banned_flags)) { > ipv6_addr_copy(addr, &ifp->addr); > err = 0; > break; > @@ -1742,7 +1746,7 @@ ok: It is not your fault, but please put a space around "&". > if (!max_addresses || > ipv6_count_addresses(in6_dev) < max_addresses) > ifp = ipv6_add_addr(in6_dev, &addr, > pinfo->prefix_len, > - > addr_type&IPV6_ADDR_SCOPE_MASK, 0); > + > addr_type&IPV6_ADDR_SCOPE_MASK,0); > > if (!ifp || IS_ERR(ifp)) { > in6_dev_put(in6_dev); Please do no kill space after ",". > @@ -2123,7 +2132,8 @@ static void addrconf_add_linklocal(struct inet6_dev > *idev, struct in6_addr *addr > { > struct inet6_ifaddr * ifp; > > - ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT); > + ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, > + IFA_F_PERMANENT|IFA_F_OPTIMISTIC); > if (!IS_ERR(ifp)) { > addrconf_dad_start(ifp, 0); > in6_ifa_put(ifp); Please align idev and IFA_F_PERMANENT. > @@ -542,7 +556,8 @@ void ndisc_send_ns(struct net_device *dev, struct > neighbour *neigh, > int send_llinfo; > > if (saddr == NULL) { > - if (ipv6_get_lladdr(dev, &addr_buf)) > + if (ipv6_get_lladdr(dev, &addr_buf, > + (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC))) > return; > saddr = &addr_buf; > } ditto... ("dev" and "(") > +and optimistic) are false then we can just fail > +dad now. > + */ > + type = ipv6_addr_type(saddr); > + if (!((ifp->flags & IFA_F_OPTIMISTIC) && > + (type & IPV6_ADDR_UNICAST))) { > + addrconf_dad_failure(ifp); > + return; > + } > } > > idev = ifp->idev; hmm? Here, is saddr always unicast, isn't it?! --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000: update device ID table for register dumps [Is an *ethtool* patch]
Auke Kok wrote: e1000: update device ID table for register dumps with new devices From: Auke Kok <[EMAIL PROTECTED]> The register dump routine of e1000 was missing several newer chipsets. I reimported the mac detection code from the linux e1000 driver. This fixes newer NIC's reporting that their bus type is PCI instead of PCI-e. Signed-off-by: Auke Kok <[EMAIL PROTECTED]> it's a patch to ethtool, of course. Apologies for any confusion. I didn't fix the mail subject. Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible ways of dealing with OOM conditions.
On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra ([EMAIL PROTECTED]) wrote: > > 2. You differentiate by hand between critical and non-critical > > allocations by specifying some kernel users as potentially possible to > > allocate from reserve. > > True, all sockets that are needed for swap, no-one else. > > > This does not prevent from NVIDIA module to > > allocate from that reserve too, does it? > > All users of the NVidiot crap deserve all the pain they get. > If it breaks they get to keep both pieces. I meant that pretty anyone can be those user, who can just add a bit into own gfp_flags which are used for allocation. > > And you artificially limit > > system to process only tiny bits of what it must do, thus potentially > > leaking pathes which must use reserve too. > > How so? I cover pretty much every allocation needed to process an skb by > setting PF_MEMALLOC - the only drawback there is that the reserve might > not actually be large enough because it covers more allocations that > were considered. (thats one of the TODO items, validate the reserve > functions parameters) You only covered ipv4/v6 and arp, maybe some route updates. But it is very possible, that some allocations are missed like multicast/broadcast. Selecting only special pathes out of the whole possible network alocations tends to create a situation, when something is missed or cross dependant on other pathes. > > So, solution is to have a reserve in advance, and manage it using > > special path when system is in OOM. So you will have network memory > > reserve, which will be used when system is in trouble. It is very > > similar to what you had. > > > > But the whole reserve can never be used at all, so it should be used, > > but not by those who can create OOM condition, thus it should be > > exported to, for example, network only, and when system is in trouble, > > network would be still functional (although only critical pathes). > > But the network can create OOM conditions for itself just fine. > > Consider the remote storage disappearing for a while (it got rebooted, > someone tripped over the wire etc..). Now the rest of the network > traffic keeps coming and will queue up - because user-space is stalled, > waiting for more memory - and we run out of memory. Hmm... Neither UDP, nor TCP work that way actually. > There must be a point where we start dropping packets that are not > critical to the survival of the machine. You still can drop them, the main point is that network allocations do not depend on other allocations. > > Even further development of such idea is to prevent such OOM condition > > at all - by starting swapping early (but wisely) and reduce memory > > usage. > > These just postpone execution but will not avoid it. No. If system allows to have such a condition, then something is broken. It must be prevented, instead of creating special hacks to recover from it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)
Auke Kok wrote: Adam Kropelin wrote: I am experiencing the no-link issue on a 82572EI single port copper PCI-E card. I've only tried 2.6.20-rc5, so I cannot tell if this is a regression or not yet. Will test older kernel soon. Can provide details/logs if you want 'em. we've already established that Allen's issue is not due to the driver and caused by interrupts being mal-assigned on his system, possibly a pci subsystem bug. You also have a completely different board (82572EI instead of 82571EB), so I'd like to see the usual debugging info as well as hearing from you whether 2.6.19.any works correctly. On 2.6.19 the link status is working (follows cable plug/unplug), but no tx or rx packets get thru. Attempts to transmit occasionally result in tx timed out errors in dmesg, but I cannot seem to generate these at will. On 2.6.20-rc5, the link status does not work (link is always down), and as expected no tx or rx. No tx timed out errors this time, presumably because it thinks the link is down. Note that both the switch and the LEDs on the NIC indicate a good 1000 Mbps link. dmesg, 'cat /proc/interrupts', and 'lspci -vvv' attached for 2.6.20-rc5. The data from 2.6.19 is essentially the same. On top of that I posted a patch to rc5-mm yesterday that fixes a few significant bugs in the rc5-mm driver, so please apply that patch too before trying, so we're not wasting our time finding old bugs ;) I haven't been able to test rc5-mm yet because it won't boot on this box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of rejects that I'm not sure how to fix. Some are obvious, but the others I'm unsure of. --Adam dmesg-2.6.20-rc5 Description: Binary data lspci-2.6.20-rc5 Description: Binary data proc-irq-2.6.20-rc5 Description: Binary data
Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)
Adam Kropelin wrote: Auke Kok wrote: Adam Kropelin wrote: I am experiencing the no-link issue on a 82572EI single port copper PCI-E card. I've only tried 2.6.20-rc5, so I cannot tell if this is a regression or not yet. Will test older kernel soon. Can provide details/logs if you want 'em. we've already established that Allen's issue is not due to the driver and caused by interrupts being mal-assigned on his system, possibly a pci subsystem bug. You also have a completely different board (82572EI instead of 82571EB), so I'd like to see the usual debugging info as well as hearing from you whether 2.6.19.any works correctly. On 2.6.19 the link status is working (follows cable plug/unplug), but no tx or rx packets get thru. Attempts to transmit occasionally result in tx timed out errors in dmesg, but I cannot seem to generate these at will. On 2.6.20-rc5, the link status does not work (link is always down), and as expected no tx or rx. No tx timed out errors this time, presumably because it thinks the link is down. Note that both the switch and the LEDs on the NIC indicate a good 1000 Mbps link. dmesg, 'cat /proc/interrupts', and 'lspci -vvv' attached for 2.6.20-rc5. The data from 2.6.19 is essentially the same. at least your interrupts look sane. I see you are using MSI, but no interrupts arrive at neither OS nor driver. On top of that I posted a patch to rc5-mm yesterday that fixes a few significant bugs in the rc5-mm driver, so please apply that patch too before trying, so we're not wasting our time finding old bugs ;) I haven't been able to test rc5-mm yet because it won't boot on this box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of rejects that I'm not sure how to fix. Some are obvious, but the others I'm unsure of. that won't work. You either need to start with 2.6.20-rc5 (and pull the changes pending merge in netdev-2.6 from Jeff Garzik), or start with 2.6.20-rc4-mm1 and manually apply that patch I sent out on monday. A different combination of either of these two will not work, as they are completely different drivers. can you include `ethtool ethX` output of the link down message and `ethtool -d ethX` as well? I'll need to dig up an 82572 and see what's up with that, I've not seen that problem before. More importantly, I suspect that *again* the issue is caused by interrupts not arriving or getting lost. Can you try running with MSI disabled in your kernel config? FYI the driver gives an interrupt to signal to the driver that link is up. no interrupt == no link detected. So that explains the symptom. Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)
Auke Kok wrote: Adam Kropelin wrote: I haven't been able to test rc5-mm yet because it won't boot on this box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of rejects that I'm not sure how to fix. Some are obvious, but the others I'm unsure of. that won't work. You either need to start with 2.6.20-rc5 (and pull the changes pending merge in netdev-2.6 from Jeff Garzik), I thought that's what I was doing when I applied git-e1000 to 2.6.20-rc5, but I guess not. or start with 2.6.20-rc4-mm1 and manually apply that patch I sent out on monday. A different combination of either of these two will not work, as they are completely different drivers. I'll try to work something out. can you include `ethtool ethX` output of the link down message and `ethtool -d ethX` as well? I'll need to dig up an 82572 and see what's up with that, I've not seen that problem before. ethtool output attached. More importantly, I suspect that *again* the issue is caused by interrupts not arriving or getting lost. Smells that way to me, too. Can you try running with MSI disabled in your kernel config? That fixes it! The link comes up and tx/rx works well. I get about 300 Mbps using default iperf settings with a nearby windows box. FYI the driver gives an interrupt to signal to the driver that link is up. no interrupt == no link detected. So that explains the symptom. Yep, makes sense. I've worked with a number of PHYs like that. --Adam ethtool-eth1 Description: Binary data ethtool-d-eth1 Description: Binary data
Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)
Adam Kropelin wrote: Auke Kok wrote: Adam Kropelin wrote: I haven't been able to test rc5-mm yet because it won't boot on this box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of rejects that I'm not sure how to fix. Some are obvious, but the others I'm unsure of. that won't work. You either need to start with 2.6.20-rc5 (and pull the changes pending merge in netdev-2.6 from Jeff Garzik), I thought that's what I was doing when I applied git-e1000 to 2.6.20-rc5, but I guess not. or start with 2.6.20-rc4-mm1 and manually apply that patch I sent out on monday. A different combination of either of these two will not work, as they are completely different drivers. I'll try to work something out. can you include `ethtool ethX` output of the link down message and `ethtool -d ethX` as well? I'll need to dig up an 82572 and see what's up with that, I've not seen that problem before. ethtool output attached. that clearly shows that the PHY detected link up status and that all is well as far as the driver and NIC is concerned. This bug really needs to be moved to linux-pci where the folks who know interrupt handling best can handle it. Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection
On Sat, Jan 20, 2007 at 08:05:07AM +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote: > Hello. > > In article <[EMAIL PROTECTED]> (at Fri, 19 Jan 2007 16:23:14 -0500), Neil > Horman <[EMAIL PROTECTED]> says: > > > Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection). > > In > > Good work. We will see if this would break core and basic ipv6 code. > Dave, please hold on. > Thank you. I'll implement your requested changes and repost monday afternoon Regards Neil > Some quick comments. > > > --- a/include/net/ipv6.h > > +++ b/include/net/ipv6.h > > @@ -110,6 +110,7 @@ struct frag_hdr { > > /* sysctls */ > > extern int sysctl_ipv6_bindv6only; > > extern int sysctl_mld_max_msf; > > +extern int sysctl_optimistic_dad; > > > : > > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c > > index 2a7e461..f7afb2a 100644 > > --- a/net/ipv6/addrconf.c > > +++ b/net/ipv6/addrconf.c > > @@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt > > __read_mostly = { > > .proxy_ndp = 0, > > }; > > > > +int sysctl_optimistic_dad = 1; > > + > > Please put this into ipv6_devconf{} and make it per-interface variable. > And I think default should be kept off (0). > > > /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */ > > #if 0 > > const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT; > > @@ -830,7 +832,8 @@ retry: > > ift = !max_addresses || > > ipv6_count_addresses(idev) < max_addresses ? > > ipv6_add_addr(idev, &addr, tmp_plen, > > - ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, > > IFA_F_TEMPORARY) : NULL; > > + ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, > > + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL; > > if (!ift || IS_ERR(ift)) { > > in6_ifa_put(ifp); > > in6_dev_put(idev); > > Please align ipv6_addr_type and IFA_F_TEMPORARY > > > @@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst, > > } > > > > > > -int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr) > > +int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, > > + unsigned char banned_flags) > > { > > struct inet6_dev *idev; > > int err = -EADDRNOTAVAIL; > > Please align "struct net_device" and "unsigned char". > > > @@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct > > in6_addr *addr) > > > > read_lock_bh(&idev->lock); > > for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) { > > - if (ifp->scope == IFA_LINK && > > !(ifp->flags&IFA_F_TENTATIVE)) { > > + if (ifp->scope == IFA_LINK && > > !(ifp->flags&banned_flags)) { > > ipv6_addr_copy(addr, &ifp->addr); > > err = 0; > > break; > > @@ -1742,7 +1746,7 @@ ok: > > It is not your fault, but please put a space around "&". > > > if (!max_addresses || > > ipv6_count_addresses(in6_dev) < max_addresses) > > ifp = ipv6_add_addr(in6_dev, &addr, > > pinfo->prefix_len, > > - > > addr_type&IPV6_ADDR_SCOPE_MASK, 0); > > + > > addr_type&IPV6_ADDR_SCOPE_MASK,0); > > > > if (!ifp || IS_ERR(ifp)) { > > in6_dev_put(in6_dev); > > Please do no kill space after ",". > > > @@ -2123,7 +2132,8 @@ static void addrconf_add_linklocal(struct inet6_dev > > *idev, struct in6_addr *addr > > { > > struct inet6_ifaddr * ifp; > > > > - ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT); > > + ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, > > + IFA_F_PERMANENT|IFA_F_OPTIMISTIC); > > if (!IS_ERR(ifp)) { > > addrconf_dad_start(ifp, 0); > > in6_ifa_put(ifp); > > Please align idev and IFA_F_PERMANENT. > > > @@ -542,7 +556,8 @@ void ndisc_send_ns(struct net_device *dev, struct > > neighbour *neigh, > > int send_llinfo; > > > > if (saddr == NULL) { > > - if (ipv6_get_lladdr(dev, &addr_buf)) > > + if (ipv6_get_lladdr(dev, &addr_buf, > > + (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC))) > > return; > > saddr = &addr_buf; > > } > > ditto... ("dev" and "(") > > > + and optimistic) are false then we can just fail > > + dad now. > > + */ > > + type = ipv6_addr_type(saddr); > > + if (!((ifp->flags & IFA_F_OPTIMISTIC) && > > + (type & IPV6_ADDR_UNICAST))) { > > + addrconf_dad_failure(ifp); > > + return; > > + } > > } > > > > idev = ifp->idev;
[PATCH 1/4] bonding: fix device name allocation error
The code to select names for the bonding interfaces was, for the non-sysfs creation case, always using a hard-coded set of bond0, bond1, etc, up to max_bonds. This caused conflicts for the second or subsequent loads of the module. Changed the code to obtain device names from dev_alloc_name(). Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 6482aed..07b9d1f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4704,6 +4704,7 @@ static int bond_check_params(struct bond static struct lock_class_key bonding_netdev_xmit_lock_key; /* Create a new bond based on the specified name and bonding parameters. + * If name is NULL, obtain a suitable "bond%d" name for us. * Caller must NOT hold rtnl_lock; we need to release it here before we * set up our sysfs entries. */ @@ -4713,7 +4714,8 @@ int bond_create(char *name, struct bond_ int res; rtnl_lock(); - bond_dev = alloc_netdev(sizeof(struct bonding), name, ether_setup); + bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "", + ether_setup); if (!bond_dev) { printk(KERN_ERR DRV_NAME ": %s: eek! can't alloc netdev!\n", @@ -4722,6 +4724,12 @@ int bond_create(char *name, struct bond_ goto out_rtnl; } + if (!name) { + res = dev_alloc_name(bond_dev, "bond%d"); + if (res < 0) + goto out_netdev; + } + /* bond_init() must be called after dev_alloc_name() (for the * /proc files), but before register_netdevice(), because we * need to set function pointers. @@ -4763,7 +4771,6 @@ static int __init bonding_init(void) { int i; int res; - char new_bond_name[8]; /* Enough room for 999 bonds at init. */ printk(KERN_INFO "%s", version); @@ -4776,8 +4783,7 @@ #ifdef CONFIG_PROC_FS bond_create_proc_dir(); #endif for (i = 0; i < max_bonds; i++) { - sprintf(new_bond_name, "bond%d",i); - res = bond_create(new_bond_name,&bonding_defaults, NULL); + res = bond_create(NULL, &bonding_defaults, NULL); if (res) goto err; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4]: bonding: fix error check in sysfs creation
The existing code did not correctly handle failures to create the per-interface sysfs group for bonding. Modified code to notice errors, and correctly unwind. Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 07b9d1f..d3801a0 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4756,14 +4756,19 @@ int bond_create(char *name, struct bond_ rtnl_unlock(); /* allows sysfs registration of net device */ res = bond_create_sysfs_entry(bond_dev->priv); - goto done; + if (res < 0) { + rtnl_lock(); + goto out_bond; + } + + return 0; + out_bond: bond_deinit(bond_dev); out_netdev: free_netdev(bond_dev); out_rtnl: rtnl_unlock(); -done: return res; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4]: bonding: update version
Update version number to reflect recent changes. Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]> diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h index dc434fb..6123b90 100644 --- a/drivers/net/bonding/bonding.h +++ b/drivers/net/bonding/bonding.h @@ -22,8 +22,8 @@ #include #include "bond_3ad.h" #include "bond_alb.h" -#define DRV_VERSION"3.1.1" -#define DRV_RELDATE"September 26, 2006" +#define DRV_VERSION"3.1.2" +#define DRV_RELDATE"January 20, 2007" #define DRV_NAME "bonding" #define DRV_DESCRIPTION"Ethernet Channel Bonding Driver" - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] bonding: fix module multiple load issues
Patch 1: fix device name allocation error Patch 2: fix error check in sysfs creation Patch 3: modify sysfs support to permit multiple loads Patch 4: update version number This patch series should resolve whatever problems there are with the logic to load the module multiple times. This code changed during the introduction of sysfs, and some recent tightening of the sysfs creation code (checking for duplicates) broke that. The multiple load logic is used primarily by the initscripts and sysconfig packages, to automatically configure multiple bonding interfaces at boot time. Originally reported by Patrick McHardy <[EMAIL PROTECTED]>. Patches generated against netdev-2.6 (hope that's ok). -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] bonding: modify sysfs support to permit multiple loads
The existing code would blindly attempt to create the bonding_masters file (in /sys/class/net) every time the module was loaded. When the module is loaded multiple times (which is the historical method used by initscripts and sysconfig to create multiple bonding interfaces), this caused load failure of the second module load attempt, as the creation request would fail. This changes the code to note the failure, arrange to not remove the bonding_masters file upon module exit, and then return success. Bonding interfaces created by the second or subsequent loads of the module will not exist in bonding_masters. This is not a significant change, as previously only the interfaces from the most recent load of the module would be listed. Both situations are less than optimal, but this case permits compatibility with existing distro configuration scripts, and is consistent. Note that previously, the sysfs create request would overwrite the exsting bonding_masters file and succeed, allowing multiple loads of the module. The sysfs code has recently changed to return an error if the file being created already exists. Patrick McHardy <[EMAIL PROTECTED]>, who reported this problem, observed crashes on the old kernel (before sysfs checked for duplicates). I did not experience such crashes, but this change should resolve them. Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c index ced9ed8..8e317e1 100644 --- a/drivers/net/bonding/bond_sysfs.c +++ b/drivers/net/bonding/bond_sysfs.c @@ -1372,6 +1372,21 @@ int bond_create_sysfs(void) return -ENODEV; ret = class_create_file(netdev_class, &class_attr_bonding_masters); + /* +* Permit multiple loads of the module by ignoring failures to +* create the bonding_masters sysfs file. Bonding devices +* created by second or subsequent loads of the module will +* not be listed in, or controllable by, bonding_masters, but +* will have the usual "bonding" sysfs directory. +* +* This is done to preserve backwards compatibility for +* initscripts/sysconfig, which load bonding multiple times to +* configure multiple bonding devices. +*/ + if (ret == -EEXIST) { + netdev_class = NULL; + return 0; + } return ret; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)
On Fri, 2007-01-19 at 13:19 +0100, Patrick McHardy wrote: > Russell Stuart wrote: > > I thought that some degree of compatibility was > > expected. At the very least the newest version > > of "tc" must work on _any_ kernel as least as > > well as the version it replaces did. > > > > I also though newer kernels should work older > > version of iproute2, albeit without the features > > added in the newer versions. > > > > Are you saying this is not so? > > No, thats exactly what I'm saying. I don't understand - too many negates here without parens. Are you saying: a. Backward / Forward compatibility between the kernel and its user space tools isn't an issue, or b. There is no compatibility problem. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/12] net namespace : find namespace by addr
On Fri, Jan 19, 2007 at 04:47:22PM +0100, [EMAIL PROTECTED] wrote: > From: Daniel Lezcano <[EMAIL PROTECTED]> > > Switch to the the l3 namespace using the destination address. > > Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> > > --- > include/linux/net_namespace.h |7 +++ > net/core/net_namespace.c | 35 +++ > net/ipv4/ip_input.c | 16 +++- > 3 files changed, 57 insertions(+), 1 deletion(-) > > Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c > === > --- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c > +++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c > @@ -374,6 +374,9 @@ > { > struct iphdr *iph; > u32 len; > + int err; > + struct net_namespace *net_ns = current_net_ns; > + struct net_namespace *dst_net_ns = NULL; > > /* When the interface is in promisc. mode, drop all the crap >* that it receives, do not try to analyse it. > @@ -393,6 +396,9 @@ > > iph = skb->nh.iph; > > + dst_net_ns = net_ns_find_from_dest_addr(iph->daddr); > + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) > + push_net_ns(dst_net_ns); > /* >* RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails > the checksum. >* > @@ -431,10 +437,18 @@ > /* Remove any debris in the socket control block */ > memset(IPCB(skb), 0, sizeof(struct inet_skb_parm)); > > - return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, > + err = NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL, > ip_rcv_finish); > > + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) > + pop_net_ns(net_ns); > + > + return err; > + > inhdr_error: > + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns)) > + pop_net_ns(net_ns); > + > IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS); > drop: > kfree_skb(skb); > Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h > === > --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h > +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h > @@ -99,6 +99,8 @@ > extern __be32 net_ns_select_source_address(const struct net_device *dev, > u32 dst, int scope); > > +extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr); > + > #define SELECT_SRC_ADDR net_ns_select_source_address > > #else /* CONFIG_NET_NS */ > @@ -167,6 +169,11 @@ > return 0; > } > > +static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) > +{ > + return NULL; > +} > + > #define SELECT_SRC_ADDR inet_select_addr > > #endif /* !CONFIG_NET_NS */ > Index: 2.6.20-rc4-mm1/net/core/net_namespace.c > === > --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c > +++ 2.6.20-rc4-mm1/net/core/net_namespace.c > @@ -385,4 +385,39 @@ > out: > return addr; > } > + > +/* > + * This function finds the network namespace destination deduced from > + * the destination address. The network namespace is retrieved from > + * the ifaddr owned by a network namespace this basically disallows to 'share' IPs between namespaces, as it is permitted in Linux-VServer right now, or am I misinterpreting this? TIA, Herbert > + * @daddr : destination > + * Returns : the network namespace destination or NULL if not found > + */ > +struct net_namespace *net_ns_find_from_dest_addr(u32 daddr) > +{ > + struct net_namespace *net_ns = NULL; > + struct net_device *dev; > + struct in_device *in_dev; > + > + if (LOOPBACK(daddr)) > + return current_net_ns; > + > + read_lock(&dev_base_lock); > + rcu_read_lock(); > + for (dev = dev_base; dev; dev = dev->next) { > + if ((in_dev = __in_dev_get_rcu(dev)) == NULL) > + continue; > + for_ifa(in_dev) { > + if (ifa->ifa_local == daddr) { > + net_ns = ifa->ifa_net_ns; > + goto out_unlock_both; > + } > + } endfor_ifa(in_dev); > + } > +out_unlock_both: > + read_unlock(&dev_base_lock); > + rcu_read_unlock(); > + > + return net_ns; > +} > #endif /* CONFIG_NET_NS */ > > -- > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/12] net namespace : Add broadcasting
On Fri, Jan 19, 2007 at 04:47:26PM +0100, [EMAIL PROTECTED] wrote: > From: Daniel Lezcano <[EMAIL PROTECTED]> > > Broadcast packets should be delivered to l2 and all l3 childs hmm, really? shouldn't it only reach those which actually have related addresses assigned? best, Herbert > Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> > > --- > include/linux/net_namespace.h | 11 +++ > net/core/net_namespace.c | 27 +++ > net/ipv4/udp.c|3 ++- > 3 files changed, 40 insertions(+), 1 deletion(-) > > Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h > === > --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h > +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h > @@ -9,6 +9,7 @@ > > struct in_ifaddr; > struct sk_buff; > +struct sock; > > struct net_namespace { > struct kref kref; > @@ -109,6 +110,9 @@ > > extern void net_ns_tag_sk_buff(struct sk_buff *skb); > > +extern int net_ns_sock_is_visible(const struct sock *sk, > + const struct net_namespace *net_ns); > + > #define SELECT_SRC_ADDR net_ns_select_source_address > > #else /* CONFIG_NET_NS */ > @@ -192,6 +196,13 @@ > { > ; > } > + > +static inline int net_ns_sock_is_visible(const struct sock *sk, > + const struct net_namespace *net_ns) > +{ > + return 1; > +} > + > #define SELECT_SRC_ADDR inet_select_addr > > #endif /* !CONFIG_NET_NS */ > Index: 2.6.20-rc4-mm1/net/core/net_namespace.c > === > --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c > +++ 2.6.20-rc4-mm1/net/core/net_namespace.c > @@ -17,6 +17,7 @@ > #include > > #include > +#include > > struct net_namespace init_net_ns = { > .kref = { > @@ -464,4 +465,30 @@ > struct net_namespace *net_ns = current_net_ns; > skb->net_ns = net_ns; > } > + > +/* > + * This function checks if the socket is visible from the specified > + * namespace. This is needed to ensure the broadcast and the multicast > + * for multiple network namespace l2 and l3 to have the packets to be > + * delivered. If we have a l3 namespace and its parent (l2 namespace) > + * listening on a broadcast address, we should deliver the packet to > + * both. That is done by the udp_v4_mcast_next function. But we should > + * find a common point between sockets which are relatives to a > + * namespace. The common point is they have the same parent in case > + * of l3 network namespace. > + * @sk : the socket to be checked > + * @net_ns : the receiving network namespace > + * Returns: 1 if the socket is visible by the namespace, 0 otherwise. > + */ > +int net_ns_sock_is_visible(const struct sock *sk, > +const struct net_namespace *net_ns) > +{ > + if (net_ns->level == NET_NS_LEVEL3) > + net_ns = net_ns->parent; > + > + if (sk->sk_net_ns->level == NET_NS_LEVEL3) > + return sk->sk_net_ns->parent == net_ns; > + else > + return sk->sk_net_ns == net_ns; > +} > #endif /* CONFIG_NET_NS */ > Index: 2.6.20-rc4-mm1/net/ipv4/udp.c > === > --- 2.6.20-rc4-mm1.orig/net/ipv4/udp.c > +++ 2.6.20-rc4-mm1/net/ipv4/udp.c > @@ -309,9 +309,10 @@ > (inet->dport != rmt_port && inet->dport)|| > (inet->rcv_saddr && inet->rcv_saddr != loc_addr)|| > ipv6_only_sock(s) || > - !net_ns_match(sk->sk_net_ns, ns)|| > (s->sk_bound_dev_if && s->sk_bound_dev_if != dif)) > continue; > + if (!net_ns_sock_is_visible(sk, ns)) > + continue; > if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif)) > continue; > goto found; > > -- > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 00/12] net namespace : L3 namespace - introduction
On Fri, Jan 19, 2007 at 04:47:14PM +0100, [EMAIL PROTECTED] wrote: > This patchset provide a network isolation similar at what > Linux-Vserver provides. It is based on the L2 namespaces and relies on > the mechanisms provided by the namespace. This L3 namespaces does not > aim to bring full virtualization for the network, it provides an IP > isolation which can be reused for Linux-Vserver, jailed application or > application containers. > > A L3 namespace are always L2 s' childs and they can not create more > network namespaces, furthermore, they lose their NET_ADMIN > capability. They share their parent's network ressources. From the > parent namespace, IP addresses are created and assigned to the > different L3 childs. From this point, L3 namespaces can use their > assigned IP address and all computed broadcast addresses. ~~~ okay, I conclude that this only handles a single address for now. what are your plans to handle entire sets? TIA, Herbert > Because the L3 namespace relies on the L2 virtualization mechanisms, > it is possible to have several L3 namespaces listening on > INADDR_ANY:port without conflict, that's allow to run several server > without modifying the network configuration. > > The loopback is a shared device between all L3 namespaces. To ensure > the 127.0.0.1 address isolation, the sender store its namespace into > the packet, so when the packet arrives, the destination namespace is > already set, because "source" == "destination". By this way, it is > easy to disable the loopback isolation and let the application to talk > with application outside of the namespace via the 127.0.0.1 because we > consider them trusted (like portmap). > > The ifconfig / ip commands will only show IP addresses assigned to the > L3 namespace. When a L3 namespace dies, the assigned IP address is > released to its parent. > > At the IP level, when a packet arrives, the L3 network namespace > destination is retrieved from the destination address. > > At the bind time, the address is checked against the assigned IP > address. > > -- > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.osdl.org/mailman/listinfo/containers - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 05/12] net namespace : ioctl to push ifa to net namespace l3
On Fri, Jan 19, 2007 at 04:47:19PM +0100, [EMAIL PROTECTED] wrote: > From: Daniel Lezcano <[EMAIL PROTECTED]> > > New ioctl to "push" ifaddr to a container. Actually, the push is done > from the current namespace, so the right word is "pull". That will be > changed to move ifaddr from l2 network namespace to l3. > > Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> > > --- > include/linux/net_namespace.h |7 ++ > include/linux/sockios.h |4 + > net/core/net_namespace.c | 118 > +- > net/ipv4/af_inet.c|4 + > 4 files changed, 132 insertions(+), 1 deletion(-) > > Index: 2.6.20-rc4-mm1/include/linux/sockios.h > === > --- 2.6.20-rc4-mm1.orig/include/linux/sockios.h > +++ 2.6.20-rc4-mm1/include/linux/sockios.h > @@ -122,6 +122,10 @@ > #define SIOCBRADDIF 0x89a2 /* add interface to bridge */ > #define SIOCBRDELIF 0x89a3 /* remove interface from bridge */ > > +/* Container calls */ > +#define SIOCNETNSPUSHIF 0x89b0 /* add ifaddr to namespace */ > +#define SIOCNETNSPULLIF 0x89b1 /* remove ifaddr to namespace */ ~~~ from > + > /* Device private ioctl calls */ > > /* > Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c > === > --- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c > +++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c > @@ -789,6 +789,10 @@ > case SIOCSIFFLAGS: > err = devinet_ioctl(cmd, (void __user *)arg); > break; > + case SIOCNETNSPUSHIF: > + case SIOCNETNSPULLIF: > + err = net_ns_ioctl(cmd, (void __user *)arg); > + break; > default: > if (sk->sk_prot->ioctl) > err = sk->sk_prot->ioctl(sk, cmd, arg); > Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h > === > --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h > +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h > @@ -91,6 +91,8 @@ > > #define net_ns_hash(ns) ((ns)->hash) > > +extern int net_ns_ioctl(unsigned int cmd, void __user *arg); > + > #else /* CONFIG_NET_NS */ > > #define INIT_NET_NS(net_ns) > @@ -141,6 +143,11 @@ > > #define net_ns_hash(ns) (0) > > +static inline int net_ns_ioctl(unsigned int cmd, void __user *arg) > +{ > + return -ENOSYS; > +} > + > #endif /* !CONFIG_NET_NS */ > > #endif /* _LINUX_NET_NAMESPACE_H */ > Index: 2.6.20-rc4-mm1/net/core/net_namespace.c > === > --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c > +++ 2.6.20-rc4-mm1/net/core/net_namespace.c > @@ -10,7 +10,9 @@ > #include > #include > #include > +#include > #include > +#include > #include > > struct net_namespace init_net_ns = { > @@ -123,6 +125,33 @@ > return err; > } > > +/* > + * The function will move the ifaddr to the l2 network namespace > + * parent. > + * @net_ns: the related network namespace > + */ > +static void release_ifa_to_parent(const struct net_namespace* net_ns) > +{ > + struct net_device *dev; > + struct in_device *in_dev; > + > + read_lock(&dev_base_lock); > + rcu_read_lock(); > + for (dev = dev_base; dev; dev = dev->next) { > + in_dev = __in_dev_get_rcu(dev); > + if (!in_dev) > + continue; > + > + for_ifa(in_dev) { > + if (ifa->ifa_net_ns != net_ns) > + continue; > + ifa->ifa_net_ns = net_ns->parent; > + } endfor_ifa(in_dev); > + } > + read_unlock(&dev_base_lock); > + rcu_read_unlock(); > +} > + > void free_net_ns(struct kref *kref) > { > struct net_namespace *ns; > @@ -139,12 +168,99 @@ > } > } > > - if (ns->level == NET_NS_LEVEL3) > + if (ns->level == NET_NS_LEVEL3) { > + release_ifa_to_parent(ns); > put_net_ns(ns->parent); > + } > > printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns); > kfree(ns); > } > EXPORT_SYMBOL_GPL(free_net_ns); > > +/* > + * This function allows to assign an IP address from a l2 network > + * namespace to one of his l3 child or to release from an l3 network > + * namespace to his l2 network namespace parent. hmm, sounds like the address is moved between the namespaces? does that mean that the 'parent' will not see the 'isolated' ip anymore? TIA, Herbert > + * @cmd: a "push" / "pull" command > + * @arg: an userspace buffer containing an ifreq structure > + * Returns: > + * - EPERM : if caller has no CAP_NET_ADMIN capabilities or the > + * current level of networ
[Fwd: Re: [PATCH 1/10] cxgb3 - main header files]
Hey Roland, Jeff has pulled in the Chelsio Ethernet driver. If you are ready to merge in the RDMA driver, you can pull it from git://staging.openfabrics.org/~swise/cxgb3.git for-roland Thanks, Steve. Forwarded Message From: Jeff Garzik <[EMAIL PROTECTED]> To: Divy Le Ray <[EMAIL PROTECTED]> Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, [EMAIL PROTECTED] Subject: Re: [PATCH 1/10] cxgb3 - main header files Date: Thu, 18 Jan 2007 22:05:02 -0500 Divy Le Ray wrote: > Jeff Garzik wrote: >> Divy Le Ray wrote: >>> From: Divy Le Ray <[EMAIL PROTECTED]> >>> >>> This patch implements the main header files of >>> the Chelsio T3 network driver. >>> >>> Signed-off-by: Divy Le Ray <[EMAIL PROTECTED]> >> >> Once you think it's ready, email me a URL to a single patch that adds >> the driver to the latest linux-2.6.git kernel. Include in the email a >> description of the driver and signed-off-by line, which will get >> directly included in the git changelog. >> >> Adding new drivers is a bit special, because we want to merge it as a >> single changeset, but that would create a patch too large to review on >> the common kernel mailing lists. > Jeff, > > You can grab the monolithic patch at this URL: > http://service.chelsio.com/kernel.org/cxgb3.patch.bz2 applied to netdev-2.6.git#upstream I'm really counting on Chelsio to actively maintain this driver, unlike the abandonware you guys first submitted. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Fwd: Re: [PATCH 1/10] cxgb3 - main header files]
> Jeff has pulled in the Chelsio Ethernet driver. If you are ready to > merge in the RDMA driver, you can pull it from Yes, I saw that... OK, I'll get serious about reviewing the RDMA stuff. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html