date:20070119

Re: Please pull 'upstream' branch of wireless-2.6

2007-01-19 Thread John W. Linville

On Thu, Jan 18, 2007 at 10:10:47PM -0500, Jeff Garzik wrote:
> John W. Linville wrote:
> >The following changes since commit 
> >10764889c6355cbb335cf0578ce12427475d1a65:
> >  Larry Finger (1):
> >bcm43xx: Fix failure to deliver PCI-E interrupts
> >
> >are found in the git repository at:
> >
> >  git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git 
> >  upstream
> 
> ACK.  Open question of parentage, though:  I just rebased 
> netdev-2.6.git#upstream.  Is your wireless-2.6 affected by this rebase?
> 
> If not, I will go ahead and pull.

Right now it looks like this:

Linus's tree -> my upstream-fixes branch -> my upstream branch

So, I think it should be fine for your you to pull.

Thanks,

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/12] L2 network namespace (v3)

2007-01-19 Thread Dmitry Mishin

On Friday 19 January 2007 10:27, Eric W. Biederman wrote:
> YOSHIFUJI Hideaki / 吉藤英明 <[EMAIL PROTECTED]> writes:
> 
> > In article <[EMAIL PROTECTED]> (at Wed, 17 Jan 2007 18:51:14
> > +0300), Dmitry Mishin <[EMAIL PROTECTED]> says:
> >
> >> ===
> >> L2 network namespaces
> >> 
> >> The most straightforward concept of network virtualization is complete
> >> separation of namespaces, covering device list, routing tables, netfilter
> >> tables, socket hashes, and everything else.
> >> 
> >> On input path, each packet is tagged with namespace right from the
> >> place where it appears from a device, and is processed by each layer
> >> in the context of this namespace.
> >> Non-root namespaces communicate with the outside world in two ways: by
> >> owning hardware devices, or receiving packets forwarded them by their 
> >> parent
> >> namespace via pass-through device.
> >
> > Can you handle multicast / broadcast and IPv6, which are very important?
> 
> The basic idea here is very simple.
> 
> Each network namespace appears to user space as a separate network stack,
> with it's own set of routing tables etc.
> 
> All sockets and all network devices (the sources of packets) belong
> to exactly one network namespace.  
> 
> >From the socket or the network device a packet enters the network stack
> you can infer the network namespace that it will be processed in.
> Each network namespace should get it own complement of the data structures
> necessary to process packets, and everything should work.
> 
> Talking between namespaces is accomplished either through an external network,
> or through a special pseudo network device.  The simplest to implement
> is two network devices where all packets transmitted on one are received
> on the other.  Then by placing one network device in one namespace and
> the other in another interface it looks like two machines connected by
> a cross over cable.
> 
> Once you have that in a one namespace you can connect other namespaces
> with the existing ethernet bridging or by configuring one of the
> namespaces as a router and routing traffic between them.
> 
> 
> Supporting IPv6 is roughly as difficult as supporting IPv4.  
> 
> What needs to happen to convert code is all variables either need
> a per network namespace instance or the data structures needs to be
> modified to have a network namespace tag.  For hash tables which
> are hard to allocate dynamically tagging is the preferred conversion
> method, for anything that is small enough duplication is preferred
> as it allows the existing logic to be kept.
> 
> In the fast path the impact of all of the conversions should be very light,
> to non-existent.  In network stack initialization and cleanup there
> is work todo because you are initializing and cleanup variables more often
> then at module insertion and removal.
> 
> So my expectation is that once we get a framework established and merged
> to allow network namespaces eventually the entire network stack will be
> converted.  Not just ipv4 and ipv6 but decnet, ipx, iptables, fair scheduling,
> ethernet bridging and all of the other weird and twisty bits of the
> linux network stack.
Thanks Eric for such descriptive comment. I can only sign off on it :)

> 
> The primary practical hurdle is there is a lot of networking code in
> the kernel.
> 
> I think I know a path by which we can incrementally merge support for
> network namespaces without breaking anything.  More to come on this
> when I finish up my demonstration patchset in a week or so that
> is complete enough to show what I am talking about.
> 
> I hope this helps but the concept into perspective.
I'll be waiting it. 

> 
> As for Dmitry's patchset in particular it currently does not support
> IPv6 and I don't know where it is with respect to the broadcast and
> multicast but I don't see any immediate problems that would preclude
> those from working.  But any incompleteness is exactly that
> incompleteness and an implementation problem not a fundamental design
> issue.
Broadcasts/multicasts are supported.

-- 
Thanks,
Dmitry.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Jarek Poplawski

On 17-01-2007 15:12, Michael Tokarev wrote:
> Herbert Xu wrote:
>> On Tue, Jan 16, 2007 at 11:08:51AM +0300, Michael Tokarev wrote:
>>> Ok.  Here's another trace, from that remote network that triggers
>>> this thing more-or-less reliable (every 2nd transfer at least) --
>>> http://www.corpit.ru/mjt/bh-bad-cksum-dmp.bin . It's a full session
>>> between 216.168.29.244 - the requesting/receiving side -- and
>>> 81.13.94.6 -- our sending side (the file being transferred is some
>>> trojan horse I found on a friend's PC, so be careful ;)
>> I'll have a look at this tomorrow.
>>
>> Since you're certain that this is being seen on the wire, one
>> possibility is that we've got a bug somewhere that's zeroing
>> skb->ip_summed on a packet with a partial checksum.
> 
> Here's another sample, which may be more useful.  I've seen quite
> alot of very similar stuff while running tcpdump.
> 
>   http://www.corpit.ru/mjt/bad-cksum-session3-dmp.bin
> 
> The scenario looks like this.
> 
> A client (82.84.172.37 -- a zombie machine trying to send us spam
> in this case) connects to a port 25 here (81.13.94.6:25).  SYN+ACK
> sequence completes.  Next, our server send an initial SMTP greething
> message, but almost right after that, the client sends a FIN packet,
> WITHOUT acknowleging that it received the (first and only) data
> packet.  So some time later our machine re-sends the data, AND adds
> FIN flag to the packet (also replying to the FIN received from the
> client).  And *that* packet - original data packet which is modified
> to also include FIN - has incorrect checksum.
> 
> So it looks like the checksum isn't being updated WHEN ADDING MORE
> FLAGS to the original data packet.
> 

Hi,

Here is my patch proposal. If I'm not totally wrong,
there is a possibility that, during collapsing, empty
skb with FIN is added to "normal" packet and changes
its ip_summed field to CHECKSUM_NONE.

Regards,
Jarek P.

PS: probably there are also other possibilities...
---

[PATCH][NET] tcp_output: rare bad TCP checksum with 2.6.19

The patch "Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE"
changed to unconditional copying of ip_summed field from collapsed
skb. This patch reverts this change.   

All substantial work including heavy testing and diagnosing by:
Michael Tokarev <[EMAIL PROTECTED]>

Signed-off-by: Jarek Poplawski <[EMAIL PROTECTED]>
---

diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c 
linux-2.6.19/net/ipv4/tcp_output.c
--- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 +0100
+++ linux-2.6.19/net/ipv4/tcp_output.c  2007-01-19 07:58:39.0 +0100
@@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str
 
memcpy(skb_put(skb, next_skb_size), next_skb->data, 
next_skb_size);
 
-   skb->ip_summed = next_skb->ip_summed;
+   if (next_skb->ip_summed == CHECKSUM_PARTIAL)
+   skb->ip_summed = CHECKSUM_PARTIAL;
 
if (skb->ip_summed != CHECKSUM_PARTIAL)
skb->csum = csum_block_add(skb->csum, next_skb->csum, 
skb_size);
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Patrick McHardy

Jarek Poplawski wrote:
> Here is my patch proposal. If I'm not totally wrong,
> there is a possibility that, during collapsing, empty
> skb with FIN is added to "normal" packet and changes
> its ip_summed field to CHECKSUM_NONE.
> 
> diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c 
> linux-2.6.19/net/ipv4/tcp_output.c
> --- linux-2.6.19-/net/ipv4/tcp_output.c   2006-11-29 22:57:37.0 
> +0100
> +++ linux-2.6.19/net/ipv4/tcp_output.c2007-01-19 07:58:39.0 
> +0100
> @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str
>  
>   memcpy(skb_put(skb, next_skb_size), next_skb->data, 
> next_skb_size);
>  
> - skb->ip_summed = next_skb->ip_summed;
> + if (next_skb->ip_summed == CHECKSUM_PARTIAL)
> + skb->ip_summed = CHECKSUM_PARTIAL;
>  
>   if (skb->ip_summed != CHECKSUM_PARTIAL)
>   skb->csum = csum_block_add(skb->csum, next_skb->csum, 
> skb_size);
> 

I noticed this too, but I can't see how it could lead to
a partial checksum on the wire since the checksumming is
done after changing ip_summed to CHECKSUM_NONE. Is this
patch verified to fix Michael's problem?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)

2007-01-19 Thread Patrick McHardy

Russell Stuart wrote:
> On Thu, 2007-01-18 at 12:37 +0100, Patrick McHardy wrote:
> 
>>>Or are you proposing tc behave differently on different
>>>kernel versions.  (I have no problem with that, but
>>>isn't it officially frowned upon?)
>>
>>Yes. There is no way you can make this work on old kernels,
>>nobody expects that. The important part is that everything
>>continues to work as before and that both old and new iproute
>>binaries work properly on both old and new kernels (new
>>iproute on old kernels without STABs obviously).
> 
> 
> I thought that some degree of compatibility was 
> expected.  At the very least the newest version 
> of "tc" must work on _any_ kernel as least as 
> well as the version it replaces did.
> 
> I also though newer kernels should work older
> version of iproute2, albeit without the features
> added in the newer versions.
> 
> Are you saying this is not so?

No, thats exactly what I'm saying.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TKIP encryption should allocate enough tailroom

2007-01-19 Thread Pekka Pietikainen

On Thu, Jan 18, 2007 at 08:55:37AM -0500, Brandon Craig Rhodes wrote:
> to debugging messages!  In some circumstances, debug messages are
> always produced; in several others, net_ratelimit() is called to
> decided whether to print an error (but why in these cases and not
> others?); and in many cases, nothing is printed at all (is this
> because convention would dictate that the caller discover the error
> and print something out?).
> 
> If I want to generate a patch that festoons the ieee80211 functions
> with informative error messages, what are the guidelines?
My understanding is:

BUG_ON() / BUG() if it's a clear "impossible" condition ("function calling
me was wrong") null pointers/buffer lengths being inconsistent. Might even be
justified in this case? 

net_ratelimit() says:
/*
 * All net warning printk()s should be guarded by this function.
 */
int net_ratelimit(void)
{
return __printk_ratelimit(net_msg_cost, net_msg_burst);
}

Especially important if the code path can be triggered by anyone (local user
or arbitrary packet from the network). Otherwise not that big a deal if it's
buggy code elsewhere in the kernel that causes the message to be printed.
You fix the code and you stop getting thousands of lines of debug
messages/second (which is why net_ratelimit() exists).

If it's an arbitrary packet from the network, there probably should even
be a sysctl to enable/disable debug output completely. IPv4 has:

static void ip_handle_martian_source(struct net_device *dev,
 struct in_device *in_dev,
 struct sk_buff *skb,
 __be32 daddr,
 __be32 saddr)
{
RT_CACHE_STAT_INC(in_martian_src);
#ifdef CONFIG_IP_ROUTE_VERBOSE
if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit()) {
/*
 *  RFC1812 recommendation, if source is martian,
 *  the only hint is MAC header.
 */
printk(KERN_WARNING "martian source %u.%u.%u.%u from "
"%u.%u.%u.%u, on dev %s\n",
NIPQUAD(daddr), NIPQUAD(saddr), dev->name);
...

(so there's a #ifdef _and_ a log_martians sysctl to see debug output).
In general #ifdefs should be avoided.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] FYI: MultiTCP for linux-2.6.19

2007-01-19 Thread Daniele Lacamera

MultiTCP[1] is (yet another) Linux TCP patch intended for
researchers/developers, which can report TCP events in the kernel logs
in order to watch TCP internal variables.

Furthermore, it includes TCP Pacing and Hoe's initial ssthresh
estimation[2]. Their use in satellite links is strongly recommended by
the TCP-Hybla authors in order to mitigate congestion episodes and limit
the initial cwnd overshoot phenomenon respectively.

A new version for linux-2.6.19 has been released at
http://www.sf.net/projects/multitcp/

Hybla's authors future goal would be to produce official patches for
both Pacing and initial ssthresh estimation, by freeing the latter one
implementation from the TCP kernel logs engine and getting the current
tcp_sock structure slimmer.

Meanwhile, we will appreciate if the comparative tests made on Linux
congestion control will include the two algorithms described above when
using TCP-Hybla.

[1]
C. Caini, R. Firrincieli and D. Lacamera, "A Linux Based Multi TCP
Implementation for Experimental Evaluation of TCP Enhancements", SPECTS
2005, Philadelphia, July 2005.

C. Caini, R. Firrincieli, D. Lacamera, "An emulation approach for the
evaluation of enhanced transport protocols performance in satellite
networks", IEEE Globecom 2006 - Satellite and Space Communications, 27
November-1 December 2006, San Francisco, CA, USA.

[2]
J. C. Hoe, "Improving the Start-up Behavior of a Congestion Control
Scheme for TCP", ACM SIGCOMM 1996, pp. 270-280


Regards,

-- 
Daniele Lacamera

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Michael Tokarev

Jarek Poplawski wrote:
> On 17-01-2007 15:12, Michael Tokarev wrote:
[]
>> Here's another sample, which may be more useful.  I've seen quite
>> alot of very similar stuff while running tcpdump.
>>
>>   http://www.corpit.ru/mjt/bad-cksum-session3-dmp.bin
>>
>> The scenario looks like this.
>>
>> A client (82.84.172.37 -- a zombie machine trying to send us spam
>> in this case) connects to a port 25 here (81.13.94.6:25).  SYN+ACK
>> sequence completes.  Next, our server send an initial SMTP greething
>> message, but almost right after that, the client sends a FIN packet,
>> WITHOUT acknowleging that it received the (first and only) data
>> packet.  So some time later our machine re-sends the data, AND adds
>> FIN flag to the packet (also replying to the FIN received from the
>> client).  And *that* packet - original data packet which is modified
>> to also include FIN - has incorrect checksum.
>>
>> So it looks like the checksum isn't being updated WHEN ADDING MORE
>> FLAGS to the original data packet.
>>
> 
> Hi,
> 
> Here is my patch proposal. If I'm not totally wrong,
> there is a possibility that, during collapsing, empty
> skb with FIN is added to "normal" packet and changes
> its ip_summed field to CHECKSUM_NONE.
> 
> Regards,
> Jarek P.
> 
> PS: probably there are also other possibilities...

Well..  I just tried it - with this patch applied, no more bad checksums
are shown.  Tried from the network that triggers it most reliable - and
wasn't able to reproduce the bad behavior.

I'm running a tcpdump right now, and so far it only captured a few bad-cksum
packets from other hosts (which are also running 2.6.19 ;)

Thanks Jarek!

/mjt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Michael Tokarev

Patrick McHardy wrote:
> Jarek Poplawski wrote:
>> Here is my patch proposal. If I'm not totally wrong,
>> there is a possibility that, during collapsing, empty
>> skb with FIN is added to "normal" packet and changes
>> its ip_summed field to CHECKSUM_NONE.
>>
>> diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c 
>> linux-2.6.19/net/ipv4/tcp_output.c
>> --- linux-2.6.19-/net/ipv4/tcp_output.c  2006-11-29 22:57:37.0 
>> +0100
>> +++ linux-2.6.19/net/ipv4/tcp_output.c   2007-01-19 07:58:39.0 
>> +0100
>> @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str
>>  
>>  memcpy(skb_put(skb, next_skb_size), next_skb->data, 
>> next_skb_size);
>>  
>> -skb->ip_summed = next_skb->ip_summed;
>> +if (next_skb->ip_summed == CHECKSUM_PARTIAL)
>> +skb->ip_summed = CHECKSUM_PARTIAL;
>>  
>>  if (skb->ip_summed != CHECKSUM_PARTIAL)
>>  skb->csum = csum_block_add(skb->csum, next_skb->csum, 
>> skb_size);
>>
> 
> I noticed this too, but I can't see how it could lead to
> a partial checksum on the wire since the checksumming is
> done after changing ip_summed to CHECKSUM_NONE. Is this
> patch verified to fix Michael's problem?

It seems to fix this "my" problem, yes - at least I can't reproduce it anymore.
Tcpdump is running however - let's see... :)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

bonding: sysfs patch broke module renaming

2007-01-19 Thread Patrick McHardy

The sysfs patch broke using multiple instances of the bonding module
through module renaming (modprobe -o). In recent kernels it fails
with -EEXIST when trying to add the bonding_masters file for the
second time, in older kernels (where sysfs_add_file didn't check
for duplicates) it will crash when unloading the modules.

I don't see a good way to fix it, can someone please look into this?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Jarek Poplawski

On Fri, Jan 19, 2007 at 04:20:01PM +0300, Michael Tokarev wrote:
...
> Well..  I just tried it - with this patch applied, no more bad checksums
> are shown.  Tried from the network that triggers it most reliable - and
> wasn't able to reproduce the bad behavior.
> 
> I'm running a tcpdump right now, and so far it only captured a few bad-cksum
> packets from other hosts (which are also running 2.6.19 ;)
> 
> Thanks Jarek!

You are welcome! But you probably didn't read this with
attention: if it works, you should thank mainly to that
other guy...

Btw. I can't remember I've seen such ferocious testing
ever!

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Jarek Poplawski

On Fri, Jan 19, 2007 at 01:14:52PM +0100, Patrick McHardy wrote:
> Jarek Poplawski wrote:
> > Here is my patch proposal. If I'm not totally wrong,
> > there is a possibility that, during collapsing, empty
> > skb with FIN is added to "normal" packet and changes
> > its ip_summed field to CHECKSUM_NONE.
> > 
> > diff -Nurp linux-2.6.19-/net/ipv4/tcp_output.c 
> > linux-2.6.19/net/ipv4/tcp_output.c
> > --- linux-2.6.19-/net/ipv4/tcp_output.c 2006-11-29 22:57:37.0 
> > +0100
> > +++ linux-2.6.19/net/ipv4/tcp_output.c  2007-01-19 07:58:39.0 
> > +0100
> > @@ -1590,7 +1590,8 @@ static void tcp_retrans_try_collapse(str
> >  
> > memcpy(skb_put(skb, next_skb_size), next_skb->data, 
> > next_skb_size);
> >  
> > -   skb->ip_summed = next_skb->ip_summed;
> > +   if (next_skb->ip_summed == CHECKSUM_PARTIAL)
> > +   skb->ip_summed = CHECKSUM_PARTIAL;
> >  
> > if (skb->ip_summed != CHECKSUM_PARTIAL)
> > skb->csum = csum_block_add(skb->csum, next_skb->csum, 
> > skb_size);
> > 
> 
> I noticed this too, but I can't see how it could lead to
> a partial checksum on the wire since the checksumming is
> done after changing ip_summed to CHECKSUM_NONE. Is this
> patch verified to fix Michael's problem?

No, this was intended as a proposal for testing.
I didn't verify all the checksum path here, but I
guessed such change during the summing could matter
(probably for skb_copy_and_csum_dev and maybe earlier)
and I couldn't find more suspicious change since 2.6.17
near this FINs. But if it really works, it shoudn't be
so hard to verify the mechanism, I hope.

Jarek P.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 01/12] net namespace : initialize init process to level 2

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

Initialize the init's network namespace to level 2

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 net/core/net_namespace.c |1 +
 1 file changed, 1 insertion(+)

Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -21,6 +21,7 @@
.dev_tail_p = &init_net_ns.dev_base_p,
.loopback_dev_p = NULL,
.pcpu_lstats_p  = NULL,
+   .level  = NET_NS_LEVEL2,
 };
 
 #ifdef CONFIG_NET_NS

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 00/12] net namespace : L3 namespace - introduction

2007-01-19 Thread dlezcano

This patchset provide a network isolation similar at what
Linux-Vserver provides. It is based on the L2 namespaces and relies on
the mechanisms provided by the namespace. This L3 namespaces does not
aim to bring full virtualization for the network, it provides an IP
isolation which can be reused for Linux-Vserver, jailed application or
application containers.

A L3 namespace are always L2 s' childs and they can not create more
network namespaces, furthermore, they lose their NET_ADMIN
capability. They share their parent's network ressources. From the
parent namespace, IP addresses are created and assigned to the
different L3 childs. From this point, L3 namespaces can use their
assigned IP address and all computed broadcast addresses.

Because the L3 namespace relies on the L2 virtualization mechanisms,
it is possible to have several L3 namespaces listening on
INADDR_ANY:port without conflict, that's allow to run several server
without modifying the network configuration.

The loopback is a shared device between all L3 namespaces. To ensure
the 127.0.0.1 address isolation, the sender store its namespace into
the packet, so when the packet arrives, the destination namespace is
already set, because "source" == "destination". By this way, it is
easy to disable the loopback isolation and let the application to talk
with application outside of the namespace via the 127.0.0.1 because we
consider them trusted (like portmap).

The ifconfig / ip commands will only show IP addresses assigned to the
L3 namespace. When a L3 namespace dies, the assigned IP address is
released to its parent.

At the IP level, when a packet arrives, the L3 network namespace
destination is retrieved from the destination address.

At the bind time, the address is checked against the assigned IP
address.

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 02/12] net namespace : store L2 parent namespace

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

All L3 namespaces are the final nodes of the L2 namespaces
tree. Because their share some ressources coming from the L2
namespace. The L2 parent namespace should be stored into the L3 child
when it is created.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |1 +
 net/core/net_namespace.c  |   11 +++
 2 files changed, 12 insertions(+)

Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -27,6 +27,7 @@
 #define NET_NS_LEVEL2  1
 #define NET_NS_LEVEL3  2
unsigned intlevel;
+   struct net_namespace*parent;
 };
 
 extern struct net_namespace init_net_ns;
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -22,6 +22,7 @@
.loopback_dev_p = NULL,
.pcpu_lstats_p  = NULL,
.level  = NET_NS_LEVEL2,
+   .parent = NULL,
 };
 
 #ifdef CONFIG_NET_NS
@@ -62,6 +63,12 @@
if (ip_fib_struct_init())
goto out_fib4;
}
+
+   if (level == NET_NS_LEVEL3) {
+   get_net_ns(old_ns);
+   ns->parent = old_ns;
+   }
+
ns->level = level;
if (loopback_init())
goto out_loopback;
@@ -126,8 +133,12 @@
ns, atomic_read(&ns->kref.refcount));
return;
}
+
if (ns->level == NET_NS_LEVEL2)
ip_fib_struct_cleanup(ns);
+   if (ns->level == NET_NS_LEVEL3)
+   put_net_ns(ns->parent);
+
printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns);
kfree(ns);
 }

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 12/12] net namespace : Add broadcasting

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

Broadcast packets should be delivered to l2 and all l3 childs

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |   11 +++
 net/core/net_namespace.c  |   27 +++
 net/ipv4/udp.c|3 ++-
 3 files changed, 40 insertions(+), 1 deletion(-)

Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -9,6 +9,7 @@
 
 struct in_ifaddr;
 struct sk_buff;
+struct sock;
 
 struct net_namespace {
struct kref kref;
@@ -109,6 +110,9 @@
 
 extern void net_ns_tag_sk_buff(struct sk_buff *skb);
 
+extern int net_ns_sock_is_visible(const struct sock *sk,
+ const struct net_namespace *net_ns);
+
 #define SELECT_SRC_ADDR net_ns_select_source_address
 
 #else /* CONFIG_NET_NS */
@@ -192,6 +196,13 @@
 {
;
 }
+
+static inline int net_ns_sock_is_visible(const struct sock *sk,
+const struct net_namespace *net_ns)
+{
+   return 1;
+}
+
 #define SELECT_SRC_ADDR inet_select_addr
 
 #endif /* !CONFIG_NET_NS */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -17,6 +17,7 @@
 #include 
 
 #include 
+#include 
 
 struct net_namespace init_net_ns = {
.kref = {
@@ -464,4 +465,30 @@
struct net_namespace *net_ns = current_net_ns;
skb->net_ns = net_ns;
 }
+
+/*
+ * This function checks if the socket is visible from the specified
+ * namespace. This is needed to ensure the broadcast and the multicast
+ * for multiple network namespace l2 and l3 to have the packets to be
+ * delivered. If we have a l3 namespace and its parent (l2 namespace)
+ * listening on a broadcast address, we should deliver the packet to
+ * both. That is done by the udp_v4_mcast_next function. But we should
+ * find a common point between sockets which are relatives to a
+ * namespace.  The common point is they have the same parent in case
+ * of l3 network namespace.
+ * @sk : the socket to be checked
+ * @net_ns : the receiving network namespace
+ * Returns: 1 if the socket is visible by the namespace, 0 otherwise.
+ */
+int net_ns_sock_is_visible(const struct sock *sk,
+  const struct net_namespace *net_ns)
+{
+   if (net_ns->level == NET_NS_LEVEL3)
+   net_ns = net_ns->parent;
+
+   if (sk->sk_net_ns->level == NET_NS_LEVEL3)
+   return sk->sk_net_ns->parent == net_ns;
+   else
+   return sk->sk_net_ns == net_ns;
+}
 #endif /* CONFIG_NET_NS */
Index: 2.6.20-rc4-mm1/net/ipv4/udp.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/udp.c
+++ 2.6.20-rc4-mm1/net/ipv4/udp.c
@@ -309,9 +309,10 @@
(inet->dport != rmt_port && inet->dport)||
(inet->rcv_saddr && inet->rcv_saddr != loc_addr)||
ipv6_only_sock(s)   ||
-   !net_ns_match(sk->sk_net_ns, ns)||
(s->sk_bound_dev_if && s->sk_bound_dev_if != dif))
continue;
+   if (!net_ns_sock_is_visible(sk, ns))
+   continue;
if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif))
continue;
goto found;

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 08/12] net namespace : find namespace by addr

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

Switch to the the l3 namespace using the destination address.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |7 +++
 net/core/net_namespace.c  |   35 +++
 net/ipv4/ip_input.c   |   16 +++-
 3 files changed, 57 insertions(+), 1 deletion(-)

Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c
+++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c
@@ -374,6 +374,9 @@
 {
struct iphdr *iph;
u32 len;
+   int err;
+   struct net_namespace *net_ns = current_net_ns;
+   struct net_namespace *dst_net_ns = NULL;
 
/* When the interface is in promisc. mode, drop all the crap
 * that it receives, do not try to analyse it.
@@ -393,6 +396,9 @@
 
iph = skb->nh.iph;
 
+   dst_net_ns = net_ns_find_from_dest_addr(iph->daddr);
+   if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
+   push_net_ns(dst_net_ns);
/*
 *  RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails 
the checksum.
 *
@@ -431,10 +437,18 @@
/* Remove any debris in the socket control block */
memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
 
-   return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
+   err =  NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
   ip_rcv_finish);
 
+   if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
+   pop_net_ns(net_ns);
+
+   return err;
+
 inhdr_error:
+   if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
+   pop_net_ns(net_ns);
+
IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
 drop:
 kfree_skb(skb);
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -99,6 +99,8 @@
 extern __be32 net_ns_select_source_address(const struct net_device *dev,
   u32 dst, int scope);
 
+extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr);
+
 #define SELECT_SRC_ADDR net_ns_select_source_address
 
 #else /* CONFIG_NET_NS */
@@ -167,6 +169,11 @@
return 0;
 }
 
+static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
+{
+   return NULL;
+}
+
 #define SELECT_SRC_ADDR inet_select_addr
 
 #endif /* !CONFIG_NET_NS */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -385,4 +385,39 @@
 out:
return addr;
 }
+
+/*
+ * This function finds the network namespace destination deduced from
+ * the destination address. The network namespace is retrieved from
+ * the ifaddr owned by a network namespace
+ * @daddr  : destination
+ * Returns : the network namespace destination or NULL if not found
+ */
+struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
+{
+   struct net_namespace *net_ns = NULL;
+   struct net_device *dev;
+   struct in_device *in_dev;
+
+   if (LOOPBACK(daddr))
+   return current_net_ns;
+
+   read_lock(&dev_base_lock);
+   rcu_read_lock();
+   for (dev = dev_base; dev; dev = dev->next) {
+   if ((in_dev = __in_dev_get_rcu(dev)) == NULL)
+   continue;
+   for_ifa(in_dev) {
+   if (ifa->ifa_local == daddr) {
+   net_ns = ifa->ifa_net_ns;
+   goto out_unlock_both;
+   }
+   } endfor_ifa(in_dev);
+   }
+out_unlock_both:
+   read_unlock(&dev_base_lock);
+   rcu_read_unlock();
+
+   return net_ns;
+}
 #endif /* CONFIG_NET_NS */

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 09/12] net namespace : make loopback address always visible

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

Add a specific condition when doing inet interface listing 
in order to see always the loopback address.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |9 +
 net/core/net_namespace.c  |   22 ++
 net/ipv4/devinet.c|   12 +---
 3 files changed, 36 insertions(+), 7 deletions(-)

Index: 2.6.20-rc4-mm1/net/ipv4/devinet.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/devinet.c
+++ 2.6.20-rc4-mm1/net/ipv4/devinet.c
@@ -695,8 +695,7 @@
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = &ifa->ifa_next) {
if (!strcmp(ifr.ifr_name, ifa->ifa_label) &&
-   net_ns_match(ifa->ifa_net_ns,
-current_net_ns) &&
+   net_ns_ifa_is_visible(ifa) &&
sin_orig.sin_addr.s_addr ==
ifa->ifa_address) {
break; /* found */
@@ -710,13 +709,12 @@
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = &ifa->ifa_next)
if (!strcmp(ifr.ifr_name, ifa->ifa_label) &&
-net_ns_match(ifa->ifa_net_ns,
- current_net_ns))
+net_ns_ifa_is_visible(ifa))
break;
}
}
 
-   if (ifa && !net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   if (ifa && !net_ns_ifa_is_visible(ifa))
goto done;
 
ret = -EADDRNOTAVAIL;
@@ -868,7 +866,7 @@
goto out;
 
for (; ifa; ifa = ifa->ifa_next) {
-   if (!net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   if (!net_ns_ifa_is_visible(ifa))
continue;
if (!buf) {
done += sizeof(ifr);
@@ -1216,7 +1214,7 @@
 
for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
 ifa = ifa->ifa_next, ip_idx++) {
-   if (!net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   if (!net_ns_ifa_is_visible(ifa))
continue;
if (ip_idx < s_ip_idx)
continue;
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -7,6 +7,8 @@
 #include 
 #include 
 
+struct in_ifaddr;
+
 struct net_namespace {
struct kref kref;
struct net_device   *dev_base_p, **dev_tail_p;
@@ -101,6 +103,8 @@
 
 extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr);
 
+extern int net_ns_ifa_is_visible(const struct in_ifaddr *ifa);
+
 #define SELECT_SRC_ADDR net_ns_select_source_address
 
 #else /* CONFIG_NET_NS */
@@ -174,6 +178,11 @@
return NULL;
 }
 
+static inline int net_ns_ifa_is_visible(const struct in_ifaddr *ifa)
+{
+   return 1;
+}
+
 #define SELECT_SRC_ADDR inet_select_addr
 
 #endif /* !CONFIG_NET_NS */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -420,4 +420,26 @@
 
return net_ns;
 }
+
+/*
+ * This function checks if the ifaddr is visible from the
+ * current network namespace. This is true if the ifaddr is
+ * the loopback address or if the ifaddr is owned by the network
+ * namespace.
+ * @ifa : the ifaddr
+ * Returns : 1 if visible, 0 otherwise
+ */
+int net_ns_ifa_is_visible(const struct in_ifaddr *ifa)
+{
+   struct net_namespace *net_ns = current_net_ns;
+
+   if (LOOPBACK(ifa->ifa_local))
+   return 1;
+
+   if (net_ns_match(ifa->ifa_net_ns, net_ns))
+   return 1;
+
+   return 0;
+}
+
 #endif /* CONFIG_NET_NS */

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 04/12] net namespace : isolate the inet device.

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

ip and ifconfig commands will not show ip addr
not belonging to the current network namespace.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/inetdevice.h |1 +
 net/ipv4/devinet.c |   22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

Index: 2.6.20-rc4-mm1/include/linux/inetdevice.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/inetdevice.h
+++ 2.6.20-rc4-mm1/include/linux/inetdevice.h
@@ -99,6 +99,7 @@
unsigned char   ifa_flags;
unsigned char   ifa_prefixlen;
charifa_label[IFNAMSIZ];
+   struct net_namespace*ifa_net_ns;
 };
 
 extern int register_inetaddr_notifier(struct notifier_block *nb);
Index: 2.6.20-rc4-mm1/net/ipv4/devinet.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/devinet.c
+++ 2.6.20-rc4-mm1/net/ipv4/devinet.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_SYSCTL
 #include 
 #endif
@@ -269,6 +270,7 @@
 
if (!(ifa->ifa_flags & IFA_F_SECONDARY) ||
ifa1->ifa_mask != ifa->ifa_mask ||
+   !net_ns_match(ifa->ifa_net_ns, ifa1->ifa_net_ns) ||
!inet_ifa_match(ifa1->ifa_address, ifa)) {
ifap1 = &ifa->ifa_next;
prev_prom = ifa;
@@ -471,6 +473,9 @@
 
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = &ifa->ifa_next) {
+   if (!net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   continue;
+
if (tb[IFA_LOCAL] &&
ifa->ifa_local != nla_get_be32(tb[IFA_LOCAL]))
continue;
@@ -544,6 +549,7 @@
ifa->ifa_flags = ifm->ifa_flags;
ifa->ifa_scope = ifm->ifa_scope;
ifa->ifa_dev = in_dev;
+   ifa->ifa_net_ns = current_net_ns;
 
ifa->ifa_local = nla_get_be32(tb[IFA_LOCAL]);
ifa->ifa_address = nla_get_be32(tb[IFA_ADDRESS]);
@@ -689,6 +695,8 @@
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = &ifa->ifa_next) {
if (!strcmp(ifr.ifr_name, ifa->ifa_label) &&
+   net_ns_match(ifa->ifa_net_ns,
+current_net_ns) &&
sin_orig.sin_addr.s_addr ==
ifa->ifa_address) {
break; /* found */
@@ -701,11 +709,16 @@
if (!ifa) {
for (ifap = &in_dev->ifa_list; (ifa = *ifap) != NULL;
 ifap = &ifa->ifa_next)
-   if (!strcmp(ifr.ifr_name, ifa->ifa_label))
+   if (!strcmp(ifr.ifr_name, ifa->ifa_label) &&
+net_ns_match(ifa->ifa_net_ns,
+ current_net_ns))
break;
}
}
 
+   if (ifa && !net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   goto done;
+
ret = -EADDRNOTAVAIL;
if (!ifa && cmd != SIOCSIFADDR && cmd != SIOCSIFFLAGS)
goto done;
@@ -749,6 +762,8 @@
ret = -ENOBUFS;
if ((ifa = inet_alloc_ifa()) == NULL)
break;
+
+   ifa->ifa_net_ns = current_net_ns;
if (colon)
memcpy(ifa->ifa_label, ifr.ifr_name, IFNAMSIZ);
else
@@ -853,6 +868,8 @@
goto out;
 
for (; ifa; ifa = ifa->ifa_next) {
+   if (!net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   continue;
if (!buf) {
done += sizeof(ifr);
continue;
@@ -1086,6 +1103,7 @@
in_dev_hold(in_dev);
ifa->ifa_dev = in_dev;
ifa->ifa_scope = RT_SCOPE_HOST;
+   ifa->ifa_net_ns = current_net_ns;
memcpy(ifa->ifa_label, dev->name, IFNAMSIZ);
inet_insert_ifa(ifa);
}
@@ -1198,6 +1216,8 @@
 
for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
 ifa = ifa->ifa_next, ip_idx++) {
+   if (!net_ns_match(ifa->ifa_net_ns, current_net_ns))
+   continue;
if (ip_idx < s_ip_idx)
continue;
if (inet_fill_ifaddr(skb, ifa, NETLINK_

[patch 03/12] net namespace : share network ressources L2 with L3

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

L3 namespace will use routes and devices belonging to its parent, so
the old network namespace structure is copied when allocating a new
one. By this way, hash value, dev list, routes are accessible from the
L3 namespaces. In case of L2 namespace, these values are overwritten
by the newly allocated values.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |   14 ++
 net/core/dev.c|4 ++--
 net/core/net_namespace.c  |   33 ++---
 3 files changed, 34 insertions(+), 17 deletions(-)

Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -37,7 +37,7 @@
  * Return ERR_PTR on error, new ns otherwise
  */
 static struct net_namespace *clone_net_ns(unsigned int level,
-   struct net_namespace *old_ns)
+ struct net_namespace *old_ns)
 {
struct net_namespace *ns;
 
@@ -45,23 +45,26 @@
if (current_net_ns->level == NET_NS_LEVEL3)
return ERR_PTR(-EPERM);
 
-   ns = kzalloc(sizeof(struct net_namespace), GFP_KERNEL);
+   ns = kmemdup(old_ns, sizeof(struct net_namespace), GFP_KERNEL);
if (!ns)
return NULL;
 
kref_init(&ns->kref);
-   ns->dev_base_p = NULL;
-   ns->dev_tail_p = &ns->dev_base_p;
-   ns->hash = net_random();
-
if ((push_net_ns(ns)) != old_ns)
+
BUG();
if (level ==  NET_NS_LEVEL2) {
+   ns->dev_base_p = NULL;
+   ns->dev_tail_p = &ns->dev_base_p;
+   ns->hash = net_random();
+
 #ifdef CONFIG_IP_MULTIPLE_TABLES
INIT_LIST_HEAD(&ns->fib_rules_ops_list);
 #endif
if (ip_fib_struct_init())
goto out_fib4;
+   if (loopback_init())
+   goto out_loopback;
}
 
if (level == NET_NS_LEVEL3) {
@@ -70,8 +73,6 @@
}
 
ns->level = level;
-   if (loopback_init())
-   goto out_loopback;
pop_net_ns(old_ns);
printk(KERN_DEBUG "NET_NS: created new netcontext %p, level %u, "
"for %s (pid=%d)\n", ns, (ns->level == NET_NS_LEVEL2) ?
@@ -127,15 +128,17 @@
struct net_namespace *ns;
 
ns = container_of(kref, struct net_namespace, kref);
-   unregister_netdev(ns->loopback_dev_p);
-   if (ns->dev_base_p != NULL) {
-   printk("NET_NS: BUG: namespace %p has devices! ref %d\n",
-   ns, atomic_read(&ns->kref.refcount));
-   return;
-   }
 
-   if (ns->level == NET_NS_LEVEL2)
+   if (ns->level == NET_NS_LEVEL2) {
ip_fib_struct_cleanup(ns);
+   unregister_netdev(ns->loopback_dev_p);
+   if (ns->dev_base_p != NULL) {
+   printk("NET_NS: BUG: namespace %p has devices! ref 
%d\n",
+  ns, atomic_read(&ns->kref.refcount));
+   return;
+   }
+   }
+
if (ns->level == NET_NS_LEVEL3)
put_net_ns(ns->parent);
 
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -56,6 +56,15 @@
 DECLARE_PER_CPU(struct net_namespace *, exec_net_ns);
 #define current_net_ns (__get_cpu_var(exec_net_ns))
 
+static inline struct net_namespace *net_ns_l2(void)
+{
+   struct net_namespace *net_ns = current_net_ns;
+
+   if (net_ns->level == NET_NS_LEVEL3)
+   return net_ns->parent;
+   return net_ns;
+}
+
 static inline void init_current_net_ns(int cpu)
 {
get_net_ns(&init_net_ns);
@@ -110,6 +119,11 @@
 
 #define current_net_ns NULL
 
+static inline struct net_namespace *net_ns_l2(void)
+{
+   return NULL;
+}
+
 static inline void init_current_net_ns(int cpu)
 {
 }
Index: 2.6.20-rc4-mm1/net/core/dev.c
===
--- 2.6.20-rc4-mm1.orig/net/core/dev.c
+++ 2.6.20-rc4-mm1/net/core/dev.c
@@ -485,7 +485,7 @@
 struct net_device *__dev_get_by_name(const char *name)
 {
struct hlist_node *p;
-   struct net_namespace *ns = current_net_ns;
+   struct net_namespace *ns = net_ns_l2();
 
hlist_for_each(p, dev_name_hash(name, ns)) {
struct net_device *dev
@@ -768,7 +768,7 @@
if (!err) {
hlist_del(&dev->name_hlist);
hlist_add_head(&dev->name_hlist, dev_name_hash(dev->name,
-   current_net_ns));
+  net_ns

[patch 06/12] net namespace : check bind address

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

Check the bind address is allowed. It must match ifaddr assigned to
the namespace and all derivative addresses.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |7 +
 net/core/net_namespace.c  |   54 ++
 net/ipv4/af_inet.c|2 +
 net/ipv4/raw.c|3 ++
 4 files changed, 66 insertions(+)

Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c
+++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c
@@ -433,6 +433,8 @@
 *  is temporarily down)
 */
err = -EADDRNOTAVAIL;
+   if (net_ns_check_bind(chk_addr_ret, addr->sin_addr.s_addr))
+   goto out;
if (!sysctl_ip_nonlocal_bind &&
!inet->freebind &&
addr->sin_addr.s_addr != INADDR_ANY &&
Index: 2.6.20-rc4-mm1/net/ipv4/raw.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/raw.c
+++ 2.6.20-rc4-mm1/net/ipv4/raw.c
@@ -559,7 +559,10 @@
if (sk->sk_state != TCP_CLOSE || addr_len < sizeof(struct sockaddr_in))
goto out;
chk_addr_ret = inet_addr_type(addr->sin_addr.s_addr);
+
ret = -EADDRNOTAVAIL;
+   if (net_ns_check_bind(chk_addr_ret, addr->sin_addr.s_addr))
+   goto out;
if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL &&
chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST)
goto out;
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -93,6 +93,8 @@
 
 extern int net_ns_ioctl(unsigned int cmd, void __user *arg);
 
+extern int net_ns_check_bind(int addr_type, u32 addr);
+
 #else /* CONFIG_NET_NS */
 
 #define INIT_NET_NS(net_ns)
@@ -148,6 +150,11 @@
return -ENOSYS;
 }
 
+static inline int net_ns_check_bind(int addr_type, u32 addr)
+{
+   return 0;
+}
+
 #endif /* !CONFIG_NET_NS */
 
 #endif /* _LINUX_NET_NAMESPACE_H */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -263,4 +263,58 @@
return err;
 }
 
+/*
+ * This function check if the specified bind address is allowed.
+ * The bind is allowed if the address is:
+ * - 127.0.0.1
+ * - INADDR_ANY
+ * - INADDR_BROADCAST
+ * - a multicast address
+ * - the specified address match an ifaddr owned by the current
+ *   network namespace. That implies the local address and the
+ *   computed address from the netmask
+ * @addr_type : an addr type
+ * @addr : the requested bind address
+ * Returns: -EPERM on failure, 0 on success
+ */
+int net_ns_check_bind(int addr_type, u32 addr)
+{
+   int ret = -EPERM;
+struct net_device *dev;
+struct in_device *in_dev;
+   struct net_namespace *net_ns = current_net_ns;
+
+   if (LOOPBACK(addr) ||
+   MULTICAST(addr) ||
+   INADDR_ANY == addr ||
+   INADDR_BROADCAST == addr)
+   return 0;
+
+read_lock(&dev_base_lock);
+rcu_read_lock();
+for (dev = dev_base; dev; dev = dev->next) {
+in_dev = __in_dev_get_rcu(dev);
+if (!in_dev)
+continue;
+
+for_ifa(in_dev) {
+if (ifa->ifa_net_ns != net_ns)
+   continue;
+   if (addr == ifa->ifa_local ||
+   addr == ifa->ifa_broadcast ||
+   addr == (ifa->ifa_local & ifa->ifa_mask) ||
+   addr == ((ifa->ifa_address & ifa->ifa_mask)|
+ ~ifa->ifa_mask)) {
+   ret = 0;
+   goto out;
+   }
+} endfor_ifa(in_dev);
+}
+out:
+read_unlock(&dev_base_lock);
+rcu_read_unlock();
+
+   return ret;
+}
+
 #endif /* CONFIG_NET_NS */

-- 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 07/12] net namespace: set source addresse

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

When no source address is specified, search from the dev list the
ifaddr allowed to be used as source address.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |   14 
 net/core/net_namespace.c  |   68 ++
 net/ipv4/route.c  |   28 +++--
 3 files changed, 100 insertions(+), 10 deletions(-)

Index: 2.6.20-rc4-mm1/net/ipv4/route.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/route.c
+++ 2.6.20-rc4-mm1/net/ipv4/route.c
@@ -2475,17 +2475,17 @@
 
if (LOCAL_MCAST(oldflp->fl4_dst) || oldflp->fl4_dst == 
htonl(0x)) {
if (!fl.fl4_src)
-   fl.fl4_src = inet_select_addr(dev_out, 0,
- RT_SCOPE_LINK);
+   fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0,
+RT_SCOPE_LINK);
goto make_route;
}
if (!fl.fl4_src) {
if (MULTICAST(oldflp->fl4_dst))
-   fl.fl4_src = inet_select_addr(dev_out, 0,
- fl.fl4_scope);
+   fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0,
+fl.fl4_scope);
else if (!oldflp->fl4_dst)
-   fl.fl4_src = inet_select_addr(dev_out, 0,
- RT_SCOPE_HOST);
+   fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0,
+RT_SCOPE_HOST);
}
}
 
@@ -2525,8 +2525,8 @@
 */
 
if (fl.fl4_src == 0)
-   fl.fl4_src = inet_select_addr(dev_out, 0,
- RT_SCOPE_LINK);
+   fl.fl4_src = SELECT_SRC_ADDR(dev_out, 0,
+RT_SCOPE_LINK);
res.type = RTN_UNICAST;
goto make_route;
}
@@ -2539,7 +2539,13 @@
 
if (res.type == RTN_LOCAL) {
if (!fl.fl4_src)
+#ifdef CONFIG_NET_NS
+   fl.fl4_src = net_ns_select_source_address(dev_out,
+ fl.fl4_dst,
+ 
RT_SCOPE_LINK);
+#else
fl.fl4_src = fl.fl4_dst;
+#endif
if (dev_out)
dev_put(dev_out);
dev_out = &loopback_dev;
@@ -2561,8 +2567,10 @@
fib_select_default(&fl, &res);
 
if (!fl.fl4_src)
-   fl.fl4_src = FIB_RES_PREFSRC(res);
-
+   fl.fl4_src = res.fi->fib_prefsrc ? :
+   SELECT_SRC_ADDR(FIB_RES_DEV(res),
+   FIB_RES_GW(res),
+   res.scope);
if (dev_out)
dev_put(dev_out);
dev_out = FIB_RES_DEV(res);
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct net_namespace {
struct kref kref;
@@ -95,6 +96,11 @@
 
 extern int net_ns_check_bind(int addr_type, u32 addr);
 
+extern __be32 net_ns_select_source_address(const struct net_device *dev,
+  u32 dst, int scope);
+
+#define SELECT_SRC_ADDR net_ns_select_source_address
+
 #else /* CONFIG_NET_NS */
 
 #define INIT_NET_NS(net_ns)
@@ -155,6 +161,14 @@
return 0;
 }
 
+static inline __be32 net_ns_select_source_address(struct net_device *dev,
+ u32 dst, int scope)
+{
+   return 0;
+}
+
+#define SELECT_SRC_ADDR inet_select_addr
+
 #endif /* !CONFIG_NET_NS */
 
 #endif /* _LINUX_NET_NAMESPACE_H */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -317,4 +317,72 @@
return ret;
 }
 
+/*
+ * This function choose the source address from the network device,
+ * destination and the scope. The function will browse the ifaddr
+ * owned by network namespace and choose the most adapted for the
+ * dst address and dev.
+ * @dev : the network device where the traffic will go
+ * @dst : the destination a

[patch 10/12] net namespace : add the loopback isolation

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

When a packet is outgoing, the namespace source is stored
into the skbuff. Because it is the loopback address, the
source == destination, so when the packet is incoming, it
has already the namespace destination set into the packet.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |   13 +++--
 include/linux/skbuff.h|5 -
 net/core/net_namespace.c  |   32 +++-
 net/ipv4/ip_input.c   |2 +-
 net/ipv4/ip_output.c  |1 +
 5 files changed, 44 insertions(+), 9 deletions(-)

Index: 2.6.20-rc4-mm1/include/linux/skbuff.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/skbuff.h
+++ 2.6.20-rc4-mm1/include/linux/skbuff.h
@@ -225,6 +225,7 @@
  * @dma_cookie: a cookie to one of several possible DMA operations
  * done by skb DMA functions
  * @secmark: security marking
+ *  @net_ns: namespace destination
  */
 
 struct sk_buff {
@@ -309,7 +310,9 @@
 #ifdef CONFIG_NETWORK_SECMARK
__u32   secmark;
 #endif
-
+#ifdef CONFIG_NET_NS
+   struct net_namespace*net_ns;
+#endif
__u32   mark;
 
/* These elements must be at the end, see alloc_skb() for details.  */
Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c
+++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c
@@ -396,7 +396,7 @@
 
iph = skb->nh.iph;
 
-   dst_net_ns = net_ns_find_from_dest_addr(iph->daddr);
+   dst_net_ns = net_ns_find_from_dest_addr(skb);
if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
push_net_ns(dst_net_ns);
/*
Index: 2.6.20-rc4-mm1/net/ipv4/ip_output.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/ip_output.c
+++ 2.6.20-rc4-mm1/net/ipv4/ip_output.c
@@ -272,6 +272,7 @@
 
IP_INC_STATS(IPSTATS_MIB_OUTREQUESTS);
 
+   net_ns_tag_sk_buff(skb);
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -8,6 +8,7 @@
 #include 
 
 struct in_ifaddr;
+struct sk_buff;
 
 struct net_namespace {
struct kref kref;
@@ -101,10 +102,13 @@
 extern __be32 net_ns_select_source_address(const struct net_device *dev,
   u32 dst, int scope);
 
-extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr);
+extern struct net_namespace
+*net_ns_find_from_dest_addr(const struct sk_buff *skb);
 
 extern int net_ns_ifa_is_visible(const struct in_ifaddr *ifa);
 
+extern void net_ns_tag_sk_buff(struct sk_buff *skb);
+
 #define SELECT_SRC_ADDR net_ns_select_source_address
 
 #else /* CONFIG_NET_NS */
@@ -173,7 +177,8 @@
return 0;
 }
 
-static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
+static inline struct net_namespace
+*net_ns_find_from_dest_addr(const struct sk_buff *skb)
 {
return NULL;
 }
@@ -183,6 +188,10 @@
return 1;
 }
 
+static inline void net_ns_tag_sk_buff(struct sk_buff *skb)
+{
+   ;
+}
 #define SELECT_SRC_ADDR inet_select_addr
 
 #endif /* !CONFIG_NET_NS */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
 #include 
 
 struct net_namespace init_net_ns = {
@@ -389,18 +392,25 @@
 /*
  * This function finds the network namespace destination deduced from
  * the destination address. The network namespace is retrieved from
- * the ifaddr owned by a network namespace
- * @daddr  : destination
+ * the ifaddr owned by a network namespace. If the packet is for the
+ * loopback address so we assume the destination address is already filled
+ * by the sender which is the same as the receiver.
+ * @skb : the packet to be delivered
  * Returns : the network namespace destination or NULL if not found
  */
-struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
+struct net_namespace *net_ns_find_from_dest_addr(const struct sk_buff *skb)
 {
struct net_namespace *net_ns = NULL;
struct net_device *dev;
struct in_device *in_dev;
+   struct iphdr *iph;
+   __be32 daddr;
+
+   iph = skb->nh.iph;
+   daddr = iph->daddr;
 
-   if (LOOPBACK(daddr))
-   return current_net_ns;
+   if (LOOPBACK(daddr))
+   return skb->net_ns;
 
read_lock(&dev_base_lock);
rcu_read_lock();
@@ -442,4 +452,16 @@
return 0;
 }
 
+/*
+ * This fun

[patch 05/12] net namespace : ioctl to push ifa to net namespace l3

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

New ioctl to "push" ifaddr to a container. Actually, the push is done
from the current namespace, so the right word is "pull". That will be
changed to move ifaddr from l2 network namespace to l3.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 include/linux/net_namespace.h |7 ++
 include/linux/sockios.h   |4 +
 net/core/net_namespace.c  |  118 +-
 net/ipv4/af_inet.c|4 +
 4 files changed, 132 insertions(+), 1 deletion(-)

Index: 2.6.20-rc4-mm1/include/linux/sockios.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/sockios.h
+++ 2.6.20-rc4-mm1/include/linux/sockios.h
@@ -122,6 +122,10 @@
 #define SIOCBRADDIF0x89a2  /* add interface to bridge  */
 #define SIOCBRDELIF0x89a3  /* remove interface from bridge */
 
+/* Container calls */
+#define SIOCNETNSPUSHIF  0x89b0 /* add ifaddr to namespace  */
+#define SIOCNETNSPULLIF  0x89b1 /* remove ifaddr to namespace   */
+
 /* Device private ioctl calls */
 
 /*
Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c
===
--- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c
+++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c
@@ -789,6 +789,10 @@
case SIOCSIFFLAGS:
err = devinet_ioctl(cmd, (void __user *)arg);
break;
+   case SIOCNETNSPUSHIF:
+   case SIOCNETNSPULLIF:
+   err = net_ns_ioctl(cmd, (void __user *)arg);
+   break;
default:
if (sk->sk_prot->ioctl)
err = sk->sk_prot->ioctl(sk, cmd, arg);
Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
===
--- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
+++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
@@ -91,6 +91,8 @@
 
 #define net_ns_hash(ns)((ns)->hash)
 
+extern int net_ns_ioctl(unsigned int cmd, void __user *arg);
+
 #else /* CONFIG_NET_NS */
 
 #define INIT_NET_NS(net_ns)
@@ -141,6 +143,11 @@
 
 #define net_ns_hash(ns)(0)
 
+static inline int net_ns_ioctl(unsigned int cmd, void __user *arg)
+{
+   return -ENOSYS;
+}
+
 #endif /* !CONFIG_NET_NS */
 
 #endif /* _LINUX_NET_NAMESPACE_H */
Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
===
--- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
+++ 2.6.20-rc4-mm1/net/core/net_namespace.c
@@ -10,7 +10,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 
 struct net_namespace init_net_ns = {
@@ -123,6 +125,33 @@
return err;
 }
 
+/*
+ * The function will move the ifaddr to the l2 network namespace
+ * parent.
+ * @net_ns: the related network namespace
+ */
+static void release_ifa_to_parent(const struct net_namespace* net_ns)
+{
+   struct net_device *dev;
+   struct in_device *in_dev;
+
+   read_lock(&dev_base_lock);
+   rcu_read_lock();
+   for (dev = dev_base; dev; dev = dev->next) {
+   in_dev = __in_dev_get_rcu(dev);
+   if (!in_dev)
+   continue;
+
+   for_ifa(in_dev) {
+   if (ifa->ifa_net_ns != net_ns)
+   continue;
+   ifa->ifa_net_ns = net_ns->parent;
+   } endfor_ifa(in_dev);
+   }
+   read_unlock(&dev_base_lock);
+   rcu_read_unlock();
+}
+
 void free_net_ns(struct kref *kref)
 {
struct net_namespace *ns;
@@ -139,12 +168,99 @@
}
}
 
-   if (ns->level == NET_NS_LEVEL3)
+   if (ns->level == NET_NS_LEVEL3) {
+   release_ifa_to_parent(ns);
put_net_ns(ns->parent);
+   }
 
printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns);
kfree(ns);
 }
 EXPORT_SYMBOL_GPL(free_net_ns);
 
+/*
+ * This function allows to assign an IP address from a l2 network
+ * namespace to one of his l3 child or to release from an l3 network
+ * namespace to his l2 network namespace parent.
+ * @cmd: a "push" / "pull" command
+ * @arg: an userspace buffer containing an ifreq structure
+ * Returns:
+ * - EPERM : if caller has no CAP_NET_ADMIN capabilities or the
+ *   current level of network namespace is not layer 2
+ * - EFAULT : if arg is an invalid buffer
+ * - EADDRNOTAVAIL : if the specified ifaddr does not exists
+ * - EINVAL : if cmd is unknown
+ * - zero on success
+ */
+int net_ns_ioctl(unsigned int cmd, void __user *arg)
+{
+   struct ifreq ifr;
+   struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;
+   struct net_namespace *net_ns = current_net_ns;
+   struct net_device *dev;
+   struct in_device *in_dev;
+   struct in_ifad

[patch 11/12] net namespace : debugfs - add net_ns debugfs

2007-01-19 Thread dlezcano

From: Daniel Lezcano <[EMAIL PROTECTED]>

For debug purpose only, this is not intended to be included. 
Add /sys/kernel/debug/net_ns.

Creation of network namespace:

echo  > /sys/kernel/debug/net_ns/start

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>

---
 fs/debugfs/Makefile |2 
 fs/debugfs/net_ns.c |  335 
 net/Kconfig |4 
 3 files changed, 340 insertions(+), 1 deletion(-)

Index: 2.6.20-rc4-mm1/fs/debugfs/Makefile
===
--- 2.6.20-rc4-mm1.orig/fs/debugfs/Makefile
+++ 2.6.20-rc4-mm1/fs/debugfs/Makefile
@@ -1,4 +1,4 @@
 debugfs-objs   := inode.o file.o
 
 obj-$(CONFIG_DEBUG_FS) += debugfs.o
-
+obj-$(CONFIG_NET_NS_DEBUG) += net_ns.o
Index: 2.6.20-rc4-mm1/fs/debugfs/net_ns.c
===
--- /dev/null
+++ 2.6.20-rc4-mm1/fs/debugfs/net_ns.c
@@ -0,0 +1,335 @@
+/*
+ *  net_ns.c - adds a net_ns/ directory to debug NET namespaces
+ *
+ *  Author: Daniel Lezcano <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation, version 2 of the
+ * License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static struct dentry *net_ns_dentry;
+static struct dentry *net_ns_dentry_dev;
+static struct dentry *net_ns_dentry_start;
+static struct dentry *net_ns_dentry_info;
+
+static ssize_t net_ns_dev_read_file(struct file *file, char __user *user_buf,
+   size_t count, loff_t *ppos)
+{
+   return 0;
+}
+
+static ssize_t net_ns_dev_write_file(struct file *file,
+const char __user *user_buf,
+size_t count, loff_t *ppos)
+{
+   return 0;
+}
+
+static int net_ns_dev_open_file(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static int net_ns_start_open_file(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static ssize_t net_ns_start_read_file(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+   return 0;
+}
+
+static ssize_t net_ns_start_write_file(struct file *file,
+  const char __user *user_buf,
+  size_t count, loff_t *ppos)
+{
+   int err;
+   size_t len;
+   const char __user *p;
+   char c;
+   unsigned long flags;
+   struct net_namespace *net, *new_net;
+   struct nsproxy *new_nsproxy = NULL, *old_nsproxy = NULL;
+
+   if (current_net_ns != &init_net_ns)
+   return -EBUSY;
+
+   len = 0;
+   p = user_buf;
+   while (len < count) {
+   if (get_user(c, p++))
+   return -EFAULT;
+   if (c == 0 || c == '\n')
+   break;
+   len++;
+   }
+
+   if (len > 1)
+   return -EINVAL;
+
+   if (copy_from_user(&c, user_buf, sizeof(c)))
+   return -EFAULT;
+
+   if (c != '2' && c != '3')
+   return -EINVAL;
+
+   flags = (c=='2'?CLONE_NEWNET2:CLONE_NEWNET3);
+   err = unshare_net_ns(flags, &new_net);
+   if (err)
+   return err;
+
+   old_nsproxy = current->nsproxy;
+   new_nsproxy = dup_namespaces(old_nsproxy);
+
+   if (!new_nsproxy) {
+   put_net_ns(new_net);
+   task_unlock(current);
+   return -ENOMEM;
+   }
+
+   task_lock(current);
+
+   if (new_nsproxy) {
+   current->nsproxy = new_nsproxy;
+   new_nsproxy = old_nsproxy;
+   }
+
+   net = current->nsproxy->net_ns;
+   current->nsproxy->net_ns = new_net;
+   pop_net_ns(new_net);
+   new_net = net;
+
+   task_unlock(current);
+
+   put_nsproxy(new_nsproxy);
+   put_net_ns(new_net);
+
+   return count;
+}
+
+static int net_ns_info_open_file(struct inode *inode, struct file *file)
+{
+   return 0;
+}
+
+static ssize_t net_ns_info_read_file(struct file *file, char __user *user_buf,
+size_t count, loff_t *ppos)
+{
+   const unsigned int length = 256;
+   size_t len;
+   char buff[length];
+   char *level;
+   struct net_namespace *net_ns = current_net_ns;
+   struct nsproxy *ns = current->nsproxy;
+
+   if (*ppos < 0)
+   return -EINVAL;
+   if (*ppos >= count)
+   return 0;
+   if (!count)
+   return 0;
+
+   switch (net_ns->level) {
+   case NET_NS_LEVEL2:
+   level = "layer 2";
+   break;
+   case NET_NS_LEVEL3:
+   level = "layer 3";
+   break;
+   default:
+   level = "unknown";

RE: [PATCH 2.6.20 1/5] s2io: updates for s2io driver.

2007-01-19 Thread Sivakumar Subramani

Hi Jeff,

Thanks for the comments and references.

As per you suggestion, we have resubmitted the patches with required
change.

Thanks,
~Siva 

-Original Message-
From: Jeff Garzik [mailto:[EMAIL PROTECTED] 
Sent: Thursday, January 18, 2007 10:32 PM
To: Ananda Raju
Cc: netdev@vger.kernel.org; Leonid Grossman; Sivakumar Subramani; Alicia
Pena; [EMAIL PROTECTED]; Ramkrishna Vepa
Subject: Re: [PATCH 2.6.20 1/5] s2io: updates for s2io driver.

Ananda Raju wrote:
> Hello,
> 
> List of changes in this patch:
> 
>   This patch adds two load parameters napi and ufo. Previously
NAPI was 
> compilation option with these changes wan enable disable NAPI using 
> load parameter. Also we are introducing ufo load parameter to 
> enable/disable ufo feature
> 
> Signed-off-by: Sivakumar Subramani <[EMAIL PROTECTED]>

OK, you're getting closer :)

Problems that need correcting:

1) Your email subject line is a one-line summary of the patch.  "s2io: 
updates for s2io driver" is useless, because it tells us nothing about
the patch itself.  When applied in a series,

git log master..upstream-fixes | git shortlog

will produce

Ananda Raju (5):
s2io: updates for s2io driver
s2io: updates for s2io driver
s2io: updates for s2io driver
s2io: updates for s2io driver
s2io: updates for s2io driver

which clearly makes it impossible to distinguish between changesets. 
Please re-read Rule #1 of http://linux.yyz.us/patch-format.html

Also, re-read Rule #2.  Everything in your email body before the "---" 
terminator is copied DIRECTLY into the kernel changelog.  As such,
comments like "Hello," and "List of changes in this patch:" must be
hand-edited out of your email, before applying the patch.

Please fix these problems and resubmit.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [take33 10/10] kevent: Kevent based AIO (aio_sendfile()/aio_sendfile_path()).

2007-01-19 Thread Evgeniy Polyakov

On Fri, Jan 19, 2007 at 11:57:00AM +0530, Suparna Bhattacharya ([EMAIL 
PROTECTED]) wrote:
> > > Since you are implementing new APIs here, have you considered doing an
> > > aio_sendfilev to be able to send a header with the data ?
> > 
> > It is doable, but why people do not like corking?
> > With Linux less than microsecond syscall overhead it is better and more
> > flexible solution, doesn't it?
> 
> That is what I used to think as well. However ...
> 
> The problem as I understand it now is not about bunching data together, but
> of ensuring some sort of atomicity between the header and the data, when
> there can be multiple outstanding aio requests on the same socket - i.e
> ensuring strict ordering without other data coming in between, when data
> to be sent is not already in cache, and in the meantime another sendfile
> or aio write requests comes in for the same socket. Without having to lock
> the socket when reading data from disk.

No, socket locking is not solution at all here.
But the same applies to header - it will be copied into socket queue,
then socket will be unlocked and populated VFS data will be put into
that queue too, but there is a window between socket unlock after header
copy and file data copy. If we will hold socket lock after header is
copied, it is possible to lock it for too long - bad sectors on disk,
and reading might take forever.

> There are alternate ways to address this, aio_sendfilev is one of the options
> I have heard people requesting.

I bet those people worked with different Unix systems, which have much
slower syscalls, so they combine several operations into one call.

Only from this perspective I see any benefit from having header in the
syscall related to file transfer. Since I already "optimized" open()
syscall into file sending, things can not became worse if I will put there
header pointer too. I will schedule new kevent release with this change
somewhere after current work on M-on-N threading model.

> Regards
> Suparna

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible ways of dealing with OOM conditions.

2007-01-19 Thread Peter Zijlstra


> Let me briefly describe your approach and possible drawbacks in it.
> You start reserving some memory when systems is under memory pressure.
> when system is in real trouble, you start using that reserve for special
> tasks mainly for network path to allocate packets and process them in
> order to get committed some memory swapping.
> 
> So, the problems I see here, are following:
> 1. it is possible that when you are starting to create a reserve, there
> will not be enough memeory at all. So the solution is to reserve in
> advance.

Swap is usually enabled at startup, but sure, if you want you can mess
this up.

> 2. You differentiate by hand between critical and non-critical
> allocations by specifying some kernel users as potentially possible to
> allocate from reserve. 

True, all sockets that are needed for swap, no-one else.

> This does not prevent from NVIDIA module to
> allocate from that reserve too, does it?

All users of the NVidiot crap deserve all the pain they get.
If it breaks they get to keep both pieces.

> And you artificially limit
> system to process only tiny bits of what it must do, thus potentially
> leaking pathes which must use reserve too.

How so? I cover pretty much every allocation needed to process an skb by
setting PF_MEMALLOC - the only drawback there is that the reserve might
not actually be large enough because it covers more allocations that
were considered. (thats one of the TODO items, validate the reserve
functions parameters)

> So, solution is to have a reserve in advance, and manage it using
> special path when system is in OOM. So you will have network memory
> reserve, which will be used when system is in trouble. It is very
> similar to what you had.
> 
> But the whole reserve can never be used at all, so it should be used,
> but not by those who can create OOM condition, thus it should be
> exported to, for example, network only, and when system is in trouble,
> network would be still functional (although only critical pathes).

But the network can create OOM conditions for itself just fine. 

Consider the remote storage disappearing for a while (it got rebooted,
someone tripped over the wire etc..). Now the rest of the network
traffic keeps coming and will queue up - because user-space is stalled,
waiting for more memory - and we run out of memory.

There must be a point where we start dropping packets that are not
critical to the survival of the machine.

> Even further development of such idea is to prevent such OOM condition
> at all - by starting swapping early (but wisely) and reduce memory
> usage.

These just postpone execution but will not avoid it.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Kernel headers - linux-atm userspace build broken by recent change; __be16 undefined

2007-01-19 Thread Adrian Bunk

On Thu, Jan 18, 2007 at 09:22:52PM +, Andrew Walrond wrote:
> Don't know exactly when this change went in, but it's not in 2.6.18.3 
> and is in 2.6.19.2+
> 
>  $ diff linux/include/linux/if_arp.h linux-2.6/include/linux/if_arp.h
> 133,134c133,134
> <   unsigned short  ar_hrd; /* format of hardware address   */
> <   unsigned short  ar_pro; /* format of protocol address   */
> ---
> >   __be16  ar_hrd; /* format of hardware address   */
> >   __be16  ar_pro; /* format of protocol address   */
> 137c137
> <   unsigned short  ar_op;  /* ARP opcode (command) */
> ---
> >   __be16  ar_op;  /* ARP opcode (command) */
> 
> 
> This causes the linux-atm userspace compile to fail like this:
> 
> In file included from arp.c:19:
> /usr/include/linux/if_arp.h:133: error: expected 
> specifier-qualifier-list before '__be16'
> 
> I guess if_arp.h needs to include include/linux/byteorder/big_endian.h?

No, linux/types.h

But what bothers me more about if_arp.h is that it is one of the
headers using "struct sockaddr" in userspace, but as far as I can see we 
aren't exporting it in any header.

This seems to work since glibc is providing the struct, but this looks 
a bit fishy.

> Andrew Walrond

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible ways of dealing with OOM conditions.

2007-01-19 Thread Christoph Lameter

On Thu, 18 Jan 2007, Peter Zijlstra wrote:

> 
> > Cache misses for small packet flow due to the fact, that the same data
> > is allocated and freed  and accessed on different CPUs will become an
> > issue soon, not right now, since two-four core CPUs are not yet to be
> > very popular and price for the cache miss is not _that_ high.
> 
> SGI does networking too, right?

Sslab deals with those issues the right way. We have per processor
queues that attempt to keep the cache hot state. A special shared queue
exists between neighboring processors to facilitate exchange of objects
between then.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] tcp_output: Re: rare bad TCP checksum with 2.6.19?

2007-01-19 Thread Herbert Xu

On Fri, Jan 19, 2007 at 12:06:41PM +0100, Jarek Poplawski wrote:
> 
> [PATCH][NET] tcp_output: rare bad TCP checksum with 2.6.19
> 
> The patch "Replace CHECKSUM_HW by CHECKSUM_PARTIAL/CHECKSUM_COMPLETE"
> changed to unconditional copying of ip_summed field from collapsed
> skb. This patch reverts this change.   
> 
> All substantial work including heavy testing and diagnosing by:
> Michael Tokarev <[EMAIL PROTECTED]>
> 
> Signed-off-by: Jarek Poplawski <[EMAIL PROTECTED]>

Acked-by: Herbert Xu <[EMAIL PROTECTED]>

Thanks for catching this! I'll take the credit for adding this bug :)

Dave, we'll need this fix for 2.6.20 as well as 2.6.19.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-19 Thread Neil Horman

Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection).  In
short, this is a feature whereby a node with a Tentative address can begin to
make use of that address almost immediately after its configured.  To enable
this, extra rules need to be followed during the Duplicate address detection
phase of the addresses configuration, so that in the event of a collision,
neighboring nodes do not have thier neighbor caches affected adversely by the
optimistic node.  This patch implements those rules as outlined in the RFC.

I have a fairly limited testing environment here, but from the testing I've
done, this patch appears to conform to the rules as outlined in RFC 4429, causes
no adverse affects on normal IPv6 operation when in use, and doesn't seem to
break anything when disabled via the sysctl.

Comments and Reviews appreciated.

Thanks and Regards
Neil


Signed-Off-By: Neil Horman <[EMAIL PROTECTED]>


 include/linux/if_addr.h|1 
 include/linux/sysctl.h |1 
 include/net/addrconf.h |4 ++-
 include/net/ipv6.h |1 
 net/ipv6/addrconf.c|   50 +++---
 net/ipv6/mcast.c   |4 +--
 net/ipv6/ndisc.c   |   59 ++---
 net/ipv6/sysctl_net_ipv6.c |8 ++
 8 files changed, 107 insertions(+), 21 deletions(-)


diff --git a/include/linux/if_addr.h b/include/linux/if_addr.h
index d557e4c..43f3bed 100644
--- a/include/linux/if_addr.h
+++ b/include/linux/if_addr.h
@@ -39,6 +39,7 @@ enum
 #define IFA_F_TEMPORARYIFA_F_SECONDARY
 
 #defineIFA_F_NODAD 0x02
+#define IFA_F_OPTIMISTIC   0x04
 #defineIFA_F_HOMEADDRESS   0x10
 #define IFA_F_DEPRECATED   0x20
 #define IFA_F_TENTATIVE0x40
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 81480e6..62034c3 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -531,6 +531,7 @@ enum {
NET_IPV6_IP6FRAG_TIME=23,
NET_IPV6_IP6FRAG_SECRET_INTERVAL=24,
NET_IPV6_MLD_MAX_MSF=25,
+   NET_IPV6_OPT_DAD_ENABLE=26,
 };
 
 enum {
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 88df8fc..d248a19 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -73,7 +73,9 @@ extern intipv6_get_saddr(struct dst_entry 
*dst,
 extern int ipv6_dev_get_saddr(struct net_device *dev, 
   struct in6_addr *daddr,
   struct in6_addr *saddr);
-extern int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *);
+extern int ipv6_get_lladdr(struct net_device *dev, 
+   struct in6_addr *,
+   unsigned char banned_flags);
 extern int ipv6_rcv_saddr_equal(const struct sock *sk, 
  const struct sock *sk2);
 extern voidaddrconf_join_solict(struct net_device *dev,
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 00328b7..dd16169 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -110,6 +110,7 @@ struct frag_hdr {
 /* sysctls */
 extern int sysctl_ipv6_bindv6only;
 extern int sysctl_mld_max_msf;
+extern int sysctl_optimistic_dad;
 
 /* MIBs */
 DECLARE_SNMP_STAT(struct ipstats_mib, ipv6_statistics);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2a7e461..f7afb2a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.proxy_ndp  = 0,
 };
 
+int sysctl_optimistic_dad = 1;
+
 /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */
 #if 0
 const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT;
@@ -830,7 +832,8 @@ retry:
ift = !max_addresses ||
  ipv6_count_addresses(idev) < max_addresses ? 
ipv6_add_addr(idev, &addr, tmp_plen,
- ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
IFA_F_TEMPORARY) : NULL;
+ ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
+   IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL;
if (!ift || IS_ERR(ift)) {
in6_ifa_put(ifp);
in6_dev_put(idev);
@@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst,
 }
 
 
-int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr)
+int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, 
+   unsigned char banned_flags)
 {
struct inet6_dev *idev;
int err = -EADDRNOTAVAIL;
@@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct 
in6_addr *addr)
 
read_lock_bh(&idev->lock);
for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) {
-

Re: [PATCH 10/12] forcedeth: tx max work

2007-01-19 Thread Ayaz Abdulla

Jeff Garzik wrote:

Ayaz Abdulla wrote:
 > This patch adds a limit to how much tx work can be done in each
 > iteration of tx processing.
 >
 > Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]>

What about the "tail end" of the work, when the limit is reached?

Remember that delaying the completion of TX's too long increases latency.

It seems to me that this patch needs a timer or somesuch, to guarantee
that TX completions are not delayed too long in the worst case.

Yes, you are right.
There is a timer interrupt that fires in throughput mode every 10ms (in 
cpu mode it fires at approx every 130us). I can use that to clean out 
any uncompleted TXs. Let me know if 10ms is not too late for worst case 
tx completion.

Jeff

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bonding: sysfs patch broke module renaming

2007-01-19 Thread Jay Vosburgh

Patrick McHardy <[EMAIL PROTECTED]> wrote:

>The sysfs patch broke using multiple instances of the bonding module
>through module renaming (modprobe -o). In recent kernels it fails
>with -EEXIST when trying to add the bonding_masters file for the
>second time, in older kernels (where sysfs_add_file didn't check
>for duplicates) it will crash when unloading the modules.

Ok, I see what the problem is; it's got to do with out device
creation was changed at some point for the sysfs stuff that broke the
multiple load logic.  I don't think it has to do with the sysfs_add_file
duplicate check business; I can see the error in how bond_create() is
called in the new (post-sysfs) stuff, although I haven't tracked it down
to a particular changeset.

There'a also a separate error handling bug I see in
bond_create() that I don't even get to because it bails out first.

Anyway, let me see what I can work out to fix this up.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

e1000: update device ID table for register dumps

2007-01-19 Thread Auke Kok

e1000: update device ID table for register dumps with new devices

From: Auke Kok <[EMAIL PROTECTED]>

The register dump routine of e1000 was missing several newer chipsets. I
reimported the mac detection code from the linux e1000 driver. This fixes
newer NIC's reporting that their bus type is PCI instead of PCI-e.

Signed-off-by: Auke Kok <[EMAIL PROTECTED]>
---

 e1000.c |  154 ++-
 1 files changed, 103 insertions(+), 51 deletions(-)

diff --git a/e1000.c b/e1000.c
index 6741323..d67947a 100644
--- a/e1000.c
+++ b/e1000.c
@@ -111,42 +111,66 @@
 #define E1000_TCTL_NRTU   0x0200/* No Re-transmit on underrun */
 
 /* PCI Device IDs */
-#define E1000_DEV_ID_82542   0x1000
-#define E1000_DEV_ID_82543GC_FIBER   0x1001
-#define E1000_DEV_ID_82543GC_COPPER  0x1004
-#define E1000_DEV_ID_82544EI_COPPER  0x1008
-#define E1000_DEV_ID_82544EI_FIBER   0x1009
-#define E1000_DEV_ID_82544GC_COPPER  0x100C
-#define E1000_DEV_ID_82544GC_LOM 0x100D
-#define E1000_DEV_ID_82540EM 0x100E
-#define E1000_DEV_ID_82540EM_LOM 0x1015
-#define E1000_DEV_ID_82540EP_LOM 0x1016
-#define E1000_DEV_ID_82540EP 0x1017
-#define E1000_DEV_ID_82540EP_LP  0x101E
-#define E1000_DEV_ID_82545EM_COPPER  0x100F
-#define E1000_DEV_ID_82545EM_FIBER   0x1011
-#define E1000_DEV_ID_82545GM_COPPER  0x1026
-#define E1000_DEV_ID_82545GM_FIBER   0x1027
-#define E1000_DEV_ID_82545GM_SERDES  0x1028
-#define E1000_DEV_ID_82546EB_COPPER  0x1010
-#define E1000_DEV_ID_82546EB_FIBER   0x1012
-#define E1000_DEV_ID_82546EB_QUAD_COPPER 0x101D
-#define E1000_DEV_ID_82541EI 0x1013
-#define E1000_DEV_ID_82541EI_MOBILE  0x1018
-#define E1000_DEV_ID_82541ER 0x1078
-#define E1000_DEV_ID_82547GI 0x1075
-#define E1000_DEV_ID_82541GI 0x1076
-#define E1000_DEV_ID_82541GI_MOBILE  0x1077
-#define E1000_DEV_ID_82541GI_LF  0x107C
-#define E1000_DEV_ID_82546GB_COPPER  0x1079
-#define E1000_DEV_ID_82546GB_FIBER   0x107A
-#define E1000_DEV_ID_82546GB_SERDES  0x107B
-#define E1000_DEV_ID_82546GB_PCIE0x108A
-#define E1000_DEV_ID_82547EI 0x1019
-#define E1000_DEV_ID_82573E  0x108B
-#define E1000_DEV_ID_82573E_IAMT 0x108C
-
-#define E1000_DEV_ID_82546GB_QUAD_COPPER 0x1099
+#define E1000_DEV_ID_825420x1000
+#define E1000_DEV_ID_82543GC_FIBER0x1001
+#define E1000_DEV_ID_82543GC_COPPER   0x1004
+#define E1000_DEV_ID_82544EI_COPPER   0x1008
+#define E1000_DEV_ID_82544EI_FIBER0x1009
+#define E1000_DEV_ID_82544GC_COPPER   0x100C
+#define E1000_DEV_ID_82544GC_LOM  0x100D
+#define E1000_DEV_ID_82540EM  0x100E
+#define E1000_DEV_ID_82540EM_LOM  0x1015
+#define E1000_DEV_ID_82540EP_LOM  0x1016
+#define E1000_DEV_ID_82540EP  0x1017
+#define E1000_DEV_ID_82540EP_LP   0x101E
+#define E1000_DEV_ID_82545EM_COPPER   0x100F
+#define E1000_DEV_ID_82545EM_FIBER0x1011
+#define E1000_DEV_ID_82545GM_COPPER   0x1026
+#define E1000_DEV_ID_82545GM_FIBER0x1027
+#define E1000_DEV_ID_82545GM_SERDES   0x1028
+#define E1000_DEV_ID_82546EB_COPPER   0x1010
+#define E1000_DEV_ID_82546EB_FIBER0x1012
+#define E1000_DEV_ID_82546EB_QUAD_COPPER  0x101D
+#define E1000_DEV_ID_82546GB_COPPER   0x1079
+#define E1000_DEV_ID_82546GB_FIBER0x107A
+#define E1000_DEV_ID_82546GB_SERDES   0x107B
+#define E1000_DEV_ID_82546GB_PCIE 0x108A
+#define E1000_DEV_ID_82546GB_QUAD_COPPER  0x1099
+#define E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3 0x10B5
+#define E1000_DEV_ID_82541EI  0x1013
+#define E1000_DEV_ID_82541EI_MOBILE   0x1018
+#define E1000_DEV_ID_82541ER_LOM  0x1014
+#define E1000_DEV_ID_82541ER  0x1078
+#define E1000_DEV_ID_82541GI  0x1076
+#define E1000_DEV_ID_82541GI_LF   0x107C
+#define E1000_DEV_ID_82541GI_MOBILE   0x1077
+#define E1000_DEV_ID_82547EI  0x1019
+#define E1000_DEV_ID_82547EI_MOBILE   0x101A
+#define E1000_DEV_ID_82547GI  0x1075
+#define E1000_DEV_ID_82571EB_COPPER   0x105E
+#define E1000_DEV_ID_82571EB_FIBER0x105F
+#define E1000_DEV_ID_82571EB_SERDES   0x1060
+#define E1000_DEV_ID_82571EB_QUAD_COPPER  0x10A4
+#define E1000_DEV_ID_82571EB_QUAD_FIBER   0x10A5
+#define E1000_DEV_ID_82571EB_QUAD_COPPER_LP   0x10BC
+#define E1000_DEV_ID_82572EI_COPPER   0x107D
+#define E1000_DEV_ID_82572EI_FIBER0x107E
+#define E1000_DEV_ID_82572EI_SERDES   0x107F
+#define E1000_DEV_ID_82572EI  0x10B9
+#define E1000_DEV_ID_82573E   0x108B
+#define E1000_DEV_ID_82573E_IAMT

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-19 Thread YOSHIFUJI Hideaki / 吉藤英明

Hello.

In article <[EMAIL PROTECTED]> (at Fri, 19 Jan 2007 16:23:14 -0500), Neil 
Horman <[EMAIL PROTECTED]> says:

> Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection).  In

Good work.  We will see if this would break core and basic ipv6 code.
Dave, please hold on.

Some quick comments.

> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -110,6 +110,7 @@ struct frag_hdr {
>  /* sysctls */
>  extern int sysctl_ipv6_bindv6only;
>  extern int sysctl_mld_max_msf;
> +extern int sysctl_optimistic_dad;
>  
:
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 2a7e461..f7afb2a 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt 
> __read_mostly = {
>   .proxy_ndp  = 0,
>  };
>  
> +int sysctl_optimistic_dad = 1;
> +

Please put this into ipv6_devconf{} and make it per-interface variable.
And I think default should be kept off (0).

>  /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */
>  #if 0
>  const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT;
> @@ -830,7 +832,8 @@ retry:
>   ift = !max_addresses ||
> ipv6_count_addresses(idev) < max_addresses ? 
>   ipv6_add_addr(idev, &addr, tmp_plen,
> -   ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
> IFA_F_TEMPORARY) : NULL;
> +   ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
> + IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL;
>   if (!ift || IS_ERR(ift)) {
>   in6_ifa_put(ifp);
>   in6_dev_put(idev);

Please align ipv6_addr_type and IFA_F_TEMPORARY

> @@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst,
>  }
>  
>  
> -int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr)
> +int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, 
> + unsigned char banned_flags)
>  {
>   struct inet6_dev *idev;
>   int err = -EADDRNOTAVAIL;

Please align "struct net_device" and "unsigned char".

> @@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct 
> in6_addr *addr)
>  
>   read_lock_bh(&idev->lock);
>   for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) {
> - if (ifp->scope == IFA_LINK && 
> !(ifp->flags&IFA_F_TENTATIVE)) {
> + if (ifp->scope == IFA_LINK && 
> !(ifp->flags&banned_flags)) {
>   ipv6_addr_copy(addr, &ifp->addr);
>   err = 0;
>   break;
> @@ -1742,7 +1746,7 @@ ok:

It is not your fault, but please put a space around "&".

>   if (!max_addresses ||
>   ipv6_count_addresses(in6_dev) < max_addresses)
>   ifp = ipv6_add_addr(in6_dev, &addr, 
> pinfo->prefix_len,
> - 
> addr_type&IPV6_ADDR_SCOPE_MASK, 0);
> + 
> addr_type&IPV6_ADDR_SCOPE_MASK,0);
>  
>   if (!ifp || IS_ERR(ifp)) {
>   in6_dev_put(in6_dev);

Please do no kill space after ",".

> @@ -2123,7 +2132,8 @@ static void addrconf_add_linklocal(struct inet6_dev 
> *idev, struct in6_addr *addr
>  {
>   struct inet6_ifaddr * ifp;
>  
> - ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT);
> + ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, 
> + IFA_F_PERMANENT|IFA_F_OPTIMISTIC);
>   if (!IS_ERR(ifp)) {
>   addrconf_dad_start(ifp, 0);
>   in6_ifa_put(ifp);

Please align idev and IFA_F_PERMANENT.

> @@ -542,7 +556,8 @@ void ndisc_send_ns(struct net_device *dev, struct 
> neighbour *neigh,
>   int send_llinfo;
>  
>   if (saddr == NULL) {
> - if (ipv6_get_lladdr(dev, &addr_buf))
> + if (ipv6_get_lladdr(dev, &addr_buf,
> + (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)))
>   return;
>   saddr = &addr_buf;
>   }

ditto... ("dev" and "(")

> +and optimistic) are false then we can just fail
> +dad now.
> + */
> + type = ipv6_addr_type(saddr);   
> + if (!((ifp->flags & IFA_F_OPTIMISTIC) && 
> + (type & IPV6_ADDR_UNICAST))) {
> + addrconf_dad_failure(ifp); 
> + return;
> + }
>   }
>  
>   idev = ifp->idev;

hmm? Here, is saddr always unicast, isn't it?!

--yoshfuji
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000: update device ID table for register dumps [Is an ethtool patch]

2007-01-19 Thread Auke Kok


Auke Kok wrote:

e1000: update device ID table for register dumps with new devices

From: Auke Kok <[EMAIL PROTECTED]>

The register dump routine of e1000 was missing several newer chipsets. I
reimported the mac detection code from the linux e1000 driver. This fixes
newer NIC's reporting that their bus type is PCI instead of PCI-e.

Signed-off-by: Auke Kok <[EMAIL PROTECTED]>



it's a patch to ethtool, of course. Apologies for any confusion. I didn't fix the mail 
subject.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible ways of dealing with OOM conditions.

2007-01-19 Thread Evgeniy Polyakov

On Fri, Jan 19, 2007 at 01:53:15PM +0100, Peter Zijlstra ([EMAIL PROTECTED]) 
wrote:
> > 2. You differentiate by hand between critical and non-critical
> > allocations by specifying some kernel users as potentially possible to
> > allocate from reserve. 
> 
> True, all sockets that are needed for swap, no-one else.
> 
> > This does not prevent from NVIDIA module to
> > allocate from that reserve too, does it?
> 
> All users of the NVidiot crap deserve all the pain they get.
> If it breaks they get to keep both pieces.

I meant that pretty anyone can be those user, who can just add a bit
into own gfp_flags which are used for allocation.

> > And you artificially limit
> > system to process only tiny bits of what it must do, thus potentially
> > leaking pathes which must use reserve too.
> 
> How so? I cover pretty much every allocation needed to process an skb by
> setting PF_MEMALLOC - the only drawback there is that the reserve might
> not actually be large enough because it covers more allocations that
> were considered. (thats one of the TODO items, validate the reserve
> functions parameters)

You only covered ipv4/v6 and arp, maybe some route updates.
But it is very possible, that some allocations are missed like
multicast/broadcast. Selecting only special pathes out of the whole
possible network alocations tends to create a situation, when something
is missed or cross dependant on other pathes.

> > So, solution is to have a reserve in advance, and manage it using
> > special path when system is in OOM. So you will have network memory
> > reserve, which will be used when system is in trouble. It is very
> > similar to what you had.
> > 
> > But the whole reserve can never be used at all, so it should be used,
> > but not by those who can create OOM condition, thus it should be
> > exported to, for example, network only, and when system is in trouble,
> > network would be still functional (although only critical pathes).
> 
> But the network can create OOM conditions for itself just fine. 
> 
> Consider the remote storage disappearing for a while (it got rebooted,
> someone tripped over the wire etc..). Now the rest of the network
> traffic keeps coming and will queue up - because user-space is stalled,
> waiting for more memory - and we run out of memory.

Hmm... Neither UDP, nor TCP work that way actually.

> There must be a point where we start dropping packets that are not
> critical to the survival of the machine.

You still can drop them, the main point is that network allocations do
not depend on other allocations.

> > Even further development of such idea is to prevent such OOM condition
> > at all - by starting swapping early (but wisely) and reduce memory
> > usage.
> 
> These just postpone execution but will not avoid it.

No. If system allows to have such a condition, then
something is broken. It must be prevented, instead of creating special
hacks to recover from it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)

2007-01-19 Thread Adam Kropelin


Auke Kok wrote:

Adam Kropelin wrote:

I am experiencing the no-link issue on a 82572EI single port copper
PCI-E card. I've only tried 2.6.20-rc5, so I cannot tell if this is a
regression or not yet. Will test older kernel soon.

Can provide details/logs if you want 'em.


we've already established that Allen's issue is not due to the driver
and caused by interrupts being mal-assigned on his system, possibly a
pci subsystem bug. You also have a completely different board
(82572EI instead of 82571EB), so I'd like to see the usual debugging
info as well as hearing from you whether 2.6.19.any works correctly.


On 2.6.19 the link status is working (follows cable plug/unplug), but no 
tx or rx packets get thru. Attempts to transmit occasionally result in 
tx timed out errors in dmesg, but I cannot seem to generate these at 
will.


On 2.6.20-rc5, the link status does not work (link is always down), and 
as expected no tx or rx. No tx timed out errors this time, presumably 
because it thinks the link is down. Note that both the switch and the 
LEDs on the NIC  indicate a good 1000 Mbps link.


dmesg, 'cat /proc/interrupts', and 'lspci -vvv' attached for 2.6.20-rc5. 
The data from 2.6.19 is essentially the same.



On top of that I posted a patch to rc5-mm yesterday that fixes a few
significant bugs in the rc5-mm driver, so please apply that patch too
before trying, so we're not wasting our time finding old bugs ;)


I haven't been able to test rc5-mm yet because it won't boot on this 
box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of 
rejects that I'm not sure how to fix. Some are obvious, but the others 
I'm unsure of.


--Adam


dmesg-2.6.20-rc5
Description: Binary data


lspci-2.6.20-rc5
Description: Binary data


proc-irq-2.6.20-rc5
Description: Binary data

Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)

2007-01-19 Thread Auke Kok


Adam Kropelin wrote:

Auke Kok wrote:

Adam Kropelin wrote:

I am experiencing the no-link issue on a 82572EI single port copper
PCI-E card. I've only tried 2.6.20-rc5, so I cannot tell if this is a
regression or not yet. Will test older kernel soon.

Can provide details/logs if you want 'em.


we've already established that Allen's issue is not due to the driver
and caused by interrupts being mal-assigned on his system, possibly a
pci subsystem bug. You also have a completely different board
(82572EI instead of 82571EB), so I'd like to see the usual debugging
info as well as hearing from you whether 2.6.19.any works correctly.


On 2.6.19 the link status is working (follows cable plug/unplug), but no 
tx or rx packets get thru. Attempts to transmit occasionally result in 
tx timed out errors in dmesg, but I cannot seem to generate these at will.


On 2.6.20-rc5, the link status does not work (link is always down), and 
as expected no tx or rx. No tx timed out errors this time, presumably 
because it thinks the link is down. Note that both the switch and the 
LEDs on the NIC  indicate a good 1000 Mbps link.


dmesg, 'cat /proc/interrupts', and 'lspci -vvv' attached for 2.6.20-rc5. 
The data from 2.6.19 is essentially the same.


at least your interrupts look sane. I see you are using MSI, but no interrupts arrive at 
neither OS nor driver.



On top of that I posted a patch to rc5-mm yesterday that fixes a few
significant bugs in the rc5-mm driver, so please apply that patch too
before trying, so we're not wasting our time finding old bugs ;)


I haven't been able to test rc5-mm yet because it won't boot on this 
box. Applying git-e1000 directly to -rc4 or -rc5 results in a number of 
rejects that I'm not sure how to fix. Some are obvious, but the others 
I'm unsure of.


that won't work. You either need to start with 2.6.20-rc5 (and pull the changes pending 
merge in netdev-2.6 from Jeff Garzik), or start with 2.6.20-rc4-mm1 and manually apply 
that patch I sent out on monday. A different combination of either of these two will not 
work, as they are completely different drivers.


can you include `ethtool ethX` output of the link down message and `ethtool -d ethX` as 
well? I'll need to dig up an 82572 and see what's up with that, I've not seen that 
problem before.


More importantly, I suspect that *again* the issue is caused by interrupts not arriving 
or getting lost. Can you try running with MSI disabled in your kernel config?


FYI the driver gives an interrupt to signal to the driver that link is up. no interrupt 
== no link detected. So that explains the symptom.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)

2007-01-19 Thread Adam Kropelin


Auke Kok wrote:

Adam Kropelin wrote:

I haven't been able to test rc5-mm yet because it won't boot on this
box. Applying git-e1000 directly to -rc4 or -rc5 results in a number
of rejects that I'm not sure how to fix. Some are obvious, but the
others I'm unsure of.


that won't work. You either need to start with 2.6.20-rc5 (and pull
the changes pending merge in netdev-2.6 from Jeff Garzik),


I thought that's what I was doing when I applied git-e1000 to 
2.6.20-rc5, but I guess not.



or start
with 2.6.20-rc4-mm1 and manually apply that patch I sent out on
monday. A different combination of either of these two will not work,
as they are completely different drivers.


I'll try to work something out.


can you include `ethtool ethX` output of the link down message and
`ethtool -d ethX` as well? I'll need to dig up an 82572 and see
what's up with that, I've not seen that problem before.


ethtool output attached.


More importantly, I suspect that *again* the issue is caused by
interrupts not arriving or getting lost.


Smells that way to me, too.


Can you try running with MSI disabled in your kernel config?


That fixes it! The link comes up and tx/rx works well. I get about 300 
Mbps using default iperf settings with a nearby windows box.



FYI the driver gives an interrupt to signal to the driver that link
is up. no interrupt == no link detected. So that explains the symptom.


Yep, makes sense. I've worked with a number of PHYs like that.

--Adam


ethtool-eth1
Description: Binary data


ethtool-d-eth1
Description: Binary data

Re: intel 82571EB gigabit fails to see link on 2.6.20-rc5 in-tree e1000 driver (regression)

2007-01-19 Thread Auke Kok


Adam Kropelin wrote:

Auke Kok wrote:

Adam Kropelin wrote:

I haven't been able to test rc5-mm yet because it won't boot on this
box. Applying git-e1000 directly to -rc4 or -rc5 results in a number
of rejects that I'm not sure how to fix. Some are obvious, but the
others I'm unsure of.


that won't work. You either need to start with 2.6.20-rc5 (and pull
the changes pending merge in netdev-2.6 from Jeff Garzik),


I thought that's what I was doing when I applied git-e1000 to 
2.6.20-rc5, but I guess not.



or start
with 2.6.20-rc4-mm1 and manually apply that patch I sent out on
monday. A different combination of either of these two will not work,
as they are completely different drivers.


I'll try to work something out.


can you include `ethtool ethX` output of the link down message and
`ethtool -d ethX` as well? I'll need to dig up an 82572 and see
what's up with that, I've not seen that problem before.


ethtool output attached.


that clearly shows that the PHY detected link up status and that all is well as far as 
the driver and NIC is concerned. This bug really needs to be moved to linux-pci where 
the folks who know interrupt handling best can handle it.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IPv6: Implement RFC 4429 Optimistic Duplicate Address Detection

2007-01-19 Thread Neil Horman

On Sat, Jan 20, 2007 at 08:05:07AM +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote:
> Hello.
> 
> In article <[EMAIL PROTECTED]> (at Fri, 19 Jan 2007 16:23:14 -0500), Neil 
> Horman <[EMAIL PROTECTED]> says:
> 
> > Patch to Implement IPv6 RFC 4429 (Optimistic Duplicate Address Detection).  
> > In
> 
> Good work.  We will see if this would break core and basic ipv6 code.
> Dave, please hold on.
> 
Thank you.  I'll implement your requested changes and repost monday afternoon
Regards
Neil

> Some quick comments.
> 
> > --- a/include/net/ipv6.h
> > +++ b/include/net/ipv6.h
> > @@ -110,6 +110,7 @@ struct frag_hdr {
> >  /* sysctls */
> >  extern int sysctl_ipv6_bindv6only;
> >  extern int sysctl_mld_max_msf;
> > +extern int sysctl_optimistic_dad;
> >  
> :
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index 2a7e461..f7afb2a 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -206,6 +206,8 @@ static struct ipv6_devconf ipv6_devconf_dflt 
> > __read_mostly = {
> > .proxy_ndp  = 0,
> >  };
> >  
> > +int sysctl_optimistic_dad = 1;
> > +
> 
> Please put this into ipv6_devconf{} and make it per-interface variable.
> And I think default should be kept off (0).
> 
> >  /* IPv6 Wildcard Address and Loopback Address defined by RFC2553 */
> >  #if 0
> >  const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT;
> > @@ -830,7 +832,8 @@ retry:
> > ift = !max_addresses ||
> >   ipv6_count_addresses(idev) < max_addresses ? 
> > ipv6_add_addr(idev, &addr, tmp_plen,
> > - ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
> > IFA_F_TEMPORARY) : NULL;
> > + ipv6_addr_type(&addr)&IPV6_ADDR_SCOPE_MASK, 
> > +   IFA_F_TEMPORARY|IFA_F_OPTIMISTIC) : NULL;
> > if (!ift || IS_ERR(ift)) {
> > in6_ifa_put(ifp);
> > in6_dev_put(idev);
> 
> Please align ipv6_addr_type and IFA_F_TEMPORARY
> 
> > @@ -1174,7 +1177,8 @@ int ipv6_get_saddr(struct dst_entry *dst,
> >  }
> >  
> >  
> > -int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr)
> > +int ipv6_get_lladdr(struct net_device *dev, struct in6_addr *addr, 
> > +   unsigned char banned_flags)
> >  {
> > struct inet6_dev *idev;
> > int err = -EADDRNOTAVAIL;
> 
> Please align "struct net_device" and "unsigned char".
> 
> > @@ -1185,7 +1189,7 @@ int ipv6_get_lladdr(struct net_device *dev, struct 
> > in6_addr *addr)
> >  
> > read_lock_bh(&idev->lock);
> > for (ifp=idev->addr_list; ifp; ifp=ifp->if_next) {
> > -   if (ifp->scope == IFA_LINK && 
> > !(ifp->flags&IFA_F_TENTATIVE)) {
> > +   if (ifp->scope == IFA_LINK && 
> > !(ifp->flags&banned_flags)) {
> > ipv6_addr_copy(addr, &ifp->addr);
> > err = 0;
> > break;
> > @@ -1742,7 +1746,7 @@ ok:
> 
> It is not your fault, but please put a space around "&".
> 
> > if (!max_addresses ||
> > ipv6_count_addresses(in6_dev) < max_addresses)
> > ifp = ipv6_add_addr(in6_dev, &addr, 
> > pinfo->prefix_len,
> > -   
> > addr_type&IPV6_ADDR_SCOPE_MASK, 0);
> > +   
> > addr_type&IPV6_ADDR_SCOPE_MASK,0);
> >  
> > if (!ifp || IS_ERR(ifp)) {
> > in6_dev_put(in6_dev);
> 
> Please do no kill space after ",".
> 
> > @@ -2123,7 +2132,8 @@ static void addrconf_add_linklocal(struct inet6_dev 
> > *idev, struct in6_addr *addr
> >  {
> > struct inet6_ifaddr * ifp;
> >  
> > -   ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, IFA_F_PERMANENT);
> > +   ifp = ipv6_add_addr(idev, addr, 64, IFA_LINK, 
> > +   IFA_F_PERMANENT|IFA_F_OPTIMISTIC);
> > if (!IS_ERR(ifp)) {
> > addrconf_dad_start(ifp, 0);
> > in6_ifa_put(ifp);
> 
> Please align idev and IFA_F_PERMANENT.
> 
> > @@ -542,7 +556,8 @@ void ndisc_send_ns(struct net_device *dev, struct 
> > neighbour *neigh,
> > int send_llinfo;
> >  
> > if (saddr == NULL) {
> > -   if (ipv6_get_lladdr(dev, &addr_buf))
> > +   if (ipv6_get_lladdr(dev, &addr_buf,
> > +   (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)))
> > return;
> > saddr = &addr_buf;
> > }
> 
> ditto... ("dev" and "(")
> 
> > +  and optimistic) are false then we can just fail
> > +  dad now.
> > +   */
> > +   type = ipv6_addr_type(saddr);   
> > +   if (!((ifp->flags & IFA_F_OPTIMISTIC) && 
> > +   (type & IPV6_ADDR_UNICAST))) {
> > +   addrconf_dad_failure(ifp); 
> > +   return;
> > +   }
> > }
> >  
> > idev = ifp->idev;

[PATCH 1/4] bonding: fix device name allocation error

2007-01-19 Thread Jay Vosburgh


The code to select names for the bonding interfaces was, for the
non-sysfs creation case, always using a hard-coded set of bond0, bond1,
etc, up to max_bonds.  This caused conflicts for the second or
subsequent loads of the module.

Changed the code to obtain device names from dev_alloc_name().

Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]>


diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6482aed..07b9d1f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4704,6 +4704,7 @@ static int bond_check_params(struct bond
 static struct lock_class_key bonding_netdev_xmit_lock_key;
 
 /* Create a new bond based on the specified name and bonding parameters.
+ * If name is NULL, obtain a suitable "bond%d" name for us.
  * Caller must NOT hold rtnl_lock; we need to release it here before we
  * set up our sysfs entries.
  */
@@ -4713,7 +4714,8 @@ int bond_create(char *name, struct bond_
int res;
 
rtnl_lock();
-   bond_dev = alloc_netdev(sizeof(struct bonding), name, ether_setup);
+   bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "",
+   ether_setup);
if (!bond_dev) {
printk(KERN_ERR DRV_NAME
   ": %s: eek! can't alloc netdev!\n",
@@ -4722,6 +4724,12 @@ int bond_create(char *name, struct bond_
goto out_rtnl;
}
 
+   if (!name) {
+   res = dev_alloc_name(bond_dev, "bond%d");
+   if (res < 0)
+   goto out_netdev;
+   }
+
/* bond_init() must be called after dev_alloc_name() (for the
 * /proc files), but before register_netdevice(), because we
 * need to set function pointers.
@@ -4763,7 +4771,6 @@ static int __init bonding_init(void)
 {
int i;
int res;
-   char new_bond_name[8];  /* Enough room for 999 bonds at init. */
 
printk(KERN_INFO "%s", version);
 
@@ -4776,8 +4783,7 @@ #ifdef CONFIG_PROC_FS
bond_create_proc_dir();
 #endif
for (i = 0; i < max_bonds; i++) {
-   sprintf(new_bond_name, "bond%d",i);
-   res = bond_create(new_bond_name,&bonding_defaults, NULL);
+   res = bond_create(NULL, &bonding_defaults, NULL);
if (res)
goto err;
}
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4]: bonding: fix error check in sysfs creation

2007-01-19 Thread Jay Vosburgh


The existing code did not correctly handle failures to create
the per-interface sysfs group for bonding.

Modified code to notice errors, and correctly unwind.

Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 07b9d1f..d3801a0 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4756,14 +4756,19 @@ int bond_create(char *name, struct bond_
 
rtnl_unlock(); /* allows sysfs registration of net device */
res = bond_create_sysfs_entry(bond_dev->priv);
-   goto done;
+   if (res < 0) {
+   rtnl_lock();
+   goto out_bond;
+   }
+
+   return 0;
+
 out_bond:
bond_deinit(bond_dev);
 out_netdev:
free_netdev(bond_dev);
 out_rtnl:
rtnl_unlock();
-done:
return res;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4]: bonding: update version

2007-01-19 Thread Jay Vosburgh


Update version number to reflect recent changes.

Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]>


diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index dc434fb..6123b90 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -22,8 +22,8 @@ #include 
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
-#define DRV_VERSION"3.1.1"
-#define DRV_RELDATE"September 26, 2006"
+#define DRV_VERSION"3.1.2"
+#define DRV_RELDATE"January 20, 2007"
 #define DRV_NAME   "bonding"
 #define DRV_DESCRIPTION"Ethernet Channel Bonding Driver"
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/4] bonding: fix module multiple load issues

2007-01-19 Thread Jay Vosburgh


Patch 1: fix device name allocation error
Patch 2: fix error check in sysfs creation
Patch 3: modify sysfs support to permit multiple loads
Patch 4: update version number

This patch series should resolve whatever problems there are
with the logic to load the module multiple times.  This code changed
during the introduction of sysfs, and some recent tightening of the
sysfs creation code (checking for duplicates) broke that.

The multiple load logic is used primarily by the initscripts and
sysconfig packages, to automatically configure multiple bonding
interfaces at boot time.

Originally reported by Patrick McHardy <[EMAIL PROTECTED]>.

Patches generated against netdev-2.6 (hope that's ok).

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] bonding: modify sysfs support to permit multiple loads

2007-01-19 Thread Jay Vosburgh


The existing code would blindly attempt to create the
bonding_masters file (in /sys/class/net) every time the module was
loaded.  When the module is loaded multiple times (which is the
historical method used by initscripts and sysconfig to create multiple
bonding interfaces), this caused load failure of the second module load
attempt, as the creation request would fail.

This changes the code to note the failure, arrange to not remove
the bonding_masters file upon module exit, and then return success.

Bonding interfaces created by the second or subsequent loads of
the module will not exist in bonding_masters.  This is not a significant
change, as previously only the interfaces from the most recent load of
the module would be listed.  Both situations are less than optimal, but
this case permits compatibility with existing distro configuration
scripts, and is consistent.

Note that previously, the sysfs create request would overwrite
the exsting bonding_masters file and succeed, allowing multiple loads of
the module.  The sysfs code has recently changed to return an error if
the file being created already exists.

Patrick McHardy <[EMAIL PROTECTED]>, who reported this problem,
observed crashes on the old kernel (before sysfs checked for
duplicates).  I did not experience such crashes, but this change should
resolve them.

Signed-off-by: Jay Vosburgh <[EMAIL PROTECTED]>



diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index ced9ed8..8e317e1 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1372,6 +1372,21 @@ int bond_create_sysfs(void)
return -ENODEV;
 
ret = class_create_file(netdev_class, &class_attr_bonding_masters);
+   /*
+* Permit multiple loads of the module by ignoring failures to
+* create the bonding_masters sysfs file.  Bonding devices
+* created by second or subsequent loads of the module will
+* not be listed in, or controllable by, bonding_masters, but
+* will have the usual "bonding" sysfs directory.
+*
+* This is done to preserve backwards compatibility for
+* initscripts/sysconfig, which load bonding multiple times to
+* configure multiple bonding devices.
+*/
+   if (ret == -EEXIST) {
+   netdev_class = NULL;
+   return 0;
+   }
 
return ret;
 
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH REPOST 1/2] NET: Accurate packet scheduling for ATM/ADSL (kernel)

2007-01-19 Thread Russell Stuart

On Fri, 2007-01-19 at 13:19 +0100, Patrick McHardy wrote:
> Russell Stuart wrote:
> > I thought that some degree of compatibility was 
> > expected.  At the very least the newest version 
> > of "tc" must work on _any_ kernel as least as 
> > well as the version it replaces did.
> > 
> > I also though newer kernels should work older
> > version of iproute2, albeit without the features
> > added in the newer versions.
> > 
> > Are you saying this is not so?
> 
> No, thats exactly what I'm saying.

I don't understand - too many negates here 
without parens.  Are you saying:

a.  Backward / Forward compatibility between the kernel
and its user space tools isn't an issue, or

b.  There is no compatibility problem.

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 08/12] net namespace : find namespace by addr

2007-01-19 Thread Herbert Poetzl

On Fri, Jan 19, 2007 at 04:47:22PM +0100, [EMAIL PROTECTED] wrote:
> From: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> Switch to the the l3 namespace using the destination address.
> 
> Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/net_namespace.h |7 +++
>  net/core/net_namespace.c  |   35 +++
>  net/ipv4/ip_input.c   |   16 +++-
>  3 files changed, 57 insertions(+), 1 deletion(-)
> 
> Index: 2.6.20-rc4-mm1/net/ipv4/ip_input.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/ipv4/ip_input.c
> +++ 2.6.20-rc4-mm1/net/ipv4/ip_input.c
> @@ -374,6 +374,9 @@
>  {
>   struct iphdr *iph;
>   u32 len;
> + int err;
> + struct net_namespace *net_ns = current_net_ns;
> + struct net_namespace *dst_net_ns = NULL;
>  
>   /* When the interface is in promisc. mode, drop all the crap
>* that it receives, do not try to analyse it.
> @@ -393,6 +396,9 @@
>  
>   iph = skb->nh.iph;
>  
> + dst_net_ns = net_ns_find_from_dest_addr(iph->daddr);
> + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
> + push_net_ns(dst_net_ns);
>   /*
>*  RFC1122: 3.1.2.2 MUST silently discard any IP frame that fails 
> the checksum.
>*
> @@ -431,10 +437,18 @@
>   /* Remove any debris in the socket control block */
>   memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
>  
> - return NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
> + err =  NF_HOOK(PF_INET, NF_IP_PRE_ROUTING, skb, dev, NULL,
>  ip_rcv_finish);
>  
> + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
> + pop_net_ns(net_ns);
> +
> + return err;
> +
>  inhdr_error:
> + if (dst_net_ns && !net_ns_match(net_ns, dst_net_ns))
> + pop_net_ns(net_ns);
> +
>   IP_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS);
>  drop:
>  kfree_skb(skb);
> Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
> ===
> --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
> +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
> @@ -99,6 +99,8 @@
>  extern __be32 net_ns_select_source_address(const struct net_device *dev,
>  u32 dst, int scope);
>  
> +extern struct net_namespace *net_ns_find_from_dest_addr(u32 daddr);
> +
>  #define SELECT_SRC_ADDR net_ns_select_source_address
>  
>  #else /* CONFIG_NET_NS */
> @@ -167,6 +169,11 @@
>   return 0;
>  }
>  
> +static inline struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
> +{
> + return NULL;
> +}
> +
>  #define SELECT_SRC_ADDR inet_select_addr
>  
>  #endif /* !CONFIG_NET_NS */
> Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
> +++ 2.6.20-rc4-mm1/net/core/net_namespace.c
> @@ -385,4 +385,39 @@
>  out:
>   return addr;
>  }
> +
> +/*
> + * This function finds the network namespace destination deduced from
> + * the destination address. The network namespace is retrieved from
> + * the ifaddr owned by a network namespace

this basically disallows to 'share' IPs between
namespaces, as it is permitted in Linux-VServer
right now, or am I misinterpreting this?

TIA,
Herbert

> + * @daddr  : destination
> + * Returns : the network namespace destination or NULL if not found
> + */
> +struct net_namespace *net_ns_find_from_dest_addr(u32 daddr)
> +{
> + struct net_namespace *net_ns = NULL;
> + struct net_device *dev;
> + struct in_device *in_dev;
> +
> + if (LOOPBACK(daddr))
> + return current_net_ns;
> +
> + read_lock(&dev_base_lock);
> + rcu_read_lock();
> + for (dev = dev_base; dev; dev = dev->next) {
> + if ((in_dev = __in_dev_get_rcu(dev)) == NULL)
> + continue;
> + for_ifa(in_dev) {
> + if (ifa->ifa_local == daddr) {
> + net_ns = ifa->ifa_net_ns;
> + goto out_unlock_both;
> + }
> + } endfor_ifa(in_dev);
> + }
> +out_unlock_both:
> + read_unlock(&dev_base_lock);
> + rcu_read_unlock();
> +
> + return net_ns;
> +}
>  #endif /* CONFIG_NET_NS */
> 
> -- 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 12/12] net namespace : Add broadcasting

2007-01-19 Thread Herbert Poetzl

On Fri, Jan 19, 2007 at 04:47:26PM +0100, [EMAIL PROTECTED] wrote:
> From: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> Broadcast packets should be delivered to l2 and all l3 childs

hmm, really? shouldn't it only reach those which
actually have related addresses assigned?

best,
Herbert

> Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/net_namespace.h |   11 +++
>  net/core/net_namespace.c  |   27 +++
>  net/ipv4/udp.c|3 ++-
>  3 files changed, 40 insertions(+), 1 deletion(-)
> 
> Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
> ===
> --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
> +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
> @@ -9,6 +9,7 @@
>  
>  struct in_ifaddr;
>  struct sk_buff;
> +struct sock;
>  
>  struct net_namespace {
>   struct kref kref;
> @@ -109,6 +110,9 @@
>  
>  extern void net_ns_tag_sk_buff(struct sk_buff *skb);
>  
> +extern int net_ns_sock_is_visible(const struct sock *sk,
> +   const struct net_namespace *net_ns);
> +
>  #define SELECT_SRC_ADDR net_ns_select_source_address
>  
>  #else /* CONFIG_NET_NS */
> @@ -192,6 +196,13 @@
>  {
>   ;
>  }
> +
> +static inline int net_ns_sock_is_visible(const struct sock *sk,
> +  const struct net_namespace *net_ns)
> +{
> + return 1;
> +}
> +
>  #define SELECT_SRC_ADDR inet_select_addr
>  
>  #endif /* !CONFIG_NET_NS */
> Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
> +++ 2.6.20-rc4-mm1/net/core/net_namespace.c
> @@ -17,6 +17,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  struct net_namespace init_net_ns = {
>   .kref = {
> @@ -464,4 +465,30 @@
>   struct net_namespace *net_ns = current_net_ns;
>   skb->net_ns = net_ns;
>  }
> +
> +/*
> + * This function checks if the socket is visible from the specified
> + * namespace. This is needed to ensure the broadcast and the multicast
> + * for multiple network namespace l2 and l3 to have the packets to be
> + * delivered. If we have a l3 namespace and its parent (l2 namespace)
> + * listening on a broadcast address, we should deliver the packet to
> + * both. That is done by the udp_v4_mcast_next function. But we should
> + * find a common point between sockets which are relatives to a
> + * namespace.  The common point is they have the same parent in case
> + * of l3 network namespace.
> + * @sk : the socket to be checked
> + * @net_ns : the receiving network namespace
> + * Returns: 1 if the socket is visible by the namespace, 0 otherwise.
> + */
> +int net_ns_sock_is_visible(const struct sock *sk,
> +const struct net_namespace *net_ns)
> +{
> + if (net_ns->level == NET_NS_LEVEL3)
> + net_ns = net_ns->parent;
> +
> + if (sk->sk_net_ns->level == NET_NS_LEVEL3)
> + return sk->sk_net_ns->parent == net_ns;
> + else
> + return sk->sk_net_ns == net_ns;
> +}
>  #endif /* CONFIG_NET_NS */
> Index: 2.6.20-rc4-mm1/net/ipv4/udp.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/ipv4/udp.c
> +++ 2.6.20-rc4-mm1/net/ipv4/udp.c
> @@ -309,9 +309,10 @@
>   (inet->dport != rmt_port && inet->dport)||
>   (inet->rcv_saddr && inet->rcv_saddr != loc_addr)||
>   ipv6_only_sock(s)   ||
> - !net_ns_match(sk->sk_net_ns, ns)||
>   (s->sk_bound_dev_if && s->sk_bound_dev_if != dif))
>   continue;
> + if (!net_ns_sock_is_visible(sk, ns))
> + continue;
>   if (!ip_mc_sf_allow(s, loc_addr, rmt_addr, dif))
>   continue;
>   goto found;
> 
> -- 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/12] net namespace : L3 namespace - introduction

2007-01-19 Thread Herbert Poetzl

On Fri, Jan 19, 2007 at 04:47:14PM +0100, [EMAIL PROTECTED] wrote:
> This patchset provide a network isolation similar at what
> Linux-Vserver provides. It is based on the L2 namespaces and relies on
> the mechanisms provided by the namespace. This L3 namespaces does not
> aim to bring full virtualization for the network, it provides an IP
> isolation which can be reused for Linux-Vserver, jailed application or
> application containers.
> 
> A L3 namespace are always L2 s' childs and they can not create more
> network namespaces, furthermore, they lose their NET_ADMIN
> capability. They share their parent's network ressources. From the
> parent namespace, IP addresses are created and assigned to the
> different L3 childs. From this point, L3 namespaces can use their
> assigned IP address and all computed broadcast addresses.
  ~~~

okay, I conclude that this only handles a single address
for now. what are your plans to handle entire sets?

TIA,
Herbert

> Because the L3 namespace relies on the L2 virtualization mechanisms,
> it is possible to have several L3 namespaces listening on
> INADDR_ANY:port without conflict, that's allow to run several server
> without modifying the network configuration.
> 
> The loopback is a shared device between all L3 namespaces. To ensure
> the 127.0.0.1 address isolation, the sender store its namespace into
> the packet, so when the packet arrives, the destination namespace is
> already set, because "source" == "destination". By this way, it is
> easy to disable the loopback isolation and let the application to talk
> with application outside of the namespace via the 127.0.0.1 because we
> consider them trusted (like portmap).
> 
> The ifconfig / ip commands will only show IP addresses assigned to the
> L3 namespace. When a L3 namespace dies, the assigned IP address is
> released to its parent.
> 
> At the IP level, when a packet arrives, the L3 network namespace
> destination is retrieved from the destination address.
> 
> At the bind time, the address is checked against the assigned IP
> address.
> 
> -- 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.osdl.org/mailman/listinfo/containers
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 05/12] net namespace : ioctl to push ifa to net namespace l3

2007-01-19 Thread Herbert Poetzl

On Fri, Jan 19, 2007 at 04:47:19PM +0100, [EMAIL PROTECTED] wrote:
> From: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> New ioctl to "push" ifaddr to a container. Actually, the push is done
> from the current namespace, so the right word is "pull". That will be
> changed to move ifaddr from l2 network namespace to l3.
> 
> Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
> 
> ---
>  include/linux/net_namespace.h |7 ++
>  include/linux/sockios.h   |4 +
>  net/core/net_namespace.c  |  118 
> +-
>  net/ipv4/af_inet.c|4 +
>  4 files changed, 132 insertions(+), 1 deletion(-)
> 
> Index: 2.6.20-rc4-mm1/include/linux/sockios.h
> ===
> --- 2.6.20-rc4-mm1.orig/include/linux/sockios.h
> +++ 2.6.20-rc4-mm1/include/linux/sockios.h
> @@ -122,6 +122,10 @@
>  #define SIOCBRADDIF  0x89a2  /* add interface to bridge  */
>  #define SIOCBRDELIF  0x89a3  /* remove interface from bridge */
>  
> +/* Container calls */
> +#define SIOCNETNSPUSHIF  0x89b0 /* add ifaddr to namespace  */
> +#define SIOCNETNSPULLIF  0x89b1 /* remove ifaddr to namespace   */
   ~~~ from
> +
>  /* Device private ioctl calls */
>  
>  /*
> Index: 2.6.20-rc4-mm1/net/ipv4/af_inet.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/ipv4/af_inet.c
> +++ 2.6.20-rc4-mm1/net/ipv4/af_inet.c
> @@ -789,6 +789,10 @@
>   case SIOCSIFFLAGS:
>   err = devinet_ioctl(cmd, (void __user *)arg);
>   break;
> + case SIOCNETNSPUSHIF:
> + case SIOCNETNSPULLIF:
> + err = net_ns_ioctl(cmd, (void __user *)arg);
> + break;
>   default:
>   if (sk->sk_prot->ioctl)
>   err = sk->sk_prot->ioctl(sk, cmd, arg);
> Index: 2.6.20-rc4-mm1/include/linux/net_namespace.h
> ===
> --- 2.6.20-rc4-mm1.orig/include/linux/net_namespace.h
> +++ 2.6.20-rc4-mm1/include/linux/net_namespace.h
> @@ -91,6 +91,8 @@
>  
>  #define net_ns_hash(ns)  ((ns)->hash)
>  
> +extern int net_ns_ioctl(unsigned int cmd, void __user *arg);
> +
>  #else /* CONFIG_NET_NS */
>  
>  #define INIT_NET_NS(net_ns)
> @@ -141,6 +143,11 @@
>  
>  #define net_ns_hash(ns)  (0)
>  
> +static inline int net_ns_ioctl(unsigned int cmd, void __user *arg)
> +{
> + return -ENOSYS;
> +}
> +
>  #endif /* !CONFIG_NET_NS */
>  
>  #endif /* _LINUX_NET_NAMESPACE_H */
> Index: 2.6.20-rc4-mm1/net/core/net_namespace.c
> ===
> --- 2.6.20-rc4-mm1.orig/net/core/net_namespace.c
> +++ 2.6.20-rc4-mm1/net/core/net_namespace.c
> @@ -10,7 +10,9 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
> +#include 
>  #include 
>  
>  struct net_namespace init_net_ns = {
> @@ -123,6 +125,33 @@
>   return err;
>  }
>  
> +/*
> + * The function will move the ifaddr to the l2 network namespace
> + * parent.
> + * @net_ns: the related network namespace
> + */
> +static void release_ifa_to_parent(const struct net_namespace* net_ns)
> +{
> + struct net_device *dev;
> + struct in_device *in_dev;
> +
> + read_lock(&dev_base_lock);
> + rcu_read_lock();
> + for (dev = dev_base; dev; dev = dev->next) {
> + in_dev = __in_dev_get_rcu(dev);
> + if (!in_dev)
> + continue;
> +
> + for_ifa(in_dev) {
> + if (ifa->ifa_net_ns != net_ns)
> + continue;
> + ifa->ifa_net_ns = net_ns->parent;
> + } endfor_ifa(in_dev);
> + }
> + read_unlock(&dev_base_lock);
> + rcu_read_unlock();
> +}
> +
>  void free_net_ns(struct kref *kref)
>  {
>   struct net_namespace *ns;
> @@ -139,12 +168,99 @@
>   }
>   }
>  
> - if (ns->level == NET_NS_LEVEL3)
> + if (ns->level == NET_NS_LEVEL3) {
> + release_ifa_to_parent(ns);
>   put_net_ns(ns->parent);
> + }
>  
>   printk(KERN_DEBUG "NET_NS: net namespace %p destroyed\n", ns);
>   kfree(ns);
>  }
>  EXPORT_SYMBOL_GPL(free_net_ns);
>  
> +/*
> + * This function allows to assign an IP address from a l2 network
> + * namespace to one of his l3 child or to release from an l3 network
> + * namespace to his l2 network namespace parent.

hmm, sounds like the address is moved between the
namespaces? does that mean that the 'parent' will
not see the 'isolated' ip anymore?

TIA,
Herbert

> + * @cmd: a "push" / "pull" command
> + * @arg: an userspace buffer containing an ifreq structure
> + * Returns:
> + * - EPERM : if caller has no CAP_NET_ADMIN capabilities or the
> + *   current level of networ

[Fwd: Re: [PATCH 1/10] cxgb3 - main header files]

2007-01-19 Thread Steve Wise

Hey Roland, 

Jeff has pulled in the Chelsio Ethernet driver.  If you are ready to
merge in the RDMA driver, you can pull it from 

git://staging.openfabrics.org/~swise/cxgb3.git for-roland

Thanks,

Steve.


 Forwarded Message 
From: Jeff Garzik <[EMAIL PROTECTED]>
To: Divy Le Ray <[EMAIL PROTECTED]>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
[EMAIL PROTECTED]
Subject: Re: [PATCH 1/10] cxgb3 - main header files
Date: Thu, 18 Jan 2007 22:05:02 -0500

Divy Le Ray wrote:
> Jeff Garzik wrote:
>> Divy Le Ray wrote:
>>> From: Divy Le Ray <[EMAIL PROTECTED]>
>>>
>>> This patch implements the main header files of
>>> the Chelsio T3 network driver.
>>>
>>> Signed-off-by: Divy Le Ray <[EMAIL PROTECTED]>
>>
>> Once you think it's ready, email me a URL to a single patch that adds 
>> the driver to the latest linux-2.6.git kernel.  Include in the email a 
>> description of the driver and signed-off-by line, which will get 
>> directly included in the git changelog.
>>
>> Adding new drivers is a bit special, because we want to merge it as a 
>> single changeset, but that would create a patch too large to review on 
>> the common kernel mailing lists.
> Jeff,
> 
> You can grab the monolithic patch at this URL:
> http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

applied to netdev-2.6.git#upstream

I'm really counting on Chelsio to actively maintain this driver, unlike 
the abandonware you guys first submitted.

Jeff




-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Fwd: Re: [PATCH 1/10] cxgb3 - main header files]

2007-01-19 Thread Roland Dreier

 > Jeff has pulled in the Chelsio Ethernet driver.  If you are ready to
 > merge in the RDMA driver, you can pull it from 

Yes, I saw that... OK, I'll get serious about reviewing the RDMA stuff.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

55 matches

Mail list logo